Node stalls & goes offline every day

I have 2 nodes on the same server, running in an LXC container. I’ve been successful with this setup for years.

Each node:

  • Runs v1.104.5
  • Has a mirrored vdev zfs partition passed in
  • Container has 3GiB RAM and 4 cores (dashboard screenshots below don’t hint at any resource constraints)

My second node on this server (storj-04) keeps getting into a stalled state and goes offline, usually in the early morning (0200-0400 UTC). I just noticed that the first node (storj-03) restarts around the same time, so it had appeared to be running “fine” but it looks like it is maybe suffering from the same problem and just restarting so I didn’t notice any downtime.

Here’s a log showing storj-03 restarting:

root@storj:~# docker events --since=8h
2024-06-10T04:20:06.371598774Z network disconnect cbd0403d8c1077cbed54bc52679cd335b945e2f696741a6bf7aed2a10991cec5 (container=fdfe23a138a2d432a49a5dfc4618a29bb37b16e596eea58809dcd6e5055b3a5f, name=root_default, type=bridge)
2024-06-10T04:20:06.423572403Z container die fdfe23a138a2d432a49a5dfc4618a29bb37b16e596eea58809dcd6e5055b3a5f (com.docker.compose.config-hash=37d7fb12adf599151407b598397ec8281371bb3768e81c1ebe3170cefb1473a7, com.docker.compose.container-number=1, com.docker.compose.depends_on=, com.docker.compose.image=sha256:a0f1d1ec77544b1f69db1e2ce49e71b26114d47af721a65bc2f48add150d5051, com.docker.compose.oneoff=False, com.docker.compose.project=root, com.docker.compose.project.config_files=/root/docker-compose.yml, com.docker.compose.project.working_dir=/root, com.docker.compose.service=storj-03, com.docker.compose.version=2.27.0, execDuration=11705, exitCode=0, image=storjlabs/storagenode:latest, name=storj-03)
2024-06-10T04:20:06.567252972Z network connect cbd0403d8c1077cbed54bc52679cd335b945e2f696741a6bf7aed2a10991cec5 (container=fdfe23a138a2d432a49a5dfc4618a29bb37b16e596eea58809dcd6e5055b3a5f, name=root_default, type=bridge)
2024-06-10T04:20:06.773811260Z container start fdfe23a138a2d432a49a5dfc4618a29bb37b16e596eea58809dcd6e5055b3a5f (com.docker.compose.config-hash=37d7fb12adf599151407b598397ec8281371bb3768e81c1ebe3170cefb1473a7, com.docker.compose.container-number=1, com.docker.compose.depends_on=, com.docker.compose.image=sha256:a0f1d1ec77544b1f69db1e2ce49e71b26114d47af721a65bc2f48add150d5051, com.docker.compose.oneoff=False, com.docker.compose.project=root, com.docker.compose.project.config_files=/root/docker-compose.yml, com.docker.compose.project.working_dir=/root, com.docker.compose.service=storj-03, com.docker.compose.version=2.27.0, image=storjlabs/storagenode:latest, name=storj-03)
2024-06-10T06:25:01.107099945Z container prune  (reclaimed=0)
2024-06-10T06:25:01.108142210Z network prune  (reclaimed=0)
2024-06-10T06:25:01.109906294Z image prune  (reclaimed=0)
2024-06-10T06:25:01.123711023Z builder prune  (reclaimed=0)

Container dashboard / resources for the last week (you can see where I increased RAM to 3GiB from 2GiB):


I noticed it doing this a few weeks ago. A reboot of the docker or lxc container fixes it until the next day when it goes offline again.

docker compose logs storj-04 | grep -i error | grep -vE "GET|PUT" - seeing a lot of database is locked - so is it slow I/O? I haven’t had an issue before now.

root@storj:~# docker compose logs storj-04 | grep -i error | grep -vE "GET|PUT"
storj-04  | 2024-06-08T07:33:22Z        INFO    lazyfilewalker.used-space-filewalker    subprocess exited with status   {"Process": "storagenode", "satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "status": -1, "error": "signal: killed"}
storj-04  | 2024-06-08T07:33:22Z        ERROR   pieces  failed to lazywalk space used by satellite      {"Process": "storagenode", "error": "lazyfilewalker: signal: killed", "errorVerbose": "lazyfilewalker: signal: killed\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:85\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:130\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:704\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"}
storj-04  | 2024-06-08T07:33:22Z        ERROR   lazyfilewalker.used-space-filewalker    failed to start subprocess      {"Process": "storagenode", "satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "error": "context canceled"}
storj-04  | 2024-06-08T07:33:22Z        ERROR   pieces  failed to lazywalk space used by satellite      {"Process": "storagenode", "error": "lazyfilewalker: context canceled", "errorVerbose": "lazyfilewalker: context canceled\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:73\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:130\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:704\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}
storj-04  | 2024-06-08T07:33:22Z        ERROR   lazyfilewalker.used-space-filewalker    failed to start subprocess      {"Process": "storagenode", "satelliteID": "12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB", "error": "context canceled"}
storj-04  | 2024-06-08T07:33:22Z        ERROR   pieces  failed to lazywalk space used by satellite      {"Process": "storagenode", "error": "lazyfilewalker: context canceled", "errorVerbose": "lazyfilewalker: context canceled\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:73\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:130\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:704\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78", "Satellite ID": "12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB"}
storj-04  | 2024-06-08T07:33:23Z        ERROR   lazyfilewalker.used-space-filewalker    failed to start subprocess      {"Process": "storagenode", "satelliteID": "12tRQrMTWUWwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo", "error": "context canceled"}
storj-04  | 2024-06-08T07:33:23Z        ERROR   pieces  failed to lazywalk space used by satellite      {"Process": "storagenode", "error": "lazyfilewalker: context canceled", "errorVerbose": "lazyfilewalker: context canceled\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:73\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:130\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:704\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78", "Satellite ID": "12tRQrMTWUWwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo"}
storj-04  | 2024-06-08T07:33:26Z        ERROR   lazyfilewalker.used-space-filewalker    failed to start subprocess      {"Process": "storagenode", "satelliteID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "error": "context canceled"}
storj-04  | 2024-06-08T07:33:26Z        ERROR   pieces  failed to lazywalk space used by satellite      {"Process": "storagenode", "error": "lazyfilewalker: context canceled", "errorVerbose": "lazyfilewalker: context canceled\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:73\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:130\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:704\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}
storj-04  | 2024-06-08T07:33:29Z        ERROR   lazyfilewalker.used-space-filewalker    failed to start subprocess      {"Process": "storagenode", "satelliteID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "error": "context canceled"}
storj-04  | 2024-06-08T07:33:29Z        ERROR   pieces  failed to lazywalk space used by satellite      {"Process": "storagenode", "error": "lazyfilewalker: context canceled", "errorVerbose": "lazyfilewalker: context canceled\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:73\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:130\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:704\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6"}
storj-04  | 2024-06-08T07:33:29Z        ERROR   piecestore:cache        error getting current used space:       {"Process": "storagenode", "error": "filewalker: context canceled; filewalker: context canceled; filewalker: context canceled; filewalker: context canceled; filewalker: context canceled; filewalker: context canceled", "errorVerbose": "group:\n--- filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:713\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78\n--- filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:713\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78\n--- filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:713\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78\n--- filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:713\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78\n--- filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:713\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78\n--- filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:713\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
storj-04  | 2024-06-08T07:33:29Z        ERROR   failure during run      {"Process": "storagenode", "error": "database is locked"}
storj-04  | Error: database is locked
storj-04  | 2024-06-08T08:26:17Z        ERROR   orders  cleaning DB archive     {"Process": "storagenode", "error": "ordersdb: database is locked", "errorVerbose": "ordersdb: database is locked\n\tstorj.io/storj/storagenode/storagenodedb.(*ordersDB).CleanArchive:325\n\tstorj.io/storj/storagenode/orders.(*Service).CleanArchive:162\n\tstorj.io/storj/storagenode/orders.(*Service).Run.func2:146\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/common/sync2.(*Cycle).Start.func1:77\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
storj-04  | 2024-06-08T09:46:56Z        ERROR   services        unexpected shutdown of a runner {"Process": "storagenode", "name": "piecestore:monitor", "error": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory", "errorVerbose": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:178\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:167\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
storj-04  | 2024-06-08T11:10:51Z        INFO    lazyfilewalker.used-space-filewalker    subprocess exited with status   {"Process": "storagenode", "satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "status": -1, "error": "signal: killed"}
storj-04  | 2024-06-08T11:10:51Z        ERROR   pieces  failed to lazywalk space used by satellite      {"Process": "storagenode", "error": "lazyfilewalker: signal: killed", "errorVerbose": "lazyfilewalker: signal: killed\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:85\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:130\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:704\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"}
storj-04  | 2024-06-08T11:10:54Z        ERROR   lazyfilewalker.used-space-filewalker    failed to start subprocess      {"Process": "storagenode", "satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "error": "context canceled"}
storj-04  | 2024-06-08T11:10:54Z        ERROR   pieces  failed to lazywalk space used by satellite      {"Process": "storagenode", "error": "lazyfilewalker: context canceled", "errorVerbose": "lazyfilewalker: context canceled\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:73\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:130\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:704\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}
storj-04  | 2024-06-08T11:10:54Z        ERROR   lazyfilewalker.used-space-filewalker    failed to start subprocess      {"Process": "storagenode", "satelliteID": "12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB", "error": "context canceled"}
storj-04  | 2024-06-08T11:10:54Z        ERROR   pieces  failed to lazywalk space used by satellite      {"Process": "storagenode", "error": "lazyfilewalker: context canceled", "errorVerbose": "lazyfilewalker: context canceled\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:73\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:130\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:704\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78", "Satellite ID": "12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB"}
storj-04  | 2024-06-08T11:10:55Z        ERROR   lazyfilewalker.used-space-filewalker    failed to start subprocess      {"Process": "storagenode", "satelliteID": "12tRQrMTWUWwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo", "error": "context canceled"}
storj-04  | 2024-06-08T11:10:55Z        ERROR   pieces  failed to lazywalk space used by satellite      {"Process": "storagenode", "error": "lazyfilewalker: context canceled", "errorVerbose": "lazyfilewalker: context canceled\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:73\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:130\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:704\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78", "Satellite ID": "12tRQrMTWUWwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo"}
storj-04  | 2024-06-08T11:10:58Z        ERROR   lazyfilewalker.used-space-filewalker    failed to start subprocess      {"Process": "storagenode", "satelliteID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "error": "context canceled"}
storj-04  | 2024-06-08T11:10:58Z        ERROR   pieces  failed to lazywalk space used by satellite      {"Process": "storagenode", "error": "lazyfilewalker: context canceled", "errorVerbose": "lazyfilewalker: context canceled\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:73\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:130\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:704\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}
storj-04  | 2024-06-08T11:11:01Z        ERROR   lazyfilewalker.used-space-filewalker    failed to start subprocess      {"Process": "storagenode", "satelliteID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "error": "context canceled"}
storj-04  | 2024-06-08T11:11:01Z        ERROR   pieces  failed to lazywalk space used by satellite      {"Process": "storagenode", "error": "lazyfilewalker: context canceled", "errorVerbose": "lazyfilewalker: context canceled\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:73\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:130\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:704\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6"}
storj-04  | 2024-06-08T11:11:01Z        ERROR   piecestore:cache        error getting current used space:       {"Process": "storagenode", "error": "filewalker: context canceled; filewalker: context canceled; filewalker: context canceled; filewalker: context canceled; filewalker: context canceled; filewalker: context canceled", "errorVerbose": "group:\n--- filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:713\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78\n--- filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:713\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78\n--- filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:713\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78\n--- filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:713\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78\n--- filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:713\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78\n--- filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:713\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
storj-04  | 2024-06-08T12:33:11Z        ERROR   gracefulexit:chore      error retrieving satellites.    {"Process": "storagenode", "error": "satellitesdb: context canceled", "errorVerbose": "satellitesdb: context canceled\n\tstorj.io/storj/storagenode/storagenodedb.(*satellitesDB).ListGracefulExits.func1:200\n\tstorj.io/storj/storagenode/storagenodedb.(*satellitesDB).ListGracefulExits:212\n\tstorj.io/storj/storagenode/gracefulexit.(*Service).ListPendingExits:59\n\tstorj.io/storj/storagenode/gracefulexit.(*Chore).AddMissing:55\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/gracefulexit.(*Chore).Run:48\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
storj-04  | 2024-06-08T20:23:58Z        ERROR   services        unexpected shutdown of a runner {"Process": "storagenode", "name": "piecestore:monitor", "error": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory", "errorVerbose": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:178\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:167\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
storj-04  | 2024-06-08T20:23:58Z        ERROR   gracefulexit:chore      error retrieving satellites.    {"Process": "storagenode", "error": "satellitesdb: context canceled", "errorVerbose": "satellitesdb: context canceled\n\tstorj.io/storj/storagenode/storagenodedb.(*satellitesDB).ListGracefulExits.func1:200\n\tstorj.io/storj/storagenode/storagenodedb.(*satellitesDB).ListGracefulExits:212\n\tstorj.io/storj/storagenode/gracefulexit.(*Service).ListPendingExits:59\n\tstorj.io/storj/storagenode/gracefulexit.(*Chore).AddMissing:55\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/gracefulexit.(*Chore).Run:48\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
storj-04  | 2024-06-08T22:05:00Z        INFO    lazyfilewalker.used-space-filewalker    subprocess exited with status   {"Process": "storagenode", "satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "status": -1, "error": "signal: killed"}
storj-04  | 2024-06-08T22:05:00Z        ERROR   pieces  failed to lazywalk space used by satellite      {"Process": "storagenode", "error": "lazyfilewalker: signal: killed", "errorVerbose": "lazyfilewalker: signal: killed\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:85\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:130\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:704\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"}
storj-04  | 2024-06-08T22:05:03Z        ERROR   lazyfilewalker.used-space-filewalker    failed to start subprocess      {"Process": "storagenode", "satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "error": "context canceled"}
storj-04  | 2024-06-08T22:05:03Z        ERROR   pieces  failed to lazywalk space used by satellite      {"Process": "storagenode", "error": "lazyfilewalker: context canceled", "errorVerbose": "lazyfilewalker: context canceled\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:73\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:130\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:704\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}
storj-04  | 2024-06-08T22:05:07Z        ERROR   lazyfilewalker.used-space-filewalker    failed to start subprocess      {"Process": "storagenode", "satelliteID": "12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB", "error": "context canceled"}
storj-04  | 2024-06-08T22:05:07Z        ERROR   pieces  failed to lazywalk space used by satellite      {"Process": "storagenode", "error": "lazyfilewalker: context canceled", "errorVerbose": "lazyfilewalker: context canceled\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:73\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:130\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:704\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78", "Satellite ID": "12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB"}
storj-04  | 2024-06-08T22:05:07Z        ERROR   lazyfilewalker.used-space-filewalker    failed to start subprocess      {"Process": "storagenode", "satelliteID": "12tRQrMTWUWwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo", "error": "context canceled"}
storj-04  | 2024-06-08T22:05:07Z        ERROR   pieces  failed to lazywalk space used by satellite      {"Process": "storagenode", "error": "lazyfilewalker: context canceled", "errorVerbose": "lazyfilewalker: context canceled\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:73\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:130\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:704\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78", "Satellite ID": "12tRQrMTWUWwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo"}
storj-04  | 2024-06-08T22:05:07Z        ERROR   lazyfilewalker.used-space-filewalker    failed to start subprocess      {"Process": "storagenode", "satelliteID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "error": "context canceled"}
storj-04  | 2024-06-08T22:05:07Z        ERROR   pieces  failed to lazywalk space used by satellite      {"Process": "storagenode", "error": "lazyfilewalker: context canceled", "errorVerbose": "lazyfilewalker: context canceled\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:73\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:130\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:704\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}

Here’s my docker-compose.yml:

x-logging: &default-logging
  logging:
    driver: "json-file"
    options:
      max-size: "10m"
      max-file: "10"
      compress: "true"

services:
  storj-03:
    image: storjlabs/storagenode:latest
    container_name: storj-03
    restart: always
    ports:
      - "28967:28967/tcp"
      - "28967:28967/udp"
      - "14002:14002"
    environment:
      - WALLET=REDACTED
      - EMAIL=REDACTED
      - ADDRESS=REDACTED:28967
      - STORAGE=3.5TB
    volumes:
      - /root/.local/share/storj/identity/storj-03:/app/identity
      - /storj-03:/app/config
    <<: *default-logging

  storj-04:
    image: storjlabs/storagenode:latest
    container_name: storj-04
    restart: always
    ports:
      - "28968:28967/tcp"
      - "28968:28967/udp"
      - "14003:14002"
    environment:
      - WALLET=REDACTED
      - EMAIL=REDACTED
      - ADDRESS=REDACTED:28968
      - STORAGE=3.5TB
    volumes:
      - /root/.local/share/storj/identity/storj-04:/app/identity
      - /storj-04:/app/config
    <<: *default-logging

  watchtower:
    image: storjlabs/watchtower
    container_name: watchtower
    restart: unless-stopped
    command: "storagenode watchtower --stop-timeout 300s --interval 21600"
    volumes:
      - "/var/run/docker.sock:/var/run/docker.sock"
    <<: *default-logging

So things have changed recently, Ingress is up allot, I’m seeing TB’s of ingress, where before there was a few 100 GB’s - It’s a good time to optimize your setup.

I posted this yesterday, covers the basics of ZFS on proxmox and the memory requirements - good you have 2 disks, so reads will be ok :eyes: you might benefit from a L2Arc on SSD on NVME being added - check the memory being allocated to ZFS hopefully you have 32GB plus in the server, afraid that’s where my ZFS expertise ends, I know SNO’s did some comparisons but can’t find it in search on if it helps much using L2Arc…

2 Likes

Seems you need to increase a writable checks timeout while you didn’t solve the issue with a slowness of zfs configuration (many zfs users says that adding a special SSD device should help).
See

about FS:

1 Like

Reading up on L2ARC vs ARC here, I found my arcstats:

# head -n20 /proc/spl/kstat/zfs/arcstats                                                                                                             ✔  root@server  10:32:01
9 1 0x01 147 39984 11132165908 339164642810737
name                            type data
hits                            4    301736836
iohits                          4    397423
misses                          4    29499176
demand_data_hits                4    22695491
demand_data_iohits              4    75973
demand_data_misses              4    1348063
demand_metadata_hits            4    240275254
demand_metadata_iohits          4    206336
demand_metadata_misses          4    15379229
prefetch_data_hits              4    40013
prefetch_data_iohits            4    1724
prefetch_data_misses            4    3665682
prefetch_metadata_hits          4    38726078
prefetch_metadata_iohits        4    113390
prefetch_metadata_misses        4    9106202

Doing some math - hit rate = hits / (hits+misses) -

  • ARC data hit ratio = 94%
  • ARC metadata hit ratio = 94%
  • ARC prefetch data hit ratio = 1%
  • ARC prefetch metadata hit ratio = 81%

The article suggests that a high ARC hit ratio means the ARC and L2ARC is sufficient. It also suggests disabling prefetch if it’s low, which mine is abysmal.

Despite the good ARC hit ratio, arc_summary suggests it’s too low, as I’m at 132.5% right now with my storj-04 down since this early AM:

 ~  arc_summary                                                                                                                                        ✔  root@server  10:48:47

------------------------------------------------------------------------
ZFS Subsystem Report                            Tue Jun 11 10:49:20 2024
Linux 6.8.4-3-pve                                             2.2.3-pve2
Machine: server (x86_64)                                      2.2.3-pve2

ARC status:                                                      HEALTHY
        Memory throttle count:                                         0

ARC size (current):                                   132.5 %   10.1 GiB
        Target size (adaptive):                       100.0 %    7.6 GiB
        Min size (hard limit):                         25.5 %    1.9 GiB
        Max size (high water):                            3:1    7.6 GiB
        Anonymous data size:                           35.8 %    3.3 GiB
        Anonymous metadata size:                        4.4 %  412.6 MiB
        MFU data target:                                3.2 %  302.8 MiB
        MFU data size:                                < 0.1 %    4.1 MiB
        MFU ghost data size:                                     3.6 GiB
        MFU metadata target:                           63.8 %    5.9 GiB
        MFU metadata size:                             47.9 %    4.4 GiB
        MFU ghost metadata size:                                 1.4 GiB
        MRU data target:                               10.4 %  980.1 MiB
        MRU data size:                                  1.4 %  136.2 MiB
        MRU ghost data size:                                     2.9 GiB
        MRU metadata target:                           22.6 %    2.1 GiB
        MRU metadata size:                             10.4 %  979.7 MiB
        MRU ghost metadata size:                                 3.4 GiB
        Uncached data size:                             0.0 %    0 Bytes
        Uncached metadata size:                         0.0 %    0 Bytes
        Bonus size:                                     1.1 %  114.8 MiB
        Dnode cache target:                            10.0 %  777.0 MiB
        Dnode cache size:                              46.8 %  363.5 MiB
        Dbuf size:                                      2.3 %  234.0 MiB
        Header size:                                    1.6 %  168.7 MiB
        L2 header size:                                 0.0 %    0 Bytes
        ABD chunk waste size:                         < 0.1 %  153.5 KiB

At this point I’ve increased it to 24GiB, thanks to your other comment @CutieePie

echo "$[24 * 1024*1024*1024]" >/sys/module/zfs/parameters/zfs_arc_max

I’ll see how this changes things. If it still suffers, my next step will be to migrate storj-04 from mirrored vdev to ext4. My plan was always to move away from mirrored vdev as I filled the drives up anyway.

1 Like

You may also take a look on