Randomly nodes turn off

Vadim · August 25, 2024, 9:50am

As now there is lot of proceses that run on HDD, they time to time jamming each other.
So i think that there is time to make default Writability check little bit bigger than 1 min.

I have several nodes that go off with this error.

|2024-08-25T11:20:17+03:00|ERROR|services|unexpected shutdown of a runner|{name: piecestore:monitor, error: piecestore monitor: timed out after 1m0s while verifying writability of storage directory, errorVerbose: piecestore monitor: timed out after 1m0s while verifying writability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:181\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:168\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78}|
|2024-08-25T11:20:17+03:00|ERROR|gracefulexit:chore|error retrieving satellites.|{error: satellitesdb: context canceled, errorVerbose: satellitesdb: context canceled\n\tstorj.io/storj/storagenode/storagenodedb.(*satellitesDB).ListGracefulExits.func1:200\n\tstorj.io/storj/storagenode/storagenodedb.(*satellitesDB).ListGracefulExits:212\n\tstorj.io/storj/storagenode/gracefulexit.(*Service).ListPendingExits:59\n\tstorj.io/storj/storagenode/gracefulexit.(*Chore).AddMissing:55\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/gracefulexit.(*Chore).Run:48\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:44\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78}|
|2024-08-25T11:20:32+03:00|WARN|services|service takes long to shutdown|{name: retain}|
|2024-08-25T11:20:32+03:00|WARN|services|service takes long to shutdown|{name: piecestore:cache}|
|2024-08-25T11:20:32+03:00|WARN|services|service takes long to shutdown|{name: collector}|
|2024-08-25T11:20:32+03:00|WARN|servers|service takes long to shutdown|{name: server}|
|2024-08-25T11:20:32+03:00|WARN|services|service takes long to shutdown|{name: pieces:trash}|
|2024-08-25T11:21:34+03:00|ERROR|pieces|used-space-filewalker failed|{Satellite ID: 1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE, Lazy File Walker: false, error: filewalker: context canceled, errorVerbose: filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).WalkAndComputeSpaceUsedBySatellite:731\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run.func1:81\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78}|
|2024-08-25T11:21:34+03:00|ERROR|piecestore:cache|encountered error while computing space used by satellite|{error: filewalker: context canceled, errorVerbose: filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).WalkAndComputeSpaceUsedBySatellite:731\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run.func1:81\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78, SatelliteID: 1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE}|
|2024-08-25T11:21:34+03:00|ERROR|pieces|used-space-filewalker failed|{Satellite ID: 121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6, Lazy File Walker: false, error: filewalker: context canceled, errorVerbose: filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).WalkAndComputeSpaceUsedBySatellite:731\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run.func1:81\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78}|
|2024-08-25T11:21:34+03:00|ERROR|piecestore:cache|encountered error while computing space used by satellite|{error: filewalker: context canceled, errorVerbose: filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).WalkAndComputeSpaceUsedBySatellite:731\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run.func1:81\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78, SatelliteID: 121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6}|
|2024-08-25T11:21:34+03:00|ERROR|pieces:trash|emptying trash failed|{error: pieces error: filestore error: context canceled; context canceled, errorVerbose: pieces error: filestore error: context canceled; context canceled\n\tstorj.io/storj/storagenode/blobstore/filestore.(*blobStore).EmptyTrash:205\n\tstorj.io/storj/storagenode/blobstore/statcache.(*CachedStatBlobstore).EmptyTrash:114\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).EmptyTrash:432\n\tstorj.io/storj/storagenode/pieces.(*Store).EmptyTrash:443\n\tstorj.io/storj/storagenode/pieces.(*TrashChore).Run.func1.1:84\n\tstorj.io/common/sync2.(*Workplace).Start.func1:89}|
|2024-08-25T11:21:34+03:00|ERROR|retain|retain pieces failed|{cachePath: C:\\Program Files\\Storj11\\Storage Node/retain, error: retain: filewalker: context canceled, errorVerbose: retain: filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePiecesToTrash:181\n\tstorj.io/storj/storagenode/pieces.(*Store).WalkSatellitePiecesToTrash:582\n\tstorj.io/storj/storagenode/retain.(*Service).retainPieces:380\n\tstorj.io/storj/storagenode/retain.(*Service).Run.func2:265\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78}|
|2024-08-25T11:21:34+03:00|ERROR|pieces|used-space-filewalker failed|{Satellite ID: 12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S, Lazy File Walker: false, error: filewalker: context canceled, errorVerbose: filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).WalkAndComputeSpaceUsedBySatellite:731\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run.func1:81\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78}|
|2024-08-25T11:21:34+03:00|ERROR|piecestore:cache|encountered error while computing space used by satellite|{error: filewalker: context canceled, errorVerbose: filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).WalkAndComputeSpaceUsedBySatellite:731\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run.func1:81\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78, SatelliteID: 12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S}|
|2024-08-25T11:21:34+03:00|ERROR|pieces|used-space-filewalker failed|{Satellite ID: 12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs, Lazy File Walker: false, error: filewalker: context canceled, errorVerbose: filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).WalkAndComputeSpaceUsedBySatellite:731\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run.func1:81\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78}|
|2024-08-25T11:21:34+03:00|ERROR|piecestore:cache|encountered error while computing space used by satellite|{error: filewalker: context canceled, errorVerbose: filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).WalkAndComputeSpaceUsedBySatellite:731\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run.func1:81\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78, SatelliteID: 12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs}|
|2024-08-25T11:21:34+03:00|ERROR|piecestore:cache|error getting current used space for trash: |{error: filestore error: failed to walk trash namespace 7b2de9d72c2e935f1918c058caaf8ed00f0581639008707317ff1bd000000000: context canceled, errorVerbose: filestore error: failed to walk trash namespace 7b2de9d72c2e935f1918c058caaf8ed00f0581639008707317ff1bd000000000: context canceled\n\tstorj.io/storj/storagenode/blobstore/filestore.(*blobStore).SpaceUsedForTrash:273\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run.func1:100\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78}|
2024-08-25T11:21:58+03:00	ERROR	collector	error during expired pieces collection	{"count": 11000, "error": "pieces error: pieceexpirationdb: context canceled", "errorVerbose": "pieces error: pieceexpirationdb: context canceled\n\tstorj.io/storj/storagenode/storagenodedb.(*pieceExpirationDB).getExpiredPaginated:103\n\tstorj.io/storj/storagenode/storagenodedb.(*pieceExpirationDB).GetExpired:66\n\tstorj.io/storj/storagenode/pieces.(*Store).GetExpiredBatchSkipV0:612\n\tstorj.io/storj/storagenode/collector.(*Service).Collect:99\n\tstorj.io/storj/storagenode/collector.(*Service).Run.func1:68\n\tstorj.io/common/sync2.(*Cycle).Run:99\n\tstorj.io/storj/storagenode/collector.(*Service).Run:64\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:44\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
2024-08-25T11:21:58+03:00	ERROR	collector	error during collecting pieces: 	{"error": "pieces error: pieceexpirationdb: context canceled", "errorVerbose": "pieces error: pieceexpirationdb: context canceled\n\tstorj.io/storj/storagenode/storagenodedb.(*pieceExpirationDB).getExpiredPaginated:103\n\tstorj.io/storj/storagenode/storagenodedb.(*pieceExpirationDB).GetExpired:66\n\tstorj.io/storj/storagenode/pieces.(*Store).GetExpiredBatchSkipV0:612\n\tstorj.io/storj/storagenode/collector.(*Service).Collect:99\n\tstorj.io/storj/storagenode/collector.(*Service).Run.func1:68\n\tstorj.io/common/sync2.(*Cycle).Run:99\n\tstorj.io/storj/storagenode/collector.(*Service).Run:64\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:44\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
2024-08-25T11:21:58+03:00	ERROR	failure during run	{"error": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory", "errorVerbose": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:181\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:168\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
2024-08-25T11:21:58+03:00	FATAL	Unrecoverable error	{"error": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory", "errorVerbose": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:181\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:168\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}

Alexey · August 25, 2024, 10:37am

That’s a culprit, you need either improve the disk subsystem or increase a timeout accordingly the error.
In this case you need to increase the writable timeout a little bit (30s for example).
And repeat this until the node would stop to crash. However, if it would be greater than 5 minutes it’s time to rethink your setup.
If you would take into the account, that now the timeout for a GET_AUDIT request is 15 seconds, and you would have a readable checks failures… well, you have a problem. Not now, by the way, but if you could have (never, please)…

Vadim · August 25, 2024, 11:43am

It is a problem only when there is a Delets are happening, all other time it working fine.

Alexey · August 25, 2024, 1:59pm

Ok.
“When I became a firefighter, it was so cool! Until, you know, you have to put out a fire… Then I prayed to get fired…”

Vadim · August 25, 2024, 4:15pm

I think problem will be partly solved when Budger cache can be moved to SSD, it will lower pressure on HDD, like it hapened with DBs

JWvdV · August 25, 2024, 5:11pm

It can but you have to take some effort:

Windows: NTFS Hard Links, Junctions and Symbolic Links
Linux: https://www.baeldung.com/linux/bind-mounts
Docker: Bind mounts | Docker Docs

Alexey · August 26, 2024, 3:14am

Do you think the badger cache have a high I/O pressure? Did you see it with a Resources Monitor? I believe that it’s cached by the OS.

Vadim · August 26, 2024, 3:30am

It is cached when it has enouth RAM for it, yes it has big presure, for example node on NVME is with cache make start scan len than in minut, it draw 25MB/s from NVME, hdd cant make this speeds.

Alexey · August 26, 2024, 4:10am

If the whole node on NVME, then it’s a total amount of operations. Not sure that you can compare speed with IOPS though.
Only the load can show this, however, you need to separate the load driven by the badger cache from the total load. You need to use either a Resources Monitor or a Performance Monitor (it has much more metrics and you can save the graphs for a period of time and review what was the load in the past).

EasyRhino · August 26, 2024, 5:25am

Hey vadim, there are some options in config.yaml about deletes. number of processed simultaneously, and number of files to delete. If you are doing more than 1 simultaneous, maybe stop and go to 1.

What filesystem are you running on?

Also, if you are using docker then it’s pretty easy right now to move the badger cache to SSD. Just mount a separate mount for filestatcache.

my docker compose is like this:

  - type: bind
    source: /srv/storj2
    target: /app/config
  - type: bind
    source: /home/ubuntu/storj/storage
    target: /app/dbs
  - type: bind
    source: /home/ubuntu/storj/badger
    target: /app/config/storage/filestatcache

Evgeny · August 26, 2024, 12:03pm

I had a similar problem. I noticed that the amount of RAM becomes extremely small after some time of computer operation. Then I enlarged the swap file in windows and the problem went away.

batelis · August 26, 2024, 3:26pm

Yep I get this error on my new node with slower disk 5600rpm aswell

arrogantrabbit · August 26, 2024, 3:56pm

The problem was swapped under the rug, it did not go away anywhere. The only way to “fix” the problem is make amount of IO reaching the disk smaller than what it can sustain. You can find specific advice on the forum here.

Vadim · August 28, 2024, 12:12am

No brain file deletion makes problem to node, watch mft size and location.

i found that this node has files in trash from 28.07 and it cant clean it, I deleted some manualy some files deletion was 1 per second or even slower.