Randomly nodes turn off

As now there is lot of proceses that run on HDD, they time to time jamming each other.
So i think that there is time to make default Writability check little bit bigger than 1 min.

I have several nodes that go off with this error.

|2024-08-25T11:20:17+03:00|ERROR|services|unexpected shutdown of a runner|{name: piecestore:monitor, error: piecestore monitor: timed out after 1m0s while verifying writability of storage directory, errorVerbose: piecestore monitor: timed out after 1m0s while verifying writability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:181\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:168\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78}|
|2024-08-25T11:20:17+03:00|ERROR|gracefulexit:chore|error retrieving satellites.|{error: satellitesdb: context canceled, errorVerbose: satellitesdb: context canceled\n\tstorj.io/storj/storagenode/storagenodedb.(*satellitesDB).ListGracefulExits.func1:200\n\tstorj.io/storj/storagenode/storagenodedb.(*satellitesDB).ListGracefulExits:212\n\tstorj.io/storj/storagenode/gracefulexit.(*Service).ListPendingExits:59\n\tstorj.io/storj/storagenode/gracefulexit.(*Chore).AddMissing:55\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/gracefulexit.(*Chore).Run:48\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:44\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78}|
|2024-08-25T11:20:32+03:00|WARN|services|service takes long to shutdown|{name: retain}|
|2024-08-25T11:20:32+03:00|WARN|services|service takes long to shutdown|{name: piecestore:cache}|
|2024-08-25T11:20:32+03:00|WARN|services|service takes long to shutdown|{name: collector}|
|2024-08-25T11:20:32+03:00|WARN|servers|service takes long to shutdown|{name: server}|
|2024-08-25T11:20:32+03:00|WARN|services|service takes long to shutdown|{name: pieces:trash}|
|2024-08-25T11:21:34+03:00|ERROR|pieces|used-space-filewalker failed|{Satellite ID: 1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE, Lazy File Walker: false, error: filewalker: context canceled, errorVerbose: filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).WalkAndComputeSpaceUsedBySatellite:731\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run.func1:81\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78}|
|2024-08-25T11:21:34+03:00|ERROR|piecestore:cache|encountered error while computing space used by satellite|{error: filewalker: context canceled, errorVerbose: filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).WalkAndComputeSpaceUsedBySatellite:731\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run.func1:81\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78, SatelliteID: 1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE}|
|2024-08-25T11:21:34+03:00|ERROR|pieces|used-space-filewalker failed|{Satellite ID: 121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6, Lazy File Walker: false, error: filewalker: context canceled, errorVerbose: filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).WalkAndComputeSpaceUsedBySatellite:731\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run.func1:81\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78}|
|2024-08-25T11:21:34+03:00|ERROR|piecestore:cache|encountered error while computing space used by satellite|{error: filewalker: context canceled, errorVerbose: filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).WalkAndComputeSpaceUsedBySatellite:731\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run.func1:81\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78, SatelliteID: 121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6}|
|2024-08-25T11:21:34+03:00|ERROR|pieces:trash|emptying trash failed|{error: pieces error: filestore error: context canceled; context canceled, errorVerbose: pieces error: filestore error: context canceled; context canceled\n\tstorj.io/storj/storagenode/blobstore/filestore.(*blobStore).EmptyTrash:205\n\tstorj.io/storj/storagenode/blobstore/statcache.(*CachedStatBlobstore).EmptyTrash:114\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).EmptyTrash:432\n\tstorj.io/storj/storagenode/pieces.(*Store).EmptyTrash:443\n\tstorj.io/storj/storagenode/pieces.(*TrashChore).Run.func1.1:84\n\tstorj.io/common/sync2.(*Workplace).Start.func1:89}|
|2024-08-25T11:21:34+03:00|ERROR|retain|retain pieces failed|{cachePath: C:\\Program Files\\Storj11\\Storage Node/retain, error: retain: filewalker: context canceled, errorVerbose: retain: filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePiecesToTrash:181\n\tstorj.io/storj/storagenode/pieces.(*Store).WalkSatellitePiecesToTrash:582\n\tstorj.io/storj/storagenode/retain.(*Service).retainPieces:380\n\tstorj.io/storj/storagenode/retain.(*Service).Run.func2:265\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78}|
|2024-08-25T11:21:34+03:00|ERROR|pieces|used-space-filewalker failed|{Satellite ID: 12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S, Lazy File Walker: false, error: filewalker: context canceled, errorVerbose: filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).WalkAndComputeSpaceUsedBySatellite:731\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run.func1:81\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78}|
|2024-08-25T11:21:34+03:00|ERROR|piecestore:cache|encountered error while computing space used by satellite|{error: filewalker: context canceled, errorVerbose: filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).WalkAndComputeSpaceUsedBySatellite:731\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run.func1:81\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78, SatelliteID: 12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S}|
|2024-08-25T11:21:34+03:00|ERROR|pieces|used-space-filewalker failed|{Satellite ID: 12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs, Lazy File Walker: false, error: filewalker: context canceled, errorVerbose: filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).WalkAndComputeSpaceUsedBySatellite:731\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run.func1:81\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78}|
|2024-08-25T11:21:34+03:00|ERROR|piecestore:cache|encountered error while computing space used by satellite|{error: filewalker: context canceled, errorVerbose: filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).WalkAndComputeSpaceUsedBySatellite:731\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run.func1:81\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78, SatelliteID: 12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs}|
|2024-08-25T11:21:34+03:00|ERROR|piecestore:cache|error getting current used space for trash: |{error: filestore error: failed to walk trash namespace 7b2de9d72c2e935f1918c058caaf8ed00f0581639008707317ff1bd000000000: context canceled, errorVerbose: filestore error: failed to walk trash namespace 7b2de9d72c2e935f1918c058caaf8ed00f0581639008707317ff1bd000000000: context canceled\n\tstorj.io/storj/storagenode/blobstore/filestore.(*blobStore).SpaceUsedForTrash:273\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run.func1:100\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78}|
2024-08-25T11:21:58+03:00	ERROR	collector	error during expired pieces collection	{"count": 11000, "error": "pieces error: pieceexpirationdb: context canceled", "errorVerbose": "pieces error: pieceexpirationdb: context canceled\n\tstorj.io/storj/storagenode/storagenodedb.(*pieceExpirationDB).getExpiredPaginated:103\n\tstorj.io/storj/storagenode/storagenodedb.(*pieceExpirationDB).GetExpired:66\n\tstorj.io/storj/storagenode/pieces.(*Store).GetExpiredBatchSkipV0:612\n\tstorj.io/storj/storagenode/collector.(*Service).Collect:99\n\tstorj.io/storj/storagenode/collector.(*Service).Run.func1:68\n\tstorj.io/common/sync2.(*Cycle).Run:99\n\tstorj.io/storj/storagenode/collector.(*Service).Run:64\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:44\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
2024-08-25T11:21:58+03:00	ERROR	collector	error during collecting pieces: 	{"error": "pieces error: pieceexpirationdb: context canceled", "errorVerbose": "pieces error: pieceexpirationdb: context canceled\n\tstorj.io/storj/storagenode/storagenodedb.(*pieceExpirationDB).getExpiredPaginated:103\n\tstorj.io/storj/storagenode/storagenodedb.(*pieceExpirationDB).GetExpired:66\n\tstorj.io/storj/storagenode/pieces.(*Store).GetExpiredBatchSkipV0:612\n\tstorj.io/storj/storagenode/collector.(*Service).Collect:99\n\tstorj.io/storj/storagenode/collector.(*Service).Run.func1:68\n\tstorj.io/common/sync2.(*Cycle).Run:99\n\tstorj.io/storj/storagenode/collector.(*Service).Run:64\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:44\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
2024-08-25T11:21:58+03:00	ERROR	failure during run	{"error": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory", "errorVerbose": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:181\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:168\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
2024-08-25T11:21:58+03:00	FATAL	Unrecoverable error	{"error": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory", "errorVerbose": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:181\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:168\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
1 Like

That’s a culprit, you need either improve the disk subsystem or increase a timeout accordingly the error.
In this case you need to increase the writable timeout a little bit (30s for example).
And repeat this until the node would stop to crash. However, if it would be greater than 5 minutes it’s time to rethink your setup.
If you would take into the account, that now the timeout for a GET_AUDIT request is 15 seconds, and you would have a readable checks failures… well, you have a problem. Not now, by the way, but if you could have (never, please)…

1 Like

It is a problem only when there is a Delets are happening, all other time it working fine.

1 Like

Ok.
“When I became a firefighter, it was so cool! Until, you know, you have to put out a fire… Then I prayed to get fired…”

4 Likes

I think problem will be partly solved when Budger cache can be moved to SSD, it will lower pressure on HDD, like it hapened with DBs

1 Like

It can but you have to take some effort:

2 Likes

Do you think the badger cache have a high I/O pressure? Did you see it with a Resources Monitor? I believe that it’s cached by the OS.

1 Like

It is cached when it has enouth RAM for it, yes it has big presure, for example node on NVME is with cache make start scan len than in minut, it draw 25MB/s from NVME, hdd cant make this speeds.

1 Like

If the whole node on NVME, then it’s a total amount of operations. Not sure that you can compare speed with IOPS though.
Only the load can show this, however, you need to separate the load driven by the badger cache from the total load. You need to use either a Resources Monitor or a Performance Monitor (it has much more metrics and you can save the graphs for a period of time and review what was the load in the past).

1 Like

Hey vadim, there are some options in config.yaml about deletes. number of processed simultaneously, and number of files to delete. If you are doing more than 1 simultaneous, maybe stop and go to 1.

What filesystem are you running on?

Also, if you are using docker then it’s pretty easy right now to move the badger cache to SSD. Just mount a separate mount for filestatcache.

my docker compose is like this:

  - type: bind
    source: /srv/storj2
    target: /app/config
  - type: bind
    source: /home/ubuntu/storj/storage
    target: /app/dbs
  - type: bind
    source: /home/ubuntu/storj/badger
    target: /app/config/storage/filestatcache

I had a similar problem. I noticed that the amount of RAM becomes extremely small after some time of computer operation. Then I enlarged the swap file in windows and the problem went away.

Yep I get this error on my new node with slower disk 5600rpm aswell

The problem was swapped under the rug, it did not go away anywhere. The only way to “fix” the problem is make amount of IO reaching the disk smaller than what it can sustain. You can find specific advice on the forum here.

No brain file deletion makes problem to node, watch mft size and location.

i found that this node has files in trash from 28.07 and it cant clean it, I deleted some manualy some files deletion was 1 per second or even slower.