So I was troubleshooting some issues with the lazy filewalker not working properly and found this log from June 4th 2024 that was 4MB long. It appears to be a loop of the same stack trace (endless loop?).
I parsed it to replace the new line characters with defaults and have attached it here if anyone wants to take a peek.
This caused my node to go offline, and a restart of the windows service was required.
If your node is working properly then there is nothing to worry about the stack trace. Also its never a good idea to upload logs to Gdrive that requires to request access.
I was providing the stack trace so that the issue could potentially be resolved if it hasn’t already as I can’t think of a good reason for a 4mb long stack trace with the same functions being called recursively.
I am experiencing some node issues on v1.105.4 with the lazy filewalker not seeming to want to start (I can inspect the filewalker process and it isn’t reading any files), but its probably the topic for another thread and unrelated to this error.
Could you please copy here the head of the error, it should start with a datestamp? You do not need to post everything, just beginning of the error before the long stack-trace.
I’m guessing this is related to the increased ingress that has been going on swamping the drive. I grabbed a few lines before it as well. Not easy to parse the log file (had to resort to powershell) as the extended line crashes Large Text File viewer.
2024-06-04T21:52:36-04:00 WARN services service takes long to shutdown {"name": "gracefulexit:chore"}
2024-06-04T21:52:36-04:00 WARN servers service takes long to shutdown {"name": "server"}
2024-06-04T21:52:36-04:00 WARN services service takes long to shutdown {"name": "orders"}
2024-06-04T21:52:37-04:00 INFO servers slow shutdown {"stack": "goroutine 1173 [running]:
storj.io/storj/private/lifecycle.(*Group).logStackTrace.func1()
/go/src/storj.io/storj/private/lifecycle/group.go:107 +0x71
sync.(*Once).doSlow(0xc0078421a0?, 0xc00e7901c0?)
/usr/local/go/src/sync/once.go:74 +0xbf
sync.(*Once).Do(...)
/usr/local/go/src/sync/once.go:65
storj.io/storj/private/lifecycle.(*Group).logStackTrace(0xc0001f7eb8?)
/go/src/storj.io/storj/private/lifecycle/group.go:104 +0x3c
storj.io/storj/private/lifecycle.(*Group).Run.func1({0x14e3a70?, 0xc0006dc870?})
/go/src/storj.io/storj/private/lifecycle/group.go:77 +0x2a5
runtime/pprof.Do({0x14e3cd8?, 0xc0006f80a0?}, {{0xc0000fc1e0?, 0xc0001f7fd0?, 0xa03025?}}, 0xc0008ac1c0)
/usr/local/go/src/runtime/pprof/runtime.go:51 +0x9d
created by storj.io/storj/private/lifecycle.(*Group).Run in goroutine 86
/go/src/storj.io/storj/private/lifecycle/group.go:64 +0x509
Somewhere earlier the node should has received a stop event, either because of the external request (from storagenode-updater or from the OS, due to a OOM for example or memory error, i.e. hardware issue).
Could you please check a few lines before this line:
I see no FATAL errors in the 5000 lines above the one reported but there are some Errors related to failed uploads before it quit (those 5k lines above start at at 21:41:43). If I recall correctly, the node was completely unresponsive and I tried to restart the service, but nothing happened, so I ended the task on the exe a minute or so later. This box has 128gb ram, so I doubt it was OOM related. I can review more of the logs if you would like, or do a findstr on FATAL. I assume that is what you are looking for?
2024-06-04T21:52:53-04:00 FATAL Unrecoverable error {"error": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory", "errorVerbose": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:178\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:167\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
2024-06-05T11:43:50-04:00 FATAL Unrecoverable error {"error": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory", "errorVerbose": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:178\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:167\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
Yes, thank you!
This is what I expected as a start of the issue with a long stacktrace.
This message indicates that the disk was unable to write a small file even after 1m0s.
See
To me that seems like an issue StorJ needs to address somehow. These drives are pretty solid, but they are getting hit with some serious abuse from the increased ingress. Things are a little slower than they were then, but I’ve also added additional nodes to help mitigate the issues. At the moment I’m fighting with the filewalker because of similar issues as well (can’t get enough IO bandwidth to get through the walk). Unless running the node on an SSD is going to become a new requirement. My equipment is pretty solid, a 2990wx with 128gb ram with a 1gbit pipe which might actually make things worse in terms of the drive abuse.
Unfortunately the root cause (the FATAL error because of writeable check failure) cannot be solved with the code, it need to be addressed in your setup.
No, we do not have such plans.
if you are using one node per disk, you can improve the I/O throughput a bit:
perform a defragmentation and enable the automatic one (it’s enabled by default);
disable 8dot3:
disable atime:
The next step would be to enable a write cache on the drive, but it requires UPS to be safe.
@ACarneiro is right, however, this option was for a weak devices, but seems @loueradun have a much more powerful device than a Raspberry Pi2/3…
And since it affects customers, I would prefer to do not use it… Especially after the “choice of n” has been implemented and deployed to the production: the load will be automatically adjusted without affecting the customers.
But there is something odd going on that even some very highly performant machines are struggling. I am still bearing @Qwinn’s setup in mind. That setup sounds very powerful and still struggles. Some of my NUCs are running with SSD for databases and spinning rust for storage and not a single problem (and I suspect when I can physically reach my Pi5 and make similar changes it’ll be the same thing).
It doesn’t make sense to me to think this is only related to hardware issues.
Yes, this setup is very capable in my opinion. It’s better than mine in many aspects. However, my system has had started to have this issue too, this was a surprise for me, because I didn’t see this problem before. And this is mean, that the default vanilla setup is not capable (!) to handle the load. Yes, I didn’t move DBs to SSD, even if I might. But I did this is for the reason to have an etalon setup for each case…
of course it’s not, @Qwinn is convinced me in that. And my etalon setup of course, too. But the last one was just a confirmation. So, @Qwinn - thank you!
To add to the discussion here, I have switched over to an SSD for the databases, and hoped that would improve the situation. I just went in to check how things were going, and see that the service is not running. I am going to investigate the logs now. This drive has 11TiB full of 12.6TiB allocated and is 14 TiB total.
The crash seems pretty similar but without the extended stack trace. Lots of failed uploads increasing in frequency leading up to the full stop, and none of the satellites finished a filewalk (been monitoring the filewalker process via Resource Monitor, and haven’t seen the filewalker read any of the files tbh after 1.105.4). 66 uploads were cancelled at the time of the crash which seems like a lot of simultaneous jobs on a single spinning hard disk, especially one nearing capacity.
It seems like a monitor for failing uploads/downloads which decreases the connections until things settle would be appropriate IMHO, but to be fair, I’m not intricately familiar with how the jobs are executed.
2024-06-22T22:27:36-04:00 ERROR services unexpected shutdown of a runner {"name": "piecestore:monitor", "error": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory", "errorVerbose": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:175\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:164\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
2024-06-22T22:27:46-04:00 ERROR pieces failed to lazywalk space used by satellite {"error": "lazyfilewalker: exit status 1", "errorVerbose": "lazyfilewalker: exit status 1\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:85\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:130\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:707\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}
2024-06-22T22:27:46-04:00 ERROR lazyfilewalker.used-space-filewalker failed to start subprocess {"satelliteID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "error": "context canceled"}
2024-06-22T22:27:46-04:00 ERROR pieces failed to lazywalk space used by satellite {"error": "lazyfilewalker: context canceled", "errorVerbose": "lazyfilewalker: context canceled\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:73\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:130\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:707\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6"}
2024-06-22T22:28:06-04:00 ERROR lazyfilewalker.used-space-filewalker failed to start subprocess {"satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "error": "context canceled"}
2024-06-22T22:28:06-04:00 ERROR pieces failed to lazywalk space used by satellite {"error": "lazyfilewalker: context canceled", "errorVerbose": "lazyfilewalker: context canceled\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:73\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:130\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:707\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}
2024-06-22T22:28:06-04:00 ERROR lazyfilewalker.used-space-filewalker failed to start subprocess {"satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "error": "context canceled"}
2024-06-22T22:28:06-04:00 ERROR pieces failed to lazywalk space used by satellite {"error": "lazyfilewalker: context canceled", "errorVerbose": "lazyfilewalker: context canceled\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:73\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:130\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:707\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"}
2024-06-22T22:28:06-04:00 ERROR piecestore:cache error getting current used space: {"error": "filewalker: context canceled; filewalker: context canceled; filewalker: context canceled; filewalker: context canceled", "errorVerbose": "group:\n--- filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:716\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78\n--- filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:716\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78\n--- filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:716\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78\n--- filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:716\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
2024-06-22T22:28:06-04:00 ERROR pieces:trash emptying trash failed {"error": "pieces error: lazyfilewalker: exit status 1", "errorVerbose": "pieces error: lazyfilewalker: exit status 1\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:85\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkCleanupTrash:187\n\tstorj.io/storj/storagenode/pieces.(*Store).EmptyTrash:422\n\tstorj.io/storj/storagenode/pieces.(*TrashChore).Run.func1.1:84\n\tstorj.io/common/sync2.(*Workplace).Start.func1:89"}
2024-06-22T22:28:06-04:00 ERROR failure during run {"error": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory", "errorVerbose": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:175\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:164\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}