Loop stacktrace discovered in 4MB line in storagenode logs that resulted in a crashed node

loueradun · June 21, 2024, 3:19am

So I was troubleshooting some issues with the lazy filewalker not working properly and found this log from June 4th 2024 that was 4MB long. It appears to be a loop of the same stack trace (endless loop?).

I parsed it to replace the new line characters with defaults and have attached it here if anyone wants to take a peek.

This caused my node to go offline, and a restart of the windows service was required.

https://drive.google.com/file/d/13ygkSTHwqkknYDrpTy_Jbq4fP5hjxO8j/view?usp=drive_link

nerdatwork · June 21, 2024, 3:51am

Welcome to the forum @loueradun !

If your node is working properly then there is nothing to worry about the stack trace. Also its never a good idea to upload logs to Gdrive that requires to request access.

loueradun · June 21, 2024, 4:51am

can you not access the file?

I was providing the stack trace so that the issue could potentially be resolved if it hasn’t already as I can’t think of a good reason for a 4mb long stack trace with the same functions being called recursively.

I am experiencing some node issues on v1.105.4 with the lazy filewalker not seeming to want to start (I can inspect the filewalker process and it isn’t reading any files), but its probably the topic for another thread and unrelated to this error.

nerdatwork · June 21, 2024, 5:36am

Alexey · June 21, 2024, 6:51am

Could you please copy here the head of the error, it should start with a datestamp? You do not need to post everything, just beginning of the error before the long stack-trace.

loueradun · June 21, 2024, 10:42am

Sorry about that, the file should be public now. Been a long time since I’ve used Drive.

loueradun · June 21, 2024, 11:22am

I’m guessing this is related to the increased ingress that has been going on swamping the drive. I grabbed a few lines before it as well. Not easy to parse the log file (had to resort to powershell) as the extended line crashes Large Text File viewer.

2024-06-04T21:52:36-04:00	WARN	services	service takes long to shutdown	{"name": "gracefulexit:chore"}
2024-06-04T21:52:36-04:00	WARN	servers	service takes long to shutdown	{"name": "server"}
2024-06-04T21:52:36-04:00	WARN	services	service takes long to shutdown	{"name": "orders"}
2024-06-04T21:52:37-04:00	INFO	servers	slow shutdown	{"stack": "goroutine 1173 [running]:
storj.io/storj/private/lifecycle.(*Group).logStackTrace.func1()
/go/src/storj.io/storj/private/lifecycle/group.go:107 +0x71
sync.(*Once).doSlow(0xc0078421a0?, 0xc00e7901c0?)
/usr/local/go/src/sync/once.go:74 +0xbf
sync.(*Once).Do(...)
/usr/local/go/src/sync/once.go:65
storj.io/storj/private/lifecycle.(*Group).logStackTrace(0xc0001f7eb8?)
/go/src/storj.io/storj/private/lifecycle/group.go:104 +0x3c
storj.io/storj/private/lifecycle.(*Group).Run.func1({0x14e3a70?, 0xc0006dc870?})
/go/src/storj.io/storj/private/lifecycle/group.go:77 +0x2a5
runtime/pprof.Do({0x14e3cd8?, 0xc0006f80a0?}, {{0xc0000fc1e0?, 0xc0001f7fd0?, 0xa03025?}}, 0xc0008ac1c0)
/usr/local/go/src/runtime/pprof/runtime.go:51 +0x9d
created by storj.io/storj/private/lifecycle.(*Group).Run in goroutine 86
/go/src/storj.io/storj/private/lifecycle/group.go:64 +0x509

pwilloughby · June 21, 2024, 3:06pm

There’s a fix for this coming private/lifecycle: condense stack traces for slow shutdown · storj/storj@b59ef96 · GitHub. Thanks for the report.

Alexey · June 22, 2024, 4:57am

Somewhere earlier the node should has received a stop event, either because of the external request (from storagenode-updater or from the OS, due to a OOM for example or memory error, i.e. hardware issue).
Could you please check a few lines before this line:

Is there a FATAL error?

loueradun · June 23, 2024, 12:49am

I see no FATAL errors in the 5000 lines above the one reported but there are some Errors related to failed uploads before it quit (those 5k lines above start at at 21:41:43). If I recall correctly, the node was completely unresponsive and I tried to restart the service, but nothing happened, so I ended the task on the exe a minute or so later. This box has 128gb ram, so I doubt it was OOM related. I can review more of the logs if you would like, or do a findstr on FATAL. I assume that is what you are looking for?

loueradun · June 23, 2024, 1:01am

A findstr on the log file only found these 2:

2024-06-04T21:52:53-04:00	FATAL	Unrecoverable error	{"error": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory", "errorVerbose": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:178\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:167\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
2024-06-05T11:43:50-04:00	FATAL	Unrecoverable error	{"error": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory", "errorVerbose": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:178\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:167\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}

Alexey · June 23, 2024, 3:25am

Yes, thank you!
This is what I expected as a start of the issue with a long stacktrace.
This message indicates that the disk was unable to write a small file even after 1m0s.
See

loueradun · June 23, 2024, 3:36am

To me that seems like an issue StorJ needs to address somehow. These drives are pretty solid, but they are getting hit with some serious abuse from the increased ingress. Things are a little slower than they were then, but I’ve also added additional nodes to help mitigate the issues. At the moment I’m fighting with the filewalker because of similar issues as well (can’t get enough IO bandwidth to get through the walk). Unless running the node on an SSD is going to become a new requirement. My equipment is pretty solid, a 2990wx with 128gb ram with a 1gbit pipe which might actually make things worse in terms of the drive abuse.

Alexey · June 23, 2024, 6:55am

Unfortunately the root cause (the FATAL error because of writeable check failure) cannot be solved with the code, it need to be addressed in your setup.

No, we do not have such plans.
if you are using one node per disk, you can improve the I/O throughput a bit:

perform a defragmentation and enable the automatic one (it’s enabled by default);
disable 8dot3:

disable atime:

The next step would be to enable a write cache on the drive, but it requires UPS to be safe.

ACarneiro · June 23, 2024, 6:58am

You may mitigate the issue somewhat by reducing the number of allowed concurrent connections.
You can tweak with

storage2.max-concurrent-requests

on your config.yaml file.

This is not without its own shortocomings but if you’re struggling after trying other things it might an be option.

Alexey · June 23, 2024, 7:32am

@ACarneiro is right, however, this option was for a weak devices, but seems @loueradun have a much more powerful device than a Raspberry Pi2/3…
And since it affects customers, I would prefer to do not use it… Especially after the “choice of n” has been implemented and deployed to the production: the load will be automatically adjusted without affecting the customers.

ACarneiro · June 23, 2024, 7:39am

Indeed, it’s not ideal.

But there is something odd going on that even some very highly performant machines are struggling. I am still bearing @Qwinn’s setup in mind. That setup sounds very powerful and still struggles. Some of my NUCs are running with SSD for databases and spinning rust for storage and not a single problem (and I suspect when I can physically reach my Pi5 and make similar changes it’ll be the same thing).
It doesn’t make sense to me to think this is only related to hardware issues.

Alexey · June 23, 2024, 7:56am

Yes, this setup is very capable in my opinion. It’s better than mine in many aspects. However, my system has had started to have this issue too, this was a surprise for me, because I didn’t see this problem before. And this is mean, that the default vanilla setup is not capable (!) to handle the load. Yes, I didn’t move DBs to SSD, even if I might. But I did this is for the reason to have an etalon setup for each case…

of course it’s not, @Qwinn is convinced me in that. And my etalon setup of course, too. But the last one was just a confirmation. So, @Qwinn - thank you!

loueradun · June 23, 2024, 6:15pm

To add to the discussion here, I have switched over to an SSD for the databases, and hoped that would improve the situation. I just went in to check how things were going, and see that the service is not running. I am going to investigate the logs now. This drive has 11TiB full of 12.6TiB allocated and is 14 TiB total.

loueradun · June 23, 2024, 6:42pm

The crash seems pretty similar but without the extended stack trace. Lots of failed uploads increasing in frequency leading up to the full stop, and none of the satellites finished a filewalk (been monitoring the filewalker process via Resource Monitor, and haven’t seen the filewalker read any of the files tbh after 1.105.4). 66 uploads were cancelled at the time of the crash which seems like a lot of simultaneous jobs on a single spinning hard disk, especially one nearing capacity.

It seems like a monitor for failing uploads/downloads which decreases the connections until things settle would be appropriate IMHO, but to be fair, I’m not intricately familiar with how the jobs are executed.

2024-06-22T22:27:36-04:00	ERROR	services	unexpected shutdown of a runner	{"name": "piecestore:monitor", "error": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory", "errorVerbose": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:175\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:164\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
2024-06-22T22:27:46-04:00	ERROR	pieces	failed to lazywalk space used by satellite	{"error": "lazyfilewalker: exit status 1", "errorVerbose": "lazyfilewalker: exit status 1\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:85\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:130\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:707\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}
2024-06-22T22:27:46-04:00	ERROR	lazyfilewalker.used-space-filewalker	failed to start subprocess	{"satelliteID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "error": "context canceled"}
2024-06-22T22:27:46-04:00	ERROR	pieces	failed to lazywalk space used by satellite	{"error": "lazyfilewalker: context canceled", "errorVerbose": "lazyfilewalker: context canceled\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:73\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:130\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:707\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6"}
2024-06-22T22:28:06-04:00	ERROR	lazyfilewalker.used-space-filewalker	failed to start subprocess	{"satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "error": "context canceled"}
2024-06-22T22:28:06-04:00	ERROR	pieces	failed to lazywalk space used by satellite	{"error": "lazyfilewalker: context canceled", "errorVerbose": "lazyfilewalker: context canceled\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:73\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:130\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:707\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}
2024-06-22T22:28:06-04:00	ERROR	lazyfilewalker.used-space-filewalker	failed to start subprocess	{"satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "error": "context canceled"}
2024-06-22T22:28:06-04:00	ERROR	pieces	failed to lazywalk space used by satellite	{"error": "lazyfilewalker: context canceled", "errorVerbose": "lazyfilewalker: context canceled\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:73\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:130\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:707\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"}
2024-06-22T22:28:06-04:00	ERROR	piecestore:cache	error getting current used space: 	{"error": "filewalker: context canceled; filewalker: context canceled; filewalker: context canceled; filewalker: context canceled", "errorVerbose": "group:\n--- filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:716\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78\n--- filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:716\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78\n--- filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:716\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78\n--- filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:716\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
2024-06-22T22:28:06-04:00	ERROR	pieces:trash	emptying trash failed	{"error": "pieces error: lazyfilewalker: exit status 1", "errorVerbose": "pieces error: lazyfilewalker: exit status 1\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:85\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkCleanupTrash:187\n\tstorj.io/storj/storagenode/pieces.(*Store).EmptyTrash:422\n\tstorj.io/storj/storagenode/pieces.(*TrashChore).Run.func1.1:84\n\tstorj.io/common/sync2.(*Workplace).Start.func1:89"}
2024-06-22T22:28:06-04:00	ERROR	failure during run	{"error": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory", "errorVerbose": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:175\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:164\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}