So, how to debug failing filewalkers with "context canceled"

So various people, including me, have had filewalker fail, often with an error of “context canceled”. Alexey has often repeated that this is a symptom of a limited I/O. So much I can hear it wringing in my ears.

And yes, in my case the disk are mounted as NFS shares. Also I can hear Alexey’s scolding. :slight_smile:

But like… still… why? It’s on a node and disk that seems to operating well. What’s the timeout in question, and is it adjustable? I tried searching around on github but I am not a programmer and got totally lost. It may have been tied to some other timeout in the settings but I could be way off.

But… failing filewalkers is like… a problem that affects a significant chunk of SNO’s, and often the may not even be aware if they’re not looking at logs. maybe there is a way to fix it. (with code, or settings, or node setup)

here is the common example of an error I get:

2024-08-14T04:01:45-07:00       INFO    pieces  used-space-filewalker started   {"Process": "storagenode", "Sate     llite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}
2024-08-14T04:01:45-07:00       INFO    lazyfilewalker.used-space-filewalker    starting subprocess     {"Proces     s": "storagenode", "satelliteID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}
2024-08-14T04:01:45-07:00       ERROR   lazyfilewalker.used-space-filewalker    failed to start subprocess     {     "Process": "storagenode", "satelliteID": "1wFTAgs9DP5RSnCq
KV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "error": "context      canceled"}
2024-08-14T04:01:45-07:00       ERROR   pieces  used-space-filewalker failed    {"Process": "storagenode", "Sate     llite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Lazy File Walker": true, "error": "lazyfilewa     lker: context canceled", "errorVerbose": "lazyfilewalker: context canceled\n\tstorj.io/storj/storagenode/pieces/     lazyfilewalker.(*process).run:73\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComput     eSpaceUsedBySatellite:130\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:715\n\tstor     j.io/storj/storagenode/pieces.(*CacheService).Run.func1:71\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
2024-08-14T04:01:45-07:00       ERROR   pieces  used-space-filewalker failed    {"Process": "storagenode", "Sate     llite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Lazy File Walker": false, "error": "filewalke     r: context canceled", "errorVerbose": "filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileW     alker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatell     ite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:724\n\tstorj.io/storj/storagen     ode/pieces.(*CacheService).Run.func1:71\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}

2 Likes

hmm interesting. in my case the databases are still stored on ssd so I didn’t have lock problems with that. Also the nfs shares are configured ‘async’ not ‘sync’, which was really slow when it was on sync.

I do not recall any locking on the storage directory whatsoever in the source code.

Do they fail because of context canceled?

i think that’s a common reason I see when people post, yeah… lemme try searching the forum.

okay I searched for used-space-filewalker and looked at the first couple pages of results to see if there were logs. I know, highly scientific statistical sampling.

4 people had context canceled errors:

Disk usage discrepancy? - Node Operators / troubleshooting - Storj Community Forum (official)

HDD failing, filewalker / bloom filter issue? What am I looking at? - Node Operators / troubleshooting - Storj Community Forum (official)

POWER outage = 3 nodes not coming back online - Node Operators / troubleshooting - Storj Community Forum (official)

Storj restart always - Node Operators / troubleshooting - Storj Community Forum (official)

Filewalker crashes node - Node Operators / troubleshooting - Storj Community Forum (official)

One guys had context canceled error but there were… a lot… of other problems simultaneously:

Node crashes after restart - Node Operators / troubleshooting - Storj Community Forum (official)

One guy had a context canceled error, but the cause was definitely not performance but the storage-dir-verification.

POWER outage = 3 nodes not coming back online - Node Operators / troubleshooting - Storj Community Forum (official)

I didn’t see any log fragments for used-space-filwalker that didn’t say context canceled, which may be part of the problem with the error message… it’ gets used for possibly lots of different underlying problems.

Context canceled is just a way to signal to goroutines (like the file walker) there’s no point in them working anymore for external reasons. So for us it’s just an observation suggesting to look for critical failures in other parts of the node leading to node shutdown.

Node killed by a database lock.

Node killed because of a database lock… hard to say where exactly, longer logs would be needed.

Node killed because of readability check.

Node killed because of disk space requirements not met.

I’d assume the actual error would be hidden in earlier log lines, the operator did not grep for errors.

Node killed because of disk space requirements not met.

2 Likes

Actually, of I run grep -i "context cancelled" against all my logs, they appear to happen now and then without any consequences as far as I can see. They especially appear on the slower nodes. And that is probably also the reason for a lot of errors mentioned by @Toyoo, including database locks and readability and writeability timeouts.

As these bugs have existed for a long time now, in my opinion I don’t think they will be publicly solved any time soon; as a fix. For the ‘bug bounty program’ seems only for security purposes, and has not garnered a single acknowledged reward to date (however anonymous it is), in this forum, or elsewhere - from my observation. I can’t imagine something as a partial security related exploit, also affecting and caused by these bugs will come to light any time soon; considering I expect it to affect a further 20% of overall capacity. The scenario may further manifest and frustrate more, yet may simply be a matter of competitive advantage for others. I shall book mark this for further developments. Not unlike the X-Files, I think there is a solution out there (and it may well be pretty obvious in the end), but why would I say so/why or how? - Or anyone, for that matter?

I will not reply, because all my nodes are pretty in pink.

5 cents

Toyoo, thank you for explaining the meaning of the error more to my smooth brain.

Sorry, context canceled happens ALL THE TIME for uploads and downloads, when either the client just aborts the transfer or when it’s finished faster by other satellites. I am more concerned about the filewalkers bonking out.

“Context canceled” errors specifically in the filewalkers usually a result of a timeout, so it was not able to finish an operation before the timeout is occurred.
For network filesystems it’s kind of expected, and this is one of the reasons why we do not recommend them to use and do not support them. However, I have read success stories, where the Operator was able to configure NFS to work smoothly. But as far as I remember it was only for the localhost, and used as a way to connect disks to the VM.

@EasyRhino do you also have a error “exit 1” for the filewalker? If yes, then you prepared your NFS wrong :wink: accordingly those success stories.
And in this case I can offer only to try to use a badger cache (not sure, that it could work on a network filesystem, because it’s kind of a database) with lazy mode disabled. Or just disable a lazy mode. Maybe the filewalker could survive and finish its work.
The best way would be to avoid the network filesystem at all and run the node on the fileserver. If it’s not the case, then the next not bad solution would be to use iSCSI.

All these problems because of timeouts when something is trying to write or read from the disk.
Of course we would improve the code anyway to use less IOPS for some operations, but we unlikely can resolve the underlaying issue which is a not optimal setup, which you call a bug for some reason.

Inherently to this, means that if we set lower assigned Storage Space, eventually Filewalkers will run successfully?

Nope, no “exit 1” anywhere in my logs.

Do we have any idea how long the timeout actually is in the filewalkers, and if it’s configurable?

The only node where I have any filewalker failures in the last few days was one where the disk activity was, for whatever reason, showing up as busier than some of the other ones. but usually it was spurts of 100% busy that would die down after a few seconds.

I do not think that it’s configurable.
Seems you need to disable the lazy mode for this node to allow it to finish calculations at least once.