So various people, including me, have had filewalker fail, often with an error of “context canceled”. Alexey has often repeated that this is a symptom of a limited I/O. So much I can hear it wringing in my ears.
And yes, in my case the disk are mounted as NFS shares. Also I can hear Alexey’s scolding.
But like… still… why? It’s on a node and disk that seems to operating well. What’s the timeout in question, and is it adjustable? I tried searching around on github but I am not a programmer and got totally lost. It may have been tied to some other timeout in the settings but I could be way off.
But… failing filewalkers is like… a problem that affects a significant chunk of SNO’s, and often the may not even be aware if they’re not looking at logs. maybe there is a way to fix it. (with code, or settings, or node setup)
hmm interesting. in my case the databases are still stored on ssd so I didn’t have lock problems with that. Also the nfs shares are configured ‘async’ not ‘sync’, which was really slow when it was on sync.
okay I searched for used-space-filewalker and looked at the first couple pages of results to see if there were logs. I know, highly scientific statistical sampling.
I didn’t see any log fragments for used-space-filwalker that didn’t say context canceled, which may be part of the problem with the error message… it’ gets used for possibly lots of different underlying problems.
Context canceled is just a way to signal to goroutines (like the file walker) there’s no point in them working anymore for external reasons. So for us it’s just an observation suggesting to look for critical failures in other parts of the node leading to node shutdown.
Node killed by a database lock.
Node killed because of a database lock… hard to say where exactly, longer logs would be needed.
Node killed because of readability check.
Node killed because of disk space requirements not met.
I’d assume the actual error would be hidden in earlier log lines, the operator did not grep for errors.
Node killed because of disk space requirements not met.
Actually, of I run grep -i "context cancelled" against all my logs, they appear to happen now and then without any consequences as far as I can see. They especially appear on the slower nodes. And that is probably also the reason for a lot of errors mentioned by @Toyoo, including database locks and readability and writeability timeouts.
As these bugs have existed for a long time now, in my opinion I don’t think they will be publicly solved any time soon; as a fix. For the ‘bug bounty program’ seems only for security purposes, and has not garnered a single acknowledged reward to date (however anonymous it is), in this forum, or elsewhere - from my observation. I can’t imagine something as a partial security related exploit, also affecting and caused by these bugs will come to light any time soon; considering I expect it to affect a further 20% of overall capacity. The scenario may further manifest and frustrate more, yet may simply be a matter of competitive advantage for others. I shall book mark this for further developments. Not unlike the X-Files, I think there is a solution out there (and it may well be pretty obvious in the end), but why would I say so/why or how? - Or anyone, for that matter?
I will not reply, because all my nodes are pretty in pink.
Toyoo, thank you for explaining the meaning of the error more to my smooth brain.
Sorry, context canceled happens ALL THE TIME for uploads and downloads, when either the client just aborts the transfer or when it’s finished faster by other satellites. I am more concerned about the filewalkers bonking out.
“Context canceled” errors specifically in the filewalkers usually a result of a timeout, so it was not able to finish an operation before the timeout is occurred.
For network filesystems it’s kind of expected, and this is one of the reasons why we do not recommend them to use and do not support them. However, I have read success stories, where the Operator was able to configure NFS to work smoothly. But as far as I remember it was only for the localhost, and used as a way to connect disks to the VM.
@EasyRhino do you also have a error “exit 1” for the filewalker? If yes, then you prepared your NFS wrong accordingly those success stories.
And in this case I can offer only to try to use a badger cache (not sure, that it could work on a network filesystem, because it’s kind of a database) with lazy mode disabled. Or just disable a lazy mode. Maybe the filewalker could survive and finish its work.
The best way would be to avoid the network filesystem at all and run the node on the fileserver. If it’s not the case, then the next not bad solution would be to use iSCSI.
All these problems because of timeouts when something is trying to write or read from the disk.
Of course we would improve the code anyway to use less IOPS for some operations, but we unlikely can resolve the underlaying issue which is a not optimal setup, which you call a bug for some reason.
Do we have any idea how long the timeout actually is in the filewalkers, and if it’s configurable?
The only node where I have any filewalker failures in the last few days was one where the disk activity was, for whatever reason, showing up as busier than some of the other ones. but usually it was spurts of 100% busy that would die down after a few seconds.