Ignore files that cannot be lstat-ed during initial file crawling + Add a notification system for serious errors?

Hi! :slight_smile:

I’m facing the very same issue again that I had in the past, with node version v1.37.1.
In short:
Whenever an issue is encountered during the initial file crawling (when the node starts), the whole node crashes and starts all over again after docker has restarted it. This can go on with a disk IO at around 100% until the SNO notices it and investigates. Which could take months really depending on how one monitors their nodes… while the disk is being killed, and the node is struggling to serve its purpose in the meantime.

Error example:

2021-03-30T17:15:25.255Z        FATAL   Unrecoverable error     {"error": "lstat config/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/7d/cu3edg2izxgw6yho7yjwzmiwbj23qas5nzlhmudfdvtysiet2a.sj1: bad message", "errorVerbose": "lstat config/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/7d/cu3edg2izxgw6yho7yjwzmiwbj23qas5nzlhmudfdvtysiet2a.sj1: bad message\n\tstorj.io/storj/storage/filestore.walkNamespaceWithPrefix:787\n\tstorj.io/storj/storage/filestore.(*Dir).walkNamespaceInPath:725\n\tstorj.io/storj/storage/filestore.(*Dir).WalkNamespace:685\n\tstorj.io/storj/storage/filestore.(*blobStore).WalkNamespace:280\n\tstorj.io/storj/storagenode/pieces.(*Store).WalkSatellitePieces:496\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:661\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:54\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:81\n\truntime/pprof.Do:40\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:80\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}

See my post from my past issue for more details:


Although I know the way froward (scan the disk for error - also I’m assuming my disk is starting to fail), I’d like to suggest to the Storj team to maybe handle this issue in a different way, because right now whenever the file crawler encounters an issue with an lstat command, the whole node seems to crash and it starts all over again, in a infinite loop.

Maybe the node should log a warning/error and ignore the file and carry on, to avoid stopping the whole node?


EDIT:

Also as said by @BrightSilence:

I agree. He also made a good point when saying that even though the node shouldn’t crash, the error shouldn’t go unnoticed, so I think we do need a way to configure nodes so notifications can be sent to SNOs whenever serious errors get triggered by nodes.

I agree that this should be handled better. But it is a serious enough issue that it shouldn’t just be hidden away. The SNO needs to take action to check the file system. Not sure how we can prevent hiding the underlying problem from the node operator if the node just continues on.

Well, as long as Storj does not provide a way to receive notification e-mails whenever a certain level of alert is triggered within nodes’ logs, I guess the only way to know there is something to do is to monitor nodes’ logs ourselves with advanced tools. Personally, I do not do that :confused:

But right now, most SNOs wouldn’t notify anything either in this particular case because simple monitoring tools like uptimerobot wouldn’t catch that as the node gets restarted by docker, and it starts happily its file crawler again for hours and hours (depending on your disk performance) until it crashes again on the same file. And on, and on, and on…

But I agree with you: Even though this issue should be handled better, we should also consider adding a system so problems are notified; like a configurable notification system of some sort so SNOs receive alerts by e-mail, whenever serious issues arise. For instance.

1 Like

With that nuance added, I voted for this. Fatal error seems over the top if there is an issue for a single file and goes against the intention of allowing for some minor corruption or bitrot.

1 Like

pretty sure this was suppose to be fixed a long time ago.

but apparently it never happened… i think the plan was simply to offset so the filewalking was done a bit more random rather than on initial boot

That would still crash the node whenever the corrupted file would be lstat-ed though.

1 Like

6 posts were merged into an existing topic: I got nodes that seem to shut themselves down when encountering high iowait’s