Ignore files that cannot be lstat-ed during initial file crawling + Add a notification system for serious errors?

Hi! :slight_smile:

I’m facing the very same issue again that I had in the past, with node version v1.37.1.
In short:
Whenever an issue is encountered during the initial file crawling (when the node starts), the whole node crashes and starts all over again after docker has restarted it. This can go on with a disk IO at around 100% until the SNO notices it and investigates. Which could take months really depending on how one monitors their nodes… while the disk is being killed, and the node is struggling to serve its purpose in the meantime.

Error example:

2021-03-30T17:15:25.255Z        FATAL   Unrecoverable error     {"error": "lstat config/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/7d/cu3edg2izxgw6yho7yjwzmiwbj23qas5nzlhmudfdvtysiet2a.sj1: bad message", "errorVerbose": "lstat config/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/7d/cu3edg2izxgw6yho7yjwzmiwbj23qas5nzlhmudfdvtysiet2a.sj1: bad message\n\tstorj.io/storj/storage/filestore.walkNamespaceWithPrefix:787\n\tstorj.io/storj/storage/filestore.(*Dir).walkNamespaceInPath:725\n\tstorj.io/storj/storage/filestore.(*Dir).WalkNamespace:685\n\tstorj.io/storj/storage/filestore.(*blobStore).WalkNamespace:280\n\tstorj.io/storj/storagenode/pieces.(*Store).WalkSatellitePieces:496\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:661\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:54\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:81\n\truntime/pprof.Do:40\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:80\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}

See my post from my past issue for more details:


Although I know the way froward (scan the disk for error - also I’m assuming my disk is starting to fail), I’d like to suggest to the Storj team to maybe handle this issue in a different way, because right now whenever the file crawler encounters an issue with an lstat command, the whole node seems to crash and it starts all over again, in a infinite loop.

Maybe the node should log a warning/error and ignore the file and carry on, to avoid stopping the whole node?


EDIT:

Also as said by @BrightSilence:

I agree. He also made a good point when saying that even though the node shouldn’t crash, the error shouldn’t go unnoticed, so I think we do need a way to configure nodes so notifications can be sent to SNOs whenever serious errors get triggered by nodes.

I agree that this should be handled better. But it is a serious enough issue that it shouldn’t just be hidden away. The SNO needs to take action to check the file system. Not sure how we can prevent hiding the underlying problem from the node operator if the node just continues on.

Well, as long as Storj does not provide a way to receive notification e-mails whenever a certain level of alert is triggered within nodes’ logs, I guess the only way to know there is something to do is to monitor nodes’ logs ourselves with advanced tools. Personally, I do not do that :confused:

But right now, most SNOs wouldn’t notify anything either in this particular case because simple monitoring tools like uptimerobot wouldn’t catch that as the node gets restarted by docker, and it starts happily its file crawler again for hours and hours (depending on your disk performance) until it crashes again on the same file. And on, and on, and on…

But I agree with you: Even though this issue should be handled better, we should also consider adding a system so problems are notified; like a configurable notification system of some sort so SNOs receive alerts by e-mail, whenever serious issues arise. For instance.

1 Like

With that nuance added, I voted for this. Fatal error seems over the top if there is an issue for a single file and goes against the intention of allowing for some minor corruption or bitrot.

1 Like

pretty sure this was suppose to be fixed a long time ago.

but apparently it never happened… i think the plan was simply to offset so the filewalking was done a bit more random rather than on initial boot

That would still crash the node whenever the corrupted file would be lstat-ed though.

1 Like

6 posts were merged into an existing topic: I got nodes that seem to shut themselves down when encountering high iowait’s

Update: I’m still facing that in version 1.44.1.

I know that the problem is probably that my disk is dying… but still, I think the node software should raise the error and carry on, instead of crashing and starting over the filewalker process forever, which will end-up killing dying disks even faster! oO

i had a problem with my nodes crashing during boot or during the finishing part of the filewalker…(i think)

it happens when using docker storage driver overlay2, turned out it was because i had changed the docker storage driver a while back and tho it could become stable… it would crash when encountering high iowait, took a month for me to actually figure out why…

because initially i hadn’t rebooted my server and my l2arc and such on zfs most likely kept it from crashing for a few weeks, until i was tinkering with other stuff and then all hell broke loose.

not sure the problem went completely away but haven’t seen much to the issue since i switched back to the default docker storage drivers.

forced the docker storage driver back to vfs by using the config from here.

Hm really? :thinking: Interesting.

My Docker is using overlay2 apparently:

$ sudo docker info
Client:
 Debug Mode: false

Server:
[...]
 Images: 2
 Server Version: 19.03.8
 Storage Driver: overlay2
  Backing Filesystem: <unknown>
  Supports d_type: true
[...]

But it seems it’s recommanded configuration though, accordingly to Docker storage drivers | Docker Documentation :

Driver Description
overlay2 overlay2 is the preferred storage driver for all currently supported Linux distributions, and requires no extra configuration.

And I read vfs is supposed to be for test purposes and has poor performances…

So I’m not sure I want to tinker with that ^^
Besides, when I run fsck -y /dev/sd** it does changes stuff on the filesystem and finds orphan nodes and stuff… So I’m pretty sure something is actually wrong with my disk ^^’

my setup is a bit weird i run my node in a container, and with mounted locations from my zfs pool using proxmox, but the reason you mention about overlay2 being the recommended was why i wanted to switch to that, and had been wanting to do that for ages, but only recently figured out how.

and atleast for me it didn’t work, so i’m stuck with vfs.
i have no idea why it doesn’t work, but runs fine on vfs so i just stopped trying to fix something that isn’t really broken…

was hell working out why my system didn’t work after a month i was down near 70% uptime because i just couldn’t keep my nodes online, and i had blissfully forgotten that i even changed the docker storage drivers and initially (maybe because of low network traffic in summer) it worked fine… until it didn’t

but my setup is … special … to say the least lol
so wouldn’t claim vfs is the answer for anyone else, but it sure was for me.
without a doubt.

1 Like