Node disqualified after FATAL Unrecoverable error {"error": "readdirent: not a directory

Bob · October 25, 2020, 5:14am

Hello,

One of my nodes got disqualified early this morning m I had a bad feeling when going to bed and I hope it is not related to changing time here over Europe. The reason why I make this post is after some research on the SNO troubleshooting guide , I haven’t found such any pattern .

Now the facts : during the night I spent some time understanding why the audit rate fell down : the disk was ejected for some reason and the node was up and down, I remounted the volume, but then the node stopped because of sync time issues with the satellite. I finally got the node stable after changing the ntp server . By the way the storage volume was checked and no error wad found in the file system.

Now is still running and it is restarting time to time, and I can’t understand what this means :
FATAL Unrecoverable error {“error”: "readdirent: not a directory;

Node is running on macosX with hfs volume.
Node ID is 12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB

I would appreciate if you could help understanding what happened, just to prevent this to happen again.
Regards,

kevink · October 25, 2020, 5:38am

I’m sad to hear that.
I can however assure you, that the time change in europe is not causing this, my node is fine and most nodes are in europe. Besides, time on PC systems is calculated in UTC. So local time changes don’t matter to them.
Time sync issues iirc would only stop your node/prevent it from starting but not DQ it.

It seems like you have some problem with disk reliability and macOS hfs volume stuff. I can’t help with that though, I have no knowledge of that environment.

nerdatwork · October 25, 2020, 6:05am

If you are using Linux use fsck to check disk for errors. If you are using Windows use chkdsk for the same.

How did you mount your disk ?

Bob · October 25, 2020, 8:02am

Hello @nerdatwork
I shutdown my server ; stopped the usb device and started again,
I check the file system was ok using macos disk utility repair functionnality which does almost the same as a fsck.

Alexey · October 25, 2020, 9:18am

Hello @Bob,

When you receive this, the node will just crash to prevent a quick DQ.
But if your node is finally started, but missed files during audits - it will fail audits and can be DQ very quickly (a few hours).

I can assume that you run a docker container without logs redirected, so we can’t find a reason in your logs anymore.
If they were exist you could check the reason as described in

Bob · October 25, 2020, 9:24am

Hello @Alexey
Thanks for your prompt answer.
I will leave the node down until I have further instruction.
Here are my last debug actions.
logs are redirected, and I use the latest versions :
docker desktop 2.4.0 with docker engine 19.03.13 and storagenode:latest
Thanks for your help

DSK CHECK with DISK UTILITY macosX Catalina uptodate.
Running First Aid on “storm” (disk4s1)
Repairing file system.
Volume is already unmounted.
Performing fsck_apfs -y -x /dev/rdisk4s1
Checking the container superblock.
Checking the space manager.
Checking the space manager free queue trees.
Checking the object map.
Checking volume.
Checking the APFS volume superblock.
The volume storm was formatted by diskmanagementd (1412.61.1) and last modified by apfs_kext (1412.141.1).
Checking the object map.
Checking the snapshot metadata tree.
Checking the snapshot metadata.
Checking the extent ref tree.
Checking the extent ref tree.
Checking the fsroot tree.

ERROR DETAIL:

2020-10-25T09:10:33.674Z FATAL Unrecoverable error {“error”: “lstat config/storage/blobs/6r2fgwqz3manwt4aogq343bfkh2n5vvg4ohqqgggrrunaaaaaaaa/2j/24de3pic34od3yvc4se7fxs25uep3bq6gx2cfav7kv33ir4cqa.sj1: not a directory; readdirent: input/output error; lstat config/storage/blobs/qstuylguhrn2ozjv4h2c6xpxykd622gtgurhql2k7k75wqaaaaaa/2u/227qw2y6vmqk2ucnc2mcs56mrq2g6ltuzwant724utwdapm4wa.sj1: not a directory; readdirent: input/output error; lstat config/storage/blobs/v4weeab67sbgvnbwd5z7tweqsqqun7qox2agpbxy44mqqaaaaaaa/2g/24sxwb7sjwbn2qzhflgaucdmjsd3mgw277sh5pyewv3ntaqvsq.sj1: not a directory”, “errorVerbose”: “group:\n— lstat config/storage/blobs/6r2fgwqz3manwt4aogq343bfkh2n5vvg4ohqqgggrrunaaaaaaaa/2j/24de3pic34od3yvc4se7fxs25uep3bq6gx2cfav7kv33ir4cqa.sj1: not a directory\n\tstorj.io/storj/storage/filestore.walkNamespaceWithPrefix:787\n\tstorj.io/storj/storage/filestore.(*Dir).walkNamespaceInPath:725\n\tstorj.io/storj/storage/filestore.(*Dir).WalkNamespace:685\n\tstorj.io/storj/storage/filestore.(*blobStore).WalkNamespace:280\n\tstorj.io/storj/storagenode/pieces.(*Store).WalkSatellitePieces:489\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:654\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:54\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func1:57\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57\n— readdirent: input/output error\n— lstat config/storage/blobs/qstuylguhrn2ozjv4h2c6xpxykd622gtgurhql2k7k75wqaaaaaa/2u/227qw2y6vmqk2ucnc2mcs56mrq2g6ltuzwant724utwdapm4wa.sj1: not a directory\n\tstorj.io/storj/storage/filestore.walkNamespaceWithPrefix:787\n\tstorj.io/storj/storage/filestore.(*Dir).walkNamespaceInPath:725\n\tstorj.io/storj/storage/filestore.(*Dir).WalkNamespace:685\n\tstorj.io/storj/storage/filestore.(*blobStore).WalkNamespace:280\n\tstorj.io/storj/storagenode/pieces.(*Store).WalkSatellitePieces:489\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:654\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:54\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func1:57\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57\n— readdirent: input/output error\n— lstat config/storage/blobs/v4weeab67sbgvnbwd5z7tweqsqqun7qox2agpbxy44mqqaaaaaaa/2g/24sxwb7sjwbn2qzhflgaucdmjsd3mgw277sh5pyewv3ntaqvsq.sj1: not a directory\n\tstorj.io/storj/storage/filestore.walkNamespaceWithPrefix:787\n\tstorj.io/storj/storage/filestore.(*Dir).walkNamespaceInPath:725\n\tstorj.io/storj/storage/filestore.(*Dir).WalkNamespace:685\n\tstorj.io/storj/storage/filestore.(*blobStore).WalkNamespace:280\n\tstorj.io/storj/storagenode/pieces.(*Store).WalkSatellitePieces:489\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:654\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:54\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func1:57\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57”}
storj.io/private/process.cleanup.func1
/go/pkg/mod/storj.io/private@v0.0.0-20200925142346-4c879709882f/process/exec_conf.go:399
github.com/spf13/cobra.(*Command).execute
/go/pkg/mod/github.com/spf13/cobra@v1.0.0/command.go:842
github.com/spf13/cobra.(*Command).ExecuteC
/go/pkg/mod/github.com/spf13/cobra@v1.0.0/command.go:950
github.com/spf13/cobra.(*Command).Execute
/go/pkg/mod/github.com/spf13/cobra@v1.0.0/command.go:887
storj.io/private/process.ExecWithCustomConfig
/go/pkg/mod/storj.io/private@v0.0.0-20200925142346-4c879709882f/process/exec_conf.go:88
storj.io/private/process.ExecCustomDebug
/go/pkg/mod/storj.io/private@v0.0.0-20200925142346-4c879709882f/process/exec_conf.go:70
main.main
/go/src/storj.io/storj/cmd/storagenode/main.go:335
runtime.main
/usr/local/go/src/runtime/proc.go:204

Alexey · October 25, 2020, 9:38am

Unfortunately there is no other instructions, if you checked your disk for errors and OS corrected them, then there is nothing we can do.
If your node disqualified on all satellites, there is only one way - to start from scratch with a new identity, new authorization token and clean storage.
If your node disqualified not on all satellites, you can decide to keep it running and receive payouts for service from customers of those satellites.