Node crashing upon realizing data storage isn't perfect

Got a node that crashes upon not being able to read a file. I have some data loss/corruption (about 0.5% of the total drive size). Why does it crash?

2024-07-07T17:11:39+02:00	FATAL	process/exec_conf.go:429	Unrecoverable error	{"Process": "storagenode", "error": "filewalker: unrecoverable error accessing data on the storage file system (path=/Storj/data/storage/blobs/pmw6tvzmf2jv6giyybmmvl4o2ahqlaldsaeha4yx74n5aaaaaaaa/ac/utdoz6lux7uhsvphpklt45ylywopllhjjoymlf2bnx7cyzcj4q.sj1; error=lstat /Storj/data/storage/blobs/pmw6tvzmf2jv6giyybmmvl4o2ahqlaldsaeha4yx74n5aaaaaaaa/ac/utdoz6lux7uhsvphpklt45ylywopllhjjoymlf2bnx7cyzcj4q.sj1: errno 97). This is most likely due to disk bad sectors or a corrupted file system. Check your disk for bad sectors and integrity; filewalker: unrecoverable error accessing data on the storage file system (path=/Storj/data/storage/blobs/qstuylguhrn2ozjv4h2c6xpxykd622gtgurhql2k7k75wqaaaaaa/ak/p45faihdd2aizdh7yzhlqxjoo6vor2tpzoiqdwvfekluwuwx4q.sj1; error=lstat /Storj/data/storage/blobs/qstuylguhrn2ozjv4h2c6xpxykd622gtgurhql2k7k75wqaaaaaa/ak/p45faihdd2aizdh7yzhlqxjoo6vor2tpzoiqdwvfekluwuwx4q.sj1: errno 97). This is most likely due to disk bad sectors or a corrupted file system. Check your disk for bad sectors and integrity; filewalker: unrecoverable error accessing data on the storage file system (path=/Storj/data/storage/blobs/v4weeab67sbgvnbwd5z7tweqsqqun7qox2agpbxy44mqqaaaaaaa/ab/wxrvb3zelabaaefc7a6wbz5zfglp5uoj2j3jsehtcpefetkcdq.sj1; error=lstat /Storj/data/storage/blobs/v4weeab67sbgvnbwd5z7tweqsqqun7qox2agpbxy44mqqaaaaaaa/ab/wxrvb3zelabaaefc7a6wbz5zfglp5uoj2j3jsehtcpefetkcdq.sj1: errno 97). This is most likely due to disk bad sectors or a corrupted file system. Check your disk for bad sectors and integrity; filewalker: unrecoverable error accessing data on the storage file system (path=/Storj/data/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/aa/ak27ty7kny3fi4cyyxku6ahlijoma6okrjtvcxc6evausjh44a.sj1; error=lstat /Storj/data/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/aa/ak27ty7kny3fi4cyyxku6ahlijoma6okrjtvcxc6evausjh44a.sj1: errno 97). This is most likely due to disk bad sectors or a corrupted file system. Check your disk for bad sectors and integrity", "errorVerbose": "group:\n--- filewalker: unrecoverable error accessing data on the storage file system (path=/Storj/data/storage/blobs/pmw6tvzmf2jv6giyybmmvl4o2ahqlaldsaeha4yx74n5aaaaaaaa/ac/utdoz6lux7uhsvphpklt45ylywopllhjjoymlf2bnx7cyzcj4q.sj1; error=lstat /Storj/data/storage/blobs/pmw6tvzmf2jv6giyybmmvl4o2ahqlaldsaeha4yx74n5aaaaaaaa/ac/utdoz6lux7uhsvphpklt45ylywopllhjjoymlf2bnx7cyzcj4q.sj1: errno 97). This is most likely due to disk bad sectors or a corrupted file system. Check your disk for bad sectors and integrity\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:720\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78\n--- filewalker: unrecoverable error accessing data on the storage file system (path=/Storj/data/storage/blobs/qstuylguhrn2ozjv4h2c6xpxykd622gtgurhql2k7k75wqaaaaaa/ak/p45faihdd2aizdh7yzhlqxjoo6vor2tpzoiqdwvfekluwuwx4q.sj1; error=lstat /Storj/data/storage/blobs/qstuylguhrn2ozjv4h2c6xpxykd622gtgurhql2k7k75wqaaaaaa/ak/p45faihdd2aizdh7yzhlqxjoo6vor2tpzoiqdwvfekluwuwx4q.sj1: errno 97). This is most likely due to disk bad sectors or a corrupted file system. Check your disk for bad sectors and integrity\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:720\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78\n--- filewalker: unrecoverable error accessing data on the storage file system (path=/Storj/data/storage/blobs/v4weeab67sbgvnbwd5z7tweqsqqun7qox2agpbxy44mqqaaaaaaa/ab/wxrvb3zelabaaefc7a6wbz5zfglp5uoj2j3jsehtcpefetkcdq.sj1; error=lstat /Storj/data/storage/blobs/v4weeab67sbgvnbwd5z7tweqsqqun7qox2agpbxy44mqqaaaaaaa/ab/wxrvb3zelabaaefc7a6wbz5zfglp5uoj2j3jsehtcpefetkcdq.sj1: errno 97). This is most likely due to disk bad sectors or a corrupted file system. Check your disk for bad sectors and integrity\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:720\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78\n--- filewalker: unrecoverable error accessing data on the storage file system (path=/Storj/data/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/aa/ak27ty7kny3fi4cyyxku6ahlijoma6okrjtvcxc6evausjh44a.sj1; error=lstat /Storj/data/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/aa/ak27ty7kny3fi4cyyxku6ahlijoma6okrjtvcxc6evausjh44a.sj1: errno 97). This is most likely due to disk bad sectors or a corrupted file system. Check your disk for bad sectors and integrity\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:720\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
storj.io/common/process.cleanup.func1
	/go/pkg/mod/storj.io/common@v0.0.0-20240604134154-517cce55bb8c/process/exec_conf.go:429
github.com/spf13/cobra.(*Command).execute
	/go/pkg/mod/github.com/spf13/cobra@v1.8.0/command.go:983
github.com/spf13/cobra.(*Command).ExecuteC
	/go/pkg/mod/github.com/spf13/cobra@v1.8.0/command.go:1115
github.com/spf13/cobra.(*Command).Execute
	/go/pkg/mod/github.com/spf13/cobra@v1.8.0/command.go:1039
storj.io/common/process.ExecWithCustomOptions
	/go/pkg/mod/storj.io/common@v0.0.0-20240604134154-517cce55bb8c/process/exec_conf.go:112
main.main
	/go/src/storj.io/storj/cmd/storagenode/main.go:34
runtime.main
	/usr/local/go/src/runtime/proc.go:271

The logs mention potential filesystem corruption. I’d ‘fsck -y’ that disk and try again.

1 Like

fsck shouldn’t work on ZFS from the looks of it…

Ah you’ll want the ‘scrub’ command for ZFS.

1 Like

Yeah, there IS corruption, a small amount of data was lost, problem is the node refuses to well… node.

try renaming that file, if there’s a bad block under it - it’ll stay f#d, but storj will stop fatal erroring at that point every re-start (time-delayed looping); and just skip it. Note that’s only the ‘ac’ of that sat’s prefix, there’s likely much more fun ahead if a scrub doesn’t work.

1 Like

The node should do something with it, not me, there’s almost a percent of data missing, nobody in their right mind is going to be able to remove these files…

I seriously hope the node isn’t designed to throw away 100% of the data when it encounters expected corruption.

There’s no such thing as ‘expected corruption’. The file can have invalid contents (that fail an audit): that’s OK. The file could not exist: also OK. The file shouldn’t throw OS-level errors when you try to open it (which is what is happening now) - filesystem errors are something a scrub should have dealt with.

Has it not finished running yet?

2 Likes

We expect node to act selfish to some degree. If there is a type of problem which the node cannot interpret on its own, it’s favorable to go down in case this problem would result in audit failures. Bad audits can disqualify a node in an hour or two, if you’re unlucky. If a node goes down, it can wait for days until you have time and means to diagnose—plus it’s easier to notice a node being down than a node showing errors in its logs from time to time.

(It would be better if known problems weren’t logged as errors and SNOs just having a reliable way of getting notified about them, but this is a topic for some other threads that already exist on the forum)

1 Like

This whole system is build upon the presumption that storage drives are neither infallible nor error-proof. Corruption is expected. I know there’s a protection against the storage location becoming unwritable, but I didn’t know it would refuse to work also when a single file is unserviceable. The last scrub is still running, it’s useless though, nothing will be accomplished by it except by accounting for all the errors.

Yes, the way it deals with it is by doing this - making the files unreadable. It is how it works, ZFS is very bad at dealing with corrupted data, I had all kinds of errors before it would start working.


This is understandable, I thought it would do so only for directory being unwritable, I remember when that change was made. However, saving my node from disqualification in this fashion will get me disqualified. Is there a switch for turning crashing the node for files being unreadable off?

We are trying to protect the node from a disqualification. If you are aware of the issue, you can fix it or apply a workaround (like renaming). The node software is not designed to work with a hardware issues, it’s true. And I believe that we shouldn’t reinvent what is OS/or specialized recovery software doing on a low hardware level.

It may result in disqualification if the node would be offline for more than 30 days, yes. However, it’s much longer than a few hours of answering on audits but do not providing requested pieces and being disqualified.
There is no switch to disable the safety check this kind. Many SNO are asked to implement this safety check, so we did.
Now it’s your turn to fix the issue as you can.

How?؜؜؜؜؜؜؜؜؜؜؜؜؜؜؜؜

For example, do not monitor a node and do not notice that it is offline until it receives a DQ message.

1 Like