Node crashing upon realizing data storage isn't perfect

Storgeez · July 7, 2024, 3:24pm

Got a node that crashes upon not being able to read a file. I have some data loss/corruption (about 0.5% of the total drive size). Why does it crash?

2024-07-07T17:11:39+02:00	FATAL	process/exec_conf.go:429	Unrecoverable error	{"Process": "storagenode", "error": "filewalker: unrecoverable error accessing data on the storage file system (path=/Storj/data/storage/blobs/pmw6tvzmf2jv6giyybmmvl4o2ahqlaldsaeha4yx74n5aaaaaaaa/ac/utdoz6lux7uhsvphpklt45ylywopllhjjoymlf2bnx7cyzcj4q.sj1; error=lstat /Storj/data/storage/blobs/pmw6tvzmf2jv6giyybmmvl4o2ahqlaldsaeha4yx74n5aaaaaaaa/ac/utdoz6lux7uhsvphpklt45ylywopllhjjoymlf2bnx7cyzcj4q.sj1: errno 97). This is most likely due to disk bad sectors or a corrupted file system. Check your disk for bad sectors and integrity; filewalker: unrecoverable error accessing data on the storage file system (path=/Storj/data/storage/blobs/qstuylguhrn2ozjv4h2c6xpxykd622gtgurhql2k7k75wqaaaaaa/ak/p45faihdd2aizdh7yzhlqxjoo6vor2tpzoiqdwvfekluwuwx4q.sj1; error=lstat /Storj/data/storage/blobs/qstuylguhrn2ozjv4h2c6xpxykd622gtgurhql2k7k75wqaaaaaa/ak/p45faihdd2aizdh7yzhlqxjoo6vor2tpzoiqdwvfekluwuwx4q.sj1: errno 97). This is most likely due to disk bad sectors or a corrupted file system. Check your disk for bad sectors and integrity; filewalker: unrecoverable error accessing data on the storage file system (path=/Storj/data/storage/blobs/v4weeab67sbgvnbwd5z7tweqsqqun7qox2agpbxy44mqqaaaaaaa/ab/wxrvb3zelabaaefc7a6wbz5zfglp5uoj2j3jsehtcpefetkcdq.sj1; error=lstat /Storj/data/storage/blobs/v4weeab67sbgvnbwd5z7tweqsqqun7qox2agpbxy44mqqaaaaaaa/ab/wxrvb3zelabaaefc7a6wbz5zfglp5uoj2j3jsehtcpefetkcdq.sj1: errno 97). This is most likely due to disk bad sectors or a corrupted file system. Check your disk for bad sectors and integrity; filewalker: unrecoverable error accessing data on the storage file system (path=/Storj/data/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/aa/ak27ty7kny3fi4cyyxku6ahlijoma6okrjtvcxc6evausjh44a.sj1; error=lstat /Storj/data/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/aa/ak27ty7kny3fi4cyyxku6ahlijoma6okrjtvcxc6evausjh44a.sj1: errno 97). This is most likely due to disk bad sectors or a corrupted file system. Check your disk for bad sectors and integrity", "errorVerbose": "group:\n--- filewalker: unrecoverable error accessing data on the storage file system (path=/Storj/data/storage/blobs/pmw6tvzmf2jv6giyybmmvl4o2ahqlaldsaeha4yx74n5aaaaaaaa/ac/utdoz6lux7uhsvphpklt45ylywopllhjjoymlf2bnx7cyzcj4q.sj1; error=lstat /Storj/data/storage/blobs/pmw6tvzmf2jv6giyybmmvl4o2ahqlaldsaeha4yx74n5aaaaaaaa/ac/utdoz6lux7uhsvphpklt45ylywopllhjjoymlf2bnx7cyzcj4q.sj1: errno 97). This is most likely due to disk bad sectors or a corrupted file system. Check your disk for bad sectors and integrity\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:720\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78\n--- filewalker: unrecoverable error accessing data on the storage file system (path=/Storj/data/storage/blobs/qstuylguhrn2ozjv4h2c6xpxykd622gtgurhql2k7k75wqaaaaaa/ak/p45faihdd2aizdh7yzhlqxjoo6vor2tpzoiqdwvfekluwuwx4q.sj1; error=lstat /Storj/data/storage/blobs/qstuylguhrn2ozjv4h2c6xpxykd622gtgurhql2k7k75wqaaaaaa/ak/p45faihdd2aizdh7yzhlqxjoo6vor2tpzoiqdwvfekluwuwx4q.sj1: errno 97). This is most likely due to disk bad sectors or a corrupted file system. Check your disk for bad sectors and integrity\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:720\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78\n--- filewalker: unrecoverable error accessing data on the storage file system (path=/Storj/data/storage/blobs/v4weeab67sbgvnbwd5z7tweqsqqun7qox2agpbxy44mqqaaaaaaa/ab/wxrvb3zelabaaefc7a6wbz5zfglp5uoj2j3jsehtcpefetkcdq.sj1; error=lstat /Storj/data/storage/blobs/v4weeab67sbgvnbwd5z7tweqsqqun7qox2agpbxy44mqqaaaaaaa/ab/wxrvb3zelabaaefc7a6wbz5zfglp5uoj2j3jsehtcpefetkcdq.sj1: errno 97). This is most likely due to disk bad sectors or a corrupted file system. Check your disk for bad sectors and integrity\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:720\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78\n--- filewalker: unrecoverable error accessing data on the storage file system (path=/Storj/data/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/aa/ak27ty7kny3fi4cyyxku6ahlijoma6okrjtvcxc6evausjh44a.sj1; error=lstat /Storj/data/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/aa/ak27ty7kny3fi4cyyxku6ahlijoma6okrjtvcxc6evausjh44a.sj1: errno 97). This is most likely due to disk bad sectors or a corrupted file system. Check your disk for bad sectors and integrity\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:720\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
storj.io/common/process.cleanup.func1
	/go/pkg/mod/storj.io/common@v0.0.0-20240604134154-517cce55bb8c/process/exec_conf.go:429
github.com/spf13/cobra.(*Command).execute
	/go/pkg/mod/github.com/spf13/cobra@v1.8.0/command.go:983
github.com/spf13/cobra.(*Command).ExecuteC
	/go/pkg/mod/github.com/spf13/cobra@v1.8.0/command.go:1115
github.com/spf13/cobra.(*Command).Execute
	/go/pkg/mod/github.com/spf13/cobra@v1.8.0/command.go:1039
storj.io/common/process.ExecWithCustomOptions
	/go/pkg/mod/storj.io/common@v0.0.0-20240604134154-517cce55bb8c/process/exec_conf.go:112
main.main
	/go/src/storj.io/storj/cmd/storagenode/main.go:34
runtime.main
	/usr/local/go/src/runtime/proc.go:271

Roxor · July 7, 2024, 3:54pm

The logs mention potential filesystem corruption. I’d ‘fsck -y’ that disk and try again.

Storgeez · July 7, 2024, 4:02pm

fsck shouldn’t work on ZFS from the looks of it…

Roxor · July 7, 2024, 4:07pm

Ah you’ll want the ‘scrub’ command for ZFS.

Storgeez · July 7, 2024, 4:09pm

Yeah, there IS corruption, a small amount of data was lost, problem is the node refuses to well… node.

Julio · July 7, 2024, 10:18pm

try renaming that file, if there’s a bad block under it - it’ll stay f#d, but storj will stop fatal erroring at that point every re-start (time-delayed looping); and just skip it. Note that’s only the ‘ac’ of that sat’s prefix, there’s likely much more fun ahead if a scrub doesn’t work.

Storgeez · July 8, 2024, 9:21pm

The node should do something with it, not me, there’s almost a percent of data missing, nobody in their right mind is going to be able to remove these files…

I seriously hope the node isn’t designed to throw away 100% of the data when it encounters expected corruption.

Roxor · July 8, 2024, 9:55pm

There’s no such thing as ‘expected corruption’. The file can have invalid contents (that fail an audit): that’s OK. The file could not exist: also OK. The file shouldn’t throw OS-level errors when you try to open it (which is what is happening now) - filesystem errors are something a scrub should have dealt with.

Has it not finished running yet?

Toyoo · July 8, 2024, 11:34pm

We expect node to act selfish to some degree. If there is a type of problem which the node cannot interpret on its own, it’s favorable to go down in case this problem would result in audit failures. Bad audits can disqualify a node in an hour or two, if you’re unlucky. If a node goes down, it can wait for days until you have time and means to diagnose—plus it’s easier to notice a node being down than a node showing errors in its logs from time to time.

(It would be better if known problems weren’t logged as errors and SNOs just having a reliable way of getting notified about them, but this is a topic for some other threads that already exist on the forum)

Storgeez · July 9, 2024, 8:57pm

This whole system is build upon the presumption that storage drives are neither infallible nor error-proof. Corruption is expected. I know there’s a protection against the storage location becoming unwritable, but I didn’t know it would refuse to work also when a single file is unserviceable. The last scrub is still running, it’s useless though, nothing will be accomplished by it except by accounting for all the errors.

Yes, the way it deals with it is by doing this - making the files unreadable. It is how it works, ZFS is very bad at dealing with corrupted data, I had all kinds of errors before it would start working.

This is understandable, I thought it would do so only for directory being unwritable, I remember when that change was made. However, saving my node from disqualification in this fashion will get me disqualified. Is there a switch for turning crashing the node for files being unreadable off?

Alexey · July 10, 2024, 6:51am

We are trying to protect the node from a disqualification. If you are aware of the issue, you can fix it or apply a workaround (like renaming). The node software is not designed to work with a hardware issues, it’s true. And I believe that we shouldn’t reinvent what is OS/or specialized recovery software doing on a low hardware level.

It may result in disqualification if the node would be offline for more than 30 days, yes. However, it’s much longer than a few hours of answering on audits but do not providing requested pieces and being disqualified.
There is no switch to disable the safety check this kind. Many SNO are asked to implement this safety check, so we did.
Now it’s your turn to fix the issue as you can.

Toyoo · July 10, 2024, 9:22pm

How?؜؜؜؜؜؜؜؜؜؜؜؜؜؜؜؜

Alexey · July 11, 2024, 5:40am

For example, do not monitor a node and do not notice that it is offline until it receives a DQ message.

Storgeez · August 4, 2024, 9:31am

By make the node unrunnable, it’s constantly exiting. I ran out of time to fix the issue due to too little free time to troubleshoot.

Lost the node.

Storgeez · August 4, 2024, 9:34am

Node should ignore inaccessible data, consider it missing, don’t lose 10TB of data because you cannot access 4096 bytes, this is idiocy. Hopefully somebody changes this behaviour or adds a switch for it.

Mitsos · August 4, 2024, 9:58am

Sorry to barge in to the conversation, but hopefully storj never implements a switch to compensate for broken nodes. There is the (slight) possibility that all nodes storing pieces have that switch turned on = client lost his/her/its file.

On a side note, how did ZFS manage to corrupt data? Sounds like an underlying hardware issue to me, and the node did what it’s supposed to do: get off the network so it can’t mess everything up (=client’s POV).

Storgeez · August 4, 2024, 10:16am

No, you do not understand the issue; node shuts down all the pieces if one piece isn’t accessible = all clients lose pieces on this node. Node should continue running and let the filewalker find the pieces that aren’t accessible and delete them from the database = 0.5% - 1% data lost in my case, in reality, 100% of the node data was lost.

Mitsos · August 4, 2024, 10:19am

Actually I understand perfectly. The node shutdown = went offline = the network will know in 5 hours that the node is unavailable, and will trigger repair of the pieces it stores (by using other nodes that have pieces).

The issue was that instead of leaving the node offline while you were repairing the corrupted filesystem (again, how did ZFS manage to corrupt data?) you rushed to bring it online, further adding to the damage.

The node could have stayed offline for 10 days while you repaired the 20TB drive (assuming) and still make it back on time before being disqualified.

Storgeez · August 4, 2024, 10:26am

Actually you don’t, because the node WAS offline for 11-12 days while I worked on the issues, the network was never the issue, the network will survive without those pieces, the issue is that good redundant data gets thrown out of the network and also that damage is done to the SNO.

ZFS didn’t corrupt the data, data was corrupted, however once data gets corrupted in ZFS, ZFS doesn’t deal well with that, ZFS still isn’t designed good enough to deal with corruption because nearly all the people never get to that point with ZFS.
Also, node should never further add any damages to data.

Mitsos · August 4, 2024, 10:33am

You couldn’t fix the node in 12 days? I find that hard to believe without a hardware issue.

How was the data corrupted? Was the disk failing? Was your RAM bad? Was the controller messing up?

If ZFS got to the point where corrupted data made it onto the disk (FYI: ZFS performs a check on every read and if it finds bad data, immediately tries to write back the correct data, a scrub isn’t needed for this), then the corruption came from somewhere outside of the software’s control (=hardware issue as I have said in my first reply). The node got a read error from ZFS (again, this cannot happen in normal working conditions even with corrupted data, see previous) which the node interpreted as “the sky is falling” hence it shut down.