Stop node whilst array is rebuilt or risk disqualification?

My node is not in a good way :slightly_frowning_face: Last night a 6TB drive failed and my controller started rebuilding the array using a hot spare. Since then my audit score has been slowly falling as the rebuild process takes priority over any drive reading/writing…

image

So my question… should I stop the node until the array has been rebuilt? I am leaning towards this as I’m not aware of any actual data loss it seems to be purely a timeout issue.

Update - I take it back there does appear to be corruption so I’ve stopped the node and now running a chkdsk…

image

doesn’t look promising.
What kind of RAID and underlaying file system on this pool?
Rebuild should not affect audits, unless data already corrupted (and now it’s replicating it).

1 Like

Adaptec RAID controller with a RAID6 configuration with 11 x 6TB and 1 hot spare. It’s NTFS formatted, used to run on Windows Server 2012 R2 but is now just Windows 10. The logical drive data was shared between Storj and personal archive files.

The rebuild has finished in 27 hours which was quicker than I thought it might be, however during the array rebuild bad blocks were found:

image

I ended up keeping the node offline, stopping the chkdsk until the rebuild completed and am now running the chkdsk again. There’s definitely some corruption or data loss on this array :cry:

I’m sorry. Perhaps zfs could be good in this case, but impossible with Windows unfortunately.

Just to update in case any one else has this issue… The chkdsk took over 3 days to run:

image

It left 1.2TB of files (approximately 10% loss) in the found.000 folder in the root of the drive. The files were still in folders…

image

As the filenames within folders are still correct I am manually restoring the folders to the blobs directory. I copy the name of a file…

image

Search the storagenode.log for the original PUT request…

Find the satellite ID, crossmatch it with the blobs folder (link), rename the dirxxxx.chk folder to the first two letters of the Piece ID (z6 in this case) then move the folder back to the correct location…

image

400 folders to go and then I’ll start the node again to see if it survives :man_shrugging:

2 Likes

I hope that works out for you. If you’re lucky, it was only the directory structure that was messed up. If you’re unlucky, those files you are restoring are all corrupt. Let’s hope it’s the first and your node will likely survive.

How many log files are you searching that through ?

Just one, the latest month (May). I’ve assumed that the recovered folders are relatively intact.

Maybe you will find this script useful: How to get the name of a piece having its file - #4 by Toyoo

1 Like

Thanks for the link but python is out of my depth, so I stuck to manually moving the recovered folders back.

The node has been online for 10 minutes and so far there are 14 ERROR lines in the log…

2023-06-21T18:50:53.841+0100	ERROR	piecestore	download failed	{"Piece ID": "VALGEGMFD6P4ED5LSYSCABUZI6PWOTVGY5RGLLVJJ5GZVA3SUBTA", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET", "Offset": 1040896, "Size": 0, "Remote Address": "184.104.224.98:57410", "error": "file does not exist", "errorVerbose": "file does not exist\n\tstorj.io/common/rpc/rpcstatus.Wrap:75\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:655\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:251\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:61\n\tstorj.io/common/experiment.(*Handler).HandleRPC:42\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:124\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:66\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:114\n\tstorj.io/drpc/drpcctx.(*Tracker).track:35"}
2023-06-21T18:52:02.660+0100	ERROR	piecestore	download failed	{"Piece ID": "PL6LMBEHLGFYVUNL4333OUNNMK7UDKYKP2FTA3D5LUI6FSX466FQ", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Action": "GET", "Offset": 0, "Size": 0, "Remote Address": "184.104.224.99:10118", "error": "file does not exist", "errorVerbose": "file does not exist\n\tstorj.io/common/rpc/rpcstatus.Wrap:75\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:655\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:251\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:61\n\tstorj.io/common/experiment.(*Handler).HandleRPC:42\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:124\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:66\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:114\n\tstorj.io/drpc/drpcctx.(*Tracker).track:35"}
2023-06-21T18:52:37.863+0100	ERROR	piecestore	download failed	{"Piece ID": "RGCRV2I5HNKINVY2YCCEYQBY3MW2LICUQIXYOKNN6KOSZDMZ7YRQ", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET", "Offset": 238080, "Size": 0, "Remote Address": "184.104.224.99:52110", "error": "file does not exist", "errorVerbose": "file does not exist\n\tstorj.io/common/rpc/rpcstatus.Wrap:75\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:655\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:251\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:61\n\tstorj.io/common/experiment.(*Handler).HandleRPC:42\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:124\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:66\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:114\n\tstorj.io/drpc/drpcctx.(*Tracker).track:35"}
2023-06-21T18:56:22.200+0100	ERROR	piecestore	download failed	{"Piece ID": "PL6LMBEHLGFYVUNL4333OUNNMK7UDKYKP2FTA3D5LUI6FSX466FQ", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Action": "GET", "Offset": 0, "Size": 0, "Remote Address": "149.6.140.107:51670", "error": "file does not exist", "errorVerbose": "file does not exist\n\tstorj.io/common/rpc/rpcstatus.Wrap:75\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:655\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:251\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:61\n\tstorj.io/common/experiment.(*Handler).HandleRPC:42\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:124\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:66\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:114\n\tstorj.io/drpc/drpcctx.(*Tracker).track:35"}
2023-06-21T18:56:43.531+0100	ERROR	piecestore	download failed	{"Piece ID": "PL6LMBEHLGFYVUNL4333OUNNMK7UDKYKP2FTA3D5LUI6FSX466FQ", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Action": "GET", "Offset": 0, "Size": 0, "Remote Address": "149.6.140.107:18376", "error": "file does not exist", "errorVerbose": "file does not exist\n\tstorj.io/common/rpc/rpcstatus.Wrap:75\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:655\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:251\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:61\n\tstorj.io/common/experiment.(*Handler).HandleRPC:42\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:124\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:66\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:114\n\tstorj.io/drpc/drpcctx.(*Tracker).track:35"}
2023-06-21T18:57:03.740+0100	ERROR	piecestore	download failed	{"Piece ID": "PL6LMBEHLGFYVUNL4333OUNNMK7UDKYKP2FTA3D5LUI6FSX466FQ", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Action": "GET", "Offset": 0, "Size": 0, "Remote Address": "149.6.140.107:12124", "error": "file does not exist", "errorVerbose": "file does not exist\n\tstorj.io/common/rpc/rpcstatus.Wrap:75\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:655\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:251\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:61\n\tstorj.io/common/experiment.(*Handler).HandleRPC:42\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:124\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:66\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:114\n\tstorj.io/drpc/drpcctx.(*Tracker).track:35"}
2023-06-21T18:57:21.989+0100	ERROR	piecestore	download failed	{"Piece ID": "PL6LMBEHLGFYVUNL4333OUNNMK7UDKYKP2FTA3D5LUI6FSX466FQ", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Action": "GET", "Offset": 0, "Size": 0, "Remote Address": "149.6.140.107:49474", "error": "file does not exist", "errorVerbose": "file does not exist\n\tstorj.io/common/rpc/rpcstatus.Wrap:75\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:655\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:251\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:61\n\tstorj.io/common/experiment.(*Handler).HandleRPC:42\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:124\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:66\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:114\n\tstorj.io/drpc/drpcctx.(*Tracker).track:35"}
2023-06-21T18:57:25.017+0100	ERROR	piecestore	download failed	{"Piece ID": "PL6LMBEHLGFYVUNL4333OUNNMK7UDKYKP2FTA3D5LUI6FSX466FQ", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Action": "GET", "Offset": 179968, "Size": 0, "Remote Address": "149.6.140.107:49484", "error": "file does not exist", "errorVerbose": "file does not exist\n\tstorj.io/common/rpc/rpcstatus.Wrap:75\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:655\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:251\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:61\n\tstorj.io/common/experiment.(*Handler).HandleRPC:42\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:124\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:66\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:114\n\tstorj.io/drpc/drpcctx.(*Tracker).track:35"}
2023-06-21T18:57:36.975+0100	ERROR	piecestore	download failed	{"Piece ID": "ZYEZ57YVYDM2PTGXGFKDPH6FT7E3IMNAVHBQ6S6EHQA25BMVMTFQ", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET", "Offset": 320256, "Size": 0, "Remote Address": "50.7.22.66:36608", "error": "file does not exist", "errorVerbose": "file does not exist\n\tstorj.io/common/rpc/rpcstatus.Wrap:75\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:655\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:251\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:61\n\tstorj.io/common/experiment.(*Handler).HandleRPC:42\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:124\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:66\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:114\n\tstorj.io/drpc/drpcctx.(*Tracker).track:35"}
2023-06-21T18:58:59.885+0100	ERROR	piecestore	download failed	{"Piece ID": "PL6LMBEHLGFYVUNL4333OUNNMK7UDKYKP2FTA3D5LUI6FSX466FQ", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Action": "GET", "Offset": 0, "Size": 0, "Remote Address": "149.6.140.107:26474", "error": "file does not exist", "errorVerbose": "file does not exist\n\tstorj.io/common/rpc/rpcstatus.Wrap:75\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:655\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:251\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:61\n\tstorj.io/common/experiment.(*Handler).HandleRPC:42\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:124\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:66\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:114\n\tstorj.io/drpc/drpcctx.(*Tracker).track:35"}
2023-06-21T18:59:03.352+0100	ERROR	piecestore	download failed	{"Piece ID": "PL6LMBEHLGFYVUNL4333OUNNMK7UDKYKP2FTA3D5LUI6FSX466FQ", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Action": "GET", "Offset": 0, "Size": 0, "Remote Address": "184.104.224.98:45732", "error": "file does not exist", "errorVerbose": "file does not exist\n\tstorj.io/common/rpc/rpcstatus.Wrap:75\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:655\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:251\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:61\n\tstorj.io/common/experiment.(*Handler).HandleRPC:42\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:124\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:66\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:114\n\tstorj.io/drpc/drpcctx.(*Tracker).track:35"}
2023-06-21T18:59:05.590+0100	ERROR	piecestore	download failed	{"Piece ID": "PL6LMBEHLGFYVUNL4333OUNNMK7UDKYKP2FTA3D5LUI6FSX466FQ", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Action": "GET", "Offset": 0, "Size": 0, "Remote Address": "149.6.140.107:33690", "error": "file does not exist", "errorVerbose": "file does not exist\n\tstorj.io/common/rpc/rpcstatus.Wrap:75\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:655\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:251\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:61\n\tstorj.io/common/experiment.(*Handler).HandleRPC:42\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:124\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:66\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:114\n\tstorj.io/drpc/drpcctx.(*Tracker).track:35"}
2023-06-21T19:00:52.664+0100	ERROR	piecestore	download failed	{"Piece ID": "PL6LMBEHLGFYVUNL4333OUNNMK7UDKYKP2FTA3D5LUI6FSX466FQ", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Action": "GET", "Offset": 0, "Size": 0, "Remote Address": "72.52.83.202:11730", "error": "file does not exist", "errorVerbose": "file does not exist\n\tstorj.io/common/rpc/rpcstatus.Wrap:75\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:655\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:251\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:61\n\tstorj.io/common/experiment.(*Handler).HandleRPC:42\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:124\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:66\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:114\n\tstorj.io/drpc/drpcctx.(*Tracker).track:35"}
2023-06-21T19:01:14.879+0100	ERROR	piecestore	download failed	{"Piece ID": "IL4YQ73XDPYY3EIIZHIPJAMJBAJEPDVMYVCYT4QAKUGCDIXWJIIA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET", "Offset": 921600, "Size": 0, "Remote Address": "149.6.140.107:62074", "error": "file does not exist", "errorVerbose": "file does not exist\n\tstorj.io/common/rpc/rpcstatus.Wrap:75\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:655\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:251\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:61\n\tstorj.io/common/experiment.(*Handler).HandleRPC:42\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:124\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:66\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:114\n\tstorj.io/drpc/drpcctx.(*Tracker).track:35"}

I’ll run the successrate script after an hour and see how the node is doing. I feel lucky but at the same time could easily get disqualified with over 4% loss.


1 hour later and the node is still running. The audit failures are less than 2% at the moment:

image

Node Dashboard update:

1 Like

If sat discovers on a node that “file dosen’t exist”, it probes for it again for a few times? Or marks it non existant on that node at first atempt, and dosen’t probe for it again?

Curious that they are listed as recoverable failed. That doesn’t seem right if the files are missing. But also, your node won’t log an error for the transfer of it sends corrupt data, so you might not see all failures.

I also wasn’t expecting it to hit your suspension score.

I believe it only tries again for “unknown errors”. The ones that hit your suspension score. But don’t quote me on that.

I didn’t pick that terminology but agree that the ‘file not found’ error is not recoverable. Looking at the successrate code the difference between critical and recoverable is whether the log line contains “open”.

I assume this was when the array was being rebuilt, chkdsk was running and the node was also egressing data - all of which caused the drive to be non-responsive??

14 hours later… the suspension score has now recovered, but annoyingly the node updated which dropped my online score a little further:

Audit failure rate has also improved from where it was…

image

Ahh, yeah, that’s based on older terminology. I created the original linux script and adjusted that one to work with the new terminology. It looks for “exist” now, I believe. I guess the windows version was never updated by @Alexey to reflect this.

Ahh, yes, that would make sense.

You still have some margin there. So you should be fine if you keep it online from now on.

Yeah, that doesn’t look too bad. Your node might just survive.

1 Like

Ok, I must have an old version. I’ll update my script for the new terminology.

:crossed_fingers:

1 Like

There are cases where it is recoverable. This error reports only an observation, not something that physically happens on a drive. This error is recoverable if, for example, the drive was just not mounted, and hence the file could not be found despite physically existing. This was a common occurence in the days when the node did not verify existence of basic files at the startup. It is still possible in some custom setups as well.

You are right, I did not update it, now it’s fixed. Thanks!
For honestly I do not remember, when I used it last time.

2 Likes

Recoverable in this case means an audit error that would put your node in containment and is retried later. It’s not about whether the underlying problem is recoverable, but about whether this immediately effects your audit score or an impact on that score can still be prevented.

3 Likes