Node Disqualified - File Not Found - Understanding cause

mbaker3 · August 25, 2022, 5:59pm

Hi There,

Yesterday I received a notification that my node was disqualified from us1.storj.io.

Reading through other threads I’ve collected some information and it looks like I have a bunch of “File does not exist” errors from “GET_REPAIR”.

docker logs storagenode 2>&1 | grep -E "GET_AUDIT|GET_REPAIR" | grep failed

returns 105

Error example:

2022-08-25T12:09:16.726Z	ERROR	piecestore	download failed	{"Process": "storagenode", "Piece ID": "DT2HVVVT2IJ3HR6LGFJICGN6HHEJAUAKG726WOFM6T7M4WPUOJXQ", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET_REPAIR", "error": "file does not exist", "errorVerbose": "file does not exist\n\tstorj.io/common/rpc/rpcstatus.Wrap:73\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:546\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:228\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:61\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:122\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:66\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:112\n\tstorj.io/drpc/drpcctx.(*Tracker).track:52"}

If I understand correctly the network was asking for repair blocks that my node was supposed to have but didn’t and I was disqualified for too many infractions.

Fair enough, but how can I dig into why this happened?

Context

I’m running two 1TB nodes in separate LXC containers running docker on Proxmox.
The first node (disqualified one) was full which is why I spun up the second node in Aug
The second node does not report any GET_AUDIT or GET_REPAIR errors in the logs
This is all backed by a 3x3TB RAID5 array
In May I did have extended offline time when my city lost power for a week but the nodes were gracefully shut down during the power outage. My scores were not great after that but up until yesterday everything was coming back to 100%
Otherwise, Aside from the very occasional 1min of downtime during a switch/router firmware upgrade the notes have been online.

I checked the smart reports on the drives in question and they’re all OK. I’m currently running a long test on them just to be sure.

What else can I check to get to the bottom of the cause so this doesn’t happen again?

Is it just that the new audit scoring logic evaluates my one extended outage more severely and now causes immediate disqualification?

BrightSilence · August 25, 2022, 6:48pm

You can find the pieces in the corresponding satellite folder.

Folder names are listed here: Satellite info (Address, ID, Blobs folder, Hex)
The first 2 letters of the piece ID are the subfolder, the rest is the filename. If they are missing, you lost some files.

You can also grep the logs for the piece ID to see what happened to that specific piece in the past. However, unless you have redirected logs to a file, there likely won’t be enough history to find everything.

Also check your file system for issues.

Downtime doesn’t count towards disqualification.

No, down time has nothing to do with it. You have either lost or corrupted files, as the log lines show. What has changed is that nodes up to 10-15% data loss used to be able to survive for a long time. This was never the intention and is no longer the case. 2% is acceptable, beyond that it gets dicey and disqualification can happen. Beyond 4% will lead to definitive disqualification.

My guess is that your data loss was already larger than it should have been and you got caught in this update. It sucks, but you can keep running the node on the other satellites and I’m sure that 1TB will fill back up. It isn’t really recommended to remove data manually, but you can look up the folder with the link I posted previously. Be very careful to remove the right folder though. Also check the scores for other satellites to see if they show dips. Unlike before, that should be visible for much longer now. Anything below 100% on audit score or suspension score means there are issues with your node.

mbaker3 · August 26, 2022, 1:58am

This is perfect. Thank you so much for setting my bearings. I’ll run through these tasks this weekend and see what I come up with.

The extended SMART tests didn’t turn up any issues so that’s good.

That’s interesting. It’s worth a shot. Worst case I destroy the reputation of the node on all satellites and start over. On the upside, I’ve noticed that even before this event my second node was filling up much faster than my first.

Will report back how things go!