Hi There,
Yesterday I received a notification that my node was disqualified from us1.storj.io.
Reading through other threads I’ve collected some information and it looks like I have a bunch of “File does not exist” errors from “GET_REPAIR”.
docker logs storagenode 2>&1 | grep -E "GET_AUDIT|GET_REPAIR" | grep failed
returns 105
Error example:
2022-08-25T12:09:16.726Z ERROR piecestore download failed {"Process": "storagenode", "Piece ID": "DT2HVVVT2IJ3HR6LGFJICGN6HHEJAUAKG726WOFM6T7M4WPUOJXQ", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET_REPAIR", "error": "file does not exist", "errorVerbose": "file does not exist\n\tstorj.io/common/rpc/rpcstatus.Wrap:73\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:546\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:228\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:61\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:122\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:66\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:112\n\tstorj.io/drpc/drpcctx.(*Tracker).track:52"}
If I understand correctly the network was asking for repair blocks that my node was supposed to have but didn’t and I was disqualified for too many infractions.
Fair enough, but how can I dig into why this happened?
Context
- I’m running two 1TB nodes in separate LXC containers running docker on Proxmox.
- The first node (disqualified one) was full which is why I spun up the second node in Aug
- The second node does not report any GET_AUDIT or GET_REPAIR errors in the logs
- This is all backed by a 3x3TB RAID5 array
- In May I did have extended offline time when my city lost power for a week but the nodes were gracefully shut down during the power outage. My scores were not great after that but up until yesterday everything was coming back to 100%
- Otherwise, Aside from the very occasional 1min of downtime during a switch/router firmware upgrade the notes have been online.
I checked the smart reports on the drives in question and they’re all OK. I’m currently running a long test on them just to be sure.
What else can I check to get to the bottom of the cause so this doesn’t happen again?
Is it just that the new audit scoring logic evaluates my one extended outage more severely and now causes immediate disqualification?