The repair queue is increasing. If you have a minute please check your storage node. I would like to know:
Is your node offline? If yes please share more details.
Any GET_REPAIR or AUDIT errors? If yes please track down the pieceID. I would like to know if it has been deleted by GC in the last few days. If so we have 4-5 more days to disable the new GC job and recover everything from trash.
On Nov 18 3 of my nodes, which are running behind the same IP have been offline for 5 hours. I suppose due to power outage or problems with internet connection, since then they have been running fine. These nodes now have a total of 164GB in trash, yesterday was a little higher.
I see drops in other countries as well, though not as significant, but at exactly the same time.
@littleskunk : Yesterdays catch up on garbage collection hit my nodes pretty hard with big spikes in CPU and RAM as well as IO. us1 had a lot of work to do. It’s possible some nodes could not keep up.
It would be useful to examine what these nodes have in common. Not to warrant any specific action, but to better understand risks to the system. It is nice to see that even 400 simultaneous node dropouts doesn’t affect availability.
2022-11-17T14:02:33.631Z ERROR piecestore download failed {"Process": "storagenode", "Piece ID": "GWW256VCTMGBKNUG7MN7E5HSQGIN77LEB5EHX6YZHII36GGFC42A", "Satellite ID":
"121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Action": "GET_REPAIR", "error": "pieces error: invalid piece file for storage format version 1: too small for header (0 < 512)", "errorVerbose":
"pieces error: invalid piece file for storage format version 1: too small for header (0 < 512)\n\tstorj.io/storj/storagenode/pieces.NewReader:230\n\tstorj.io/storj/storagenode/pieces.(*Store).Reader:273\
n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:550\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:228\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpct
racing.(*Handler).HandleRPC:61\n\tstorj.io/common/experiment.(*Handler).HandleRPC:42\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:122\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:66\n\tstorj.io/drpc
/drpcserver.(*Server).Serve.func2:112\n\tstorj.io/drpc/drpcctx.(*Tracker).track:52"}
2022-11-18T00:41:21.552Z ERROR piecestore download failed {"Process": "storagenode", "Piece ID": "OMQGWGL442QOCPWGUZY2T55NYUTFDN2R6GXYL3SIQQV2NJFNYCEA", "Satellite ID":
"12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET_REPAIR", "error": "pieces error: invalid piece file for storage format version 1: too small for header (0 < 512)", "errorVerbose":
"pieces error: invalid piece file for storage format version 1: too small for header (0 < 512)\n\tstorj.io/storj/storagenode/pieces.NewReader:230\n\tstorj.io/storj/storagenode/pieces.(*Store).Reader:273\
n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:550\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:228\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpct
racing.(*Handler).HandleRPC:61\n\tstorj.io/common/experiment.(*Handler).HandleRPC:42\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:122\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:66\n\tstorj.io/drpc
/drpcserver.(*Server).Serve.func2:112\n\tstorj.io/drpc/drpcctx.(*Tracker).track:52"}
2022-11-20T01:04:18.721Z ERROR piecestore download failed {"Process": "storagenode", "Piece ID": "BABIRUAF2MBZA6FXUJYE7WZQZVYLXDWSJXUHODFTEYDVZZ5ZECZQ", "Satellite ID":
"12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET_REPAIR", "error": "pieces error: invalid piece file for storage format version 1: too small for header (0 < 512)", "errorVerbose":
"pieces error: invalid piece file for storage format version 1: too small for header (0 < 512)\n\tstorj.io/storj/storagenode/pieces.NewReader:230\n\tstorj.io/storj/storagenode/pieces.(*Store).Reader:273\
n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:550\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:228\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpct
racing.(*Handler).HandleRPC:61\n\tstorj.io/common/experiment.(*Handler).HandleRPC:42\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:122\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:66\n\tstorj.io/drpc
/drpcserver.(*Server).Serve.func2:112\n\tstorj.io/drpc/drpcctx.(*Tracker).track:52"}
2022-11-21T17:47:48.744Z ERROR piecestore download failed {"Process": "storagenode", "Piece ID": "MHZCAAOKETCULPCUYWQLMINBNT76JEHT45TCC4IRTGKZFRWXZLGQ", "Satellite ID":
"12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET_REPAIR", "error": "pieces error: invalid piece file for storage format version 1: too small for header (0 < 512)", "errorVerbose":
"pieces error: invalid piece file for storage format version 1: too small for header (0 < 512)\n\tstorj.io/storj/storagenode/pieces.NewReader:230\n\tstorj.io/storj/storagenode/pieces.(*Store).Reader:273\
n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:550\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:228\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpct
racing.(*Handler).HandleRPC:61\n\tstorj.io/common/experiment.(*Handler).HandleRPC:42\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:122\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:66\n\tstorj.io/drpc
/drpcserver.(*Server).Serve.func2:112\n\tstorj.io/drpc/drpcctx.(*Tracker).track:52"}
Funny you’d say that:
I have 4 nodes running on a relatively low powered machine.
It has been absolutely hammered by the garbage collection for a couple of days, to the point of being barely reachable even on the LAN.
My online score also dropped from 100% to 97% so it seems like it wasn’t reachable from the STORJ network.
Seems to have gone back to normal now.
Have you guys changed anything with the garbage collection?
That’s what I feared. It seems to line up since that started this Monday.
It hasn’t necessarily changed, it just had been turned off for a long time while they were fixing an issue and since Monday, it started up again, but with massive loads of data to catch up with.
That’s really interesting.
Thank you for that. So long as it’s a one-off then I won’t worry too much.
Was starting to wonder if one of the disks was going bad
Next run may still be quite big. GC catches about 90% with each run, so you’re gonna see a fairly big GC run again at about 1/10th of this one. Then 1/100th and so on until it’s back to normal levels.
Some nodes on some satellites started reporting an online score of ~90%, but I see it mostly on test satellites, and I had some unplanned downtime this month (~6h twice), so I just assumed being unlucky with timing of connectivity checks. One of them even went down to 75%, but it was saltlake… Though not a new thing, I noticed that last weekend.
Yeah, satellites prepare and send the bloom filters for garbage collection, so they could also spread it out more so it doesn’t hit so many nodes at the same time. That would help a lot for repair even if it doesn’t necessarily help individual nodes.
A quick idea. How about having the node pause the file walker process whenever the node detects large I/O load in some form (either, let say, just pure drive I/O statistics, or maybe just a simple “we have more than X concurrent uploads/downloads”)? Bloom filters ought to be less important than accepting audits, repairs and downloads.
Correct, this last batch was the worst of it. We’ve been monitoring network performance and haven’t seen any slowdowns even with the GC this week.
We could look into adding rate limiting on the bloom filter checking (it would need to be smart enough that it never rate limits so much that it never finishes). The node could suspend other tasks as well that is a good suggestion. We could also “roll out” the bloom filters to the network kind of like node updates, where only some % of the network gets them and we can control the rate so we can do large GC batches slowly.
Considering that this GC batch was abnormally large, I don’t think we need to add those things immediately.
Well, you seemed worried enough that it prompted you to ask us if there was something going on…
Was this what caused it or have you found another issue? Was it “just” the ISP-taking-400-nodes-offline thing?