Repair queue increasing since 2 days. Any clue why?

The repair queue is increasing. If you have a minute please check your storage node. I would like to know:

  1. Is your node offline? If yes please share more details.
  2. Any GET_REPAIR or AUDIT errors? If yes please track down the pieceID. I would like to know if it has been deleted by GC in the last few days. If so we have 4-5 more days to disable the new GC job and recover everything from trash.
  3. Any other related observations?
3 Likes

On Nov 18 3 of my nodes, which are running behind the same IP have been offline for 5 hours. I suppose due to power outage or problems with internet connection, since then they have been running fine. These nodes now have a total of 164GB in trash, yesterday was a little higher.

According to https://storjnet.info/ some 400+ nodes went offline recently. Looks like they are in France. Perhaps some ISP there is having issues?

2 Likes

I see drops in other countries as well, though not as significant, but at exactly the same time.

@littleskunk : Yesterdays catch up on garbage collection hit my nodes pretty hard with big spikes in CPU and RAM as well as IO. us1 had a lot of work to do. It’s possible some nodes could not keep up.

2 Likes

Log file from 2022 11 14 to now

I only see these REPAIR errors:

2022-11-16T17:22:34.713Z        ERROR   piecestore      download failed {"Process": "storagenode", "Piece ID": "4XUU36EHSASPJSKBN6PYED5CRDEB6FCPAVZR5GG2URB24GV7NX3A", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_REPAIR", "error": "write tcp 172.17.0.2:28967->5.161.130.154:38214: use of closed network connection", "errorVerbose": "write tcp 172.17.0.2:28967->5.161.130.154:38214: use of closed network connection\n\tstorj.io/drpc/drpcstream.(*Stream).rawWriteLocked:326\n\tstorj.io/drpc/drpcstream.(*Stream).MsgSend:404\n\tstorj.io/common/pb.(*drpcPiecestore_DownloadStream).Send:317\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download.func5.1:628\n\tstorj.io/common/rpc/rpctimeout.Run.func1:22"}
2022-11-16T17:22:34.721Z        ERROR   piecestore      download failed {"Process": "storagenode", "Piece ID": "A6JU2URN2ZFQ2DFVGSX2I5H7TNPA4WHGGHXKFJ2I6UI7LK7W53GQ", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_REPAIR", "error": "write tcp 172.17.0.2:28967->5.161.130.154:37916: write: connection reset by peer", "errorVerbose": "write tcp 172.17.0.2:28967->5.161.130.154:37916: write: connection reset by peer\n\tstorj.io/drpc/drpcstream.(*Stream).rawWriteLocked:326\n\tstorj.io/drpc/drpcstream.(*Stream).MsgSend:404\n\tstorj.io/common/pb.(*drpcPiecestore_DownloadStream).Send:317\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download.func5.1:628\n\tstorj.io/common/rpc/rpctimeout.Run.func1:22"}
2022-11-16T17:43:30.278Z        ERROR   piecestore      download failed {"Process": "storagenode", "Piece ID": "YRKE752LYUHVKKABI6O4SNSHQNBWUIYDWAKZTMYZQ3BIPDLAZRNQ", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_REPAIR", "error": "write tcp 172.17.0.2:28967->5.161.130.152:52570: use of closed network connection", "errorVerbose": "write tcp 172.17.0.2:28967->5.161.130.152:52570: use of closed network connection\n\tstorj.io/drpc/drpcstream.(*Stream).rawWriteLocked:326\n\tstorj.io/drpc/drpcstream.(*Stream).MsgSend:404\n\tstorj.io/common/pb.(*drpcPiecestore_DownloadStream).Send:317\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download.func5.1:628\n\tstorj.io/common/rpc/rpctimeout.Run.func1:22"}

Otherwise:

========== REPAIR DOWNLOAD ==== 
Failed:                3 
Fail Rate:             0.002%
Canceled:              0 
Cancel Rate:           0.000%
Successful:            195455 
Success Rate:          99.998%
========== REPAIR UPLOAD ====== 
Failed:                0 
Fail Rate:             0.000%
Canceled:              3 
Cancel Rate:           0.004%
Successful:            67126 
Success Rate:          99.996%

No AUDIT failures.

It would be useful to examine what these nodes have in common. Not to warrant any specific action, but to better understand risks to the system. It is nice to see that even 400 simultaneous node dropouts doesn’t affect availability.

1 Like

GET_REPAIR error from this month:

2022-11-23T22:20:23.315Z	INFO	piecestore	download started	{"Piece ID": "IVT2UM43M275XSV5QUCJDEMXRJFXXTUZM7OQOWKBRRBBWBQFLKMA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET_REPAIR"}
2022-11-23T22:20:23.447Z	ERROR	piecestore	download failed	{"Piece ID": "IVT2UM43M275XSV5QUCJDEMXRJFXXTUZM7OQOWKBRRBBWBQFLKMA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET_REPAIR", "error": "file does not exist", "errorVerbose": "file does not exist\n\tstorj.io/common/rpc/rpcstatus.Wrap:73\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:554\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:228\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:61\n\tstorj.io/common/experiment.(*Handler).HandleRPC:42\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:122\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:66\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:112\n\tstorj.io/drpc/drpcctx.(*Tracker).track:52"}

Interestingly it does look as though I lost this piece:

2022-08-26T05:16:42.025+0100	INFO	piecestore	upload started	{"Piece ID": "IVT2UM43M275XSV5QUCJDEMXRJFXXTUZM7OQOWKBRRBBWBQFLKMA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "PUT", "Available Space": 431835076480}
2022-08-26T05:16:43.105+0100	INFO	piecestore	uploaded	{"Piece ID": "IVT2UM43M275XSV5QUCJDEMXRJFXXTUZM7OQOWKBRRBBWBQFLKMA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "PUT", "Size": 399104}

No GET_AUDIT errors.

That one might date back to a previous event. I wouldn’t worry to much about it.

2022-11-17T14:02:33.631Z        ERROR   piecestore      download failed {"Process": "storagenode", "Piece ID": "GWW256VCTMGBKNUG7MN7E5HSQGIN77LEB5EHX6YZHII36GGFC42A", "Satellite ID":
"121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Action": "GET_REPAIR", "error": "pieces error: invalid piece file for storage format version 1: too small for header (0 < 512)", "errorVerbose":
"pieces error: invalid piece file for storage format version 1: too small for header (0 < 512)\n\tstorj.io/storj/storagenode/pieces.NewReader:230\n\tstorj.io/storj/storagenode/pieces.(*Store).Reader:273\
n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:550\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:228\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpct
racing.(*Handler).HandleRPC:61\n\tstorj.io/common/experiment.(*Handler).HandleRPC:42\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:122\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:66\n\tstorj.io/drpc
/drpcserver.(*Server).Serve.func2:112\n\tstorj.io/drpc/drpcctx.(*Tracker).track:52"}
2022-11-18T00:41:21.552Z        ERROR   piecestore      download failed {"Process": "storagenode", "Piece ID": "OMQGWGL442QOCPWGUZY2T55NYUTFDN2R6GXYL3SIQQV2NJFNYCEA", "Satellite ID":
"12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET_REPAIR", "error": "pieces error: invalid piece file for storage format version 1: too small for header (0 < 512)", "errorVerbose":
"pieces error: invalid piece file for storage format version 1: too small for header (0 < 512)\n\tstorj.io/storj/storagenode/pieces.NewReader:230\n\tstorj.io/storj/storagenode/pieces.(*Store).Reader:273\
n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:550\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:228\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpct
racing.(*Handler).HandleRPC:61\n\tstorj.io/common/experiment.(*Handler).HandleRPC:42\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:122\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:66\n\tstorj.io/drpc
/drpcserver.(*Server).Serve.func2:112\n\tstorj.io/drpc/drpcctx.(*Tracker).track:52"}
2022-11-20T01:04:18.721Z        ERROR   piecestore      download failed {"Process": "storagenode", "Piece ID": "BABIRUAF2MBZA6FXUJYE7WZQZVYLXDWSJXUHODFTEYDVZZ5ZECZQ", "Satellite ID":
"12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET_REPAIR", "error": "pieces error: invalid piece file for storage format version 1: too small for header (0 < 512)", "errorVerbose":
"pieces error: invalid piece file for storage format version 1: too small for header (0 < 512)\n\tstorj.io/storj/storagenode/pieces.NewReader:230\n\tstorj.io/storj/storagenode/pieces.(*Store).Reader:273\
n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:550\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:228\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpct
racing.(*Handler).HandleRPC:61\n\tstorj.io/common/experiment.(*Handler).HandleRPC:42\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:122\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:66\n\tstorj.io/drpc
/drpcserver.(*Server).Serve.func2:112\n\tstorj.io/drpc/drpcctx.(*Tracker).track:52"}
2022-11-21T17:47:48.744Z        ERROR   piecestore      download failed {"Process": "storagenode", "Piece ID": "MHZCAAOKETCULPCUYWQLMINBNT76JEHT45TCC4IRTGKZFRWXZLGQ", "Satellite ID":
"12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET_REPAIR", "error": "pieces error: invalid piece file for storage format version 1: too small for header (0 < 512)", "errorVerbose":
"pieces error: invalid piece file for storage format version 1: too small for header (0 < 512)\n\tstorj.io/storj/storagenode/pieces.NewReader:230\n\tstorj.io/storj/storagenode/pieces.(*Store).Reader:273\
n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:550\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:228\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpct
racing.(*Handler).HandleRPC:61\n\tstorj.io/common/experiment.(*Handler).HandleRPC:42\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:122\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:66\n\tstorj.io/drpc
/drpcserver.(*Server).Serve.func2:112\n\tstorj.io/drpc/drpcctx.(*Tracker).track:52"}

Funny you’d say that:
I have 4 nodes running on a relatively low powered machine.
It has been absolutely hammered by the garbage collection for a couple of days, to the point of being barely reachable even on the LAN.
My online score also dropped from 100% to 97% so it seems like it wasn’t reachable from the STORJ network.

Seems to have gone back to normal now.

Have you guys changed anything with the garbage collection?

That’s what I feared. It seems to line up since that started this Monday.

It hasn’t necessarily changed, it just had been turned off for a long time while they were fixing an issue and since Monday, it started up again, but with massive loads of data to catch up with.

More info here:

2 Likes

That’s really interesting.
Thank you for that. So long as it’s a one-off then I won’t worry too much.
Was starting to wonder if one of the disks was going bad :wink:

Next run may still be quite big. GC catches about 90% with each run, so you’re gonna see a fairly big GC run again at about 1/10th of this one. Then 1/100th and so on until it’s back to normal levels.

2 Likes

Thank you for the heads-up, I won’t panic :wink:

I have a smaller node running on and SSD and that one coped just fine, all I see is quite a lot more trash reported. Now I know why :slight_smile:

Nope.

Some nodes on some satellites started reporting an online score of ~90%, but I see it mostly on test satellites, and I had some unplanned downtime this month (~6h twice), so I just assumed being unlucky with timing of connectivity checks. One of them even went down to 75%, but it was saltlake… Though not a new thing, I noticed that last weekend.

Cascading failures are scary!

Yeah, satellites prepare and send the bloom filters for garbage collection, so they could also spread it out more so it doesn’t hit so many nodes at the same time. That would help a lot for repair even if it doesn’t necessarily help individual nodes.

1 Like

A quick idea. How about having the node pause the file walker process whenever the node detects large I/O load in some form (either, let say, just pure drive I/O statistics, or maybe just a simple “we have more than X concurrent uploads/downloads”)? Bloom filters ought to be less important than accepting audits, repairs and downloads.

2 Likes

Correct, this last batch was the worst of it. We’ve been monitoring network performance and haven’t seen any slowdowns even with the GC this week.

We could look into adding rate limiting on the bloom filter checking (it would need to be smart enough that it never rate limits so much that it never finishes). The node could suspend other tasks as well that is a good suggestion. We could also “roll out” the bloom filters to the network kind of like node updates, where only some % of the network gets them and we can control the rate so we can do large GC batches slowly.

Considering that this GC batch was abnormally large, I don’t think we need to add those things immediately.

5 Likes

Well, you seemed worried enough that it prompted you to ask us if there was something going on…
Was this what caused it or have you found another issue? Was it “just” the ISP-taking-400-nodes-offline thing?

That could be long tail cancellation doing what it’s supposed to do. It doesn’t mean some nodes aren’t hurting under the pressure.

3 Likes