Suspended because of low online score

Adding another data point, I got an email that one of my nodes was suspended – only one of my five nodes on this one Internet connection. When I checked, the audit score was nearly 100%.

When I grep the node’s log for failed audits, there is no output.

Failed GET_REPAIRs:

2022-02-21T06:28:51.778Z        ERROR   piecestore      download failed {"Piece ID": "XIDLCNVJVIJ63TUSVHGYLLEUCQOWYGZ5SH24QSSVJ55HX4AAWUIQ", "Satellite ID": "12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB", "Action": "GET_REPAIR", "error": "write tcp 172.19.0.4:28967->78.47.138.30:44542: write: broken pipe", "errorVerbose": "write tcp 172.19.0.4:28967->78.47.138.30:44542: write: broken pipe\n\tstorj.io/drpc/drpcstream.(*Stream).rawFlushLocked:347\n\tstorj.io/drpc/drpcstream.(*Stream).MsgSend:396\n\tstorj.io/common/pb.(*drpcPiecestore_DownloadStream).Send:317\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download.func4:570\n\tstorj.io/common/rpc/rpctimeout.Run.func1:22"}
2022-02-21T06:28:51.988Z        ERROR   piecestore      download failed {"Piece ID": "HXHQLFBK32ZTOM3E74ENTIB2CCUOUCL5TVSHPMMI77W5XFHRHUIQ", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET_REPAIR", "error": "write tcp 172.19.0.4:28967->157.90.234.115:33124: write: broken pipe", "errorVerbose": "write tcp 172.19.0.4:28967->157.90.234.115:33124: write: broken pipe\n\tstorj.io/drpc/drpcstream.(*Stream).rawFlushLocked:347\n\tstorj.io/drpc/drpcstream.(*Stream).MsgSend:396\n\tstorj.io/common/pb.(*drpcPiecestore_DownloadStream).Send:317\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download.func4:570\n\tstorj.io/common/rpc/rpctimeout.Run.func1:22"}
2022-02-21T06:28:52.085Z        ERROR   piecestore      download failed {"Piece ID": "IPBGUALDM2ER7TRMCGDALS6HYDJVX34WLL4MNFHUOH37L6KN3JIA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET_REPAIR", "error": "write tcp 172.19.0.4:28967->5.161.82.71:52758: write: broken pipe", "errorVerbose": "write tcp 172.19.0.4:28967->5.161.82.71:52758: write: broken pipe\n\tstorj.io/drpc/drpcstream.(*Stream).rawFlushLocked:347\n\tstorj.io/drpc/drpcstream.(*Stream).MsgSend:396\n\tstorj.io/common/pb.(*drpcPiecestore_DownloadStream).Send:317\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download.func4:570\n\tstorj.io/common/rpc/rpctimeout.Run.func1:22"}
2022-02-21T06:28:52.350Z        ERROR   piecestore      download failed {"Piece ID": "K5NQSZGCQOQUHBKE64HTXLZUVQOPNBGH3PPVWZIRLZB4NYWR5KQQ", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET_REPAIR", "error": "write tcp 172.19.0.4:28967->5.161.82.39:55348: write: broken pipe", "errorVerbose": "write tcp 172.19.0.4:28967->5.161.82.39:55348: write: broken pipe\n\tstorj.io/drpc/drpcstream.(*Stream).rawFlushLocked:347\n\tstorj.io/drpc/drpcstream.(*Stream).MsgSend:396\n\tstorj.io/common/pb.(*drpcPiecestore_DownloadStream).Send:317\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download.func4:570\n\tstorj.io/common/rpc/rpctimeout.Run.func1:22"}
2022-02-21T06:28:52.463Z        ERROR   piecestore      download failed {"Piece ID": "ZJXZVDJCPO2ZLQ6JOZY3QFSJ5HOBFP32WJD6OP546A4Q63EJVM4Q", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET_REPAIR", "error": "write tcp 172.19.0.4:28967->5.161.45.235:52468: write: broken pipe", "errorVerbose": "write tcp 172.19.0.4:28967->5.161.45.235:52468: write: broken pipe\n\tstorj.io/drpc/drpcstream.(*Stream).rawFlushLocked:347\n\tstorj.io/drpc/drpcstream.(*Stream).MsgSend:396\n\tstorj.io/common/pb.(*drpcPiecestore_DownloadStream).Send:317\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download.func4:570\n\tstorj.io/common/rpc/rpctimeout.Run.func1:22"}
2022-02-21T06:28:52.752Z        ERROR   piecestore      download failed {"Piece ID": "446LBHFPCY7ATV665SAZYGTOHANDKUSJAIYE3ABUERJ3GKJWCMXA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET_REPAIR", "error": "write tcp 172.19.0.4:28967->5.161.82.35:33370: write: broken pipe", "errorVerbose": "write tcp 172.19.0.4:28967->5.161.82.35:33370: write: broken pipe\n\tstorj.io/drpc/drpcstream.(*Stream).rawFlushLocked:347\n\tstorj.io/drpc/drpcstream.(*Stream).MsgSend:396\n\tstorj.io/common/pb.(*drpcPiecestore_DownloadStream).Send:317\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download.func4:570\n\tstorj.io/common/rpc/rpctimeout.Run.func1:22"}
2022-02-21T06:29:03.206Z        ERROR   piecestore      download failed {"Piece ID": "L7C67O6DRNMYZHT4PH5KAPAN4TEYPPRAC65URSILBCTVOBS7PS5Q", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET_REPAIR", "error": "write tcp 172.19.0.4:28967->5.161.82.73:50454: write: broken pipe", "errorVerbose": "write tcp 172.19.0.4:28967->5.161.82.73:50454: write: broken pipe\n\tstorj.io/drpc/drpcstream.(*Stream).rawFlushLocked:347\n\tstorj.io/drpc/drpcstream.(*Stream).MsgSend:396\n\tstorj.io/common/pb.(*drpcPiecestore_DownloadStream).Send:317\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download.func4:570\n\tstorj.io/common/rpc/rpctimeout.Run.func1:22"}
2022-02-21T06:29:03.206Z        ERROR   piecestore      download failed {"Piece ID": "XJI4WOWHAXSOEUHHX4WGTMO2DXOMJ47P4W4XFWZQZEZGQ6ONHNLQ", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET_REPAIR", "error": "write tcp 172.19.0.4:28967->23.88.104.55:41354: write: broken pipe", "errorVerbose": "write tcp 172.19.0.4:28967->23.88.104.55:41354: write: broken pipe\n\tstorj.io/drpc/drpcstream.(*Stream).rawFlushLocked:347\n\tstorj.io/drpc/drpcstream.(*Stream).MsgSend:396\n\tstorj.io/common/pb.(*drpcPiecestore_DownloadStream).Send:317\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download.func4:570\n\tstorj.io/common/rpc/rpctimeout.Run.func1:22"}
2022-02-21T06:29:03.207Z        ERROR   piecestore      download failed {"Piece ID": "H2NVPTGGVRURVQZIOEKK7US7RYWJEAWUYIA6BUUQ43D4PYUACAXQ", "Satellite ID": "12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB", "Action": "GET_REPAIR", "error": "write tcp 172.19.0.4:28967->78.47.138.30:49808: write: broken pipe", "errorVerbose": "write tcp 172.19.0.4:28967->78.47.138.30:49808: write: broken pipe\n\tstorj.io/drpc/drpcstream.(*Stream).rawFlushLocked:347\n\tstorj.io/drpc/drpcstream.(*Stream).MsgSend:396\n\tstorj.io/common/pb.(*drpcPiecestore_DownloadStream).Send:317\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download.func4:570\n\tstorj.io/common/rpc/rpctimeout.Run.func1:22"}
2022-02-21T06:29:03.210Z        ERROR   piecestore      download failed {"Piece ID": "QZK3G6LB7H2N7UK3K3CATVWOFJVO6WHNKZSSJPSZCVFZECV2C52Q", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET_REPAIR", "error": "write tcp 172.19.0.4:28967->5.161.45.235:51592: write: broken pipe", "errorVerbose": "write tcp 172.19.0.4:28967->5.161.45.235:51592: write: broken pipe\n\tstorj.io/drpc/drpcstream.(*Stream).rawFlushLocked:347\n\tstorj.io/drpc/drpcstream.(*Stream).MsgSend:396\n\tstorj.io/common/pb.(*drpcPiecestore_DownloadStream).Send:317\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download.func4:570\n\tstorj.io/common/rpc/rpctimeout.Run.func1:22"}
2022-02-21T06:29:03.267Z        ERROR   piecestore      download failed {"Piece ID": "SED73YBVQBOHNGDDYVX4UANVNNXUDCDDJ2QC5F4XB2SNFDBCGEUA", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Action": "GET_REPAIR", "error": "write tcp 172.19.0.4:28967->195.201.119.188:45604: write: broken pipe", "errorVerbose": "write tcp 172.19.0.4:28967->195.201.119.188:45604: write: broken pipe\n\tstorj.io/drpc/drpcstream.(*Stream).rawWriteLocked:317\n\tstorj.io/drpc/drpcstream.(*Stream).MsgSend:392\n\tstorj.io/common/pb.(*drpcPiecestore_DownloadStream).Send:317\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download.func5.1:619\n\tstorj.io/common/rpc/rpctimeout.Run.func1:22"}
2022-02-21T06:29:03.356Z        ERROR   piecestore      download failed {"Piece ID": "QZ7BYEOUAFKHPDCBCACOQ33W5EIE5OE76CKZGGPG3OKLBVYMEWCQ", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET_REPAIR", "error": "write tcp 172.19.0.4:28967->157.90.17.108:60366: write: broken pipe", "errorVerbose": "write tcp 172.19.0.4:28967->157.90.17.108:60366: write: broken pipe\n\tstorj.io/drpc/drpcstream.(*Stream).rawFlushLocked:347\n\tstorj.io/drpc/drpcstream.(*Stream).MsgSend:396\n\tstorj.io/common/pb.(*drpcPiecestore_DownloadStream).Send:317\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download.func4:570\n\tstorj.io/common/rpc/rpctimeout.Run.func1:22"}
2022-02-21T06:29:03.434Z        ERROR   piecestore      download failed {"Piece ID": "V5GMKU5LFJY3S6EIDJSHBURILUDFFNWYHOINLHZ5HZPB6UDNHHFQ", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET_REPAIR", "error": "write tcp 172.19.0.4:28967->116.203.197.184:45942: write: broken pipe", "errorVerbose": "write tcp 172.19.0.4:28967->116.203.197.184:45942: write: broken pipe\n\tstorj.io/drpc/drpcstream.(*Stream).rawFlushLocked:347\n\tstorj.io/drpc/drpcstream.(*Stream).MsgSend:396\n\tstorj.io/common/pb.(*drpcPiecestore_DownloadStream).Send:317\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download.func4:570\n\tstorj.io/common/rpc/rpctimeout.Run.func1:22"}
2022-02-21T06:29:03.481Z        ERROR   piecestore      download failed {"Piece ID": "NHBHTSHYFAGQH53RJ5EHVAXLX2RZMFUDGZ2F3XUZJH7476SXQBIA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET_REPAIR", "error": "write tcp 172.19.0.4:28967->5.161.45.235:52622: write: broken pipe", "errorVerbose": "write tcp 172.19.0.4:28967->5.161.45.235:52622: write: broken pipe\n\tstorj.io/drpc/drpcstream.(*Stream).rawFlushLocked:347\n\tstorj.io/drpc/drpcstream.(*Stream).MsgSend:396\n\tstorj.io/common/pb.(*drpcPiecestore_DownloadStream).Send:317\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download.func4:570\n\tstorj.io/common/rpc/rpctimeout.Run.func1:22"}

We are sending nodes into suspension and sending email to SNOs for network connectivity issues spanning less than 15 seconds now?

What?

2 posts were merged into an existing topic: Now it hit me: ‘Your node has been suspended’

3 posts were merged into an existing topic: Suspension score dropped with no errors, then reverted

Just faced the same problem, suddendly, while operating I got the message “your node has ben suspended”…


What’s happening?

3 posts were merged into an existing topic: Suspension score dropped with no errors, then reverted

i’ve also seen suspension emails, but i don’t think that was without cause… have had a bit of system issues, most likely due to the high traffic we are seeing.

the graph showing near 15k/all nodes suspended is kinda crazy… not sure if that can be right…

but we have been seeing unusually high traffic, so everyone that has issues with their nodes being borderline able to keep up would see it now.

i have resolved my issues, had to remove some 16GB RAM modules which was causing my zfs to make some sort of memory issue, so will keep an eye on if i see any more suspension emails…

however i think the nodes suspended most likely just aren’t able to keep up with the traffic io.
which in turn could cause a cascade like effect, because the ingress would move to remaining nodes which then in turn would be overloaded putting more ingress load on the even fewer remaining nodes.

the traffic might not look like much, but it represents a substantial io load.

nearly 1/3 of all files are 4k or less so 1 IO
so 400KB avg ingress would be about 133KB of which a write would be a ratio of 2 x1k files
to 1 x2k and 1 x4k approx… so those 133KB writes equals on avg about 38 writes for the 1k files, 19 writes for the 2k files and 19 writes for the 4k files.

giving us a 76 write IO

so just accounting for 1/3 of the small file ingress io, creates ofc the higher end io would be less… but hdd’s aren’t good at this type of random io… even the top tier ones will struggle at these levels.

and we can basically double that number because there would also be reads… so 152 and ofc as you can see on my graphs, my egress is much higher than the ingress, so if we call that 1MB/s and apply the same then we can just multiply the 76 by 2.5 so just dealing my less than 4k IO would be 190 IO and thats just the heavy 1/3 of what the traffic is…

ofc the real question is, if it caused the suspensions…
but the random write IO of a SMR HDD can be something like 40, if under sustained load…
even a good HDD will peak out at 400 at the best of times. and thats optimistic numbers.

not saying this is exactly right, its a very rough back of the napkin estimate using numbers i got easily accessible.

2 Likes

Not directly addressing the original question but for ZFS users this perhaps a good argument to incorporate a separate, fast device as an SLOG if using a separate dataset for the DB? I know I have a couple used SSDs laying around that might be better than nothing but usually something like an NVMe or Optane is recommended. Only makes sense if this makes a difference for the HDD performance.

* Edit: SLOG may be best suited for the db dataset if the Storj db makes synchronous write calls. Any programmers/developers that can comment on whether synchronous writes are used?

Your node were offline. You can check when:

The online score will recover in the next 30 days online. Each downtime will require another 30 days online.