Suspended by satellites - how to clear that and why is it happening

EvilC0P · May 8, 2020, 3:06pm

Good day Operators,

i ran into an issue 2 nights ago where my drive dropped while sleeping and when i woke up i was suspended by 3 satellites.
fixed the issue and 24 hours later i am still suspended but also got an email during the night that an other satellite has suspended my node.

only error i can see in my log is this :
2020-05-08T10:42:00.597-0400 INFO piecestore upload started {“Piece ID”: “C4UADX2B4MNQQ77BD6ODPHIBQ4PNQ43IYBACOV2ZERIPFOAYEVMA”, “Satellite ID”: “12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB”, “Action”: “PUT”, “Available Space”: 19905158906880}
2020-05-08T10:42:01.851-0400 INFO piecestore upload canceled {“Piece ID”: “C4UADX2B4MNQQ77BD6ODPHIBQ4PNQ43IYBACOV2ZERIPFOAYEVMA”, “Satellite ID”: “12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB”, “Action”: “PUT”}
2020-05-08T10:42:02.473-0400 INFO bandwidth Performing bandwidth usage rollups
2020-05-08T10:42:09.494-0400 INFO piecestore upload started {“Piece ID”: “VANLBZXS3LSMLHKMEH7RX2SL5YW7TQIQSXPHVR2LZ6HC7OLSKMQA”, “Satellite ID”: “12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB”, “Action”: “PUT”, “Available Space”: 19905156587008}
2020-05-08T10:42:10.724-0400 INFO piecestore upload started {“Piece ID”: “MXEX3JHJBEKDKFTU2JXGYFPBXPO4OWPL3264BT7A2PJZLH7XTDVA”, “Satellite ID”: “12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB”, “Action”: “PUT”, “Available Space”: 19905156587008}
2020-05-08T10:42:10.882-0400 INFO piecestore upload canceled {“Piece ID”: “VANLBZXS3LSMLHKMEH7RX2SL5YW7TQIQSXPHVR2LZ6HC7OLSKMQA”, “Satellite ID”: “12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB”, “Action”: “PUT”}

i see it attempts to do an upload but then cancels it. there is ~19 TB free as shown in the error message. i got multiple of those errors.

my node is running on this :
i5-7600, 8gb ddr4, 120 gb m2 for OS with a RAID1 of 20TB (3x 8TB drives WD 256mb cache), Win10
i have a fibre line 500/500 mbps.

How do I clear the suspensions (now i’m suspended from 4 satellites)?
What do i need to do or look for?

thank you for your help

BrightSilence · May 8, 2020, 3:19pm

You need to look for audit errors. So those would be lines with “Action”: “GET_AUDIT” and error in them.

If you indeed fixed the underlying issue, you just have to wait. Suspension scores need to recover before your node goes out of suspension. The only way to do that is respond successfully to audits, which will take some time.

EvilC0P · May 8, 2020, 6:17pm

thanks for the hint,

looked for GET_AUDIT and found these :

2020-05-08T08:24:07.983-0400	INFO	piecestore	download started	{“Piece ID”: “CVYPFNDCOXUOPQBPX5NSHMVTRYPVLJBD4RG5PY6UGVWNWSIXJSTQ”, “Satellite ID”: “1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE”, “Action”: “GET_AUDIT”}
2020-05-08T08:24:08.116-0400	INFO	piecestore	downloaded	{“Piece ID”: “CVYPFNDCOXUOPQBPX5NSHMVTRYPVLJBD4RG5PY6UGVWNWSIXJSTQ”, “Satellite ID”: “1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE”, “Action”: “GET_AUDIT”}
2020-05-08T08:24:17.144-0400	INFO	piecestore	upload started	{“Piece ID”: “KY5G2WTSYXFDIW3VYQ4HRSNZUJLTROLP7DUUZ4XHXGU4DQM7372A”, “Satellite ID”: “12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB”, “Action”: “PUT”, “Available Space”: 19905600560640}
2020-05-08T08:24:19.601-0400	INFO	piecestore	upload canceled	{“Piece ID”: “KY5G2WTSYXFDIW3VYQ4HRSNZUJLTROLP7DUUZ4XHXGU4DQM7372A”, “Satellite ID”: “12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB”, “Action”: “PUT”}
2020-05-08T08:24:46.341-0400	INFO	piecestore	upload started	{“Piece ID”: “ETYBR5WLY2PRLVSCTUAHNX4YHDYOWHIDPCWPUGSSFOH4P5QYQGIA”, “Satellite ID”: “12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB”, “Action”: “PUT”, “Available Space”: 19905598240768}
2020-05-08T08:24:48.452-0400	INFO	piecestore	upload canceled	{“Piece ID”: “ETYBR5WLY2PRLVSCTUAHNX4YHDYOWHIDPCWPUGSSFOH4P5QYQGIA”, “Satellite ID”: “12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB”, “Action”: “PUT”}

doesn’t tell me why it was cancelled . i see a bunch of upload that worked then a few that were cancelled. i guess it’s testing to see if it’s working?

just trying to understand how this work, i am relatively new, sorry for the noob questions
thanks

BrightSilence · May 8, 2020, 6:25pm

Canceled uploads are normal, don’t worry about those. Only the lines with GET_AUDIT matter for this. The one audit you included went fine. You’re looking for a download failed line with GET_AUDIT.

Alexey · May 8, 2020, 8:16pm

Pac · August 14, 2020, 11:01pm

Hey there.

After all my disks got disconnected, all my nodes got suspended on the 12th of August (around 5~6 AM GMT).
Since then, I fixed the problem and all my nodes got back to normal in roughly 24H, except for the newest one, with the least data.

It is still suspended on 2 sat’:

I have activated logs on this particular node for the past 24H, and here are the results of the command as suggested by @Alexey: Suspension mode – Storj (no result):

pi@raspberrypi:~ $ sudo docker logs storj_node_5 2>&1 | grep GET_AUDIT | grep failed
pi@raspberrypi:~ $

Without grep failed, the very few audits that were made seem okay:

pi@raspberrypi:~ $ sudo docker logs storj_node_5 2>&1 | grep GET_AUDIT
2020-08-14T14:21:12.697Z        INFO    piecestore      download started        {"Piece ID": "OXGGQKAVJBYTRJMQ45WW6NNYLBCHNU6QYQLTYR6S2QZ6DD2RQ6HQ", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET_AUDIT"}
2020-08-14T14:21:13.414Z        INFO    piecestore      downloaded      {"Piece ID": "OXGGQKAVJBYTRJMQ45WW6NNYLBCHNU6QYQLTYR6S2QZ6DD2RQ6HQ", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET_AUDIT"}
2020-08-14T21:12:47.957Z        INFO    piecestore      download started        {"Piece ID": "LTCSCK47U2CAIR7M4STUXJRWWMFBD5YLETWVD6G7OCMFBKH3SSFA", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET_AUDIT"}
2020-08-14T21:12:48.718Z        INFO    piecestore      downloaded      {"Piece ID": "LTCSCK47U2CAIR7M4STUXJRWWMFBD5YLETWVD6G7OCMFBKH3SSFA", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET_AUDIT"}
2020-08-14T22:13:58.980Z        INFO    piecestore      download started        {"Piece ID": "KVGZFHKMFIJ27VROELA4RYN7I7YP6S77Z5ELDFEKH7L7JJY7GXHQ", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"}
2020-08-14T22:13:59.326Z        INFO    piecestore      downloaded      {"Piece ID": "KVGZFHKMFIJ27VROELA4RYN7I7YP6S77Z5ELDFEKH7L7JJY7GXHQ", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"}

And here is the score those 2 satelites have on this node:

pi@raspberrypi:~ $ for sat in `sudo docker exec -i storj_node_5 wget -qO - localhost:14002/api/sno | jq .satellites[].id -r`; do sudo docker exec -i storj_node_5 wget -qO - localhost:14002/api/sno/satellite/$sat | jq .id,.audit; done
[...]
"121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6"
{
  "totalCount": 1,
  "successCount": 0,
  "alpha": 1,
  "beta": 0,
  "unknownAlpha": 0.95,
  "unknownBeta": 1,
  "score": 1,
  "unknownScore": 0.48717948717948717
}
[...]
"12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"
{
  "totalCount": 1,
  "successCount": 0,
  "alpha": 1,
  "beta": 0,
  "unknownAlpha": 0.95,
  "unknownBeta": 1,
  "score": 1,
  "unknownScore": 0.48717948717948717
}
[...]

Should I be concerned? Aren’t suspended nodes being frequently checked by satellites to find out whether they recovered or not? How long before this node gets disqualified if its suspension score does not improve (I believe it’s 7 days)?

Alexey · August 15, 2020, 7:27am

The frequency is the same. If your node started to pass audits, it should recover the reputation.

Yes, the grace period is 7 days. Then it will be disqualified if not recovered.
If you would not see any improvements in the “unknown” metrics after a few hours, I would like to ask you to enable the debug level for the logs and redirect them to the disk.

Pac · August 15, 2020, 9:44am

Turns out it recovered, finally. Took more than 3 days to merely go above 60% for the suspension score (compared to just one day for the other ones to almost fully recover).

Maybe because it had almost no data.
Anyways, everything is back to normal, pheww. Time to add some crons to handle disk disconnections in the future… ^^’

BrightSilence · August 15, 2020, 5:54pm

Is this in place yet?

There was some discussion around only disqualifying on the next failed audit after the grace period. But I’m not sure it was implemented like that. That would help in this scenario.

@Pac: It took longer on the newer node because audits are far less frequent if you have very little data. Eventually the score will reach 100 again, but it’ll just take time.

Pac · August 15, 2020, 6:31pm

That’s what I though, got it Thanks for confirming.

Which means that on an almost empty node that would have been just started, it could potentially never recover from a suspension within 7 days. But… that wouldn’t really be a problem considering how new and empty it would be, so I think it’s not an issue in the end.