Suspended node message vague and needs some additional detail

I received emails this morning indicating I was suspended from the following satellites:

  • us-central-1
  • europe-west-1
  • asia-east-1

The message has the following line:

You won’t receive any new data on this Satellite until you resolve the issue causing audit failures on your node.

However, there are no indications what issues needs to be fixed to address these audit failures. The message is of little value.

Regards,
Gary

Check your log for download failed and GET_AUDIT entries. These are audit failures.

When I run the following command it shows 50 GET_AUDIT entries:

docker logs -t storagenode 2>&1 | grep -c GET_AUDIT

When I look at these specific log entries they are all like the following:

docker logs -t storagenode 2>&1 | grep GET_AUDIT

2020-05-14T15:44:10.942145874Z 2020-05-14T15:44:10.941Z INFO piecestore download started {“Piece ID”: “NPQD2DRYPNYC7DPSIAPPJ7NTBWBKLEZGYACFSR3YJCO7DN5ITFSQ”, “Satellite ID”: “118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW”, “Action”: " GET_AUDIT "}

2020-05-14T15:44:11.138385877Z 2020-05-14T15:44:11.137Z INFO piecestore downloaded {“Piece ID”: “NPQD2DRYPNYC7DPSIAPPJ7NTBWBKLEZGYACFSR3YJCO7DN5ITFSQ”, “Satellite ID”: “118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW”, “Action”: " GET_AUDIT "}

2020-05-14T15:51:55.012695688Z 2020-05-14T15:51:55.011Z INFO piecestore download started {“Piece ID”: “QJNJFJKSQJZTYP3W7D64DRFPHUACZAUAZ3RREVIEJLJTD5SVA5LA”, “Satellite ID”: “118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW”, “Action”: " GET_AUDIT "}

2020-05-14T15:51:55.225571664Z 2020-05-14T15:51:55.223Z INFO piecestore downloaded {“Piece ID”: “QJNJFJKSQJZTYP3W7D64DRFPHUACZAUAZ3RREVIEJLJTD5SVA5LA”, “Satellite ID”: “118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW”, “Action”: " GET_AUDIT "}

These do not indicate to me that there are errors. What am I missing?

You are missing the second keyword. Its “download failed” AND “GET_AUDIT” together.

@nerdatwork, thx…

There are also 11 of these entries as well:

2020-05-14T16:56:41.142764658Z 2020-05-14T16:56:41.141Z ERROR piecestore download failed {“Piece ID”: “IGJOLYY3BTNAK3BNYMYYFTTDWLKNSMI3OZMRSIXAWSR3T7Q4QZFA”, “Satellite ID”: “1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE”, “Action”: “GET”, “error”: “write tcp 172.17.0.3:28967->35.236.66.70:50224: use of closed network connection”, “errorVerbose”: “write tcp 172.17.0.3:28967->35.236.66.70:50224: use of closed network connection\n\tstorj.io/drpc/drpcstream.(*Stream).pollWrite:221\n\tstorj.io/drpc/drpcwire.SplitN:29\n\tstorj.io/drpc/drpcstream.(*Stream).RawWrite:276\n\tstorj.io/drpc/drpcstream.(*Stream).MsgSend:318\n\tstorj.io/common/pb.(*drpcPiecestoreDownloadStream).Send:1080\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).doDownload.func5.1:640\n\tstorj.io/common/rpc/rpctimeout.Run.func1:22”}

What does use of closed network connection mean and how would/should I fix it? Is there some sort of timeout value needed for whatever connection this may be referring to that I can adjust?

That error is shown when timeout is hit for that operation. The TCP connection is closed and results in failed message.

Storj is looking in to it.

Try docker logs -t storagenode 2>&1 | grep "GET_AUDIT" | grep "download failed"

@donald.m.motsinger thx… I’ve already used the following already:

docker logs -t storagenode 2>&1 | grep -E “download failed|GET_AUDIT”

it achieves the same thing… well it gives both… :slight_smile:

No, look closer. My command gives all lines where both terms occur. Your command returns all lines where either terms occur.

@donald.m.motsinger

Yeah, you are correct, however, there are no lines that has both.

That’s a good thing. It means no audit failures.

@nerdatwork

sorry what do you mean “Storj is looking in to it”?

Is this being looked by a official Storj support person or group?

There is known bug regarding “database is locked” error, but you do not have such errors

@Alexey

Any idea what this error could be? Are there any knobs I could set to adjust whatever is timing out?

Did you recently recreate the container? If so you may not have any logs that actually show the error at this point. Most likely it’s the database locked issue. This is something that needs to be fixed in software, but you can vacuum the usedserials.db and defrag it. For some users that eliminates the problem.

@BrightSilence

Yes, I did remove/pull/restart after receiving the emails so I suppose you are correct the real error messages have been thrown away.

Anyway, how do I do this vacuum and defrag you speak of? And after that how do I remove the suspension?

BTW, knowing that there is a know bug out there shouldn’t there be some sort of grace given before being suspended?

yes, the following disqualification is disabled

@Alexey

I’ve run the commands above…

How can I tell my suspension status?

Take a look on dashboard, it will show on which satellite (if any) it’s suspended

Please be aware it can take some time until the suspension is resolved. You need to respond successfully to a few audits before it recovers.

2 Likes