Suspended node message vague and needs some additional detail

thebadcat · May 14, 2020, 3:27pm

I received emails this morning indicating I was suspended from the following satellites:

us-central-1
europe-west-1
asia-east-1

The message has the following line:

You won’t receive any new data on this Satellite until you resolve the issue causing audit failures on your node.

However, there are no indications what issues needs to be fixed to address these audit failures. The message is of little value.

Regards,
Gary

nerdatwork · May 14, 2020, 3:59pm

Check your log for download failed and GET_AUDIT entries. These are audit failures.

thebadcat · May 14, 2020, 4:08pm

When I run the following command it shows 50 GET_AUDIT entries:

docker logs -t storagenode 2>&1 | grep -c GET_AUDIT

When I look at these specific log entries they are all like the following:

docker logs -t storagenode 2>&1 | grep GET_AUDIT

2020-05-14T15:44:10.942145874Z 2020-05-14T15:44:10.941Z INFO piecestore download started {“Piece ID”: “NPQD2DRYPNYC7DPSIAPPJ7NTBWBKLEZGYACFSR3YJCO7DN5ITFSQ”, “Satellite ID”: “118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW”, “Action”: " GET_AUDIT "}

2020-05-14T15:44:11.138385877Z 2020-05-14T15:44:11.137Z INFO piecestore downloaded {“Piece ID”: “NPQD2DRYPNYC7DPSIAPPJ7NTBWBKLEZGYACFSR3YJCO7DN5ITFSQ”, “Satellite ID”: “118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW”, “Action”: " GET_AUDIT "}

2020-05-14T15:51:55.012695688Z 2020-05-14T15:51:55.011Z INFO piecestore download started {“Piece ID”: “QJNJFJKSQJZTYP3W7D64DRFPHUACZAUAZ3RREVIEJLJTD5SVA5LA”, “Satellite ID”: “118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW”, “Action”: " GET_AUDIT "}

2020-05-14T15:51:55.225571664Z 2020-05-14T15:51:55.223Z INFO piecestore downloaded {“Piece ID”: “QJNJFJKSQJZTYP3W7D64DRFPHUACZAUAZ3RREVIEJLJTD5SVA5LA”, “Satellite ID”: “118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW”, “Action”: " GET_AUDIT "}

These do not indicate to me that there are errors. What am I missing?

nerdatwork · May 14, 2020, 5:05pm

You are missing the second keyword. Its “download failed” AND “GET_AUDIT” together.

thebadcat · May 14, 2020, 5:21pm

@nerdatwork, thx…

There are also 11 of these entries as well:

2020-05-14T16:56:41.142764658Z 2020-05-14T16:56:41.141Z ERROR piecestore download failed {“Piece ID”: “IGJOLYY3BTNAK3BNYMYYFTTDWLKNSMI3OZMRSIXAWSR3T7Q4QZFA”, “Satellite ID”: “1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE”, “Action”: “GET”, “error”: “write tcp 172.17.0.3:28967->35.236.66.70:50224: use of closed network connection”, “errorVerbose”: “write tcp 172.17.0.3:28967->35.236.66.70:50224: use of closed network connection\n\tstorj.io/drpc/drpcstream.(*Stream).pollWrite:221\n\tstorj.io/drpc/drpcwire.SplitN:29\n\tstorj.io/drpc/drpcstream.(*Stream).RawWrite:276\n\tstorj.io/drpc/drpcstream.(*Stream).MsgSend:318\n\tstorj.io/common/pb.(*drpcPiecestoreDownloadStream).Send:1080\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).doDownload.func5.1:640\n\tstorj.io/common/rpc/rpctimeout.Run.func1:22”}

What does use of closed network connection mean and how would/should I fix it? Is there some sort of timeout value needed for whatever connection this may be referring to that I can adjust?

nerdatwork · May 14, 2020, 5:44pm

That error is shown when timeout is hit for that operation. The TCP connection is closed and results in failed message.

Storj is looking in to it.

donald.m.motsinger · May 14, 2020, 6:04pm

Try docker logs -t storagenode 2>&1 | grep "GET_AUDIT" | grep "download failed"

thebadcat · May 14, 2020, 6:08pm

@donald.m.motsinger thx… I’ve already used the following already:

docker logs -t storagenode 2>&1 | grep -E “download failed|GET_AUDIT”

it achieves the same thing… well it gives both…

donald.m.motsinger · May 14, 2020, 6:36pm

No, look closer. My command gives all lines where both terms occur. Your command returns all lines where either terms occur.

thebadcat · May 14, 2020, 6:41pm

@donald.m.motsinger

Yeah, you are correct, however, there are no lines that has both.

donald.m.motsinger · May 14, 2020, 6:43pm

That’s a good thing. It means no audit failures.

thebadcat · May 14, 2020, 6:59pm

@nerdatwork

sorry what do you mean “Storj is looking in to it”?

Is this being looked by a official Storj support person or group?

Alexey · May 14, 2020, 7:19pm

There is known bug regarding “database is locked” error, but you do not have such errors

thebadcat · May 14, 2020, 7:34pm

@Alexey

Any idea what this error could be? Are there any knobs I could set to adjust whatever is timing out?

BrightSilence · May 14, 2020, 8:10pm

Did you recently recreate the container? If so you may not have any logs that actually show the error at this point. Most likely it’s the database locked issue. This is something that needs to be fixed in software, but you can vacuum the usedserials.db and defrag it. For some users that eliminates the problem.

thebadcat · May 14, 2020, 8:18pm

@BrightSilence

Yes, I did remove/pull/restart after receiving the emails so I suppose you are correct the real error messages have been thrown away.

Anyway, how do I do this vacuum and defrag you speak of? And after that how do I remove the suspension?

BTW, knowing that there is a know bug out there shouldn’t there be some sort of grace given before being suspended?

Alexey · May 14, 2020, 8:24pm

yes, the following disqualification is disabled

thebadcat · May 14, 2020, 8:41pm

@Alexey

I’ve run the commands above…

How can I tell my suspension status?

Alexey · May 14, 2020, 9:03pm

Take a look on dashboard, it will show on which satellite (if any) it’s suspended

BrightSilence · May 14, 2020, 9:05pm

Please be aware it can take some time until the suspension is resolved. You need to respond successfully to a few audits before it recovers.