Got Disqualified from saltlake

jammerdan · August 10, 2021, 5:23am

Yes, that’s why I said the node had a silent issue, that I did notice.

Until 06/08 around 1000 it looks fine and the node responds to GET_AUDIT requests with downloaded.
Then it stops responding succesfully and I see multiple audits that have “Download started” but no “Downloaded” for it.
The satellite keeps requesting audits.
Between 1516 and 1800 there is no audits, probably the node was not able to handle anything due to the silent error. It resumed around 1800 then the satellite picked up requesting pieces for audits.
I see start at 1813 requesting the same piece for auditing:

2021-08-06T18:13:31.696Z        INFO    piecestore      download started        {"Piece ID": "FIQ3PWFNVTUIGT5NYSSEYKLEZ3IUR3UU3IIHPIOAYLTRWZVTZATA", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"}
2021-08-06T18:46:22.176Z        INFO    piecestore      download started        {"Piece ID": "FIQ3PWFNVTUIGT5NYSSEYKLEZ3IUR3UU3IIHPIOAYLTRWZVTZATA", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"}
2021-08-06T18:55:42.670Z        INFO    piecestore      download started        {"Piece ID": "FIQ3PWFNVTUIGT5NYSSEYKLEZ3IUR3UU3IIHPIOAYLTRWZVTZATA", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"}
2021-08-06T19:02:44.963Z        INFO    piecestore      download started        {"Piece ID": "FIQ3PWFNVTUIGT5NYSSEYKLEZ3IUR3UU3IIHPIOAYLTRWZVTZATA", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"}
2021-08-06T19:06:42.317Z        INFO    piecestore      download started        {"Piece ID": "FIQ3PWFNVTUIGT5NYSSEYKLEZ3IUR3UU3IIHPIOAYLTRWZVTZATA", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"}
2021-08-06T19:40:15.145Z        INFO    piecestore      download canceled       {"Piece ID": "FIQ3PWFNVTUIGT5NYSSEYKLEZ3IUR3UU3IIHPIOAYLTRWZVTZATA", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"}
2021-08-06T19:42:53.615Z        INFO    piecestore      download canceled       {"Piece ID": "FIQ3PWFNVTUIGT5NYSSEYKLEZ3IUR3UU3IIHPIOAYLTRWZVTZATA", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"}
2021-08-06T19:47:26.911Z        INFO    piecestore      download canceled       {"Piece ID": "FIQ3PWFNVTUIGT5NYSSEYKLEZ3IUR3UU3IIHPIOAYLTRWZVTZATA", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"}
2021-08-06T19:49:06.452Z        INFO    piecestore      download canceled       {"Piece ID": "FIQ3PWFNVTUIGT5NYSSEYKLEZ3IUR3UU3IIHPIOAYLTRWZVTZATA", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"}
2021-08-06T20:02:53.329Z        INFO    piecestore      download started        {"Piece ID": "FIQ3PWFNVTUIGT5NYSSEYKLEZ3IUR3UU3IIHPIOAYLTRWZVTZATA", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"}
2021-08-06T20:06:52.313Z        INFO    piecestore      download started        {"Piece ID": "FIQ3PWFNVTUIGT5NYSSEYKLEZ3IUR3UU3IIHPIOAYLTRWZVTZATA", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"}
2021-08-06T20:16:46.677Z        INFO    piecestore      download started        {"Piece ID": "FIQ3PWFNVTUIGT5NYSSEYKLEZ3IUR3UU3IIHPIOAYLTRWZVTZATA", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"}
2021-08-06T20:17:19.117Z        INFO    piecestore      download started        {"Piece ID": "FIQ3PWFNVTUIGT5NYSSEYKLEZ3IUR3UU3IIHPIOAYLTRWZVTZATA", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"}
2021-08-06T20:45:40.431Z        INFO    piecestore      download started        {"Piece ID": "FIQ3PWFNVTUIGT5NYSSEYKLEZ3IUR3UU3IIHPIOAYLTRWZVTZATA", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"}
2021-08-06T21:10:01.886Z        INFO    piecestore      download started        {"Piece ID": "FIQ3PWFNVTUIGT5NYSSEYKLEZ3IUR3UU3IIHPIOAYLTRWZVTZATA", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"}
2021-08-06T21:10:54.685Z        INFO    piecestore      download started        {"Piece ID": "FIQ3PWFNVTUIGT5NYSSEYKLEZ3IUR3UU3IIHPIOAYLTRWZVTZATA", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"}
2021-08-06T21:20:38.667Z        INFO    piecestore      download started        {"Piece ID": "FIQ3PWFNVTUIGT5NYSSEYKLEZ3IUR3UU3IIHPIOAYLTRWZVTZATA", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"}
2021-08-06T21:28:12.355Z        INFO    piecestore      download started        {"Piece ID": "FIQ3PWFNVTUIGT5NYSSEYKLEZ3IUR3UU3IIHPIOAYLTRWZVTZATA", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"}
2021-08-06T21:38:27.972Z        INFO    piecestore      download started        {"Piece ID": "FIQ3PWFNVTUIGT5NYSSEYKLEZ3IUR3UU3IIHPIOAYLTRWZVTZATA", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"}
2021-08-06T22:21:45.268Z        INFO    piecestore      download canceled       {"Piece ID": "FIQ3PWFNVTUIGT5NYSSEYKLEZ3IUR3UU3IIHPIOAYLTRWZVTZATA", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"}
2021-08-06T22:27:18.063Z        INFO    piecestore      download canceled       {"Piece ID": "FIQ3PWFNVTUIGT5NYSSEYKLEZ3IUR3UU3IIHPIOAYLTRWZVTZATA", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"}
2021-08-06T22:30:46.174Z        INFO    piecestore      download canceled       {"Piece ID": "FIQ3PWFNVTUIGT5NYSSEYKLEZ3IUR3UU3IIHPIOAYLTRWZVTZATA", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"}
2021-08-06T22:30:46.676Z        ERROR   piecestore      download failed {"Piece ID": "FIQ3PWFNVTUIGT5NYSSEYKLEZ3IUR3UU3IIHPIOAYLTRWZVTZATA", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT", "error": "tls: use of closed connection", "errorVerbose": "tls: use of closed connection\n\tstorj.io/drpc/drpcstream.(*Stream).rawFlushLocked:276\n\tstorj.io/drpc/drpcstream.(*Stream).MsgSend:318\n\tstorj.io/common/pb.(*drpcPiecestore_DownloadStream).Send:302\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download.func5.1:608\n\tstorj.io/common/rpc/rpctimeout.Run.func1:22"}
2021-08-06T22:30:46.773Z        ERROR   piecestore      download failed {"Piece ID": "FIQ3PWFNVTUIGT5NYSSEYKLEZ3IUR3UU3IIHPIOAYLTRWZVTZATA", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT", "error": "tls: use of closed connection", "errorVerbose": "tls: use of closed connection\n\tstorj.io/drpc/drpcstream.(*Stream).rawFlushLocked:276\n\tstorj.io/drpc/drpcstream.(*Stream).MsgSend:318\n\tstorj.io/common/pb.(*drpcPiecestore_DownloadStream).Send:302\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download.func5.1:608\n\tstorj.io/common/rpc/rpctimeout.Run.func1:22"}
2021-08-06T22:30:49.735Z        INFO    piecestore      download canceled       {"Piece ID": "FIQ3PWFNVTUIGT5NYSSEYKLEZ3IUR3UU3IIHPIOAYLTRWZVTZATA", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"}
2021-08-06T22:30:50.037Z        INFO    piecestore      download canceled       {"Piece ID": "FIQ3PWFNVTUIGT5NYSSEYKLEZ3IUR3UU3IIHPIOAYLTRWZVTZATA", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"}

For Europe-North the log shows a similar pattern.
I think I restarted the node next day but it was too late the node had been disqualified already since then on both satellites. So finally disqualification happened in less than 24 hours from correctly responding to audits to disqualification.

Alexey · August 11, 2021, 9:02am

Then there is nothing we can do. If audit score is lower than 60% - it’s permanent.
The reason - we cannot be sure that all pieces intact, we discussed that case many times on the forum.

I would not offer you to search the reason on the satellites - in your case it’s more obvious, than in case of @allenyllee . The problem that such requests consumes a lot of time for research and cannot be requested too often on this time.
I hope the Team will implement an easy tool for investigation or maybe logging on the storagenode’s side too (but it could make life of bad actors easier).

allenyllee · August 12, 2021, 9:13am

Oh, I haven’t mentioned, not only saltlake encounter this problem. In fact, almost all satellite drops audit score during that time. If I found this situation too late, it’ll disqualified from all other satellite too. But the logs doesn’t seen any strange to other satellite too. So, if there were some situation causes this problem, it should be a systematically fail. If malicious ware can trigger this situation to whole storj network, it may break down whole SNO ecosystem permanently and can not be recovered.

So, to avoid this kind of situation, I think there should have some recovery mechanism, to recover from this “disqualification attack”.

Alexey · August 12, 2021, 7:47pm

Unlikely it’s a systematic failure - only your node is affected, this usually suggest a problem with the local configuration.
You can see, that your logs lacks audits on that time. This is mean that it was not responsive enough to provide a piece for audit after three attempts with 5 minutes timeout. And did the same for other pieces within 3 hours (maybe more, but after 3 hours being unresponsive it was disqualified).
I would suggest to check what from hardware or software could be related.
Just note - your node was online and answered on audit request, but didn’t log it into logs and didn’t provide a piece.
So, the disk is a first subject of investigation.

jammerdan · August 14, 2021, 5:58am

It seems we are having a similar case here in which it took 4 hours to disqualification:

And I am really wondering if Storj expects their home user SNOs to spot and fix an issue within 4 hours (12 hours in my case) to prevent permanent disqualification?

zed · August 14, 2021, 9:30am

i reported a similar support case (request 11735) on Mo 2021-08 .02. it just took a view hours for the audit score to drop below 60%. the evening before, all was fine.

at 7.41am i got the " Your Node Is Disqualified on the saltlake Satellite" mail.

at 10.00am next mail: “Your Node Is Disqualified on the asia-east-1 Satellite”.

no more logs available, so i could’nt check. restarting prevented the other satellites from disqualification.

the following days i saw more node disqualification forum posts. so i presume, it’s not only a local SNO issue, satellite/software might be involved.

Alexey · August 14, 2021, 3:46pm

it’s not related to hours, it’s related to failing audits. We cannot demand to setup monitoring, but it’s up on SNO to keep their node running and healthy. If there are problems with hardware or OS or software - it’s better to fix them.

For example, my own Windows 10 -based home server started to fall to BSOD when Windows tried to forcedly upgrade itself to the next version, thanks to MS developers it was smart enough to detect that and rollback the upgrade. However, the scheduled forced upgrade have not disappeared and it was trying again and again. I noticed that and was a little bit angry, as result I do not have Windows 10 there anymore. Now there is Ubuntu. All my nodes are migrated, but their online score is affected and some audit and suspension checks too - some files were corrupted by these windows upgrades.
I do not have a special monitoring, I just checks the status from time to time to make sure that it’s still running and responsive.
I can setup monitoring, but it’s not required, so I wont. I want to have an “average setup” to be sure that it’s possible for others.

Alexey · August 14, 2021, 3:56pm

Hello @zed ,
Welcome to the forum!

We currently not aware of any “mass error situation” related to audits failures.
It’s not more than usual, see https://storjstats.info/

Toyoo · August 14, 2021, 7:32pm

Yeah.

And Storj could help in that by pre-loading some simple monitoring into the node software itself. The lone fact that they don’t is reassuring, as this means that in Storj’s view the risk is very actually low.

Alexey · August 14, 2021, 7:38pm

On the other hand, our Community provides many monitoring configurations: Search results for '#sno-category:sno-start tags:monitoring' - Storj Community Forum (official)

By the way - do you know, are dashboard or API available, when the node cannot provide pieces for audits?

When I have had the situation with disk have been dismounted because of general protection fault, neither dashboard nor API was available. I should say, that node was able to answer on audit requests but returned an error and suspension score dropped a little bit. Then it was in restarting cycle because pre-flight check detects the disk failure and shutdowns the node, docker restarts it, repeat.

jammerdan · August 15, 2021, 2:23am

I understand that there is no time limit set by Storj. But the result of this is that nodes can get disqualified within as little time as 4 hours. I think that is totally unacceptable.

So maybe you have been lucky. But this example shows the problem that operating a Storj node turns rather into a lottery with the odds against the SNO. What you are describing is the perfect example that every single Windows 10 update could turn a node into a zombie node without the SNOs fault. And if that happens it seems that it is only luck what kind of failure happens, if it leads to audits failing and if you will be able to spot and fix the condition on time before permanent disqualification.

I say it again: The SNOs are home users with home grade equipment and you cannot treat them like professionals who monitor their nodes 24/7 and have the personal availability, the technical skills and hardware replacement options like professionals do.

Storj should find a solution to avoid such quick disqualifications within hours. If audits are failing instead of quick disqualification the node should go offline/suspension so the node operator can have a look at it and fix the issue. If it cannot be fixed within a deadline, then disqualification can happen.

Alexey · August 15, 2021, 7:05am

Perhaps this could help:

But we need to create a feature request for that to give it to the Team.

jammerdan · August 16, 2021, 6:47am

@BrightSilence
I am just trying to figure out if the successrate script would have helped to warn before disqualification.
For that I have grepped the Saltlake satellite and day to produce a log file that contains only Saltlake results for a specific days.

The results are:
5th of August
========== AUDIT =============
Successful: 59
Recoverable failed: 1
Unrecoverable failed: 0
Success Rate Min: 98.333%
Success Rate Max: 100.000%

6th of August (date of disqualification)
========== AUDIT =============
Successful: 40
Recoverable failed: 2
Unrecoverable failed: 0
Success Rate Min: 95.238%
Success Rate Max: 100.000%

So according to these results the node looked fine. Does the script take into account audits that have timed out and up-/downloads that have been canceled? If not, maybe an idea to count canceled up and downloads as well as these might hint there is an issue with the node?

As an idea, it would be handy, if successrate command input would accept date and satellite as optional parameter, something like: ./successrate.sh /path_to_logfile -s SATELLITE ID -d DATE

BrightSilence · August 16, 2021, 6:58am

It doesn’t take into account transfers that started but never finished. It only looks at result lines. But I don’t think it matters. From what I understand, neither are logged in this unresponsive state.

jammerdan · August 16, 2021, 7:12am

I’d say it depends on the state the node is in. I see canceled downloads and uploads in the logs as well.
It is just an idea, that it would help to diagnose a node, if transfers that have timed out or cancelled would be displayed as well.
I don’t know how the satellite takes those into account for disqualification but it would certainly ring an alarm bell for the SNO if suddenly the rate of broken transfers suddenly jumps.

Alexey · August 16, 2021, 7:15am

canceled downloads for GET_AUDIT treated as not succeed, the node will be put into the containment mode and would be asked for the same piece two more times. If they would not be succeed too - the audit will be accounted as failed.

BrightSilence · August 16, 2021, 7:40am

Canceled is a result line, so those are taken into account. Though you may be using an old version of the script as they aren’t listed in your results. Get the latest one here: Success rate script - Now updated for new delete terminology since v1.29.3

I don’t think I’ve ever seen a cancelled GET_AUDIT. So I think @jammerdan was talking about normal uploads and downloads. In which case they don’t count against your score. The satellite isn’t even really aware of those as they are direct transfers between uplink and node. They may still be an early indication of trouble though. The satellite only uses audits (and I believe now also repair) to determine your nodes score and it doesn’t give up on those for a long time. So that by the time it does, I think you can be pretty sure your node is actually unresponsive and also won’t log it.

jammerdan · August 16, 2021, 8:07am

You are right, the new version shows the cancellations as well.

I am starting to think, that this script could become really helpful to detect problems before disqualification.
Now when you are saying there might be no logging in an unresponsive state, could the script detect such condition? Like when time between log entries grows and to report that as well?

If a node would be running a script like this every 10 minutes and would compare the results against predefined values or previous results, it might help to get early indication if something is wrong. Based on that the SNOs could be warned of even automated actions could be applied like taking the node offline.

BrightSilence · August 16, 2021, 8:10am

I mean, that would be a completely different script, but I suppose something like that could be built. Question is, if the node is unresponsive… what makes you think that script would still work? It has to read rather massive log files, so if there is an IO issue, that would definitely not work either.

Alexey · August 16, 2021, 8:30am

That’s the most difficult question. I have no suggestions, how to make hanging node shutdown itself, if it’s functioning only partially. More like there are lacks even in syslog.
I meet such a problem after general protection fault on Ubuntu system. The system were “working”, but it constantly print the panic log on the screen. And only option was to reset it.