Suspension mode and disqualification emails

Later today we are going to send out a bunch of automated emails to storage nodes in suspension mode and to disqualified storage nodes. You might receive multiple of these emails one from each satellite. Time to explain what you should do if you get one of these emails.

First of all, give us feedback. Is the email useful for you? Is there something you would like us to change? Yes, the first email might arrive a bit late. In the future, we send out these emails a few hours after you got disqualified or suspended. Is your storage node dashboard showing a suspension warning (new with v1.3.2) but you didn’t receive the email warning?

If you receive the suspension email it is time to fix your storage node. The suspension email is the last warning. Soon (in about 2 weeks) we will start disqualifying storage nodes that didn’t manage to exit suspension mode in time. The score we are using is not displayed on the storage node dashboard. You can get suspended and later disqualified even with a 100% audit score. The suspension mode has a separate calculation that is not displayed on the storage node dashboard and not available on the storage node API. So please take the suspension warning seriously.

Now you might ask how can you get out of suspension mode. Please take a look at your storage node log to find out why audits are failing. If you need any help you can ask here in the forum. A storage node in suspension mode will not get selected for new uploads but will still receive audit requests. If you can fix the problem on your storage node you should get out of suspension mode within a few successful audits.

What is the difference between suspension mode and disqualification?
Corrupted or missing data => permanent disqualification
Other audit errors => suspension mode with the option to recover
Don’t get out of suspension mode in time => permanent disqualification
(Credits to @BrightSilence for this easy explanation :slight_smile: )

More technical information about suspension mode can be found here: Design Draft: Storage Node "Suspended" State

6 Likes

It’s great to get informed when there’s an error. My last node got disqualified after 9 months and I still don’t know why. (No failed audits in the log file, see here) A detailed warning could have helped to avoid this.

I do have some “feature requests” (if not already implemented):

  1. Include more detail on what exactly triggered the email. This would help a lot.
  2. Send also an email once the satellite is accepting a node again
3 Likes

It would certainly be helpful to know when your node is ok again so you can stop worrying and searching for (additional) reasons for the suspension.

2 Likes

I agree with what was said above, more details on the why and another email when everything goes back to normal again

1 Like

I just got one of these suspension e-mails from Saltlake satellite. Says suspension happened on 23Apr2020 at 0650 UTC, and that my node won’t receive any additional data from the satellite until the audit issues are resolved. So all of that is good information.

After running the successrate script for the node that has been online for over 88h and I have 0 failed audits total. API dashboard shows 99.9% audit checks for Saltlake, and I’m still receiving tons of data from that satellite.

So I’m at a complete loss of what the issue is for the node…

EDIT: so I had totally forgotten that I never set up the static mount on the node…so perhaps that was the issue? Although again, at the time of receiving the e-mail I had 99.9% audit on Saltlake and was still receiving data from the satellite. I just went through the process of shutting down the node, setting up the static mount for the HDD and then restarted the node and sure enough, I’m now not getting any new data from Saltlake.

EDIT 2: sorry, so now I’m starting to receive data from Saltlake again, so I don’t know if the issue is corrected or not. I guess we’ll see.

Ultimately, if the SNOs get these e-mails and if the troubleshooting tips in the link of the e-mails don’t point to any clear indicator of the problem, I have a bad feeling that the network is going to end up disqualifying a bunch of SNOs and we’ll never know why. Which would be quite a bummer.

1 Like

I got the exact same message from Saltlake satellite, looking at the logs I see nothing wrong… I restarted the node in hope everything come back to normal

Meanwhile we found out that the storage node is not logging all error messages. I am working on a pull request to improve the storage node logging.

I hope that will help. Sending the audit error with the email is nothing we can easily archive.

That would happen after 3 days. There would be a second mail shot that reminds you about still beeing suspended or that you have resolved the issue. I would postpone that question for a moment. Lets fix the root cause first and then reevaluate the emails.

Sorry I don’t understand that improvement idea. Please rephrase your feedback.

4 Likes
2 Likes

Thank you for clarifying this – your help desk apparently doesn’t know this – you might want to update them as well.

@littleskunk wheres my email I wanted to be included too

Are you saying you haven’t received an email but you have expected to get one?

Yes I did expect one ive been refreshing my email and checking the spam folder.

The automated emails would hit only nodes that are getting disqualified or suspended now. The system will not send out emails to storage nodes that got disqualified or suspended yesterday before the rule was activated.

Oh ok I honestly thought for sure one of my nodes would have been suspended though cause it was stuggling with an SMR hard drive. Means im not in as bad of shape as I thought good news.

1 Like

I had two mails today, Using smr drives :frowning: ( did not now this until 2 days ago )
Now using a cache drive no problems at this moment.

Saltlake is pushing data like crazy

Greetz Peter

Support is already aware of this and has been actively involved in addressing the email spam issue. If you have not yet received a final response to your support ticket, it is most likely being attended by a support agent who is currently not on his work shift. I am sorry that we cannot be answering questions 24 hours a day. If you would like me to take over your support ticket, I will be happy to do so, just mention what is your ticket number or email so I can identify the right ticket. All tickets that I have in my own queue have already been notified.

2 Likes

Two of 3 issues have been solved.

  1. The satellite was not writing the audit reputation back into the database. With the default reputation, a single audit failure will trigger suspension mode. That is now fixed and only a few audit failures in a row will trigger suspension mode.
  2. The storage node didn’t write a log message. With the new storage node version that will be fixed.
  3. Most likely the reason for the audit failures is a locked SQLite database. That is still under investigation. If the suspension mode is working as expected it should tolerate a few audit failures.

We will not rush out the storage node rollout. That means we have to wait with the next email round until next week.

6 Likes

My node was suspended after I moved it to a new location. I got the nodes back online but they still say suspended a week later. Here is one of the errors in the log. Any help would be appreciated. There appears to be a lot of traffic on this node. Suspension and Audit for each satellite are both at 100%.

C:\Program Files\Storj\Storage Node\storagenode.log:53039986:2021-07-23T07:01:11.551-0700 ERROR piecestore download
failed {“Piece ID”: “R63ZXYPZXFNM4E4A7TIWZGBUC5IVF6S6IZ3O6H3PCXTKJTLOG6AA”, “Satellite ID”:
“121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6”, “Action”: “GET_AUDIT”, “error”: “file does not exist”,
“errorVerbose”: “file does not exist\n\tstorj.io/common/rpc/rpcstatus.Wrap:73\n\tstorj.io/storj/storagenode/piecestore.
(*Endpoint).Download:534\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:217\n\tstorj.io/drpc/drpcmux.(*Mu
x).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:58\n\tstorj.io/drpc/drpcserver.(*Server).handleR
PC:102\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:60\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:95\n\tstorj.
io/drpc/drpcctx.(*Tracker).track:51”}

Were you suspended for downtime or audit suspension? Could you post the scores on your dashboard?

The audit error you posted could be related to a recent issue on that satellite. So it may not be related. Either way that failure would only effect your audit score, which would lead to disqualification if it dropped too low, not suspension.


The online percentages havent changed in over a week and the nodes have been online.