[SOLVED] FAQ: Node suspension emails?
hi everybody – I have been seeing confusion about emails received by some SNOs. Here’s the answer and the breakdown…
Updated Apr 24:
Two of 3 issues have been solved.
- The satellite was not writing the audit reputation back into the database. With the default reputation, a single audit failure will trigger suspension mode. That is now fixed and only a few audit failures in a row will trigger suspension mode.
- The storage node didn’t write a log message. With the new storage node version that will be fixed.
- Most likely the reason for the audit failures is a locked SQLite database. That is still under investigation. If the suspension mode is working as expected it should tolerate a few audit failures.
We will not rush out the storage node rollout. That means we have to wait with the next email round until next week.
- SNOs are receiving message about a suspension
- This is confusing, because the dashboard reflects different information
- Additionally some people are receiving more than one of these emails
- We agree and apologize for the confusion this has caused. Keeping a healthy node is something you work hard for
- Yesterday the emails were turned on, but because something in the query was broken, the email went to some people who shouldn’t have received it.
- An engineer is working to fix this, and in the meantime the emails are paused while we work out the kinks
For more info, I am pasting an excerpt from a post that our product manager Brandon posted. Since his response was posted in a much longer thread, it may have been buried in some folks’ feeds.
Hey everyone, We are sorry for all of the confusion with these suspension mode emails. Your nodes are not going to be DQ’ed because of this. We prematurely setup automatic emails for suspension mode, these emails have now been paused so you should no longer be receiving them and we will enable them again once we are able to fix the known issues.
The issue is in the storage node logs we are missing some error logs that tell you if you failed an audit or not. Suspension mode is currently triggering too rapidly so nodes are going in and out of suspension mode within minutes. Don’t worry your node will NOT be DQ’ed because of this. We also found an issue with Storage Node database locking up which we are investigating.
Thank you for all of your feedback on this topic. this feedback has been tremendously helpful for us to track down and resolve the last issues with this feature!
I will also sticky this post for a while in hopes it will be helpful. I may also be reaching out to some folks on the longer threads. We care about your experience, and are working now to straighten out the emails now. Thank you very much