We are working on better and faster email notifications for storage nodes send directly from the satellite.
Blueprint: Design Draft: Node Status Emails From Satellite
Testplan: Testplan for storage node email notification
It is deployed on the QA satellite but without sending the final email.
In phase 1 I would like to verify in the database if the following emails would get queued up in the correct way without sending too many spam emails.
In phase 2 I would like to enable the final email sending and take a look at the email wording.
In phase 3 we are going to enable this feature on the production satellites without sending the final emails. (Might overlap with phase 2)
Finally, in phase 4 we will enable the final email sending for all production satellites.
In any case, these email notifications will also help to identify problems with new storage node versions. We managed to break the QA network once by pushing a bad storage node binary. It took us a bit longer to notice that. With email notifications, we will get a warning earlier. So even if you are not planning to take a look into this tech preview I would still recommend to setup a test storage node: Please join our public test network
And now my plan for the QA satellite:
- Take a node offline. → Email about downtime.
- Node comes back online. → Another email about downtime end / back online.
- Delete all pieces from my node. → Disqualification for lost pieces.
- No email on manual removing the disqualification. (Out of scope)
- Disable system clock check and offset system clock so far that audits are getting invalid. → Suspension email for unknown audit errors.
- Sync system clock. → Unsuspension email.
- Increase config for downtime suspension on QA satellite. Take a node offline. → suspension email for downtime.
- Downgrade storage node to an older version < satellite node selection. → Version outdated email.
- Downgrade storage node to minimum version >= satellite node selection but < version control minimum and suggested version. → No email. Config for satellite node selection should be important here.
- Graceful exit. Should send a disqualification email right? That might be confusing for the operator. I might have to create a ticket.
I can see in the table already that we have an issue with number 3. The email queue contains way to many disqualification emails. The system keeps adding more events to the queue. The email address is empty for these events. That is a bug for sure.
Number 1 looks very good. There are just 2 times 3 offline events as intended. → We can test all the other situations without having to wait for a fix for the disqualification email.
The plan looks good to me.
We just need to solve the empty email address issue that we’re seeing in QA.
Additionally, though we would see the events show up in the DB table, and the data is sent to the ESP, we haven’t configured the email sending on the ESP side.
@cameron I have tested basically all the email triggers except downtime suspension and unsuspension. I could test that on the QA satellite with the help of a config change but I also realized that I will have to repeat all the tests once we allow the QA satellite to actually send the emails. What would be needed to do that?
So far I believe we identified 2 bugs.
- Something was wrong with the disqualification emails. The email was empty and we would send way to many emails. We might want to fix that before moving on to the next step on the QA satellite.
- Storage node version outdated is also sending an email every few hours or so. Not a blocker on the QA satellite. Without us manually creating that situation the QA satellite wouldn’t send these emails at all. We could fix that bug later before enabling it in production.
So when do you think we can enable email sending on the QA satellite?
To actually send the emails, there is still work that needs to be done to implement it on the ESP, which falls to the data team I believe. I “think” the plan was to implement that last, but I think your point about needing to test twice in that case is valid, so maybe that work should be started sooner.
For #1, we identified the root cause and a ticket has been created for team satellite. As of this morning, that ticket appears to still be in “Todo”.
For #2, I personally have a ticket this sprint to investigate.
So, there are a few different teams working on aspects of this project, and I’m not sure of the ETAs. We should discuss with project management to make sure everyone is on the same page.
I have received my first email. Looks promising so far. The text looks a bit wrong:
Your Node has gone offline Your Node is Below the Minimum Version
@cameron is that text about my node being offline maybe in the template? If we delete that part it should be correct.
The email template has been fixed. Time for the last test round and maybe we can enable it on all production satellites next week.
Take a node offline. → Email about downtime.
Node comes back online. → Another email about downtime end / back online.
Delete all pieces from my node. → Disqualification for lost pieces.
No email on manual removing the disqualification. (Out of scope)
Disable system clock check and offset system clock so far that audits are getting invalid. → Suspension email for unknown audit errors.
Sync system clock. → Unsuspension email.
Increase config for downtime suspension on QA satellite. Take a node offline. → suspension email for downtime.
Downgrade storage node to an older version < satellite node selection. → Version outdated email.
Downgrade storage node to minimum version >= satellite node selection but < version control minimum and suggested version. → No email. Config for satellite node selection should be important here.
Graceful exit. Should send a disqualification email right? That might be confusing for the operator. I might have to create a ticket.
Take 2 nodes offline → one email containing both nodeIDs.
what about a disqualification message? Will it be sent?
Also the suspension message as well
I can’t trigger all of the emails at the same time. It takes a few hours if not days to get my node suspended. Disqualification email is the last one I am going to trigger.
Good news. I was able to trigger almost all of the emails. They look good to me.
The offline notification took a bit longer on the QA satellite. I had to wait a few hours for the next audit round. This could also happen in production especially with smaler nodes. I would say that is still an improvement. → Next step will be to enable all of the emails in production.
A few combinations I was not able to test. I was unable to trigger 2 different emails at the same time to see if they would get combined. My expectation would be to receive 2 emails and only combine if the reason would be the same. I was able to trigger 2 emails for getting back online. They are getting combined as expected. So not tested is just 2 different emails at the same time.
I also couldn’t test the downtime suspension email. My testnode is already suspended for downtime and I need to get out of suspension first. Since I am not getting any spam mails for being suspended I see no reason to stop the final rollout.