[Tech Preview] Storage node email notification on QA satellite

We are working on better and faster email notifications for storage nodes send directly from the satellite.
Blueprint: Design Draft: Node Status Emails From Satellite
Testplan: Testplan for storage node email notification

It is deployed on the QA satellite but without sending the final email.

In phase 1 I would like to verify in the database if the following emails would get queued up in the correct way without sending too many spam emails.

In phase 2 I would like to enable the final email sending and take a look at the email wording.

In phase 3 we are going to enable this feature on the production satellites without sending the final emails. (Might overlap with phase 2)

Finally, in phase 4 we will enable the final email sending for all production satellites.

In any case, these email notifications will also help to identify problems with new storage node versions. We managed to break the QA network once by pushing a bad storage node binary. It took us a bit longer to notice that. With email notifications, we will get a warning earlier. So even if you are not planning to take a look into this tech preview I would still recommend to setup a test storage node: Please join our public test network

2 Likes

And now my plan for the QA satellite:

  1. Take a node offline. → Email about downtime.
  2. Node comes back online. → Another email about downtime end / back online.
  3. Delete all pieces from my node. → Disqualification for lost pieces.
  4. No email on manual removing the disqualification. (Out of scope)
  5. Disable system clock check and offset system clock so far that audits are getting invalid. → Suspension email for unknown audit errors.
  6. Sync system clock. → Unsuspension email.
  7. Increase config for downtime suspension on QA satellite. Take a node offline. → suspension email for downtime.
  8. Downgrade storage node to an older version < satellite node selection. → Version outdated email.
  9. Downgrade storage node to minimum version >= satellite node selection but < version control minimum and suggested version. → No email. Config for satellite node selection should be important here.
  10. Graceful exit. Should send a disqualification email right? That might be confusing for the operator. I might have to create a ticket.

I can see in the table already that we have an issue with number 3. The email queue contains way to many disqualification emails. The system keeps adding more events to the queue. The email address is empty for these events. That is a bug for sure.

Number 1 looks very good. There are just 2 times 3 offline events as intended. → We can test all the other situations without having to wait for a fix for the disqualification email.

3 Likes

The plan looks good to me.
We just need to solve the empty email address issue that we’re seeing in QA.
Additionally, though we would see the events show up in the DB table, and the data is sent to the ESP, we haven’t configured the email sending on the ESP side.

2 Likes

@cameron I have tested basically all the email triggers except downtime suspension and unsuspension. I could test that on the QA satellite with the help of a config change but I also realized that I will have to repeat all the tests once we allow the QA satellite to actually send the emails. What would be needed to do that?

So far I believe we identified 2 bugs.

  1. Something was wrong with the disqualification emails. The email was empty and we would send way to many emails. We might want to fix that before moving on to the next step on the QA satellite.
  2. Storage node version outdated is also sending an email every few hours or so. Not a blocker on the QA satellite. Without us manually creating that situation the QA satellite wouldn’t send these emails at all. We could fix that bug later before enabling it in production.

So when do you think we can enable email sending on the QA satellite?

To actually send the emails, there is still work that needs to be done to implement it on the ESP, which falls to the data team I believe. I “think” the plan was to implement that last, but I think your point about needing to test twice in that case is valid, so maybe that work should be started sooner.
For #1, we identified the root cause and a ticket has been created for team satellite. As of this morning, that ticket appears to still be in “Todo”.
For #2, I personally have a ticket this sprint to investigate.

So, there are a few different teams working on aspects of this project, and I’m not sure of the ETAs. We should discuss with project management to make sure everyone is on the same page.

1 Like