Design Draft: Node Status Emails From Satellite

cameron · September 22, 2022, 1:58pm

Hey all, we want to change how we’re sending emails to SNOs. Check it out

https://review.dev.storj.io/plugins/gitiles/storj/storj/+/refs/changes/53/8353/3/docs/blueprints/satellite-node-emails.md

Toyoo · September 22, 2022, 3:58pm

Just to be sure. Do you know the effort necessary to avoid emails being not just marked as spam, but instantly blackholed by major email account providers?

Regarding «send an email at the time an event occurs», this would be fine for me, but also increases chances of emails being blackholed. If someone operates more than one node, they might be getting many emails, and getting many emails very quickly is one of the things mail account providers don’t like. Given you seem to be set on implementing a chore for offline emails, maybe implementing some simple form of aggregation by email address and a chore to periodically send aggregated events wouldn’t be much more work?

Instead of «Rather, it is the lack of an event, namely the node successfully checking in with the satellite.», maybe it would be easier for you to consider an audit not being able to access the node as the event? Would save the effort of implementing a chore.

I’d also wish for an event like Node degrading triggered, let say, every time node’s reputation crosses some predefined threshold downwards (maybe ¼ towards disqualification), to serve as an early warning before the node is disqualified.

I’d also suggest four more events: node’s first successful check-in after setup, node’s vetting being finished, node’s crossing 9 months (and hence no more profits being held) and 16 months (held-out being paid). The first would serve as a simple one-time test whether email notifications work for a new operator. All would additionally provide some psychological token of success to the operator. All of them could be triggered on check-ins.

Alexey · September 22, 2022, 7:56pm

Check-in happened every hour by default. Too many emails

      --contact.interval duration                                how frequently the node contact chore should run
(default 1h0m0s)

cameron · October 28, 2022, 1:44pm

Good points! Thanks for the feedback! We’re looking into a solution to aggregate emails for nodes with the same email address

moby · October 28, 2022, 9:21pm

Here is a summary of our updated plan. It has the benefits of triggering email sending events directly from the satellite (notifications should be more prompt), while maintaining Customer.io as the email sender (rather than sending emails directly from the satellite). In addition, it should be easy to batch events this way, as @cameron mentioned, and to add new events in the future, such as the ones @Toyoo suggested.

The plan outline:

Add a new satellite table, called node_events, or similar. It will have columns email, node_id, event_type (e.g. an enum representing “offline”, “disqualified”, “online”, etc…), email_sent (nullable timestamp), and created_at
When a “reputation event” occurs (node gets disqualified, for example), add a new row to the node_events table. We already have code for these “triggers” written. We just need to replace the line that sends the email with a line that adds a row to the new table
Add a new satellite chore which does the following:
- select the oldest row in node_events where notified=false and where created_at is at least 5 minutes ago (or some other configured buffer time) - call this r
- select all rows in node_events where notified=false and email=r.email, grouped by event_type - this way, if one email is associated with 10 nodes that go offline at the same time, these events will be grouped together
- compile each event type for this email address, and send an event to customer.io indicating that this email address needs to be sent an email for event_type for one or more nodes (providing a list of node IDs to customer.io)
- set email_sent=true for all these rows
- repeat - if no rows returned, wait 5 minutes (or some other configured buffer time) and repeat

Advantages of this approach:

customer.io can still handle the emails. We don’t need to spend time engineering our own solution to deal with unsubscribing, checking open rate, getting off spam lists, etc…
“reputation change” events are triggered directly from satellite. No more dataflow/customer.io segment/redash query annoyances
should guarantee prompt email sending (within 5 or 10 mins of event occurring, which is much better than our current process)
should combine multiple emails of the same type (e.g. node offline) for the same email address when they occur close to each other (within 5 minutes). Less spam, in other words.
new useful table node_events which has utility outside of email sending. We can get a detailed history of any node’s reputation events
it is not very different from the original design, and a lot of the code that has already been written can be preserved
while it still makes use of customer.io, the end-to-end process of how these emails are triggered and sent should be a lot more understandable to the average developer

littleskunk · November 2, 2022, 8:38pm

@moby each satellite would have its own queue right? Will customer.io send me one email per satellite or combine all satellites into one email?

Pentium100 · November 3, 2022, 9:03am

Can you send emails if the node DQ/suspension score is going down, before the node is actually disqualified or suspended? I’m sure node operators would like advanced warning.

moby · November 3, 2022, 10:58pm

yes, the queue would still be separate for each satellite. So if you have one node on us1, eu1, and ap1, you can expect three different emails when this node goes offline.

moby · November 3, 2022, 11:00pm

This design should set us up to be able to accomplish what you are requesting, but warning node operators in advance is outside the scope of the initial change we are making.
But I think this would be a great idea for something to support in the future.

Vadim · March 18, 2023, 9:14pm

Hello.

I would like to thank you for Node status notification, it notify my if my node is offline for some time. As I have lot of nodes it is hard to monitor every day that nodes are online.
Looks working perfect.

Pentium100 · March 21, 2023, 4:19am

I got the “your node has been suspended” and “your node has been unsuspended” from the QA satellite emails. Cool, but it would be better, I think, if I got notified of audit failures before my node is suspended or disqualified.

Toyoo · March 21, 2023, 7:22pm

Yeah, sth like:

Got some “gone offline” emails recently as well, that was nice! Some additional suggestion:

It would be nice if, in addition to the node ID, there was also a human-friendly semi-identifier, maybe node listening address:port? Would help quickly figure out which node of many is failing.