Design Draft: Node Status Emails From Satellite

moby · October 28, 2022, 9:21pm

Here is a summary of our updated plan. It has the benefits of triggering email sending events directly from the satellite (notifications should be more prompt), while maintaining Customer.io as the email sender (rather than sending emails directly from the satellite). In addition, it should be easy to batch events this way, as @cameron mentioned, and to add new events in the future, such as the ones @Toyoo suggested.

The plan outline:

Add a new satellite table, called node_events, or similar. It will have columns email, node_id, event_type (e.g. an enum representing “offline”, “disqualified”, “online”, etc…), email_sent (nullable timestamp), and created_at
When a “reputation event” occurs (node gets disqualified, for example), add a new row to the node_events table. We already have code for these “triggers” written. We just need to replace the line that sends the email with a line that adds a row to the new table
Add a new satellite chore which does the following:
- select the oldest row in node_events where notified=false and where created_at is at least 5 minutes ago (or some other configured buffer time) - call this r
- select all rows in node_events where notified=false and email=r.email, grouped by event_type - this way, if one email is associated with 10 nodes that go offline at the same time, these events will be grouped together
- compile each event type for this email address, and send an event to customer.io indicating that this email address needs to be sent an email for event_type for one or more nodes (providing a list of node IDs to customer.io)
- set email_sent=true for all these rows
- repeat - if no rows returned, wait 5 minutes (or some other configured buffer time) and repeat

Advantages of this approach:

customer.io can still handle the emails. We don’t need to spend time engineering our own solution to deal with unsubscribing, checking open rate, getting off spam lists, etc…
“reputation change” events are triggered directly from satellite. No more dataflow/customer.io segment/redash query annoyances
should guarantee prompt email sending (within 5 or 10 mins of event occurring, which is much better than our current process)
should combine multiple emails of the same type (e.g. node offline) for the same email address when they occur close to each other (within 5 minutes). Less spam, in other words.
new useful table node_events which has utility outside of email sending. We can get a detailed history of any node’s reputation events
it is not very different from the original design, and a lot of the code that has already been written can be preserved
while it still makes use of customer.io, the end-to-end process of how these emails are triggered and sent should be a lot more understandable to the average developer