Blueprint: Downtime Disqualification

cameron · March 30, 2020, 1:28pm

Hey folks, we’re looking to start using downtime for node disqualification again. Here’s a blueprint we’ve been working on for that purpose. Please share your feedback!

kevink · March 30, 2020, 1:44pm

Sounds good to me! Giving SNOs a chance to get their node sorted out before full DQ is a good feature.

Odmin · March 30, 2020, 1:53pm

I have a proposition, before starting real DQ please make a “trial period” with notification only, if everything went as expected you can start real DQ activity. This “trial period” will be like “testing a new feature” period and prevent the wrong DQ if we have any bugs on this mechanism.

cameron · March 30, 2020, 1:56pm

That sounds reasonable to me. I should also say, we are planning on being quite gracious with the allowed downtime and grace periods to start off

kevink · March 30, 2020, 1:57pm

But we’ll definitely need a warning by email if our node got into suspension mode. Otherwise some might not notice and get DQed anyway.

cameron · March 30, 2020, 1:59pm

That’s a good point. I’ll add it to the document. Thanks for bring that up!

Krey · March 30, 2020, 2:02pm

I am in full confidence that before turn DQ on, it is necessary to use the reserves of the SNO’s. For example correct process DNS records with multiple addresses, or give a SNO to specify a backup IP address in the config.

lbndev · March 30, 2020, 2:42pm

Hello,

If the total is greater than the allowed downtime, the node is suspended for a length of time equal to the grace period + one tracking period.
Suspended nodes are not selected for uploads, and all pieces they hold are considered unhealthy.

If I understand correctly, with these parameters:

tracking period 30 days
allowed downtime 6 hours
grace period 60 days
say my node is suddenly offline for 8 hours (power surge, circuit breaker goes down, I put it back up when I come back from work or wake up in the morning, for example).

Then I would be suspended for 90 days, during which I would get no uploads and no downloads at all (is this what “all pieces are considered unhealthy” means) ? What happens to those pieces I’m storing ? Are they becoming obsolete ? Automatically deleted ? Would they start being used by the network again for upload only if not modified for 90 days ?

Toyoo · March 30, 2020, 3:13pm

Are there any statistics on what would happen if these rules were implemented in the past? Ie. how many nodes would be disqualified, how many nodes would be affected by suspension and for how long, what would be the numbers interpreted in terms of storage?

Also, is it possible to estimate an own storage node current uptime score by these rules based on either data from the log or database?

Pentium100 · March 30, 2020, 3:34pm

What is the planned maximum interval between checks? The scoring system looks like it would cope well if a node fails a check 5 times in 5 minutes (only recording 5 minutes worth of downtime). However, let’s say the checks are once every two hours and my node goes down (due to the ISP for example) twice for 5 minutes exactly at the wrong time - I would get 2 hours of downtime even though my node was actually down for 10 minutes total.
So, what is the proposed maximum check interval?

cameron · March 30, 2020, 3:43pm

I’m going to need to confer with the team and get back to you on that one

cameron · March 30, 2020, 3:59pm

I see that “unhealthy” is vague. I’ll edit the document to be more descriptive. You would not receive any new uploads, yes. You would still be able to receive downloads unless one of the pieces you hold belongs to a segment which needs repair. If a segment is repaired, the piece you are holding has a chance to become placed on a different node, in which case the garbage collection service should tell you to put that piece into the trash. If you made it through suspension and became reinstated, if your pieces were not repaired they would remain on the segment just like normal

cameron · March 30, 2020, 4:21pm

It depends on how far back you want to look. Downtime tracking was implemented a few months ago, which means we should have some data to look at with regards to how much suspension/DQ we might have right out the gate. However, further back than that I don’t think is possible, given that this system measures node uptime/downtime in a different way than the previous system based on a ratio of successful/failed pings

It should be possible to add this information to the NodeStats service, which gives you details around audit results

thepaul · March 30, 2020, 5:46pm

To clarify, are you saying that storage nodes should be allowed to have multiple IP addresses, and that it should be the satellite’s responsibility to check all IPs before deciding that a node is unavailable?

cameron · March 30, 2020, 5:58pm

In order to fully address this, I’d like to give some context.
In the code we have two relevant values for downtime tracking: last_contact_success and last_contact_failure. These are timestamps of the last time a contact was successful and unsuccessful, respectively.

Downtime tracking currently does two things:

It searches for nodes where last_contact_success > last_contact_failure, but greater than one hour ago. The purpose of this is to reach out to nodes who might be offline since they have not checked in and update last_contact_failure. This plays into part 2.
It searches for nodes where last_contact_failure > last_contact_success. We reach out to these nodes, and if they are still offline, we mark the elapsed time as downtime. If they respond, we do not mark any downtime!

Now that we have some context, let’s look at how this might play out in your example.
Your node goes down for 5 minutes, maybe you’re audited in the meantime, and we update your last_contact_failure. One of the first things that the storage node does when it starts up is check in with the satellites, which updates last_contact_success. If you were really only down for 5 minutes, the only way we’re going to mark you for that is if downtime tracking tries to reach you in that 5 minute window

Krey · March 30, 2020, 6:03pm

At first approximation yes.
Actually then you deal with DNS you must be prepared for the fact that DNS resolves to several addresses. And the unavailability of one of IP addresses does not mean the unavailability of the service.
But storjlabs with own \24 filter made things very complicated. Now to come out of this you need to consider specific use cases. This filter encourages a SNO with several ISP channels, keep a node by channel than to provide redundancy or load balancing for more reliable one node per channels. You should think about it

direktorn · March 30, 2020, 6:13pm

I think, if that is the intention that it makes sense. DNS is a core component of the internet and if you dont understand that perhaps hostig a node is not the greatest idea. Hosting multiple A records within the same subnet is a 100% match for being with the same ISP (i.e. not having PI addresses w BGP)

All my nodes have multiple A records, with multiple providers for redundancy and currently that is not reflected on the satellites.

Pentium100 · March 30, 2020, 9:27pm

What if the node is not restarted? That is, internet connection stops working for a few minutes (at exactly the “wrong” time) but I do not restart the node after the connection comes back up?

BrightSilence · March 30, 2020, 11:49pm

I like the approach outlined in this doc, but I feel like the suspended state may take too long if you wait out the entire grace period + tracking period.

This would result in:

higher repair costs incurred (as unhealthy pieces don’t count towards the repair threshold)
higher chance of node churn, because SNOs may think it’s not working anymore
longer peiods of no ingress for SNOs

In short, it’s bad for both sides.

It seems to me the problem with taking nodes out of suspension early is not so much the possibility of going in and out of suspension, but rather the resetting of the monitoring time frame. So I would suggest taking nodes out of suspension as soon as their downtime drops below the maximum allowed, but not resetting that monitoring window and sticking with the grace period + tracking period starting when the node was first suspended. During this time the node would be reinstated while its downtime is below the threshold and it could receive new data and pieces could be marked healthy. But it would be in a “under review” state. If it goes in and out of suspension a few times, it would not matter too much. When the grace period + tracking period since the initial suspension expires, the decision would be made to reinstate the node completely (no longer under review) or disqualify. This would limit the time the node is effectively not taking part in the network and reduce repair needed as a result of that.

BrightSilence · March 30, 2020, 11:55pm

Please do add this. It’s nice that we can use downtime robot, but it’s not accurate and we SNOs really need to know what data the satellite uses for this determination. Either a down time percentage or the actual amount of down time measure during the current rolling tracking period would be great. I would definitely incorporate this info into the earnings calculator as well if it were available. Some of us geeks want to know exactly how close to suspension we are before it happens.