Blueprint: Downtime Disqualification

Thanks for your reply!
I use DDNS about a year to keep my services available on disaster situation, and it save me especially on night time when I sleeping. Disaster situation is rare but can happen 1-3 times per a year.

With the suspension mode it shouldn’t be needed at all. Even if you are offline for a day you can still recover.

1 Like

So, mark it as offline if if is offline when the satellite tries to audit it and ignore the previous check-in.

Something like this:
00:00 audit success
00:05 audit attempt, unable to connect, node offline
00:10 node check-in
00:12 audit attempt, node offline. Ignore the previous check-in, mark the node offline since 0:05
00:20 node check-in
00:25 audit attempt, success. Mark the node as online since 00:20

In this scenario the node would rack up 15 minutes of downtime (00:05 - 00:20).

@cameron this would mean the easiest implementation would be your first idea in combination with the audit job starting the downtime clock.

The first failed audit will marke the stroage node as beeing offline. The satellite starts tracking the downtime. Only a successful ping can stop the downtime clock. We ignore the storage node checkin message. We wouldn’t even need to ping the storage node a random time later. We just wait for the next regular downtime tracking ping. Downtime would be the time between failed pings excluding the first and the last successful ping as described in the test plan.

Small correction. As described in the testplan we can’t just assume that the node was offline until 0:20 we only know it was offline until 00:12 because we tried to contact it at that time. The node could have had a ISP outage until 00:13 and running just fine without knowing that the satellite is waiting for a checkin message.

1 Like

Yeah, missed that. Still, the idea is the same and it should catch that type of cheating (the node would have to respond to audits to be considered online).

Updated list.

  1. Apply the rules of containment mode on the first offline audit after the last contact success. Because the cheater node checks in every 10 minutes we would need only 3 audits checks to apply an audit penalty and a few more iteration to DQ the node. A storage node that is offline for a few hours would be added to containment mode only once. Additional audit check will notice that the node wasn’t online and the containment mode counter should not get increased.
    1.1. Downside: We would DQ nodes that have connection issues and are restart frequently. Maybe we should use suspension mode for failed audits and for containment mode as well. That way they wouldn’t get DQed without the suspension mode grace period. We could send them a suspension warning email and they would be able to fix the problem without getting DQed.
    1.2. Personal note: Getting DQed for frequent restarts is probably unlikely. How about adding some metrics to see if that ever happens in production before we worry about it?

  2. Limit update last contact success to only once per hour. Removed from the list.

  3. Store time of audit contact failure to figure out how long in a row they missed audits. At some point DQ them or trigger some kind of suspension mode.
    3.1. Downside: I don’t know the rules for DQ here so it is hard to say which side effects we might get.

  4. Trigger an uptime check a few minutes after storage node checkin. If the node if offline again continue tracking the downtime and ignore the checkin even if it was successful. Audits should add the storage node to the downtime detection.
    4.1. Downside: Additional complexity. We can’t send the ping immediately after checkin. We would have to wait a few minutes. That means we have to store them somewhere and we need another job to execute the pings.

  5. Ignore storage node checkin messages. Only the downtime tracking pings are allowed to mark a storage node as online. If the next ping fails we still count it as offline regardless of the successful checkin.
    5.1. Downside: This is a bit tricky. The downtime tracking is using the last contact success timestamp. The checkin still needs to update that timestamp because otherwise, the audit system wouldn’t try to contact the node. The node would still be able to avoid DQ by just waiting for the next downtime tracking ping and go offline after that. How do we make sure the node is getting audits between the checkin and the following downtime tracking ping?

1 Like

The problem still is that we would need an pointer with at least 28 other nodes for an audit. In that situation I would prefer to use the containment mode to remember the missed audit and repeat it without having to call the other 28 nodes. → Option 1 that would be.

Why? Unless you mark an offline node as having failed an audit.
The way I understand this, audit score should be used for tracking if the node still has the data and uptime score should be used for tracking if the node is online.

If an audit is unable to connect to a node, then it would mean that the node is offline (uptime score should be affected), but there is no way to know if it has the data or not (audit score should not be affected).

The audit score of an offline node (or one that restarts a lot) should not be affected - if the audit process is unable to connect to the node, just mark the node as offline without increasing total audit count for it. The node should be marked offline until it manages to respond to that audit (affecting the score) from the first time it failed to respond to the last.

^ Please remember the containment mode is not new. Why do we discuss the consequences of the containment mode?

Because it may be confusing (if I see a failed audit, it means that something is really wrong with my setup, while I expect my internet connection to have some downtime)? That is, my audit score should not drop because of a reboot or ISP downtime.
Audit score should track the data integrity (if it’s not 1 there is a big problem that needs to be fixed), while uptime score should track uptime (if it’s less not 1, it’s not such a big problem, as long as it is above some value).

It is currently implemented. So if it wasn’t confusing by now than why should it get confusing in the future? The containment mode is active in production!

Isn’t this already taken care of by the design doc discussed here?

If it’s just used to extend offline time, that’s not confusing. But if the satellite would report that back as an audit failure, I would get a lot more concerned seeing that in the data.

Do these downtime tracking pings happen at a high enough frequency for this to be accurate enough? If not, since the node was now marked as offline, you could still add back in that random ping after the node checks in to limit the impact. But in general I like this idea if the checks are frequent enough.

Tricky indeed. As this would be an argument to lower the frequency of those downtime checkins. I think if the node has to be online for several minutes it would start failing audits pretty fast. So perhaps something in that time frame?

How’s that to avoid the 3 posts in a row limit? :wink:

No. The current implementation of the suspension mode will only care about unknown audit errors. Lets take the example with a disconnected USB drive. The storage node will get DQed for issues like that without ever getting into the suspension mode.

The satellite will try to send you the same audit 3 times in a row and if you are going offline or the transfer times out all 3 times the satellite will count that as an audit failure. That is already active in production. The only change we would apply is extending the definition of going offline. We wouldn’t touch the rest of the containment mode which makes it very confusing for me that we discuss that part. That is already implemented and if you haven’t noticed it that means it works as expected. The containment mode should punish cheaters and not any healty nodes.

The interval shouldn’t matter. Lets say we ping the node 30 or 5 minutes after the checkin message. If it is an success we can’t increase the downtime because we don’t know when the node was coming back online. We have to assume it was a second after the previous failed ping. Only if the new ping is failing as well we can add downtime. A shorter interval would be more accurate but that also means the storage node will get a higher downtime at the end. Lets say the storage node was online after 21 minutes. With a 30 minute interval that would be no downtime but with a 5 minute interval we will apply a 20 minute downtime.

Works just fine. Thank you :smiley:

Then I completely agree. The example you outline has unfortunately happened to quite a few people who hadn’t actually lost the data. Suspension allows them to recover without the network taking on additional risk as it already assumes the data is lost when it comes to repair work.

I always took this as what happens when the node is online but doesn’t respond on time. I can see how going offline during this check would have the same effect. I didn’t realize that if your node goes offline, but the satellite isn’t aware it’s offline yet, this could actually lead to an audit failure. Ideally you would want the satellite to ping the node and detect it as offline before all 3 audit failures happen.
To be fair, I haven’t noticed this, because my node has near 100% uptime. So there have been very few opportunities to be able to notice this. I still think failing audits because you’re offline is not the most elegant solution, but I guess you have to avoid giving the node an out by just going offline during an audit check. You’re right, this part should not be under discussion right now. I’m glad you clarified it though, since I probably had a slight misunderstanding of how it worked exactly.

I understand it’s skewed in favor of the node. But it does assume continuous down time between two failed pings. Due to unfortunate timing a node could be online for 25 minutes in between 2 failed uptime pings 30 minutes apart. This is pretty acceptable at relatively low times between pings but could become a bit more of an issue if the interval is quite long.

That said, this all seem like acceptable downsides to me, so I think you’re closing in on a great solution with option 5.

2 Likes

12 posts were split to a new topic: Costs of running storagenode

Maybe i’m just ignorant… but it sounds to me like merging the uptime checks and the audit.
sure a ping is easier and faster, but you do audits anyways… and if an audit is a uptime check then one cannot avoid the audit and people with connection issues are simply offline because of their inability to complete audits… also you got all this traffic running to an from and to the storagenodes all the time.

the system that makes sure the data isn’t lost should be pretty good at evaluating bad nodes.

but yeah just my 10 cents, because i got little idea about the actual nitty gritty… anyways hope someone might find a grain of inspiration from my rambling.

I am planning to open a different topic for that. For my own storage node the system registered 30 minutes downtime in the last 30 days. Thats fine for me. We can run the DB query for a few other nodes and see how the system would behave.

That doesn’t work out. There is a minimum interval we need in order to be sure we can count the time between 2 failed uptime checks as downtime. If we send a node 2 audits per day that doesn’t mean it was offline for 12 hours! If we send the storage node a ping every 30 minutes this little inaccuracy doesen’t matter.

i checked my logs, my node has done 40 audits in 4 hours.
i just don’t understand the point in bothering sending a ping every 10 minutes or whatever to see if the node is online, when an audit could be the ping and thus negate the whole problem with nr 5. on the list.

but i’m sure you got reasons i don’t understand that merits creating more subsystems for the satellites to run.

Nodes with less data get significantly fewer audits. It takes about a month to get to the first 100 during vetting. So that should get you an idea of how low that frequency can be. Relying on audits alone would not be enough.

then maybe let pings be the method for new nodes… not like you want to store any thing important on them anyways before they are vetted.