Blueprint: Downtime Disqualification

cameron · April 6, 2020, 4:44pm

I would also like to bring up another point for discussion related to the current downtime tracking implementation (https://github.com/storj/storj/blob/61b6ff518626a0d89d1b417ae9002bf73cae8d2c/docs/blueprints/storage-node-downtime-tracking.md).

We are aware of a loophole which would allow nodes to effectively remain offline and go undetected. This is due to how downtime tracking looks for nodes whose last contact was a failure.
When a node checks in, its last_contact_success is updated. As long as a node comes online periodically to check in, it can shut down in the meantime.

Now, a node doing this is not going to be receiving any download payouts, so the incentive isn’t quite so high, although the potential for exploitation does exist.

We have had a few quick ideas about how we might address this, which I can describe here in just a bit, but we were also hoping the community here might help us brainstorm in finding a good solution

Toyoo · April 6, 2020, 5:19pm

Exponential back-off was invented to avoid collisions, ie. multiple nodes trying to do the same thing at the same time. Not just fruitless contacts, which is probably not a big issue.

Let’s consider a case where for many nodes a link goes down (e.g. a major ISP whose users host many nodes). These nodes will likely notice the downtime at roughly the same time, and they will all try to connect to it in roughly the same periods. This might lead to congestion¹ within the ISP’s network.

Now, the link goes up. All storage nodes will wait for the next contact period and will all spam the satellite at, again, roughly the same time. Congestion².

More than that: all those users will have their downtime period correlated, as opposed to some lucky ones getting back right after connectivity is restored, and some less lucky maybe a bit later. The nice Bayesian equations in the whitepaper assume lack of correlation though, so you risk that the bounds for data availability will not hold.

It would be even worse in a hopefully hypothetical case a satellite itself has a downtime.

A little bit of randomization will fix this. Try e.g. the old Ethernet idea: after nth unsuccessful attempt, generate a random number from [0, 2^n] and wait that amount of seconds. Keep the upper bound.

¹ likely small, as the packets are small and the ISP doesn’t analyze them.
² likely worse, as the satellite needs crypto to handle them.

cameron · April 6, 2020, 5:56pm

We actually did have the contact chore sleep for a random period before running in the past to avoid that. I’m not sure why it was removed. I’ll have to look into that

cameron · April 6, 2020, 7:29pm

It appears we got rid of this because it caused a delay in SNOs discovering that they had connection issues, i.e. they could wait up to 45 minutes before the first attempted connection. I think we could add some randomness to the backoff. That would also preclude the idea of frequent retries as Pentium100 would like, though

cameron · April 6, 2020, 7:50pm

For the check in shutdown problem:
I should say, neither of these ideas is fully thought out.

rate limit storage node check in’s ability to manipulate the last_contact_success value.
i.e. a node could check in every ten minutes, but we will only change the last_contact_success once per X. This way a node is playing a much riskier game by shutting down, since it can’t trick us by checking in all the time. Ultimately, this probably would not be a great idea since one reason we can have a fair amount of confidence in how we mark node downtime is that when nodes come online, one of the first things they do is contact the satellite
design a way to keep track of the interplay between last_contact_success and offline audits. This exploit involves only coming online long enough to contact the satellite, but doing so repeatedly to not get caught. We need to be able to tell that the node alternates between last_contact_success > last_contact_failure and vice versa

Pentium100 · April 6, 2020, 8:24pm

I am strongly against high upper bounds. Retry contacting the satellite very minute or two. If the satellite cannot handle a lot of nodes checking in at once there may be other problems with performance, unless the check-in process requires more resources than a client uploading a file etc.

However many hours of downtime are allowed, I would not want the node to just not contact the satellite and give me even more downtime, even though everything is working. Do I have to restart the node every time my connections goes down for a few minutes.

Hmm… Maybe there should be an API call to get the last time my node has pinged the satellite or the scheduled next ping, so, when my internet connection goes back up, if the nextPing is far away, I can just restart the node.

kevink · April 6, 2020, 8:50pm

I might have missed the explanation (that is surely somewhere) but what speaks against a constant connection between the node and the satellite? The satellite can request random pings but theoretically a constant connection should be stable and covered by TCP so that uptime can be easily calculated by the time the connection is closed.
For example MQTT has a constant connection with all clients like many other services (imap, chats, …).

Or what about simple polling in random intervals of 1-15 minutes? Like sending a ping only that you have to open the connection first.

BrightSilence · April 7, 2020, 6:11am

In the early days when we didn’t have a way to check uptime scores my router was actually having an issue that caused very similar behavior. For months my node seemed to be running fine but I actually got suspended for a bit because it turns out my node failed 10% of uptime checks at random. Now I would agree that the SNO should be notified of such issues, but the assumption shouldn’t necessarily be that someone is trying to trick the system.

So, back to the question posed. After a node has been offline last_contact_failure > last_contact_success and a successful contact is made also store a value for when the node came back online. Over the next hour, have the satellite check back at random times. When the hour passes and the node has successfully responded to all uptime checks, use the back online value to determine how long the node was actually offline.

This basically requires the node to be online for an hour before it counts as back online, but doesn’t count that hour as offline time unless you fail uptime checks during that hour.

cameron · April 9, 2020, 6:41pm

This question actually has been asked internally. I admittedly am not knowledgeable in the pros and cons here, though here’s a quote from one engineer:
keeping long lived connections over the open wan can be challenging in a number of ways. file descriptor exhaustion, weird network failures, weird routing changes, NAT tables expiring, tons of insane things.

Though, I don’t think the discussion is necessarily over

Pentium100 · April 9, 2020, 6:55pm

That makes sense, especially since an idle TCP connection may fail, but you may not notice it. It could be done with regular keepalives and relatively short timeouts and retries. Still though, since the reason for this is to make sure that nodes are accessible, the current “I’m alive” pings look good enough as long as the node does not wait for 30 minutes after the internet connection comes back up to notify the satellite.

kevink · April 9, 2020, 7:22pm

Thanks for the answer. Yes there are a lot of weird issues with long lived connections. Using regular keepalives would help to detect the connection state.
And if the client tries to reconnect (like it currently does), most problems (like routing changes, NAT tables expiring, …) shouldn’t be a big problem.
I understand that file descriptor exhaustion might be a problem but the limit for file descriptors is typically >1M so having 10k nodes and 10k customers connected at the same time wouldn’t even make a difference.
But I have never worked with any system that has this amount of clients connected, there might be some other strange issues.

But, as this is apparently a well known and discussed option for your engineers, there might not be any benefit of discussing this approach on the forum. (I doubt I can contribute anything new anyways )

SGC · April 12, 2020, 2:58pm

maybe it would be a good idea if the storj network actually tried to make contact with the SNO in case of a suspension pending disqualification, could be automated, but still it would make sense to actually check in to figure out what is going on… like what if one is located in a disaster location and is offline for weeks… doesn’t mean the data is corrupted, or that the node is invalid, just that the data online…

i think purely disqualifying nodes could be a step towards data loss, especially in rare special cases and if the data actually store isn’t useful for the SNO when the system is restored.

i’m not saying that there needs to be payment or such… just that permanently removing the node from the network… would lead SNO’s to just kill the data in case of DQ and then who knows if a huge region is flooded, the data for that region could in theory be lost… due to just disqualifying nodes permanently based on extended downtime…

ofc unreliable nodes is a whole nother matter…

littleskunk · April 13, 2020, 7:55pm

I would like to share the high-level test plan with you. Please let me know if something is missing.

The following tests are not specific for downtime suspension mode. We might have them already for the new downtime tracking service. It would be nice to double check that we are not missing any tests.

Storage node didn’t send the houlry checkin message, satellite successfully pings storage node, no penalty.

Storage node didn’t send the hourly checkin message, 2 ping from satellite to storage node failed, at the end storage node checks in again, double check the offline time calculation.

Expected: 0:00 last hourly checking, 0:40 and 0:50 pings failed, 1:00 checkin = 0:10 offline time.

Storage node didn’t send the hourly checkin message, 2 ping from satellite to storage node failed, at the end one successful ping without a storage node checkin, double check the offline time calculation.

Expected: 0:00 last hourly checking, 0:40 and 0:50 pings failed, 1:00 succesful ping = 0:10 offline time.

Satellite is offline for an hour

Expected: 0:00 last hourly checking, 0:40 and 0:50 pings failed, 0:55 satellite offline, 1:40 satellite comes back online 2:00 failed ping, 2:15 successful ping = 0:10 offline time. (this is the exact expectation. If we apply a higher penalty even if it is only additional 10 minutes we should talk about this example. I am happy to explain my expectation.)

Note: We can count only the time between 2 failed pings as offline time. The storage node will send a checkin message on startup but not on internet reconnect. A storage node might go offline and fail 2 pings because of an ISP outage. It comes back online and is not aware that the satellite is waiting for a checkin message. It might send the next checkin message halve an hour later. Same deal for a successful ping at the end. We can’t count that time as offline as well for the same reasons.

The following tests are specific for downtime suspension mode.

A suspended storage node should:
1.1 not get any uploads including repair uploads
1.2 repair should get triggered if a segment falls below repair threshold because of suspension mode
1.3 still get audits and uptime checks
1.4 still get downloads
1.5 not allowed to initiate graceful exit

Storage node has some downtime 30 days ago which brings it close to suspension mode, storage node is now offline and gets into suspension mode (this also checks that we are counting the offline time correct), meanwhile the downtime 30 days ago is moved out of scope, storage node should leave suspension mode as soon as it comes back online (early reinstatement).

Storage node is getting disqualified if it can’t leave the downtime suspension mode in time.

Storage node was offline for a very long time and also suspended for unknown audit errors. Storage node comes back online.
4.1 Leaving the audit suspension mode with successful audits but should still be suspended for uptime.
4.2 Leaving downtime suspension mode but should still be suspended for unknown audit errors.
4.3 Leaving suspension mode for both and should get uploads again.

Storage node has dropped all data, to avoid audit DQ it pretends to be offline, to avoid downtime DQ it checks in every 10 minutes but goes offline imidate. Storage node should somehow get DQed. How do we detect that behavior?

littleskunk · April 13, 2020, 8:02pm

And on top of that we still have to solve the open issue of dodging DQ. The following ideas are on the table and we would like to collect more feedback about these ideas or additional ideas on how to solve the problem:

Apply the rules of containment mode on the first offline audit after the last contact success. Because the cheater node checks in every 10 minutes we would need only 3 audits checks to apply an audit penalty and a few more iteration to DQ the node. A storage node that is offline for a few hours would be added to containment mode only once. Additional audit check will notice that the node wasn’t online and the containment mode counter should not get increased.
1.1 Downside: We would DQ nodes that have connection issues and are restart frequently. Maybe we should use suspension mode for failed audits and for containment mode as well. That way they wouldn’t get DQed without the suspension mode grace period. We could send them a suspension warning email and they would be able to fix the problem without getting DQed.
1.2 Personal note: Getting DQed for frequent restarts is probably unlikely. How about adding some metrics to see if that ever happens in production before we worry about it?

Limit update last contact success to only once per hour.
2.1 Downside: If the storage node checks in exactly every hour we would still not ping it and not DQ it. Does an offline audit result count as failed ping? Then we would catch the offline time but small 500 GB nodes would still get a low number of audits and with the cheater strategy, the remaining time will still look like uptime.

Store time of audit contact failure to figure out how long in a row they missed audits. At some point DQ them or trigger some kind of suspension mode.
3.1 Downside: I don’t know the rules for DQ here so it is hard to say which side effects we might get.

Trigger an uptime check a few minutes after storage node checkin. If the node if offline again continue tracking the downtime and ignore the checkin even.
4.1 Downside: Additional complexity. We can’t send the ping immediately after checkin. We would have to wait a few minutes. That means we have to store them somewhere and we need another job to execute the pings.

BrightSilence · April 13, 2020, 8:59pm

Could these two things not be incorporated into the same job?

I think I like approach 4 best, mostly because it’s the only one that relies on a check that isn’t initiated by the node. It really simplifies the problem. It’s elegant and can be explained by simple language.

Node: Hey satellite, I’m back online!
Satellite: We’ll be the judge of that, I’ll check back with you at a random time in the future
[random time later]
Satellite: Hey are you there?
Node: Yep, all good!
Satellite: Looks like you were indeed back online, noted. Carry on!

In my opinion 2 doesn’t really even solve it. You could make the node look exactly like a normal running node would and make contact every hour.

Option 3 likely also gets you a long way there though. It would be useful to get some stats on how often smaller nodes get audits and how long it would take to use that task to detect down time. However, it is a lot harder to explain this scenario to SNOs. Your node was offline, but it did check in, so it’s not a downtime DQ. You didn’t fail any audits, so it’s not an audit DQ. But you didn’t respond to too many audits over a period of time, so it’s a somewhere in between DQ? It gets a little messy.

cameron · April 13, 2020, 9:48pm

I like this idea. We’ll have to hammer out the details, but I think it could work. I don’t think it necessarily even has to mark downtime itself, as long as it forces the node to stay online long enough that the incentive to shut down is just not worth the trouble.

I think option 1 could work too, but the potential of DQing flapping nodes is there

BrightSilence · April 13, 2020, 10:04pm

Right I skipped over that one. I think it could work, but it’s a little confusing. If I understand it right, the first audit that detects a node is offline would after 3 attempts count as an audit failure. This would eventually lead to a DQ based on audits, but the actual problem was that the node wasn’t online.

Now… I have absolutely no problem to saddle cheaters with this slightly confusing scenario, but it would also mean that people who have their node go online see audit failures and audit scores dropping. And it may send them looking for issues in the wrong place.

I’m honestly not entirely sure I understand option 1, so please correct me if I interpreted it wrong.

cameron · April 13, 2020, 10:16pm

I think the idea is that if an audit detects that the node is offline, but it was last seen to be online, we put it into containment mode. If the node is audited again and is still found to be offline, no penalty. If it’s audited again and offline, but again we find that last_contact_success > last_contact_failure, increment containment counter. After 3, it counts as an audit failure. I suppose the only way to leave this containment mode is to pass the audit.
That was my understanding anyway

BrightSilence · April 13, 2020, 11:59pm

I see, that approach would make it less likely that a node with spotty connection issues would run into this. I can see that working, though you’d still be registering failed audits instead of offline time, while the node is effectively offline.

Odmin · April 14, 2020, 6:15am

@littleskunk I have an opposite question:

How SNO can avoid downtime?

Is it still can use a backup channel for disaster situations with DDNS name on the storage node side and switching IP via DDNS when the main ISP will have a disaster ?