Blueprint: Downtime Disqualification

Pentium100 · April 14, 2020, 6:23am

Why? Assuming the node can stay online for GE to complete and has not lost any data (also a requirement for GE), why shouldn’t it be allowed to GE? I could think of some scenarios where I would want that.

Initiate an audit a few minutes after the node checks in (a few minutes because sometimes the node takes a few minutes to start just to see the CLI dashboard), if the audit times out, disregard the check-in, if the audit fails, mark the node as online, but with a failed audit, if the audit is successful, mark the node as online.

However, I think that the checks should be made in such a way that a node either fails an uptime check or an audit, but not both for the same problem.

The satellite shoud use DNS a bit differently than standard. DDNS entries have low TTL because the IP can change without warning and, obviously, a long TTL here would be bad. However, if the DDNS service itself goes down, short TTL would make the node unreachable. So, the satellite should cache the last IP of the node and re-check it after TTL expires, but in the event of a failure, keep the last known IP.

Or just drop the DDNS requirement completely and just note the IP every time the node checks in (and the node should check-in after noticing no traffic from a satellite even before the 1 hour mark).

Odmin · April 14, 2020, 6:29am

Sorry, but my question is:

Little bit different: How SNO can avoid downtime?

I (like a SNO) don’t care about how it works on the satellite side, and not have influence for satellite, but I would like to know and asked this question with a solution as an example.

Do you have another answer for this question?

Pentium100 · April 14, 2020, 6:32am

Redundant internet connections (preferably separate ISPs), UPSs, generator, redundant switches, server with two PSUs, ECC RAM etc. You can also do a HA setup with shared iSCSI storage and multiple hosts for the node VM.

Pretty much do what they do in data centers since the goal is the same - avoiding downtime.

EDIT: Oh, and not using automatic updates.

Odmin · April 14, 2020, 6:39am

Thanks!

You are right, and I agree with you, SNO can (should) apply standard HA practices on storage node side.
I already applied all that you specified and caught myself thinking that I made a homebrew data center

Pentium100 · April 14, 2020, 6:42am

One difference is that it is not possible to back up the node, so I don’t. I also currently use one server for the node instead of HA with shared storage.

BrightSilence · April 14, 2020, 7:23am

I think we can all agree that this shouldn’t be a requirement. And you’re going to spend a long time trying to get any ROI on these expenses. I really think the uptime requirement should be lenient enough that ISP outages don’t get your node disqualified ever and I’m not willing to spend money for any of these measures. Nor should any SNO in my opinion. This new system already allows you to recover, I’m hoping that with it new up time requirements are introduced as well. Either way, I’m not going to build a data center like setup, especially since Storjlabs has consistently said they don’t require that and I trust they will build something that doesn’t require it as well.

Pentium100 · April 14, 2020, 7:29am

Storjlabs has consistenly said that you should use a single hard drive instead of RAID, but the system is buit in such a way that if your drive gets some bad sectors you are screwed (backups not pussible, GE requires 100% of data to be present and it does not take a lot of failed audits to DQ).

stefanbenten · April 14, 2020, 7:32am

We already do that

stefanbenten · April 14, 2020, 7:35am

That is not entirely correct. I think it’s in the realm of 90%!

Pentium100 · April 14, 2020, 7:45am

I was under the impression that if a piece is missing during GE, then GE fails and node gets disqualified. If I was wrong then great.

stefanbenten · April 14, 2020, 8:01am

We tolerate a small percentage of pieces being corrupted/missing, same as with your audit rate!

littleskunk · April 14, 2020, 8:02am

The problem is the storage node is checking in every 10 minutes. The satellite will not detect that the storage node was offline in the first place. If we want to detect that we would have to do it on every checkin for every storage node. I don’t think the satellite is able to send that many ping messages.

I agree and will remove it from the list. I just wanted to have a second opinion. Thank you.

BrightSilence · April 14, 2020, 8:16am

Ahh, I see the problem. It would be tough to even get into that situation where you’re perfectly cheating the system without ever being offline, but theoretically not impossible.
Perhaps you could use an offline response to an audit to trigger this check back system.
So:

Offline audit happens
Node checks back in
Satellite performs uptime check random amount of minutes later
- Node responds => Accept the previous check in
- Node doesn’t respond => discard the previous check in and record down time between the first offline audit and failed check in by the satellite.

In theory this could lead to some more offline time on spotty connections, but I think if the period for the next check is short enough that’s a reasonable trade off and it seems not all that likely to happen.

littleskunk · April 14, 2020, 8:41am

Option 1 would take a long time to DQ a node. I think you need more than 10 failed audits to get DQed and because we use containment mode we would have to multiply it with 3. That means a “healty” node would need to miss 30 audits in a row. The question is would we still call that healty?

Lets say a node managed to get into a crash loop and is missing 30 audits that also means it is not responding to any download requests. I agree that DQ might be a bit too agressive here. I would love to replace DQ with suspension mode everywhere. Even a node that dropped all data could simply get a second chance via suspension mode. That way storage node can’t complain getting DQed because they had a chance to fix the issue no matter what it was.

What do you think about that idea?

Fun fact: That is already the case with the current implentation. Containment mode is used to remember nodes that go offline on an audit request. On the storage node side you wouldn’t see that. We would only extend that offline detection a bit.

Pentium100 · April 14, 2020, 8:46am

Yes. If the data is really gone (dead drive), then the node will be disqualified anyway, just a bit later, but if the data is “gone” because the USB cable fell out, the operator will be able to reconnect it and pass audits again.

It is important to keep in mind that the storage nodes are running on home setups. The node may be left unattended for a significant amount of time (operator at work, on vacation etc), this is different from a datacenter where some employee can get to the servers rather quickly at any time.

littleskunk · April 14, 2020, 8:47am

Yes you can do that. We are caching IP addresses but only for uplinks. The audit service and the repair service will make sure to use your DDNS. The IP cache should catch up a few minutes later.

littleskunk · April 14, 2020, 8:53am

@cameron That is a valid question. I think the plan was to avoid giving a bad node an escape plan but even if they start GE we would still apply the downtime DQ rules. We would have to update the documentation arount GE. Carefull if you are in suspension mode. You can initiate GE but it might be a waste of resource because if you don’t fix the problem you will get DQed in the middle of GE.

littleskunk · April 14, 2020, 8:59am

That is even more expensive for the satellite than the idea of pinging the node a random time after checkin. We would still have the same problem with it. The cheater node is avoiding the detection in the first place which would force us to ping/audit all nodes on checkin. We can’t just do it for nodes that have been offline.

Edit: Now I posted 3 comments in a row and I am not allowed to continue until someone else posted something

Pentium100 · April 14, 2020, 10:11am

Wouldn’t such a node be caught by regular audits? The way I understand this, with an audit you can get multiple results:

Audit pass
Audit fail (node returns wrong data or an error that it cannot access the data).
Audit unknown (node returns some other error).
Node offline - timed out trying to contact the node.

If the cheater node passes audits, then what is it cheating? If it fails audits then it will be DQd for that. If the cheater sends check-in pings but does not respond to audits, then a timed out audit attempt should be treated as a failed uptime check.

Also, I wonder, if I started a node and forgot to forward the port, would it behave like the “cheater” in this scenario? Pinging the satellite, but not accepting any incoming connections.

littleskunk · April 14, 2020, 10:20am

The cheater node will skip audits because it is offline by the time the audit hits the node. The cheater node will also avoid the downtime tracking by pinging the satellite once per hour (or every 10 minutes for additional safety)