Blueprint: Downtime Disqualification

Why? Assuming the node can stay online for GE to complete and has not lost any data (also a requirement for GE), why shouldn’t it be allowed to GE? I could think of some scenarios where I would want that.

Initiate an audit a few minutes after the node checks in (a few minutes because sometimes the node takes a few minutes to start just to see the CLI dashboard), if the audit times out, disregard the check-in, if the audit fails, mark the node as online, but with a failed audit, if the audit is successful, mark the node as online.

However, I think that the checks should be made in such a way that a node either fails an uptime check or an audit, but not both for the same problem.

The satellite shoud use DNS a bit differently than standard. DDNS entries have low TTL because the IP can change without warning and, obviously, a long TTL here would be bad. However, if the DDNS service itself goes down, short TTL would make the node unreachable. So, the satellite should cache the last IP of the node and re-check it after TTL expires, but in the event of a failure, keep the last known IP.

Or just drop the DDNS requirement completely and just note the IP every time the node checks in (and the node should check-in after noticing no traffic from a satellite even before the 1 hour mark).

1 Like

Sorry, but my question is:

Little bit different: How SNO can avoid downtime?

I (like a SNO) don’t care about how it works on the satellite side, and not have influence for satellite, but I would like to know and asked this question with a solution as an example.

Do you have another answer for this question? :grinning:

Redundant internet connections (preferably separate ISPs), UPSs, generator, redundant switches, server with two PSUs, ECC RAM etc. You can also do a HA setup with shared iSCSI storage and multiple hosts for the node VM.

Pretty much do what they do in data centers :slight_smile: since the goal is the same - avoiding downtime.

EDIT: Oh, and not using automatic updates.

1 Like

Thanks! :slight_smile:

You are right, and I agree with you, SNO can (should) apply standard HA practices on storage node side.
I already applied all that you specified and caught myself thinking that I made a homebrew data center :smiley:

One difference is that it is not possible to back up the node, so I don’t. I also currently use one server for the node instead of HA with shared storage.

I think we can all agree that this shouldn’t be a requirement. And you’re going to spend a long time trying to get any ROI on these expenses. I really think the uptime requirement should be lenient enough that ISP outages don’t get your node disqualified ever and I’m not willing to spend money for any of these measures. Nor should any SNO in my opinion. This new system already allows you to recover, I’m hoping that with it new up time requirements are introduced as well. Either way, I’m not going to build a data center like setup, especially since Storjlabs has consistently said they don’t require that and I trust they will build something that doesn’t require it as well.

Storjlabs has consistenly said that you should use a single hard drive instead of RAID, but the system is buit in such a way that if your drive gets some bad sectors you are screwed (backups not pussible, GE requires 100% of data to be present and it does not take a lot of failed audits to DQ).

We already do that :+1:

1 Like

That is not entirely correct. I think it’s in the realm of 90%!

3 Likes

I was under the impression that if a piece is missing during GE, then GE fails and node gets disqualified. If I was wrong then great.

We tolerate a small percentage of pieces being corrupted/missing, same as with your audit rate!

The problem is the storage node is checking in every 10 minutes. The satellite will not detect that the storage node was offline in the first place. If we want to detect that we would have to do it on every checkin for every storage node. I don’t think the satellite is able to send that many ping messages.

I agree and will remove it from the list. I just wanted to have a second opinion. Thank you.

Ahh, I see the problem. It would be tough to even get into that situation where you’re perfectly cheating the system without ever being offline, but theoretically not impossible.
Perhaps you could use an offline response to an audit to trigger this check back system.
So:

  • Offline audit happens
  • Node checks back in
  • Satellite performs uptime check random amount of minutes later
    • Node responds => Accept the previous check in
    • Node doesn’t respond => discard the previous check in and record down time between the first offline audit and failed check in by the satellite.

In theory this could lead to some more offline time on spotty connections, but I think if the period for the next check is short enough that’s a reasonable trade off and it seems not all that likely to happen.

Option 1 would take a long time to DQ a node. I think you need more than 10 failed audits to get DQed and because we use containment mode we would have to multiply it with 3. That means a “healty” node would need to miss 30 audits in a row. The question is would we still call that healty?

Lets say a node managed to get into a crash loop and is missing 30 audits that also means it is not responding to any download requests. I agree that DQ might be a bit too agressive here. I would love to replace DQ with suspension mode everywhere. Even a node that dropped all data could simply get a second chance via suspension mode. That way storage node can’t complain getting DQed because they had a chance to fix the issue no matter what it was.

What do you think about that idea?

Fun fact: That is already the case with the current implentation. Containment mode is used to remember nodes that go offline on an audit request. On the storage node side you wouldn’t see that. We would only extend that offline detection a bit.

1 Like

Yes. If the data is really gone (dead drive), then the node will be disqualified anyway, just a bit later, but if the data is “gone” because the USB cable fell out, the operator will be able to reconnect it and pass audits again.

It is important to keep in mind that the storage nodes are running on home setups. The node may be left unattended for a significant amount of time (operator at work, on vacation etc), this is different from a datacenter where some employee can get to the servers rather quickly at any time.

1 Like

Yes you can do that. We are caching IP addresses but only for uplinks. The audit service and the repair service will make sure to use your DDNS. The IP cache should catch up a few minutes later.

1 Like

@cameron That is a valid question. I think the plan was to avoid giving a bad node an escape plan but even if they start GE we would still apply the downtime DQ rules. We would have to update the documentation arount GE. Carefull if you are in suspension mode. You can initiate GE but it might be a waste of resource because if you don’t fix the problem you will get DQed in the middle of GE.

That is even more expensive for the satellite than the idea of pinging the node a random time after checkin. We would still have the same problem with it. The cheater node is avoiding the detection in the first place which would force us to ping/audit all nodes on checkin. We can’t just do it for nodes that have been offline.

Edit: Now I posted 3 comments in a row and I am not allowed to continue until someone else posted something :smiley:

Wouldn’t such a node be caught by regular audits? The way I understand this, with an audit you can get multiple results:

  1. Audit pass
  2. Audit fail (node returns wrong data or an error that it cannot access the data).
  3. Audit unknown (node returns some other error).
  4. Node offline - timed out trying to contact the node.

If the cheater node passes audits, then what is it cheating? If it fails audits then it will be DQd for that. If the cheater sends check-in pings but does not respond to audits, then a timed out audit attempt should be treated as a failed uptime check.

Also, I wonder, if I started a node and forgot to forward the port, would it behave like the “cheater” in this scenario? Pinging the satellite, but not accepting any incoming connections.

The cheater node will skip audits because it is offline by the time the audit hits the node. The cheater node will also avoid the downtime tracking by pinging the satellite once per hour (or every 10 minutes for additional safety)