Blueprint: Downtime Disqualification

thepaul · March 31, 2020, 12:57am

We might end up allowing multiple IPs per node, but it would need to be shown that it helps a significant number of SNOs. The need to hit multiple IPs in some cases would add a noticeable cost to both satellite and uplink operations. A single multihomed IP might be a better solution.

thepaul · March 31, 2020, 1:02am

Oh, don’t worry, nodes also check in with satellites at regular intervals while running. You shouldn’t need to restart the node if the internet connection has been down.

Krey · March 31, 2020, 3:23am

Can’t parse this: A single multihomed IP
But possible it is new Internet mem

IT is not about SNO. We adopted already. It is about Tardigrade reliability and node churn. Right before you turn on DQ, half of the SNOs make GE. Because StorjLabs algorithms not so good.
Not so good for old SNOs can count for receive helded amounts on escrow before you lambda and epsilons and scales DQed us.

vovannovig · March 31, 2020, 3:57am

Ideally, the nodes would work by ID.
I have several pools of IP addresses from different providers.
I could specify all of them, and I do not care which of them will work my node.
I support the work at several IP addresses

Pentium100 · March 31, 2020, 4:40am

Does the node retry the check-in attempt quickly if one fails? If the check-ins are one attempt per hour, then it would be possible to lose a few in a row even though the rest of the time the connection works fine.

jammerdan · March 31, 2020, 6:27am

First of all I think before implementing Downtime Disqualification Storj should implement the Planned Maintenance feature for SNOs to reduce the chance for disqualification for required downtime. I have experienced some offline time recently and not all of it was unexpected or avoidable.
For example we had a power outage. Which was unplanned maintenance by our power supplier. There was nothing we could do. With 1 week prior notice they have set a date when power will be shut off for some urgent maintenance. So I had to shutdown the server before and restart it later when the power was back on. That would have been a perfect case for notifying the satellites that the node will have a planned maintenance period and will not be reachable for some time.

Second before implementing I think Storj should list all the likely causes for node downtime to get a better idea, what nodes have to deal with and in what cases penalties are justified. It should always take into account that we are dealing with home equipment and home operators here. So there is no enterprise grade hardware, normally no UPS, no ISP high availability, no SLAs and things like that.

A node can become offline for power outage (as noted above) but also the ISPs frequently have issues. Hardware like home/DSL routers can fail and where I live (Germany) due to shop closing law, if something fails on a Saturday evening, you cannot buy and replace it until Monday morning and if that is a holiday it is even later. So this can easily add up to 40 hours of downtime in just one weekend if some hardware fails. It is also common here to use a DSL modem from your ISP company, so if that fails, they have to send a new one, which can take a couple of days.

We are also dealing with SNOs that do not monitor their node frequently. It is said many times on this forum that being SNO is something to run it and forget it. So if a node goes down while at work or during the night, it might remain unnoticed for some time.

And of course there is the planned maintenance. The node servers will have to do (Windows) upgrades, they will have to do hardware upgrades. They will have to move node data from one hard drive to another and so on. I just recently had to do that and moving around hundreds of Gigabytes or even Terabytes can take some time.

So with all that said, when trying to be a “good” SNO could lead to something like 90 days suspension I am not sure, if it would be worth it to bring the node back online once it receives suspension for downtime that is beyond my influence. So a node should not be disqualified for downtime too easily.

So basically what I am trying to say that Storj must be very careful to find a balance between the enterprise grade availability that it advertises to its customers but running their data on low end hardware SNOs without SLAs or much influence on availability of their nodes. Because the risk is that a node, once disqualified will not come back online.

The truth behind of course is, if Storj needs good high availability nodes, they would have to pay for them. With higher payments, SNOs could (and I believe they would) invest in better and redundant hardware, UPSs, enterprise grade disks, ISP redundancy and so on.

So maybe instead of suspending the node, reduce their payouts to the level of availability they reached in a month. So if a node was up only 50%, pay him only 50%. Or reduce the payout by the repair costs they have caused.
Also one thing is true: Data, that is not frequently accessed, does not need high availability. Let’s say backup data, that is once uploaded to Tardigrade and never touched again. So maybe data organisation within Tardigrade, could put data that requries high availability on nodes with better availability, while data, that must not be frequently accessed, can be put on nodes with lower availability. This would also result in lower repair activity.

Pentium100 · March 31, 2020, 6:48am

While I agree with most of what you said, I think there should be a lower limit on uptime. I mean if your node is up only every second day then there is probably some problem you need to fix.

However, rare long downtimes are to be expected. While I have two ISPs, a UPS and server-class hardware, it may be that my node hardware would fail when I am far away from home and just have no way to get back and fix it in 5 hours. Even more so if I do not have a spare part of whatever failed.

Unlike a datacenter where someone is always nearby even when some employees are on vacation.

Higher payment or not, the official recommendation to run a node is a Raspberry or similar (definitely not a server-class device) with a single hard drive connected via USB.

While the requirements are more in line with “server class hardware, RAID, multiple uplinks, UPS and a generator”. Probably redundant switches as well, since switches (at least the good ones), while reliable, can still fail.

Assuming there is no other problem in the last 30 days, 5 hours to bring the node back online is a very short time unless you have redundant hardware or at least keep configured spares on hand and never go away from home so far that you cannot return in 3 hours. Well, with the virus and quarantine it may be easier to stay at home, but hopefully the quarantine will not last forever.

BrightSilence · March 31, 2020, 7:47am

I understand that the exact numbers used do not belong in a design doc. But we’re having a discussion about them without knowing them. I see someone assuming the grace period would be 60 days and others start repeating it. I was thinking this would likely be closer to like 7 days, but I could be equally wrong. Additionally a discussion about what represents reasonable down time started up again with the assumption the requirements won’t change. I’m going to have to agree with the previous posts that 5 or 6 hours is simply not enough given the situation of most nodes being in homes somewhere. Personally I would argue for something closer to 48 hours of maximum down time, which is much more reasonable to ensure for home users.

It would be nice to get some kind of indication for what sort of numbers you are intending to use for this process so we don’t discuss things that aren’t relevant.

kevink · March 31, 2020, 1:00pm

Without talking about actual numbers it is quite simple:

STORJ can only demand an SLA from SNOs that is lower than the SLA of their ISPs.
Otherwise every SNO would be (more or less) in violation of the TOS because SNOs can’t provide a hierh SLA than they receive and would need to stop being an SNO (or be forced at some point due to downtime), resulting in probably the majority of SNOs having to leave.

The other thing to consider is that the SLA of ISPs often isn’t restricted to a single month. It might be that you have 99.9% availability during 11 month, but in one month there is a (possibly planned) downtime of 1 day. A 30 day frame for SNOs could result in node suspension. A “planned downtime” feature for storagenodes would be good in this case.

Furthermore, as already mentioned, almost nobody has redundant hardware. So every hardware failure will lead to a downtime of 1-7 days (e.g. if your CPU drops dead) but should not occur more often than 2-4 times a year. This is not a “planned downtime” but one that could be reported manually to the satellite.

However, even those long scenarios are currently covered by “suspension” mode. We don’t know how long it’ll take for the node to be considered “healty” afterwards, but if a good timeframe is chosen, this could already be enough for now.

I’m confident STORJ is considering these scenarios and will choose good values for uptime requirements, grace period and suspension mode.

lbndev · March 31, 2020, 4:26pm

I don’t really see the added value of the suspension period, besides punishing the SNO for the downtime (which may have been planned, or for good reasons behind his control). It doesn’t “solve” the consequences of a node being offline (reduced redundancy, potential repair cost to the network).

When a node runs a graceful exit, it uploads all its pieces to be stored on other nodes for free (I assume it is not paid extra for this egress) before being DQed. I understand the idea behind this as a way to keep the pieces on the network (and the redundancy) without needing to perform a costly repair.

Could we apply a similar logic to a node returning from being offline, instead of suspending it ?
I mean, could it perform a “partial graceful exit” of the pieces it holds for segments with the lowest redundancy ? So that pieces that are the most “at risk” in this node’s storage are safely handed over to more reliable nodes.

When back online, the node would “relocate” to other nodes, for free, all pieces belonging to a segment which has less than X pieces available.

The satellite repairs a segment when it has less than 35 pieces available.
So we could, for example, have X = 40 for downtime less than 1 day, X = 45 for downtime between 1 and 7 days, and X = 50 for downtime more than 7 days. (Of course these numbers would need to be fine tuned)

This would still result in a punishment for the node, because it will generate unpaid egress and eventually the node will have less data stored and less chance for paid egress (because it will be holding pieces with higher average redundancy available on the network, so less chance to be selected for a download).

We could apply the same logic to planned downtime support, maybe with slightly lower X values (depending on planned downtime duration) and the “pieces relocation” happening before the node actually goes offline.

How does this sound ?

Alexey · April 1, 2020, 8:59pm

4 posts were split to a new topic: Have just had my node disqualified

BrightSilence · March 31, 2020, 4:51pm

This could result in having to upload a significant amount of data while the node is still in a not trusted state. It would take a significant amount of time to do that and it would actually incur a harsher penalty on nodes than suspending them for a bit. Please note that downloads still take place in this suspended state.

thepaul · March 31, 2020, 5:24pm

Yes, if a node fails to contact a satellite it will retry connections to that satellite very quickly with an increasing backoff interval.

thepaul · March 31, 2020, 5:43pm

A multihomed IP is one that can be reached by more than one network. Multihoming - Wikipedia

I’m sorry I’m not entirely following your message here, but the gist of it seems to be that you feel that [a] nodes have been DQ’d unfairly in the past, or that [b] the proposed changes here will cause unfair DQ, or that [c] the uptime requirements are too onerous. Or some combination of those.

If [a], then we agree! That’s a big part of the motivation for this blueprint.

If [b], then we’re interested in seeing some specifics as to why it might happen. Saying it’s “Not so good” is not particularly helpful.

If [c], then I’m sorry it’s not working out for you. Certainly people in different circumstances and different local markets will have different costs involved in achieving good availability, and for some people those costs will mean it is not profitable to run a Storj storage node. We would like to tune things so that as many people as possible can run profitable nodes, but at the same time we need the network as a whole to remain viable. If we tolerate too much downtime from storage nodes, general availability and performance and durability all suffer and few people will want to store data on the network.

BrightSilence · March 31, 2020, 5:48pm

While I agree that the network as a whole needs to be protected, I think that the passed months with no consequences of downtime whatsoever, as well as a lower than expected amount of repair, which has been mentioned in town halls as well as the time series database presentations, point towards some room to move here.
Is the plan to still stick to a maximum of 5 hours? Given the above, I don’t see a need to stick to that. Am I overlooking something?

thepaul · March 31, 2020, 5:51pm

That is a useful data point. It might be possible to allow nodes to have multiple IPs in the future. If that happens, though, it will be a separate effort from this downtime qualification blueprint. You might want to start a new thread about it.

This might be a helpful sidenote: your node can change its effective IP anytime by contacting all relevant satellites and passing on the new address. If you have a system for detecting failure on one IP and you’d like to fail over to another, that could be automated on the SN end without adding to the satellite and uplink costs.

thepaul · March 31, 2020, 6:06pm

Yes, there has definitely been room to move with the network so far. However, you need to consider two points: One is that Storj Labs has so far been operating at a loss, effectively subsidizing the growth of the v3 network. To some extent, we can make up for less reliable nodes by doing more work on the satellite side. We won’t be able to continue that indefinitely. The second point is that while the v3 network was being developed, the network dynamics were (most likely) entirely different from what they will be when there is more heavy and widespread usage. But of course we need to make sure it’s possible for people to run profitable SNOs too, so I wouldn’t worry too much about us making the requirements way too stringent.

Having said all that, I don’t actually know what the current plan is, myself. I just know that we are putting a lot of careful design work into balancing the needs of SNOs and customers.

thepaul · March 31, 2020, 6:23pm

That is a very interesting idea. Possibly we’d even keep the suspension period by default, but give the node the option of skipping the suspension period if it does perform the “partial graceful exit”? I haven’t run any models for your idea, but it’s possible it could work. It would involve a lot of complexity, though. Graceful exit already adds a surprising amount of complexity to the system, and this would compound that. I think we have to put this idea in the bucket of “future possible improvements”. I’ll think about it some more, though.

serverninjas · March 31, 2020, 7:54pm

Keep in mind node operators have been running at a “loss” also. With such low demand and low payouts your are expecting enterprise redundant systems while not even covering electricity costs. Let alone hardware costs. Its normal for a ISP to have issues that are out of the SNO hands. My ISP has been down for a couple days for a fiber cut. Data centers that host servers I have can take up to a day to troubleshoot hardware issues, ect. Maybe look at giving a “bonus” to incentives rather then to start kicking people off for what is going to be normal down time in a consumer setting that is not protected by SLA.

Toyoo · March 31, 2020, 8:25pm

Yes, if a node fails to contact a satellite it will retry connections to that satellite very quickly with an increasing backoff interval.

What are the intervals?