Blueprint: Downtime Disqualification

For point 5, as you already said, I’m not sure how we reconcile distrust of storage node initiated check in with the need for last_contact_success to be reasonably fresh for nodes to receive data/not be considered unhealthy/etc. For this, it seems that we would need the satellites to be checking in with everyone on a regular basis.
I’m concerned about how this would scale the more nodes we get. I’m also concerned about downtime tracking for the same reason. Our confidence that the duration between two failed pings was 100% downtime is reliant upon how long the interval was. 5 minutes, sure. 10 minutes, sure. 30 minutes, maybe so. Longer than that? Nodes should be getting random audits in the meantime which can help prevent the interval from becoming too long as well. We’re actually working on some changes to try to make sure no one falls through the cracks for audits too.
As the network grows, we’re going to encounter slowdown with how often we can ping nodes I think. Maybe we can scale it out, but I it’s a concern I have. Do you have thoughts about that?

duno if this is something that could be made to work, but the satellites could broadcast/stream a sort of randomized satellite key, which the nodes then generate a proof of uptime token from, ofc the trick in that would be how to make it so that any node cannot use the randomized satellite key to generate the proof of uptime token at a later time…

public key encryption comes to mind, however thinking of that stuff kinda breaks my mind… :smiley:
but it sure would scale well, if it was something that could be streamed instead of having to ping each storagenode.

basically doesn’t it come down to moving the issue of establishing actual uptime to the storagenodes instead, and then sending proof of it periodically in nice compact packages.

We would need to scale up the number of ping worker the same way we have to scale up the audit worker. In both cases we need a minimum throughput in order to identify bad nodes in time.

I get the point. You are concened that a node comes back online, sends a checkin message but for the next 30 minutes we would still count the node as offline or some state that is something between online and offline. During these 30 minutes we would have to still repair pieces which would be an additional small penalty for the storage node. Yea that sounds like something we can’t garantee.

This downside would affect idea 4 and 5. So go on with idea 1?

How about the question regarding graceful exit. I agree that there is no downside letting suspended nodes initiate graceful exit. Graceful exit will DQ them anyway. They can’t use that to escape suspension mode.

2 Likes

afaik graceful exit should be fine, yes. We do still have some strong internal opinions in favor of sticking to using audits for downtime, so I might need to go back and give some more thought to different ways we might achieve that. I will bring up option 1 while I’m at it

2 Likes

Hi all!

How can we know if the Downtime Disqualification is activated or not?
If not, could you please make a big announcement, just to make sure every SN is aware of that?

Thanks a lot!

I expect email notification specific to the nodes that are in danger of getting DQed. Why should we make it a big announcement for the nodes that have good uptime?

Global announcement would be useful if you plan to shutdown your node for more than 5 hours.
If you know that the downtime DQ is activated, you will for sure try to find another solution.

You should try to find a solution because even without DQ activated the repair job will still kick in and start moving data away from your node.

You should always assume DQ is active any downtime is bad. More then 5 hours of downtime is just shutting down your node cause you know DQ isnt active.

Agreed.
The thing is that I have to move my node to a different location, where I have a better connection (fiber). Today, I had to change my setup and it took more than 3 hours.
So even if I am serious and really want to offert the best quality of service for my nodes, I want to make sure that I can move it without being disqualified. If I know that DQ is activated, I will wait another month to move it.

I agree with you, we always have to provide the best quality of service and any downtime is bad.
But having the right information always helps to take the right decision.

Hey everyone, thanks for the discussion. Internally, we had some pretty strong opinions in favor of using audits to track downtime, which prompted a redesign here: Blueprint: tracking downtime with audits

Check it out!

5 Likes

My head is spinning reading through this chain (and it is late here) but I want to offer some less technical observations first as a SNO and then from a Storj business perspective and I would love to offer a customer view but I don’t have real insight for that to offer at this point.

Reading through this thread I imagine a spectrum where one end is that enterprise class bullet proof HA/multihomed data center type set up some SNOs have. At the other end is more of a home set up that feels like the original marketing brief where a person has some extra storage and can just plug in anytime they have the ability to, sort of like the old SETI@home project (or perhaps modern crypto mining). The reality would be a bell curve of SNOs with the majority bunched somewhere in the middle. The bullet proof SNOs would be rewarded for their exceptional uptime and the casual home user would really just be doing it for fun not profit. As someone in the ‘middle’ (your SNO base) I might be down for a whole day a few times each year (where ‘few’ determines which end of the spectrum you tilt towards) and I am not punished for it. The first few months is a learning curve for each SNO as you accept a place along that spectrum (more investment or more casual). The model in my mind is one where customer data sits on three different SNOs nodes so that it is always available because the chance of all three being down on the same day in a year is highly improbable. So as a SNO when my rig (all of my nodes) is down, I don’t earn income but I am not penalized unless I am down for more than a few days. I would hope that the architecture could support this and adapt to SNOs sliding along this spectrum dynamically. Ideally there is no such thing as DQ, your node simply moves along the spectrum from ‘unverified’ to more trusted and reliable. If I then fall back to less availability I end up back at the ‘casual end’ and data activity would be minimal and income non-existent, until I ramped up availability. After no reachability for a month maybe I go ‘inactive’ (my stored data flushed) until I become active again. (From a crypto mining perspective this is the norm, I am not expected to mine, when I do I may make some income, more continuous mining on the same pool provides more bonus through PPLNS).

From a business perspective, it appears your SNOs will be in the middle of that bell curve, generally available but not to the extreme ends. Does the architecture support this? What does the target SNO profile look like (based on real world supply not on an ideal demand)? Where does the significant reduce-able cost come from? It sounds like the satellites… a good business will try to attack those costs, would a more decentralized architecture make sense? Perhaps blockchain based, or peer to peer or DApp or distributed operating system?

I want Storj to succeed and that means potential customers and SNOs’ needs/capabilities should be understood. A wider range of SNOs and customers will allow for greater scale and potential as well as agility to adapt to a rapidly changing world. I can’t say how much uptime I can reach, I strive for less than 5 hours downtime per month but in the last 6 months I know I exceeded that at least once. I am not ready to invest in a backup generator with continuous UPS, I don’t know how many would.

Thanks for giving me a chance to offer my views as well as participate in Storj, it is inspiring.

3 Likes

5 hours per month isn’t easy to keep, sure if all goes well then one can keep going at time… 3 months, a year… two years… but eventually something will fail and if one doesn’t have redundancy / failovers then one will get down time… which ofc will always be when one is away…
and the thing that isn’t expected to break ever…

then it can easily take a few days to get back online… like say somebody somewhere digs through the primary internet connection to the local area…
so two days of downtime is like 10 months of flawless uptime…
so lets add 2 months to that and call it 1 year… with 10hours of downtime + 2 days for a larger downtime cycle.

the 10 hours is for updates, general maintenance, reboots… xD my server takes like 8 minutes for a reboot, might be able to cut it down to 6 but then its just a managed reboot going flawless.
usually it takes more like 15 minutes because there is always something i need to check or do when rebooting, and thus i often need to run through the boot cycle a couple of times.
so that gives me time for lets say 8 reboots in a year… that should be manageable.
which subtracts 2 hours.

then if we imagine one wanted to like say copy the primary storagenode from one place to another, that would mean shutting it down for lets be optimistic and say 30minutes.
sounds low i know, but last time i tried went pretty smooth, because i rsync’ed the directories like 8 times before shutting down the node, which meant that on the last run it was over in like 15 minutes…

so adding 1 node migration pr year…
giving us a total of 2 days of emergency shutdown time a year, for if shit hits the fan…
8 reboots a year, because when actively using a server and when one isn’t enterprise level.
then reboots happen.
this leaves us with 7.5 hour of downtime for whatever else people want to do to their nodes in a year… which would leave us at 60hours of dt out of 8760 in a year,
lets call that 0.66%

imo thats very very very difficult for a home user to keep, not that many of them cannot, but some will always fail to no fault for their own, even if trying really hard and having good abilities.
but their locations in the world might simply mean they are in disaster areas that make it unlikely for them to succeed long term.

i think it might be wise having multiple classes of SNO’s

That was a fun project to have running on my old PC.

There’s newer version of downtime tracking algorithm:

…I see now that this was posted earlier…