Blueprint: tracking downtime with audits

cameron · May 18, 2020, 2:32pm

If we can successfully dial the node, but it just takes too long to transfer the data, it doesn’t count as offline.

But yeah, if it can’t be reached at all due to congestion… I kind of agree with SGC. If this is happening so often that the node becomes disqualified, then it seems that node not really useful to us.

To your point though, I agree; it depends on the numbers. It might be beneficial to have different requirements for high-audit vs low-audit nodes.

A quick idea which comes to mind would be if a node has fewer than x audits over the tracking period, then it is evaluated on a more lenient standard. This could help counteract random issues popping up and damaging the overall score.

cameron · May 18, 2020, 3:07pm

If I’m interpreting this tool correctly, I believe the “recoverable failed” is referring to the situation where a connection is successfully established, but times out before the piece is fully transferred for audit. In this event, the node is put into “containment mode” where the next time it would be audited, we instead retry the timed out audit because the node could be hiding a deleted or lost piece. After (5?) retries, it counts as a regular failure.

So currently, I don’t think that specific retry mechanism would be very helpful for retrying offline audits. It could be hours later

Pentium100 · May 18, 2020, 3:13pm

Retry in a few minutes, maybe the node is restarting. Node pings the satellite after start, I did not see that used in the design document. Wouldn’t that be possible to use at least to trigger another audit to confirm?
It would not be fun to update my node and find out that I “earned” an hour of downtime even though the node was offline for 5 minutes.
However, if you have special rules for nodes with small amounts of data it would probably be OK.

Sasha · May 19, 2020, 3:29am

Interestingly I had to physically move servers into another building along with my node for StorJ. The node would have been down for several hours (maybe 2 or 3 hours) as I also need to change IP and forgot about the port forwarding

This was early in the month of May and I do wonder how this downtime was recorded.

SGC · May 19, 2020, 8:10am

i have way more dt than i should admit to… lol
maybe a day or 1½ worth during the last month, but seems to have ironed out all the issues now… so i wouldn’t worry about a few hours of dt

BrightSilence · May 19, 2020, 9:49am

I’m almost certain this new system would start measuring when implemented. So down time right now shouldn’t impact it. However, you should still aim to keep downtime as low as possible.

cameron · May 29, 2020, 8:02pm

Hey everyone, thanks for all the discussion.
The design has been approved and we’re going to begin implementing pieces of it, but I want to say that we’re not going to switch on disqualification to begin with.
To start, we want to observe what kind of data we get back from the system and how it’s working out for different kinds of nodes, and based on that information make adjustments as necessary

Beddhist · June 2, 2020, 6:16am

So after what time offline will a node be suspended and dq’d?

Box-shadow · June 2, 2020, 4:36pm

This is really good. I’m quite happy with the solution proposed.

dbyte · July 13, 2020, 5:36pm

What is the status of this? I am moving cities early in August and my node will likely be offline for several days due to service provisioning and such.

SGC · July 14, 2020, 6:57pm

i doubt that one will ever get DQ for 2 days of downtime… ofc your node will lose some data if the satellites end up doing repair jobs… but aside from that… limited downtime can happen… even for the best setups…

even google datacenters crash from time to time… even tho it’s very rare xD
but real world problems can arise that simply cannot be avoided and thus extended downtime should be accounted for… so long as it’s not a regular thing…

personally i wouldn’t worry to much about it… so long as its not like a week…
but yeah i duno… just guessing here…

one alternative can be a wireless broadband connection and then set your storagenode to like 1tb or whatever … a number much lower than it actually is so it won’t allow ingress… then you can basically break even on the egress value vs the cost of most mobile broadband providers… your mobile phone can be setup to do this… or an old spare one …

cameron · July 14, 2020, 7:03pm

We’ve added the new DB tables and columns, though we haven’t added the logic to write any data to the columns or do any evaluations. Though, we are looking to start that work soon.
Even so, once everything is set up, we’ll be running it for a bit to collect data and make sure everything looks to be working smoothly before turning it on for suspensions and DQs.
And even then, we’ll be setting the threshold of allowed offline audit percentage pretty high to start off.
We do also have plans to implement planned downtime, which would be perfect for your case, but getting the tracking system working is going to come before that.
By August the offline threshold will probably still be pretty relaxed if the consequences are even switched on, so I think you’d probably be okay.
We’ll definitely announce it a few weeks in advance before we start penalizing for offline audits again

SGC · July 14, 2020, 7:11pm

i don’t even understand why a node needs to be DQ for downtime… i mean if one has enough dt all the data or most of it will end up being repaired… and to my understanding that data is simply deleted when the node gets back online…

ofc Storj may have some reasons i don’t understand to do it this way…
but the fundamental concept of downtime DQ doesn’t make sense imho

so what if a node that went offline for ages comes back online gets 80% of its data deleted because that was repaired while it was away…
but then simply resumes normal operation…

maybe it’s just me not understanding the technology, but DQ data that could be viable just because of having had extended downtime might one day lead to data that could have been recovered, to be destroyed…

but it’s easy to see things as being a bad solution when one doesn’t understand it…

i’m sure horseless carts didn’t seem a very good idea at one time…

cameron · July 14, 2020, 7:17pm

It’s an issue of trust.
By being offline for more than four hours, we consider a node to be unreliable (at that moment). If enough pieces in a segment are unreliable, we trigger repair, which we pay for.
DQ essentially means, this node was offline for too long that we no longer trust it. We think it will cause more repair and cost us more money if we let it rejoin

SGC · July 14, 2020, 7:29pm

that’s a very valid point… that planned downtime thing sounds pretty cool…

i’ve had some pretty long downtimes because of hardware issues early on… it’s bad when one reaches the point of spending 16 hours on the system and it’s still not working… then one is a bit like … maybe i should just give up, go to bed and approach the problem tomorrow…

not that i think my setup will go down, but it’s nice to know that there atleast is a little wiggle room for well performing nodes…
i mean how many years of uptime does one need to have to be able to get a couple of days of dt lol… sure any operation can basically be done while online… but then one is into having like a cluster… and then it might be somebody digging through the fiber cable and killing the internet for days until they can figure out why…

ofc then one can just have a redundant internet connection… but it all adds up… 24/7/365 uptime is hard and expensive…if one wants to be 95% covered… ill leave 5% for acts of God…

ofc the last 5% can mostly be negated with a second location… ofc to be really safe then that should most likely be located outside national boarders…

cameron · July 14, 2020, 7:51pm

Yeah, I agree with you there. The node’s pieces may all still be perfectly good, but we would DQ them…
Then again, that’s what suspension mode is supposed to protect against. The node encounters some problems, we give them some time to fix it and get back online before DQ.

Then once we allow planned downtime, even better!

cameron · July 14, 2020, 8:26pm

Oh sorry, @Beddhist I didn’t see this.
I don’t have a number for you right now. Once it’s implemented, and we get some data around offline percentages we’ll be able figure out where we want to draw the line

Pentium100 · July 14, 2020, 8:59pm

Yeah, uptime gets diminishing returns pretty fast. 6 hours (the previous limit) are not enough time for a home-based operator to react and fix the problem (he may not be at home, the ISP contract does not promise such fast repairs, the operator may not have the spare parts at home etc), so it pretty much requires everything to be redundant (internet connections, power supplies, UPSs, switches etc), except that making the node itself run in a cluster is a bit difficult.

However, if I understand correctly those limits are going to be increased and there is going to be a planned downtime possibility (so I will able to get dust out of the server once in a while) then it may even be possible to run a node without having large UPSs, generator and multiple uplinks.

BrightSilence · July 14, 2020, 10:12pm

Like it always has been…

It’s clearly not the intention of Storj Labs to make setups without that impossible. There has been so much weight put on what was merely listed in the ToS but was actually never enforced. Lets focus on the intentions outlined here rather than a theoretical never used limit of 6 hours. That never applied and is never going to.

hoarder · July 14, 2020, 10:23pm

While every SNO hopes that this is the case, it being in TOS is exactly the reason why it can be applied and enforced.