Blueprint: tracking downtime with audits

cameron · May 12, 2020, 10:06pm

Hey everyone, we have a new design here for tracking downtime with audits

storj/storj/blob/20580cadd574b157e6c3eab32c1b0bdd0be44507/docs/blueprints/storage-node-downtime-tracking-with-audits.md

# Storage Node Downtime Tracking With Audits

## Abstract

This document describes a means of tracking storage node downtime with audits and using this information to suspend and disqualify.

## Background

The previous implementation of uptime reputation consisted of a ratio of online audits to offline audits. We encountered a problem where some nodes' reputations would quickly become destroyed over a relatively short period of downtime due to the frequency of auditing any particular node being directly correlated with the number of pieces it holds. To solve this problem we need a system that takes into account not only how many offline audits occur, but _when_ they occur as well.

## Design

The solution proposed here is to use a series of sliding windows to indicate a general timeframe in which offline audits occur. Each window keeps two separate tallies indicating how many offline audits and total audits a particular node received within its timeframe. Once a window is complete, it is scored by calculating the percentage of total audits for which it was offline. We can average these scores over a trailing period of time, called the _tracking period_, to determine an overall "offline score" to be used for suspension and disqualification. By granting each individual window the same weight in the calculation of the overall average, the effect of any particularly unlucky period can be minimized while still allowing us to take the failures into account over a longer period.

Storage node downtime can have a range of causes. For those storage node operators who may have fallen victim to a temporary issue, we want to give them a chance to diagnose and fix it before disqualifying them for good. For this reason, we are introducing suspension as a component of disqualification.

Once a node's offline score has risen above an _offline threshold_, it is _suspended_ and enters a period of review. A suspended node will not receive any new pieces, but can continue to receive download and audit requests for the pieces it currently holds. However, its pieces are considered to be unhealthy. We repair a segment if it contains too many unhealthy pieces, at which point we may transfer the repaired pieces from a suspended node to a more reliable node. If at any point during the review period we find that a node's score has fallen below the offline threshold, it is unsuspended, or _reinstated_, but it remains _under review_. This prevents nodes from alternating between suspension and reinstatement without consequence.

The review period consists of one _grace period_ and one _tracking period_. The _grace period_ is given to fix whatever issue is causing the downtime. After the grace period has expired, any offline audits will fall within the scope of the tracking period, and thus will be used in the node's final evaluation. If at the end of the review period, the node is still suspended, it is disqualified. Otherwise, the node is no longer _under review_.

This file has been truncated. show original

We had a previous discussion on a related document here Blueprint: Downtime Disqualification , but we decided to reconsider using audits to measure storage node downtime.

Let us know what you think

anon27637763 · May 12, 2020, 10:51pm

This is really good. I’m quite happy with the solution proposed.

Cmdrd · May 13, 2020, 4:18am

Agreed with @anon27637763, this is quite good and I think an excellent proposal that isn’t too complex from your end to manage or for SNOs to understand either.

Pentium100 · May 13, 2020, 6:14am

I have a question: does the offline audit impact the audit score of my node? I mean, if my internet connection has failed that does not mean I have lost data.
I have set my monitoring system to alert me if the audit score drops below 1 and the number of failed audits increases. That would be a bit useless if it got triggered every time there was a connection problem.

I may have more questions after I have time to properly read the document

pietro · May 13, 2020, 6:41am

Very good, seems fair! I like the use of sliding windows.

It’s advisable that you fix email notification system to be sure a SNO gets notifications as soon as possible. At the moment the email notification system is broken: I received suspension emails when not suspended and got suspension without receiving any notification. Moreover, when the email triggers correctly, there’s a lot of delay between the event and the corresponding emails, which makes the late email arrival almost useless.

SGC · May 13, 2020, 9:51am

Nodes stuck in suspension

Since a suspended node cannot receive new pieces, and it can only be evaluated for reinstatement after an audit, if it happened to have all of its pieces deleted, it would be stuck in a limbo state where it would never leave suspension.

i don’t understand this, i mean even if a node is deleted and only the identity remains, i kinda assumed that the satellites has some kind a record of pieces being store and would send audits of that data, which would then fail, even if very infrequent, and if its that infrequent is it really a node worth keeping anyways…
also doesn’t really matter if a bad node falls to its death in a day or a week does it… i suppose it might be relevant for repair jobs… (i’m not that familiar with all of the in’s and out’s of this)

oh it just dawned on me, what is the difference between a suspended node with a offline iSCSI drive and a suspended node with all it’s data lost…

maybe a node could have a preflight check it runs before attempting to launch into the cloud, that way it would simply refuse to launch without atleast being able to see that its data seems intact.
doesn’t have to read the whole thing… just maybe a few spot checks here and there, ofc that would delay launch time a bit … so it would need to be sleek… like maybe it pairs with its data, kinda like bluethooth… each part has a shared key or data block.
so like say 1 + for every 100k files it which would mean a 24TB node would have 240 files with a code checksum thing matching something like it which is kept with the identity… like lets call it data certificates, each of these being 1 part of a pair which is written inbetween the data storage while data is being added to the node…

so at launch / boot a node would start by going in checking data certificates 1,2,3,4… every check verifying its data stores exist and is legible / non corrupt

checking and comparing or doing some kind of encryption key stuff on 480 tiny files for verifying 24tb of data would be insignificant

this method or something similar would

make sure a node cannot launch without it’s data stores in accessible form… maybe it should do a write test also…
any node that has lost it’s data will not be able to launch, and then the sats will after an email contact with the owner and a set time like say 1 month( it is possible that in the future you might loose data and then you might want pieces back that is lost for whatever disaster)
the node is disqualified for pure offline time or if proper cause exist maybe extension of suspension or other solutions to data restoration to the network is found…

(i don’t believe its possible to always rely on repair jobs, this would be the fail safe for the repair
job… maybe storj would pay for upload of a suspended node to recover the data, in case of disasters.
lets face it repair jobs are sure fine… until we get outside the scope of what it was designed for… like say meteor destroys most of europe, can all your data then be restored… would you want to…)
alas i digress… what can i say… i like fault tolerances

not sure this is a good solution, but if nothing else it might inspire someone…
seems like a lot of nodes often try to connect without their data being hooked up or working right… this would prevent it from being damage from an active node without proper access to its data…
it might even be useful to check contact to storage from time to time… but yeah i duno

i don’t even understand how a node can be stuck in suspension without having any data… but in my view they should stay there for a while before being DQ maybe like stated above this should involve contact from storj to the SNO, to help protect against data loss at disasters whatever they maybe…

i’m sure ill have more to say, but yeah i got work to do …

i would say… so long as the data is good, DT should be almost irrelevant…(and ofc not paid for storage) i know you might have other considerations to take into account like live data…
and reintegrating a month old node into the network, might not be worthwhile… but isn’t it always like that with redundancy, it’s never worthwhile… until you need it.

SNO’s need an incentive to bring back working by offline storagenodes in case of disasters, instead of just pulling the plug and formatting everything… .if that is done on a large scale because that is the consensus created from storj, then data will be lost… it’s just a matter of how long before it happens and how wide the data blackout will be.

the audit thing is good, makes much better sense that wasting a lot of energy on pinging everything which wasn’t doing the job it was suppose to anyways.

BrightSilence · May 13, 2020, 10:02am

This is looking very good. I love that learnings from previous discussions are incorporated in this design, like setting a fixed review period, but allowing node to be removed from suspension before the review is over.

I have one question: How do you ensure nodes (especially new ones) get enough audits to determine uptime? New nodes take about a month to get to 100 audits. It can even take days before a node gets its first audit. That doesn’t seem enough for a representable score.

I also have a comment on the “nodes stuck in suspension”. Many SNOs have some kind of issue when first setting up the node. It’s very well possible they got it to work for a little bit but then run into issues and pick it up a few days later. Hence it might be a while until the node is actually up and running. This may increase the number of times this scenario actually occurs. Just something to keep in mind.

@SGC I think the problem is that this node will never be disqualified because it will never get an audit again and never be evaluated again. So no, it’s not a node worth keeping, but the satellite keeps it stuck in suspension anyways. A lot of the rest of your comment seems to be about file availability, which is outside of the scope of this blueprint.

SGC · May 13, 2020, 10:53am

how would you not include that in downtime evaluations, there could be one offline node between being able to repair or not being able to repair a sizable chunk of network data.

file availability is why suspending and disqualifying nodes is a necessary part of the network, because without such features data would either not be live or introduce possible bit rot into the network.

a bad node is worse that no node at all in most cases, one could simply just do an automated or direct contact to the SNO to evaluate it and then after say 1 month of suspension just let the node time out…

in normal network operations (not storj network) but like ethernet and such… when a connection or peer times out, that’s just it… i know this has to be on a much later time scale, but i would draw upon existing verfied good conceptual solutions from how networks connections have been done for decades, thats solutions that is know to work and be good, while taking into account to many considerations for a few people to consider over a short time.

ofc it would be on a hour and daily scale… but might not matter… anyways it would be a place i would look for inspiration and guidance.

so i don’t see how either element of file availability and downtime can be two different things, because they are basically the same.

granted a node with no files isn’t really relevant, but would you tell the difference of a node with no data and … a node with its iSCSI data drive disconnected.

also the suspension of a node might mean the node never even connects to the network again…
so it would be obvious for satellites to have it timeout after a while and just forget it ever existed…

it would be the node’s problem to get back onto the network and get itself verified…
one could also make a sort of scan function for long time suspended nodes, to just verify the existing data against if the network is missing bits.

like a checksum of the data stored, which the satellites also keep, so they can cross reference data what seems like data integrity, without having to scan every file on a node or many nodes.

ofc to really come up with something like that i would have to understand how the whole system works in detail… but the concept is pretty basic use checksums to verify data… so a suspended node become essentially a checksum of the 240 checksums at 24TB node would have in the previously suggested concept of mine… then when the suspended node logs in, it computates a checksum of the datastores, because reading data is pretty fast… essentially one could scan the whole thing and recompute the checksums recompute which ofc would take a while…
its just like a key to open the door into the network… and is essentially what audits do also i suspect…

maybe the solution is so simply that a node should simply be able to prepare for a full audit, scanning its entire dataset, like zfs does a scrub… scrubs are pretty fast compared to how much data it checks.
for every 100k files (datastore/timebased or whatever) it computes a block audit checksum which collectively are like the uberblock in zfs. when this process is done and everything seems okay as far as the node can see and it’s ready to perform its full audit, it connects to the network, requests a full audit, the satelite and node verifies and compares checksums of the different blocks, if okay the data seems to be good and regular audits can continue, be allowed to rejoin and else it would timeout after a set amount of time in suspension mode and be disqualified.

maybe this did belong in the downtime disqualification thing i started out in…
but it was here the description of how it was suppose to work was, so it was here i did my minddump

Sasha · May 14, 2020, 11:58pm

Definitely an improvement over the current audit uptime calculation.

littleskunk · May 15, 2020, 11:33am

We currently don’t have any kind of uptime calculation.

Sasha · May 15, 2020, 11:54am

lol, it’s meant to say downtime calc. I was talking in another thread about uptime at the same time

littleskunk · May 15, 2020, 12:06pm

We also don’t track downtime

Sasha · May 15, 2020, 12:09pm

littleskunk · May 15, 2020, 12:13pm

I am not saying that we display something on the dashboard but that is not time related.

The point is lets just assume we don’t have any uptime or downtime information at the moment because that is basically the current state. It makes no sense to compare the design document against that. The question in this thread should be what the edge cases are. Can the new system fail and if so how can we prevent that.

cameron · May 15, 2020, 6:30pm

@Pentium100
This design will not affect your audit reputation at all

@pietro
Oh yes, if that’s the case we will definitely need to fix that (if someone isn’t already working on it!)

@SGC
Parsing…

@BrightSilence
Yes, I’m also somewhat concerned about how new nodes will fare under this design.

Under the “Open Issues” section I briefly mentioned one potential bandaid here would be to wait until a node has some minimum number of entries in the windows table before doing evaluations. This would prevent a node being suspended for being offline for its very first audit, or first few audits.

Maybe I didn’t make this quite clear in the document, but we also want to ensure that all nodes will be audited at least once per window (if they have any pieces, that is), and we have some changes planned to make this happen.

However, I think there’s another potential issue here. The stricter we are with the offline threshold, nodes with low audit frequencies will pay a relatively high price for mistakes.

If a node is only being audited once per window, and it happens to go offline at just the wrong moment, it’s too late. Of course, this goes both ways. A node could be online just for the audit, then go offline and we don’t know.

It’s only one window out of, maybe 30? (I imagine keeping data for 30 days, maybe 24h windows), but if we had something like 10% offline tolerance, then 3 or 4 mistakes in 30 days and that’s it. I’m not saying that’s how strict we’ll be, just an example.

Perhaps tuning the window size so each node can be audited at least twice would help this case.

SGC · May 15, 2020, 8:42pm

i suppose a possible simple solution to the punishment of new nodes would be to not punish them with anything more than a suspension for the 30 days odd days of the vetting process, unless if they start spewing corrupt data into the network which i assume is one of the worst things that can happen, bad data can often be worse than no data at all…
there is a vetting process, so why use exactly the same rules for a node only getting test data for the first month… or however exactly it works… then after vetting is completed successfully things get more serious.

but i duno… sorry if my ideas and comments are a bit extensive… its not always easy to relay advanced concepts simply… if nothing else so long as it makes people think a few new thoughts or gain a different perspective, its not all wasted effort from my side.

Pentium100 · May 15, 2020, 8:51pm

IMO it also really depends on the numbers overall. If the allowed downtime is short, then any “rounding up” of downtime is really bad (for example, 5 hours max downtime per month, but a single failed check counts 1 hour of downtime for you, even if that failed check was because that packet was dropped by the network).

Come to think of it - what happens to the nodes with saturated uplink? On one had, the node is transferring a lot of data, so it is definitely online, but it can fail the check.

By the way, a brand new node is more likely to go offline for short periods of time as the (new) SNO figures out settings etc.

SGC · May 16, 2020, 9:13am

TL;DR
if one fails an audit, the from what the successrate.sh suggests, then it retries later… it would make sense that the uptime might make use of that…

so that if you drop an audit it just tries again shortly after or however it works… should be part of the audit recovery kinda thing… because the whole point was to improve performance and not add additional workloads to the network, thus if there is a recovery of failed audits, making that as a recoverable uptime check should be “trivial” atleast in theory…

[ramblings and reasons]
well it’s for tracking uptime, a node that cannot be contacted is still kinda useless for the network, if a customer wants access to their data… granted @Pentium100 you are right… by one point it wouldn’t matter to much just before the network went almost silent a few days ago, i was up to getting like an audit a minute… granted my storagenode isn’t well representing a stressed one hdd node, but even in that case i don’t think after the first month or so that lack of audits is a real issue… ofc it would depend on network traffic, which is a downside, but when the network is at fulltilt the performance benefit from utilizing audits for tracking uptime should be worth while.

and no matter how long the downtime is then repair jobs would be started and thus one would be punished by the existence of more copies of the data one had.

right now on the network a single audit might also represent a long while for most nodes
i got 1000 audits in the last 10-11 hours… that was actually more than i would have expected… thats nearly equal to my uploads at 1160.

maybe we should find some low performing nodes thats like 1 week - 2 weeks or so old… maybe a month and see what their numbers of audits actually are…

1k in 10 hours is beyond 1 a minute, thats pretty decent tracking and then if the system just has a certain error tollerence… tho i do have 0 failed audits, … better post the successrate … so much easier.

my node is 9 weeks…

$ ./successrate.sh /zPool/logs/storagenode_2020-05-16.log
========== AUDIT ==============
Critically failed: 0
Critical Fail Rate: 0.000%
Recoverable failed: 0
Recoverable Fail Rate: 0.000%
Successful: 966
Success Rate: 100.000%
========== DOWNLOAD ===========
Failed: 24
Fail Rate: 0.428%
Canceled: 24
Cancel Rate: 0.428%
Successful: 5564
Success Rate: 99.145%
========== UPLOAD =============
Rejected: 0
Acceptance Rate: 100.000%
---------- accepted -----------
Failed: 1
Fail Rate: 0.065%
Canceled: 379
Cancel Rate: 24.610%
Successful: 1160
Success Rate: 75.325%
========== REPAIR DOWNLOAD ====
Failed: 0
Fail Rate: 0.000%
Canceled: 0
Cancel Rate: 0.000%
Successful: 58
Success Rate: 100.000%
========== REPAIR UPLOAD ======
Failed: 0
Fail Rate: 0.000%
Canceled: 267
Cancel Rate: 22.723%
Successful: 908
Success Rate: 77.277%
========== DELETE =============
Failed: 0
Fail Rate: 0.000%
Successful: 351
Success Rate: 100.000%

it seems very much like audits rarely fails… but ill check my logs and see what the worst one is and

15-05-2020 - 13k requests, 2k cancelled, 109 failed downloads, audits 2k 1 recoverable audit failed
14-05-2020 - 100k req, 25k cancelled, 45 fail dl, audits 872, 3 recoverable audit failed
thats weird i would have figured the more traffic the more audits…
13-05-2020 - 100k req - 20k cancelled - 77 failed dl - 20 rejected - 841 audits
12th 900 audits no fails regular numbers
11th 877 + 2 recoverable fails
10th 1214 + 10 recoverable failed audits
9th 534 + 6 rec fail aud (might be one of the days i crashed hard for a extended period… had some issues with my server turning itself off
8th 658 + 6rec fail aud (most likely the same issue causing my numbers to be outside the norm)
7th 824 + 0 failed (day of the deletions 79000 deleted) got 36 rejected uploads because ill decided just how fast i deal with stuff thank you…
6th

worst one yet

========== AUDIT ==============
Critically failed:     0
Critical Fail Rate:    0.000%
Recoverable failed:    7
Recoverable Fail Rate: 2.154%
Successful:            318
Success Rate:          97.846%
========== DOWNLOAD ===========
Failed:                94
Fail Rate:             1.260%
Canceled:              48
Cancel Rate:           0.644%
Successful:            7317
Success Rate:          98.096%
========== UPLOAD =============
Rejected:              34
Acceptance Rate:       99.964%
---------- accepted -----------
Failed:                2
Fail Rate:             0.002%
Canceled:              15070
Cancel Rate:           16.144%
Successful:            78278
Success Rate:          83.854%
========== REPAIR DOWNLOAD ====
Failed:                0
Fail Rate:             0.000%
Canceled:              0
Cancel Rate:           0.000%
Successful:            17
Success Rate:          100.000%
========== REPAIR UPLOAD ======
Failed:                0
Fail Rate:             0.000%
Canceled:              116
Cancel Rate:           13.892%
Successful:            719
Success Rate:          86.108%
========== DELETE =============
Failed:                0
Fail Rate:             0.000%
Successful:            107716
Success Rate:          100.000%

logs go a bit further back like this… then they are bigger bits, i kinda like being able to do this…
seems very much like audits atleast in my case… always get through… avg comes out to maybe 99.5% successrate on audits, and most recoverable… which might be something one could use…
if one fails an audit, the from what the successrate.sh suggests, then it retries later… it would make sense that the uptime might make use of that…

so that if you drop an audit it just tries again shortly after or however it works… should be part of the audit recovery kinda thing… because the whole point was to improve performance and not add additional workloads to the network, thus if there is a recovery of failed audits, making that as a recoverable uptime check should be “trivial” atleast in theory…

from what i can see i don’t think there is a problem with using audits… they do vary quite a bit… but mine are anywhere from 500 to 2000k and avg of pretty much 1k a day…±200

Pentium100 · May 16, 2020, 11:39am

My node gets an audit on average every two minutes, though recently the audit rate has increased

However, my node has about 10TB of data. For a node with 100GB of data the average would be less than once per hour and the maximum would be once every 7 minutes.

SGC · May 16, 2020, 11:47am

good point, that explains why my audits seems so stable over extended periods…