Blueprint: tracking downtime with audits

Hey everyone, we have a new design here for tracking downtime with audits

We had a previous discussion on a related document here Blueprint: Downtime Disqualification , but we decided to reconsider using audits to measure storage node downtime.

Let us know what you think

13 Likes

This is really good. Iā€™m quite happy with the solution proposed.

Agreed with @anon27637763, this is quite good and I think an excellent proposal that isnā€™t too complex from your end to manage or for SNOs to understand either.

1 Like

I have a question: does the offline audit impact the audit score of my node? I mean, if my internet connection has failed that does not mean I have lost data.
I have set my monitoring system to alert me if the audit score drops below 1 and the number of failed audits increases. That would be a bit useless if it got triggered every time there was a connection problem.

I may have more questions after I have time to properly read the document :slight_smile:

Very good, seems fair! I like the use of sliding windows.

Itā€™s advisable that you fix email notification system to be sure a SNO gets notifications as soon as possible. At the moment the email notification system is broken: I received suspension emails when not suspended and got suspension without receiving any notification. Moreover, when the email triggers correctly, thereā€™s a lot of delay between the event and the corresponding emails, which makes the late email arrival almost useless.

Nodes stuck in suspension

Since a suspended node cannot receive new pieces, and it can only be evaluated for reinstatement after an audit, if it happened to have all of its pieces deleted, it would be stuck in a limbo state where it would never leave suspension.

i donā€™t understand this, i mean even if a node is deleted and only the identity remains, i kinda assumed that the satellites has some kind a record of pieces being store and would send audits of that data, which would then fail, even if very infrequent, and if its that infrequent is it really a node worth keeping anywaysā€¦
also doesnā€™t really matter if a bad node falls to its death in a day or a week does itā€¦ i suppose it might be relevant for repair jobsā€¦ (iā€™m not that familiar with all of the inā€™s and outā€™s of this)

oh it just dawned on me, what is the difference between a suspended node with a offline iSCSI drive and a suspended node with all itā€™s data lostā€¦

maybe a node could have a preflight check it runs before attempting to launch into the cloud, that way it would simply refuse to launch without atleast being able to see that its data seems intact.
doesnā€™t have to read the whole thingā€¦ just maybe a few spot checks here and there, ofc that would delay launch time a bit ā€¦ so it would need to be sleekā€¦ like maybe it pairs with its data, kinda like bluethoothā€¦ each part has a shared key or data block.
so like say 1 + for every 100k files it which would mean a 24TB node would have 240 files with a code checksum thing matching something like it which is kept with the identityā€¦ like lets call it data certificates, each of these being 1 part of a pair which is written inbetween the data storage while data is being added to the nodeā€¦

so at launch / boot a node would start by going in checking data certificates 1,2,3,4ā€¦ every check verifying its data stores exist and is legible / non corrupt

checking and comparing or doing some kind of encryption key stuff on 480 tiny files for verifying 24tb of data would be insignificant

this method or something similar would

  1. make sure a node cannot launch without itā€™s data stores in accessible formā€¦ maybe it should do a write test alsoā€¦

  2. any node that has lost itā€™s data will not be able to launch, and then the sats will after an email contact with the owner and a set time like say 1 month( it is possible that in the future you might loose data and then you might want pieces back that is lost for whatever disaster)
    the node is disqualified for pure offline time or if proper cause exist maybe extension of suspension or other solutions to data restoration to the network is foundā€¦

(i donā€™t believe its possible to always rely on repair jobs, this would be the fail safe for the repair
jobā€¦ maybe storj would pay for upload of a suspended node to recover the data, in case of disasters.
lets face it repair jobs are sure fineā€¦ until we get outside the scope of what it was designed forā€¦ like say meteor destroys most of europe, can all your data then be restoredā€¦ would you want toā€¦)
alas i digressā€¦ what can i sayā€¦ i like fault tolerances

not sure this is a good solution, but if nothing else it might inspire someoneā€¦
seems like a lot of nodes often try to connect without their data being hooked up or working rightā€¦ this would prevent it from being damage from an active node without proper access to its dataā€¦
it might even be useful to check contact to storage from time to timeā€¦ but yeah i duno

i donā€™t even understand how a node can be stuck in suspension without having any dataā€¦ but in my view they should stay there for a while before being DQ maybe like stated above this should involve contact from storj to the SNO, to help protect against data loss at disasters whatever they maybeā€¦

iā€™m sure ill have more to say, but yeah i got work to do ā€¦

i would sayā€¦ so long as the data is good, DT should be almost irrelevantā€¦(and ofc not paid for storage) i know you might have other considerations to take into account like live dataā€¦
and reintegrating a month old node into the network, might not be worthwhileā€¦ but isnā€™t it always like that with redundancy, itā€™s never worthwhileā€¦ until you need it.

SNOā€™s need an incentive to bring back working by offline storagenodes in case of disasters, instead of just pulling the plug and formatting everythingā€¦ .if that is done on a large scale because that is the consensus created from storj, then data will be lostā€¦ itā€™s just a matter of how long before it happens and how wide the data blackout will be.

the audit thing is good, makes much better sense that wasting a lot of energy on pinging everything which wasnā€™t doing the job it was suppose to anyways.

This is looking very good. I love that learnings from previous discussions are incorporated in this design, like setting a fixed review period, but allowing node to be removed from suspension before the review is over.

I have one question: How do you ensure nodes (especially new ones) get enough audits to determine uptime? New nodes take about a month to get to 100 audits. It can even take days before a node gets its first audit. That doesnā€™t seem enough for a representable score.

I also have a comment on the ā€œnodes stuck in suspensionā€. Many SNOs have some kind of issue when first setting up the node. Itā€™s very well possible they got it to work for a little bit but then run into issues and pick it up a few days later. Hence it might be a while until the node is actually up and running. This may increase the number of times this scenario actually occurs. Just something to keep in mind.

@SGC I think the problem is that this node will never be disqualified because it will never get an audit again and never be evaluated again. So no, itā€™s not a node worth keeping, but the satellite keeps it stuck in suspension anyways. A lot of the rest of your comment seems to be about file availability, which is outside of the scope of this blueprint.

how would you not include that in downtime evaluations, there could be one offline node between being able to repair or not being able to repair a sizable chunk of network data.

file availability is why suspending and disqualifying nodes is a necessary part of the network, because without such features data would either not be live or introduce possible bit rot into the network.

a bad node is worse that no node at all in most cases, one could simply just do an automated or direct contact to the SNO to evaluate it and then after say 1 month of suspension just let the node time outā€¦

in normal network operations (not storj network) but like ethernet and suchā€¦ when a connection or peer times out, thatā€™s just itā€¦ i know this has to be on a much later time scale, but i would draw upon existing verfied good conceptual solutions from how networks connections have been done for decades, thats solutions that is know to work and be good, while taking into account to many considerations for a few people to consider over a short time.

ofc it would be on a hour and daily scaleā€¦ but might not matterā€¦ anyways it would be a place i would look for inspiration and guidance.

so i donā€™t see how either element of file availability and downtime can be two different things, because they are basically the same.

granted a node with no files isnā€™t really relevant, but would you tell the difference of a node with no data and ā€¦ a node with its iSCSI data drive disconnected.

also the suspension of a node might mean the node never even connects to the network againā€¦
so it would be obvious for satellites to have it timeout after a while and just forget it ever existedā€¦

it would be the nodeā€™s problem to get back onto the network and get itself verifiedā€¦
one could also make a sort of scan function for long time suspended nodes, to just verify the existing data against if the network is missing bits.

like a checksum of the data stored, which the satellites also keep, so they can cross reference data what seems like data integrity, without having to scan every file on a node or many nodes.

ofc to really come up with something like that i would have to understand how the whole system works in detailā€¦ but the concept is pretty basic use checksums to verify dataā€¦ so a suspended node become essentially a checksum of the 240 checksums at 24TB node would have in the previously suggested concept of mineā€¦ then when the suspended node logs in, it computates a checksum of the datastores, because reading data is pretty fastā€¦ essentially one could scan the whole thing and recompute the checksums recompute which ofc would take a whileā€¦
its just like a key to open the door into the networkā€¦ and is essentially what audits do also i suspectā€¦

maybe the solution is so simply that a node should simply be able to prepare for a full audit, scanning its entire dataset, like zfs does a scrubā€¦ scrubs are pretty fast compared to how much data it checks.
for every 100k files (datastore/timebased or whatever) it computes a block audit checksum which collectively are like the uberblock in zfs. when this process is done and everything seems okay as far as the node can see and itā€™s ready to perform its full audit, it connects to the network, requests a full audit, the satelite and node verifies and compares checksums of the different blocks, if okay the data seems to be good and regular audits can continue, be allowed to rejoin and else it would timeout after a set amount of time in suspension mode and be disqualified.

maybe this did belong in the downtime disqualification thing i started out inā€¦
but it was here the description of how it was suppose to work was, so it was here i did my minddump

Definitely an improvement over the current audit uptime calculation.

We currently donā€™t have any kind of uptime calculation.

lol, itā€™s meant to say downtime calc. I was talking in another thread about uptime at the same time :smiley:

We also donā€™t track downtime

I am not saying that we display something on the dashboard but that is not time related.

The point is lets just assume we donā€™t have any uptime or downtime information at the moment because that is basically the current state. It makes no sense to compare the design document against that. The question in this thread should be what the edge cases are. Can the new system fail and if so how can we prevent that.

1 Like

@Pentium100
This design will not affect your audit reputation at all

@pietro
Oh yes, if thatā€™s the case we will definitely need to fix that (if someone isnā€™t already working on it!)

@SGC
Parsingā€¦

@BrightSilence
Yes, Iā€™m also somewhat concerned about how new nodes will fare under this design.

Under the ā€œOpen Issuesā€ section I briefly mentioned one potential bandaid here would be to wait until a node has some minimum number of entries in the windows table before doing evaluations. This would prevent a node being suspended for being offline for its very first audit, or first few audits.

Maybe I didnā€™t make this quite clear in the document, but we also want to ensure that all nodes will be audited at least once per window (if they have any pieces, that is), and we have some changes planned to make this happen.

However, I think thereā€™s another potential issue here. The stricter we are with the offline threshold, nodes with low audit frequencies will pay a relatively high price for mistakes.

If a node is only being audited once per window, and it happens to go offline at just the wrong moment, itā€™s too late. Of course, this goes both ways. A node could be online just for the audit, then go offline and we donā€™t know.

Itā€™s only one window out of, maybe 30? (I imagine keeping data for 30 days, maybe 24h windows), but if we had something like 10% offline tolerance, then 3 or 4 mistakes in 30 days and thatā€™s it. Iā€™m not saying thatā€™s how strict weā€™ll be, just an example.

Perhaps tuning the window size so each node can be audited at least twice would help this case.

1 Like

i suppose a possible simple solution to the punishment of new nodes would be to not punish them with anything more than a suspension for the 30 days odd days of the vetting process, unless if they start spewing corrupt data into the network which i assume is one of the worst things that can happen, bad data can often be worse than no data at allā€¦
there is a vetting process, so why use exactly the same rules for a node only getting test data for the first monthā€¦ or however exactly it worksā€¦ then after vetting is completed successfully things get more serious.

but i dunoā€¦ sorry if my ideas and comments are a bit extensiveā€¦ its not always easy to relay advanced concepts simplyā€¦ if nothing else so long as it makes people think a few new thoughts or gain a different perspective, its not all wasted effort from my side.

IMO it also really depends on the numbers overall. If the allowed downtime is short, then any ā€œrounding upā€ of downtime is really bad (for example, 5 hours max downtime per month, but a single failed check counts 1 hour of downtime for you, even if that failed check was because that packet was dropped by the network).

Come to think of it - what happens to the nodes with saturated uplink? On one had, the node is transferring a lot of data, so it is definitely online, but it can fail the check.

By the way, a brand new node is more likely to go offline for short periods of time as the (new) SNO figures out settings etc.

TL;DR
if one fails an audit, the from what the successrate.sh suggests, then it retries laterā€¦ it would make sense that the uptime might make use of thatā€¦

so that if you drop an audit it just tries again shortly after or however it worksā€¦ should be part of the audit recovery kinda thingā€¦ because the whole point was to improve performance and not add additional workloads to the network, thus if there is a recovery of failed audits, making that as a recoverable uptime check should be ā€œtrivialā€ atleast in theoryā€¦

[ramblings and reasons]
well itā€™s for tracking uptime, a node that cannot be contacted is still kinda useless for the network, if a customer wants access to their dataā€¦ granted @Pentium100 you are rightā€¦ by one point it wouldnā€™t matter to much just before the network went almost silent a few days ago, i was up to getting like an audit a minuteā€¦ granted my storagenode isnā€™t well representing a stressed one hdd node, but even in that case i donā€™t think after the first month or so that lack of audits is a real issueā€¦ ofc it would depend on network traffic, which is a downside, but when the network is at fulltilt the performance benefit from utilizing audits for tracking uptime should be worth while.

and no matter how long the downtime is then repair jobs would be started and thus one would be punished by the existence of more copies of the data one had.

right now on the network a single audit might also represent a long while for most nodes
i got 1000 audits in the last 10-11 hoursā€¦ that was actually more than i would have expectedā€¦ thats nearly equal to my uploads at 1160.

maybe we should find some low performing nodes thats like 1 week - 2 weeks or so oldā€¦ maybe a month and see what their numbers of audits actually areā€¦

1k in 10 hours is beyond 1 a minute, thats pretty decent tracking and then if the system just has a certain error tollerenceā€¦ tho i do have 0 failed audits, ā€¦ better post the successrate ā€¦ so much easier.

my node is 9 weeksā€¦

$ ./successrate.sh /zPool/logs/storagenode_2020-05-16.log
========== AUDIT ==============
Critically failed: 0
Critical Fail Rate: 0.000%
Recoverable failed: 0
Recoverable Fail Rate: 0.000%
Successful: 966
Success Rate: 100.000%
========== DOWNLOAD ===========
Failed: 24
Fail Rate: 0.428%
Canceled: 24
Cancel Rate: 0.428%
Successful: 5564
Success Rate: 99.145%
========== UPLOAD =============
Rejected: 0
Acceptance Rate: 100.000%
---------- accepted -----------
Failed: 1
Fail Rate: 0.065%
Canceled: 379
Cancel Rate: 24.610%
Successful: 1160
Success Rate: 75.325%
========== REPAIR DOWNLOAD ====
Failed: 0
Fail Rate: 0.000%
Canceled: 0
Cancel Rate: 0.000%
Successful: 58
Success Rate: 100.000%
========== REPAIR UPLOAD ======
Failed: 0
Fail Rate: 0.000%
Canceled: 267
Cancel Rate: 22.723%
Successful: 908
Success Rate: 77.277%
========== DELETE =============
Failed: 0
Fail Rate: 0.000%
Successful: 351
Success Rate: 100.000%

it seems very much like audits rarely failsā€¦ but ill check my logs and see what the worst one is and

15-05-2020 - 13k requests, 2k cancelled, 109 failed downloads, audits 2k 1 recoverable audit failed
14-05-2020 - 100k req, 25k cancelled, 45 fail dl, audits 872, 3 recoverable audit failed
thats weird i would have figured the more traffic the more auditsā€¦
13-05-2020 - 100k req - 20k cancelled - 77 failed dl - 20 rejected :smiley: - 841 audits
12th 900 audits no fails regular numbers
11th 877 + 2 recoverable fails
10th 1214 + 10 recoverable failed audits
9th 534 + 6 rec fail aud (might be one of the days i crashed hard for a extended periodā€¦ had some issues with my server turning itself off
8th 658 + 6rec fail aud (most likely the same issue causing my numbers to be outside the norm)
7th 824 + 0 failed (day of the deletions 79000 deleted) got 36 rejected uploads because ill decided just how fast i deal with stuff thank youā€¦
6th

worst one yet

========== AUDIT ==============
Critically failed:     0
Critical Fail Rate:    0.000%
Recoverable failed:    7
Recoverable Fail Rate: 2.154%
Successful:            318
Success Rate:          97.846%
========== DOWNLOAD ===========
Failed:                94
Fail Rate:             1.260%
Canceled:              48
Cancel Rate:           0.644%
Successful:            7317
Success Rate:          98.096%
========== UPLOAD =============
Rejected:              34
Acceptance Rate:       99.964%
---------- accepted -----------
Failed:                2
Fail Rate:             0.002%
Canceled:              15070
Cancel Rate:           16.144%
Successful:            78278
Success Rate:          83.854%
========== REPAIR DOWNLOAD ====
Failed:                0
Fail Rate:             0.000%
Canceled:              0
Cancel Rate:           0.000%
Successful:            17
Success Rate:          100.000%
========== REPAIR UPLOAD ======
Failed:                0
Fail Rate:             0.000%
Canceled:              116
Cancel Rate:           13.892%
Successful:            719
Success Rate:          86.108%
========== DELETE =============
Failed:                0
Fail Rate:             0.000%
Successful:            107716
Success Rate:          100.000%

logs go a bit further back like thisā€¦ then they are bigger bits, i kinda like being able to do thisā€¦
seems very much like audits atleast in my caseā€¦ always get throughā€¦ avg comes out to maybe 99.5% successrate on audits, and most recoverableā€¦ which might be something one could useā€¦
if one fails an audit, the from what the successrate.sh suggests, then it retries laterā€¦ it would make sense that the uptime might make use of thatā€¦

so that if you drop an audit it just tries again shortly after or however it worksā€¦ should be part of the audit recovery kinda thingā€¦ because the whole point was to improve performance and not add additional workloads to the network, thus if there is a recovery of failed audits, making that as a recoverable uptime check should be ā€œtrivialā€ atleast in theoryā€¦

from what i can see i donā€™t think there is a problem with using auditsā€¦ they do vary quite a bitā€¦ but mine are anywhere from 500 to 2000k and avg of pretty much 1k a dayā€¦Ā±200

My node gets an audit on average every two minutes, though recently the audit rate has increased

However, my node has about 10TB of data. For a node with 100GB of data the average would be less than once per hour and the maximum would be once every 7 minutes.

1 Like

good point, that explains why my audits seems so stable over extended periodsā€¦