New audit scoring is live

again this new audit score is not suppose to fix such issues, there are other failsafes to take care of that…

even running with the old audit score system allowing only 10 consecutive audit failures, i haven’t had any issues, even when my zfs storage was stalled out for hours.

also i don’t think inaccessible count’s as failed exactly, if it did then i would have had nodes that failed long ago.

i get that you are worried about such things, as am i… but thus far i haven’t seen any signs that i should be worried about that on my setup.

however it is very worrying when the node keep running while the storage is basically inaccessible…

Yeah, I am worried more about this than losing 4% of data (which would be something like 900GB for my node. It is possible, but it is more likely that my entire pool would fall apart than resulting in “just” a loss of 900GB of data.
900GB is quite a lot, I could even back up my node and my backups would be valid for something like a month.

On the other hand, it is possible to rsync a node to a different server, run the new one for a while and then start the old one for some reason (forgetting to turn off autostart) and not notice it for 10 minutes.
Or, having more than one node on a VM, start a node and give it the wrong directory and not notice that for 10 minutes.

yeah i did start the same node twice one time… then they will share uptime… lol
basically it will switch between each node because only one will be accepted by the network…
so it will start to drop in uptime on both nodes, afaik.

and ofc after a while the files will be slightly different, but again… the odds of those being audited would be slim to none in most cases except for new nodes.

if you give a node the wrong folder, it will refuse to start, same if it is write protected or such…
or shutdown near immediately because it fails its write check.

Imagine I have two nodes A and B on the same VM and manage to start node A with the directory of node B.

In any case, the permanent disqualification (apparently very fast) for a temporary problem is what I am worried about te most. If I actually lose data, then DQ is the correct response and I would probably be more upset about losing the other data in the pool more than the node.

You can’t. It checks for matching node ID between identity and data. The node won’t start that way.

Understandable, but a lot of checks are already in place, like the one I mentioned above + readability/writeability checks. And I know dealing with IO timeouts was also in the works.

That said, this change wasn’t about that. However it does bring other big improvements we can all celebrate.

3 Likes

@thepaul are changes to the dashboard to incorporate the new scoring planned already?

Screenshots posted here: Your Storage Node on the us-central-1 satellite has been disqualified and can no longer host data on the network - #2 by faga
Shows red highlighting isn’t happening when a node is below 96% (not even yellow warning) and there is info text describing 60% as the threshold. I don’t see any changes on GitHub for this yet.

Perhaps given the relatively small allowed range (96-100), the yellow warning should probably appear below 99.5% or at least 99%.

Just checking that these adjustments are in scope.

3 Likes

Good call. @heunland noted this problem too. We’re getting it prioritized and assigned now.

4 Likes

created an issue about color coding for audit score on the dashboard:

3 Likes

I see a few more posts pop up now. Which was to be expected. But so far the story seems to be the same. File not found errors, pointing to actual data loss. It’s not exactly a flood of complaints, so that’s good. But I guess it does point out that the old system let nodes with fairly significant loss just hang around. Luckily there has always been plenty of redundancy to ensure that wasn’t really a problem.

I remember someone on the original topic saying they had a node with more than 4% loss too. I’m guessing they were caught by this too, but at least they were prepared this was coming.

Found the quote:

How is this one doing @CutieePie ?

I had to transplant a node from one HDD to a new one a few days ago, aparently enough data got lost during the copy to get disqualified, but It’s not a big deal.

Luckily it was a little secondary node, I need to learn how to move a node safely though.

There is documentation on that here. How do I migrate my node to a new device? | Storj Docs

Sorry to hear you lost the node though.

2 Likes

It’s cool, because it happened in the little one that has worked as a guineapig :rofl:

checked my nodes today, just for good measure since the new audit score is live.
everything was at 100% not even my node which have had a minor data loss good while ago now… but it was very minor… didn’t keep the audit score from jumping around for like ½ a year…

so no signs of any problems from what i can tell atleast, and the new audit score system does make me feel a lot more confident that my nodes won’t be unjustly DQ’ed.

have also been keeping an eye on the forum, so far a lot less complaints… ofc i’m sure those will keep trickling in for weeks or months… many might not keep much track of their nodes.

Yes or no? :innocent:

It’s possible those pieces have since been deleted. Or it’s just a miniscule fraction of data that is unlikely to be audited again.

Well… Both.

This effect is happening and over long periods of time you would see it reflected. But the score still has a little random noise inherent of any sampling method. Just much less. But there could still be a 1% + or -. This is why I said the score moving more than 1% is a good indication. If you had 2% data loss, you would literally have to double your nodes size before it becomes really noticeable. Which could take a very long time.

1 Like

Interesting :nerd_face:

So as expected the 10% data loss node was DQ’ed in around 24 Hours from the new audit System being enabled, as below (period of 30 days ish)

The dedicated Sat AUDIT workers failed to pickup the 10% loss, with the average audit working picking up 1% loss per day. As below the graph show’s AUDIT only downloads (yes the title is wrong :stuck_out_tongue: ) on average there are around 200-300 audits on ~ 1 million files, of which less than 1% on average fail.

The one SNO’s should be thinking about it the Sat REPAIR worker, as this is what caused the DQ. As interest, the 10% simulation node was over 12 months old, and so has more repair work activity. Another 10% simulation which is under 6 months old has still not been DQ’ed, this is due to the repair process not really kicking in yet, as the network seems to currently be able to achieve a 6 month+ durability on node pieces, and given current traffic patterns with most pieces with a sub 14 day lifespan (sad :frowning: )

While percentages are being handed out, it is important that it seems consequtive failiures on the repair worker are the issue. As below in the DQ period there were 15k REPAIR requests, of which 252 failed… As a percentage this was roughly 2% I think (Sorry math is bad so please correct) , however over 150 of those occurred within a 1 hour window.

zCapture

I will summarise;

1) Yes, the new DQ scale works much better in my view as it removes bad nodes quicker - I would still say it’s not aggressive enough, as there is a cost to repair but it good first step.

a) The display in the Node GUI is super confusing for SNO’s - 96% of AUDIT is now a DQ is not intuitive, remembering this forum doesn’t represent more than a few hundreds SNO’s. Maybe rename this stat to Percentage of Being DQ’ed or something, and make it 0% - 100%, so at 100% you are DQ’ed - I think this is what would save issues - It will also reduce the flood of query’s on the forum, it is just creating noise and lots of posts.

b) This change now impacts older nodes greater than 6+ months more, due to the REPAIR worker process getting older pieces from them.

c) This could create a negative wave of SNO’s in 6 Months… Imagine you join as an SNO and go through vetting, your node disk is bad but as the REPAIR worker doesn’t really pull your pieces as data so new… Then as you node approaches the 6 month mark, your older pieces start to be retrieved, and suddenly you go from fine now to DQ node.
2) I think this can be covered by pulling the AUDIT system out of the SAT code base - The AUDIT workers in my view should be dedicated workers. I would then split the worker base to address to start with 3 scenarios.
a) New Node Auditors - they are dedicated to more aggressive auditing to nodes under 3 months old - This way, if someone one new joins, they will be DQ’ed much quicker as there node is being audited more - This will make them feel better, as no one wants to find out there node is bad after 6 months+
b) Standard Auditor - This does nodes older than 3+ months
c) Old Age Auditor - This does pieces which are more than 6 months old

CP

Sorry, just realised i done a SGC sized post, sorry :rofl:

11 Likes

my node has been disqulified just in one us1 sattelite, other sattelites a ok. in logs i have seen just this type of error

2022-08-24T21:39:53.306+0500 ERROR piecestore download failed {Piece ID: MWKDV522B2RPYLAX7ZAI6ESWJMQZKDYROAADZWRSKPQFRC3C2QPA, Satellite ID: 12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S, Action: GET, error: write tcp 10.201…230:48298: use of closed network connection, errorVerbose: write tcp 10.201…198.230:48298: use of closed network connection\n\tstorj.io/drpc/drpcstream.(*Stream).rawWriteLocked:326\n\tstorj.io/drpc/drpcstream.(*Stream).MsgSend:404\n\tstorj.io/common/pb.(*drpcPiecestore_DownloadStream).Send:317\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download.func5.1:621\n\tstorj.io/common/rpc/rpctimeout.Run.func1:22}
2022-08-24T21:39:53.338+0500 INFO piecestore upload started {Piece ID: HCJAPKMALWJETUNZNOBDE5D3JFQE2TMN6MDZUVWN4WXKRGVMYXAQ, Satellite ID: 12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S, Action: PUT, Available Space: 1385922376448}

I love this graph as it shows the reset of the score and gradual degradation from there compared to the wild fluctuations before it. It’s nice to see it working in such a good visual representation.

I understand, but you should still see the score gradually dropping. It will happen at some point, it just takes a little longer on nodes with less repair. Could you maybe show the same graph for that node? I’d be interested to see it progress.

That’s something to keep an eye out for. If the repair worker clusters because of how it goes through pieces in a non-random way, it may DQ nodes a little faster than intended. In your case with about 1.7% failures. Just pinging @thepaul here to take a look at that. The audit process is built to be random, but it seems repair may cause some clustering that needs to be taken into account.

If you’re only failing 1% of audits, I would say your data loss at this point isn’t close to 10% anymore though.

I don’t think there is currently a need to be over aggressive on this. Even before when lots of worse nodes survived it didn’t really cause an issue. Keep in mind that this change came with a reset of scores, so we may see more nodes DQ’ed over the coming days. It’s best for the satellites to collect enough data about the node before DQ and take a little more time than risk DQ’ing nodes that weren’t intended to be DQ’ed.

I’ve advocated for this before and some nice bars on the dashboard. This is already how the numbers work in my earnings calculator. So yeah, I agree this would be better.

The impact will be the same, it will just take longer on smaller nodes. Which isn’t really a big deal as those also hold less data. There is also plenty of redundancy to cover for that.

Agreed, the network could do with a little more audits in general. Using repair is a partial and delayed solution that doesn’t work too well on new nodes.

I love these suggestions and I think that would help a lot. If this could be scaled up enough, perhaps using repair for this won’t be necessary at all. Given that repair is actually a non-random approach and leads to some clustering as you showed.

When they’re packed with useful information like this, I welcome that :rofl: (I can hardly say anything else about it as I’m prone to doing it myself, like right now, haha)

3 Likes

The error you posted isn’t one that impacts your audit score.
Have a look at this page to learn more and find out which errors caused the DQ.

1 Like

Even going by the graph before, this doesn’t look like a 10% loss anymore. It’s possible that the data that was once lost has for a large part been removed by customers and additional new data has made the loss percentage drop significantly. It almost looks like the score is already kind of settling where it is. I’m curious to see where this one is going over time. It may even survive.