New audit scoring is live

I had to transplant a node from one HDD to a new one a few days ago, aparently enough data got lost during the copy to get disqualified, but It’s not a big deal.

Luckily it was a little secondary node, I need to learn how to move a node safely though.

There is documentation on that here. How do I migrate my node to a new device? | Storj Docs

Sorry to hear you lost the node though.


It’s cool, because it happened in the little one that has worked as a guineapig :rofl:

checked my nodes today, just for good measure since the new audit score is live.
everything was at 100% not even my node which have had a minor data loss good while ago now… but it was very minor… didn’t keep the audit score from jumping around for like ½ a year…

so no signs of any problems from what i can tell atleast, and the new audit score system does make me feel a lot more confident that my nodes won’t be unjustly DQ’ed.

have also been keeping an eye on the forum, so far a lot less complaints… ofc i’m sure those will keep trickling in for weeks or months… many might not keep much track of their nodes.

Yes or no? :innocent:

It’s possible those pieces have since been deleted. Or it’s just a miniscule fraction of data that is unlikely to be audited again.

Well… Both.

This effect is happening and over long periods of time you would see it reflected. But the score still has a little random noise inherent of any sampling method. Just much less. But there could still be a 1% + or -. This is why I said the score moving more than 1% is a good indication. If you had 2% data loss, you would literally have to double your nodes size before it becomes really noticeable. Which could take a very long time.

1 Like

Interesting :nerd_face:

So as expected the 10% data loss node was DQ’ed in around 24 Hours from the new audit System being enabled, as below (period of 30 days ish)

The dedicated Sat AUDIT workers failed to pickup the 10% loss, with the average audit working picking up 1% loss per day. As below the graph show’s AUDIT only downloads (yes the title is wrong :stuck_out_tongue: ) on average there are around 200-300 audits on ~ 1 million files, of which less than 1% on average fail.

The one SNO’s should be thinking about it the Sat REPAIR worker, as this is what caused the DQ. As interest, the 10% simulation node was over 12 months old, and so has more repair work activity. Another 10% simulation which is under 6 months old has still not been DQ’ed, this is due to the repair process not really kicking in yet, as the network seems to currently be able to achieve a 6 month+ durability on node pieces, and given current traffic patterns with most pieces with a sub 14 day lifespan (sad :frowning: )

While percentages are being handed out, it is important that it seems consequtive failiures on the repair worker are the issue. As below in the DQ period there were 15k REPAIR requests, of which 252 failed… As a percentage this was roughly 2% I think (Sorry math is bad so please correct) , however over 150 of those occurred within a 1 hour window.


I will summarise;

1) Yes, the new DQ scale works much better in my view as it removes bad nodes quicker - I would still say it’s not aggressive enough, as there is a cost to repair but it good first step.

a) The display in the Node GUI is super confusing for SNO’s - 96% of AUDIT is now a DQ is not intuitive, remembering this forum doesn’t represent more than a few hundreds SNO’s. Maybe rename this stat to Percentage of Being DQ’ed or something, and make it 0% - 100%, so at 100% you are DQ’ed - I think this is what would save issues - It will also reduce the flood of query’s on the forum, it is just creating noise and lots of posts.

b) This change now impacts older nodes greater than 6+ months more, due to the REPAIR worker process getting older pieces from them.

c) This could create a negative wave of SNO’s in 6 Months… Imagine you join as an SNO and go through vetting, your node disk is bad but as the REPAIR worker doesn’t really pull your pieces as data so new… Then as you node approaches the 6 month mark, your older pieces start to be retrieved, and suddenly you go from fine now to DQ node.
2) I think this can be covered by pulling the AUDIT system out of the SAT code base - The AUDIT workers in my view should be dedicated workers. I would then split the worker base to address to start with 3 scenarios.
a) New Node Auditors - they are dedicated to more aggressive auditing to nodes under 3 months old - This way, if someone one new joins, they will be DQ’ed much quicker as there node is being audited more - This will make them feel better, as no one wants to find out there node is bad after 6 months+
b) Standard Auditor - This does nodes older than 3+ months
c) Old Age Auditor - This does pieces which are more than 6 months old


Sorry, just realised i done a SGC sized post, sorry :rofl:


my node has been disqulified just in one us1 sattelite, other sattelites a ok. in logs i have seen just this type of error

2022-08-24T21:39:53.306+0500 ERROR piecestore download failed {Piece ID: MWKDV522B2RPYLAX7ZAI6ESWJMQZKDYROAADZWRSKPQFRC3C2QPA, Satellite ID: 12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S, Action: GET, error: write tcp 10.201…230:48298: use of closed network connection, errorVerbose: write tcp 10.201…198.230:48298: use of closed network connection\n\*Stream).rawWriteLocked:326\n\*Stream).MsgSend:404\n\*drpcPiecestore_DownloadStream).Send:317\n\*Endpoint).Download.func5.1:621\n\}
2022-08-24T21:39:53.338+0500 INFO piecestore upload started {Piece ID: HCJAPKMALWJETUNZNOBDE5D3JFQE2TMN6MDZUVWN4WXKRGVMYXAQ, Satellite ID: 12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S, Action: PUT, Available Space: 1385922376448}

I love this graph as it shows the reset of the score and gradual degradation from there compared to the wild fluctuations before it. It’s nice to see it working in such a good visual representation.

I understand, but you should still see the score gradually dropping. It will happen at some point, it just takes a little longer on nodes with less repair. Could you maybe show the same graph for that node? I’d be interested to see it progress.

That’s something to keep an eye out for. If the repair worker clusters because of how it goes through pieces in a non-random way, it may DQ nodes a little faster than intended. In your case with about 1.7% failures. Just pinging @thepaul here to take a look at that. The audit process is built to be random, but it seems repair may cause some clustering that needs to be taken into account.

If you’re only failing 1% of audits, I would say your data loss at this point isn’t close to 10% anymore though.

I don’t think there is currently a need to be over aggressive on this. Even before when lots of worse nodes survived it didn’t really cause an issue. Keep in mind that this change came with a reset of scores, so we may see more nodes DQ’ed over the coming days. It’s best for the satellites to collect enough data about the node before DQ and take a little more time than risk DQ’ing nodes that weren’t intended to be DQ’ed.

I’ve advocated for this before and some nice bars on the dashboard. This is already how the numbers work in my earnings calculator. So yeah, I agree this would be better.

The impact will be the same, it will just take longer on smaller nodes. Which isn’t really a big deal as those also hold less data. There is also plenty of redundancy to cover for that.

Agreed, the network could do with a little more audits in general. Using repair is a partial and delayed solution that doesn’t work too well on new nodes.

I love these suggestions and I think that would help a lot. If this could be scaled up enough, perhaps using repair for this won’t be necessary at all. Given that repair is actually a non-random approach and leads to some clustering as you showed.

When they’re packed with useful information like this, I welcome that :rofl: (I can hardly say anything else about it as I’m prone to doing it myself, like right now, haha)


The error you posted isn’t one that impacts your audit score.
Have a look at this page to learn more and find out which errors caused the DQ.

1 Like

Even going by the graph before, this doesn’t look like a 10% loss anymore. It’s possible that the data that was once lost has for a large part been removed by customers and additional new data has made the loss percentage drop significantly. It almost looks like the score is already kind of settling where it is. I’m curious to see where this one is going over time. It may even survive.

Seconded! That is awesome. We don’t have historical tracking of any particular node’s reputation score, so that’s very interesting to see.

That does sound really good- it could be called “audit failure rate” maybe, even though it’s not exactly a rate. Or just “data loss”.

Interesting. Repair isn’t very random, as it traverses the segment ID keyspace in order, but the piece IDs that get repaired ought to be indistinguishable from random.

If there is a cluster of errors like that, it must be from something else wrong in the system. It could be network problems between the satellite and the node, for example, or temporary high load on the node that made it unable to service requests as fast as usual.

You’ll be happy to know this is already underway. Not removing from the satellite code base; the auditors will still be a part of the satellite, but they will be dedicated workers and will be runnable on dedicated VMs. But we will soon be able to scale up audit workers. (Currently, due to an early shortcut we took, we can only safely run one audit worker at a time.)

This might be harder to implement with the existing architecture. It helps us to be able to audit all the pieces for a given segment at the same time. That way, we don’t need to download entire pieces to check their hash. Instead, we can reconstruct a single stripe using Berlekamp-Welch forward-error correction and be able to identify which inputs don’t match the others. We could have different audit workers operate on segments which are new, 6 months old, etc, and that would naturally have a tendency to hit nodes of different ages as you suggest. Maybe that is an option to explore.


Hi Thanks for the response.

I’m currently running a workflow to deep look at the pieces involved in the DQ… it’s going take a long time as lots of systems to work through, and many rows however.

Top looking at early results, there is a trend which I wasn’t expecting on AUDIT.

As an example, one of the pieces involved in the 10% DQ traced is as below.

Aug 8th - GET_AUDIT - Piece ID AAA, node responded with “ERROR - file not found” - Time Delta hh:mm:30.421
Aug 8th - GET_AUDIT - Piece ID AAA, node responded with “ERROR - file not found” - Time Delta hh:mm:30.421

This was a duplicate request processed, call trace below.

file does not exist*Endpoint).Download:546*Mux).HandleRPC:33*Handler).HandleRPC:58*Server).handleRPC:122


Aug 16th - GET_AUDIT - Piece ID AAA, node responded with “ERROR - file not found” - Time Delta hh:mm:05.341
Aug 16th - GET_AUDIT - Piece ID AAA, node responded with “ERROR - file not found” - Time Delta hh:mm:05.341

Again, duplicate request processed for the same PIECE_ID (this really confuses me, my understanding was that a failed audit on the piece marks it off the node, and could trigger a repair)


Aug 24th - GET_REPAIR - Piece ID AAA , node responded with “ERROR - file not found” - Time Delta hh:mm:18.040

No duplicate on this one, again same Piece ID that has failed 2 audits, and this repair is part of the “Cluster of Non-Random pieces”


  1. Why are we sending in some cases duplicate GET_AUDITS ? I’m really surprised the delta is exactly the same - again I haven’t got all the details, but the firewall segment offset stamped on the data is the same, which either means there was only 1 request received and the node has managed to duplicate this in the StorageNode code or two parallel get audit requests were sent over clearnet at EXACTLY the same time (this is hard, but not impossible - and can often be a side effect of leaking clusters, or bad network teaming, or even network card drivers)

  2. Why after a clear failed audit, where the file is not available do we retry again days later - it’s clearly going to fail again - my understanding was the file would be marked as lost on that node, and no further requests would be made, but this doesn’t seem to be the behaviour seen.

  3. The start of the looking was the GET_REPAIR, it was part of the 10% node burst of pieces and shares the same PIECE ID, as the previous failed Audits… why was the piece still being tagged to the node, with 2 failed audits days before…

…I’m still checking node and disk IO - it would be interesting if the duplicates are linked to busy disks causing timeouts.

#edit: nope, the latency from start download request, to node reply with failed download is < 200ms on average

#edit: also thinking, is this database issue with transaction locking and rollback when the duplicate requests happen - so the synchronised update to satellite causes no update to database, or rollback as locking or sync issue detects failed transaction and default is to rollback, ,therefore no update to piece - that would be bad as failed pieces would not be represented as lost. (again not bothered to look at schema, so this might not even be possible through constraints)


This isn’t the case. The point of audits has never been to judge whether pieces were available, but it’s only meant to judge whether the node as a whole is reliable enough to keep around.

Even when you include repair, audits only touch a very small subset of your node. So it’s not possible to reliably mark pieces as lost anyway. And as long as the node has lost an acceptable amount, the redundancy on the network can easily handle those losses.

Is that in order of segment id or in the order they are listed in the table? The first should be quite random I think, assuming segment id’s are randomly generated like piece id’s. But if it’s in order of how it’s been added to a table, there would be time wise clustering, which could be problematic for nodes that lost data because of a temporary problem, like a mistake during migration or something. If you lost a few hours of ingress and repair goes through them in chronological order, it would cluster those errors together and fail a node faster than intended as a result.

node size is also a major factor when it comes to new nodes, as a node finishes vetting its size starts to grow at an immense rate, and thus 10% dataloss a week ago for a node might only represent 1 %…
i know this is only a short period this is relevant, but still early on the ingress of a month can be a substantial part of a nodes total data stored.

so unless if one is getting rid of or corrupting 10% of the data on a continual basis, a new node could have moved back to a state when it won’t be DQ.

agreed the score should be uniform or easy to read…
personally i might just make it the same as the suspension score and online score.
since they are already known numbers…

not sure why they even change it to 96% anyways… i mean it was not like DQ would happen at 40% data loss before…
it should be changed back or revamped.

complex topics or ideas aren’t easy to do in short form. :smiley:

If repairs are not random, then maybe they should affect the audit score less? Something like two failed repairs = 1 failed audit. Otherwise there would be unfair DQs.

I must say, even though the “audit score” could probably be renamed for better clarity (and/or span on a larger scale depending on what it should reflect), I really like this new system better :ok_hand:

Before, one of my nodes that did lose data in the past was randomly at 100, then 90, then back to 100 few days later, then 95, rarely at 85… It was difficult to get a good idea of its health.
Now, it’s more or less steady around 99.85% which is very nice and more usable.

I say, nice job to the Team and all forum members who actively participated in improving the audit score system! :+1: :smile:


Nice to see it working as intended! Thanks for posting your experience.


I don’t know why that is happening. I think the most likely explanation is that the line was simply emitted twice to the log, because of some minor bug with logger setup or confusion about whether the line had been logged already. Nothing in the Storj network is sent over clearnet; everything is encrypted and authenticated over TLS, so duplicate packets or leaky networks or bad drivers aren’t to blame.

@BrightSilence is right; that’s not the way things work at the moment. It could be made to work that way, but audit failures are not so frequent that it seems like an important thing to do.

  1. For one thing, it is sometimes the case that a clear failed audit becomes a successful audit in the future, when mount points are fixed or data moved to the right location, etc.
  2. Then, even if we changed the code, it would still be possible that the satellite does not receive the error from the failed audit on your node before the audit times out. The error might be logged on the node but the audit resulted in a timeout error on the satellite side. This would put the node in “containment”, where we keep trying to audit the same piece on purpose until we get a timely response one way or the other, or until the node is disqualified because of too many audit timeouts. That is to say, it might still look from the node side like we are auditing the same piece multiple times, even after a clear failure.

That’s a well-reasoned theory! I like the way you think. But no, that doesn’t match with the way we deal with the databases. The explanation is simply that we don’t mark pieces as missing on audit failure. (Yet. It could certainly change.)

That is in order of segment ID. You’re right that it is essentially random as far as the user or the SNO is concerned; I meant it is not random from our perspective, where we know what the segment IDs are already. But I think it’s not an important difference in this case.

There can indeed be clustering of errors within a segment; if one piece had a problem, it is more likely that other pieces of the same segment will have errors. But since no node should have more than one piece of that segment, that shouldn’t have any clustering effect on nodes themselves.

I think you’re entirely right, but I think it’s an acceptable situation. If a node loses 10% of its data very early on but they fix whatever the problem was and accumulate lots more data that is not lost, before we have time to discover the data loss by audit, it is probably fair to treat that node as though it had only lost the 1%, which of course is the case. If we didn’t have time to notice the early data loss by audit, then it is also less likely that the missing pieces had any effect on their respective segments’ durability during that time.

Not sure what you mean here. Do you know of any nodes with 40% data loss that weren’t disqualified? The old audit reputation was not as closely correlated with data loss percentage as it is now, with the newer beta model parameters, so it would have been harder to identify the nodes with 40% loss, but still it’s quite likely they would have been disqualified.


As long as they are not generated in order that should still make the order random though. I can’t really explain the clustering in the graphs then though. Unless these errors aren’t caused by missing data, but by slowdown on the node itself. But in this case I believe missing data was what caused the errors.

I think @SGC meant that it didn’t require 40% data loss to be disqualified before. Personally I would consider that part of the problem with the old system. The threshold had to be that low because of the erratic score behavior, not because 40% data loss was actually allowed to survive. It became way too much a game of chance rather than a clear cut off of what’s acceptable. The new scores still have a small margin of chance but it dropped from a 20% range to a 2% range (roughly). So while the old system suggested that up to 40% loss was acceptable, anything beyond 5% loss could in some cases lead to disqualification after a long time. The new system may suggest 4% loss is ok, but anything beyond 2% loss is dangerous. That seems much more reasonable to me.

In the end I think the reasons were clearly outlined in the linked topic and in my summary here. So @SGC have a look at that summary and if something is still not clear as to why this was chosen, maybe you can be more specific?