Why audit percentage is dropped?

iops · October 19, 2020, 1:47pm

Would this cause audit percentages to drop? Was fine a few hours ago, checked my dashboard now and I see this, the rest of my satellites are all at 100%. Nothing wrong with the drive.

europe-west-1.tardigrade.io:7777

Suspension

100 %

Audit

89.01586945079346 %

Running an audit script:

for sat in `docker exec -i storagenode wget -qO - localhost:14002/api/sno | jq .satellites[].id -r`; do docker exec -i storagenode wget -qO - localhost:14002/api/sno/satellite/$sat | jq .id,.audit; done
"1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"
{
  "totalCount": 11232,
  "successCount": 11232,
  "alpha": 20,
  "beta": 0,
  "unknownAlpha": 20,
  "unknownBeta": 0,
  "score": 1,
  "unknownScore": 1
}
"121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6"
{
  "totalCount": 23014,
  "successCount": 23014,
  "alpha": 20,
  "beta": 0,
  "unknownAlpha": 20,
  "unknownBeta": 0,
  "score": 1,
  "unknownScore": 1
}
wget: server returned error: HTTP/1.1 500 Internal Server Error
wget: server returned error: HTTP/1.1 500 Internal Server Error
wget: server returned error: HTTP/1.1 500 Internal Server Error

EDIT: Audit back up to 95%. Weird.

SGC · October 19, 2020, 4:44pm

my audit is stable at 100% for europe west
you should check your disk for errors or check your logs for failed audits to try to identify what the problem was…

Pac · October 19, 2020, 9:26pm

That’s pretty concerning… ?

You should check your logs for lines containing both “GET_AUDIT” and “failed” to find out what’s causing your audit scores to drop before your node gets disqualified.

iops · October 20, 2020, 7:23am

Nothing concerning in my logs, what’s throwing the 500 here exactly? Nothing from my side.

SGC · October 20, 2020, 7:32am

maybe it’s an old script?
that would be the first think i would check anyways… else i duno

iops · October 20, 2020, 7:53am

Grabbed a newer script, this one for any interested https://github.com/ReneSmeekes/storj_success_rate

========== AUDIT ============== 
Critically failed:     3 
Critical Fail Rate:    1.068%
Recoverable failed:    0 
Recoverable Fail Rate: 0.000%
Successful:            278 
Success Rate:          98.932%
========== DOWNLOAD =========== 
Failed:                124 
Fail Rate:             1.163%
Canceled:              535 
Cancel Rate:           5.019%
Successful:            10001 
Success Rate:          93.818%
========== UPLOAD ============= 
Rejected:              0 
Acceptance Rate:       100.000%
---------- accepted ----------- 
Failed:                1 
Fail Rate:             0.002%
Canceled:              2112 
Cancel Rate:           3.542%
Successful:            57512 
Success Rate:          96.456%
========== REPAIR DOWNLOAD ==== 
Failed:                1 
Fail Rate:             0.239%
Canceled:              0 
Cancel Rate:           0.000%
Successful:            418 
Success Rate:          99.761%
========== REPAIR UPLOAD ====== 
Failed:                0 
Fail Rate:             0.000%
Canceled:              0 
Cancel Rate:           0.000%
Successful:            1889 
Success Rate:          100.000%
========== DELETE ============= 
Failed:                0 
Fail Rate:             0.000%
Successful:            1399 
Success Rate:          100.000%

SGC · October 20, 2020, 8:31am

seems like you are missing some files, you should check your hdd for problems…

tho it doesn’t have to be from disk problems, could also be from system instability / random reboots making some files being lost… or if you migrated the node recently and didn’t get everything transferred, or a thousand similar problems.

the critical failed audit means you got an audit request for a file, then wasn’t able to provide the file / piece whatever we want to call it… then it will wait for a while… i forget how long… or until the satellite has contact with the node again, then it will try to request an audit for the same piece again and if your storagenode again fails to provide the piece / file then it will count as a critical failed audit…

this would mean it’s not a temporary state of your storagenode, it literally is missing that piece / file which the satellite expects the node to have, this is very bad and can kill your node very quickly.
ofc if its from an issue that is solved, which it can be… then it will just jump up and down in audit % and hopefully not get DQ for dropping to low… but thats very random … depends on which files the satellite decides to audit.

so if you know you have had an issue that could have caused loss of files, then it’s most likely due to that…

if you aren’t aware of a problem you should try to attempt to track down the issue sooner rather than later, because it can get your node DQ in something like a day or two, maybe even less…

Alexey · October 20, 2020, 8:49am

The “file not found” error is failing audit immediately, there no other attempts, unless it would be audited again later (and will eventually fail again).
Only the timeout can cause the containment mode and checking the same piece three more times.
If the storagenode is unable to provide a piece even then - the audit considered as failed.

Since the node has critical failed audits that’s mean the pieces was unavailable and throwed the “file not found” error.

So either disk failure, or files got lost during abnormal reboot, power loss for example.

iops · October 20, 2020, 9:54am

Jumped back to 98% again, still 3 critical fails but the fail rate is down to 0.926%
The HDD is fine, uptime has been excellent on this machine - I did have intermittent internet issues a while back, may have caused it?

How do I fix it though?

baker · October 20, 2020, 12:50pm

Internet issues shouldn’t cause missing files. I am pretty sure the transfer would have failed and your node would not be asked for the piece during audit.

There is nothing to fix in terms of the missing files. They are gone. But you should run a file system check and make sure everything else in your system is working properly. If there are no further issues and your aren’t missing too many more pieces your node may be okay. About once every couple months I see a failed audit due to some unclean shutdowns I had a while back, but the node keeps running.

Pac · October 20, 2020, 3:05pm

True that!
A node can even survive losing thousands of files in some situations…

Not that I would recommend losing any file at all obviously…