The score in the earnings calculator shows how close you are to disqualification. Which is very close. At 100% you’ll be disqualified. I suggest stopping your node while you debug, since you’re right on the edge of being disqualified.
If you have recently removed and recreated the container, your logs from before that are gone. I recommend redirecting the logs to a file to prevent losing vital information for debugging this issue. In the mean time, stop your node. Check all paths, check your identity. Make sure your databases are ok and check the file system.
Make sure you haven’t used the same identity on any of the new nodes or the same port. Either of those could lead to the wrong node receiving data. If you find a mistake like that, write down what went wrong and try to determine where data from this node might have ended up.
I can’t really think of anything else to check, better redirect those logs to make sure you actually have something to use for debugging in the future. We’re kind of diagnosing blind right now.
10 node… how many ram does that thing have… i mean if memory serves… xD no pun intended… then RPI 4 A has 1gb maybe and B has 4gb or something like that… if you are running 10 nodes on a 1gb it might be asking for trouble even if it isn’t what caused the current issue…
i would argue even on the 4gb version i wouldn’t feel to safe… yesterday because of issues… my node reached 2.1gb ram utilization… without that spare memory the node would run into some sort of trouble.
Success Rate: 88.230% this is usually a hint of high load on the drive, drives or system
i think the lowest i’ve ever seen mine is 94-96%
should be close to the 99% range in most cases… maybe a few % off
might not seem like much but i would say that could easily be a pointer towards what is wrong
Well, if you go back to the thread title, I think the problem is due to past bad sectors (on a drive that is not currently in use)
The other 9 nodes don’t have the problem. (5 of these were created recently testing create identity )
All I keep going back to is that all audits are good and nothing is logged about fails.
Maybe there were no bad sectors after all and these audit scores are just corrupt at the satellite end
all hdd’s have smart data unless if its turned off… that should record any issues and write errors, not sure how well the logging is, but atleast for the most recent operation time there will be a log of issues…
else it keep a count of stuff like write errors spanning the life time of the disk… not aware if one can reset that…
very good way to see if the disk has issues or have had issue…
smartctl is pretty good, ofc its not always clearly visible if there are issues… then you should ofc benchmark it, smart also has some integrated tests one can use.
lots of stuff can cause bad data… it’s why server gear generally have redundancy on most stuff.
something so simple as a bad cable can sometimes gives errors on disk, even if it seems to run fine…
kevink recently had a file get corrupted, even tho he is running a raidz1 and there was no reason for why it should have happened… we ended up concluding that it was most likely a degradation of data in ram before it was written to disk.
it just goes to show how little it takes for an error to slip through, 300 bad files seems high tho…
since you say scrub… are you using zfs?
and how much ram does the RPI have?
and did replacing the hdd fix the issue?
i doubt it’s satellite related… or i mean i don’t think i’ve seen a bad file or failed audit, unless if it was a local hardware / configuration / user fault…
The smart error “unsafe shutdown” (something like that) were recorded in smart.
Yes, ECC ram is normal for Sun servers, where solaris and ZFS came from.
300 was an estimate, it was a few pages of errors when I copied the files to another disk
No, I don’t use ZFS. I imagine the scrub would consist of trying to read all the blobs are check the checksums that were sent with the data during ingress.
This RPI4 has 4GB but others are available 1GB, 2GB, 4GB, 8GB
The current disk does not have unsafe shutdown errors and I don’t think it is even valid for the new disk but the storagenode continues to get bad audit scores.
I don’t see how it can be the satellite at fault but perhaps, after the audit data is received, the satellite checks the data and finds that it is bad but does not inform the storagenode.
The cryptography is guarantee that hash is match the piece, the order is signed by uplink and storagenode otherwise the order will be rejected after submitting and will not be accounted as valid.
The only possible situation if storagenode corrupted this piece and thus hash is wrong.
Please, provide the error from the log for failed audits.
Please, count started audit downloads and audits finished.
If you find audits that are started but never finished that will also count as audit failure after 3 retries.
This is mean your node failing audits because of continues timeouts. It could be disk (more likely) or your internet channel (rare, because exactly on audits).
I happen to have a recent node which lost as much as 5% of its files, because of a dying disk (or dying enclosure, not sure yet, tests are on-going).
Switched everything to a new disk, but rsync could copy only 95% of the files (that is roughly 40000 files out of ~42100 files - yes, more than 2000 files lost!).
(also, the config.yaml was only binary gibberish and I lost half of the databases after the copy was “complete”… Luckily these are recoverable)
3 satellites did notice something was off, as its Audit score went down to 80% for a couple of days, but it seems it is slowly going back up now, as it’s between 88% and 93% on those satellites. Edit: In fact one of them went back down a bitrecently: to 85%. And all audit failures I see since the “rsync” operation are “file does not exist”, as expected…
I’m surprised this node is surviving this incident for the moment. We’ll see how it goes…
Once a file audit fails once, I guess the satellite is never going to audit this file again and it registers the fact that this node doesn’t have it anymore, right? If that’s the case, I guess the Node will have less and less chance to fail future audits, and it should be okay.
Negative. The same file can get audited multiple times. The chance to hit the same file again is almost 0 for your storage node. You are simply holding too many pieces to hit it again.
What is slowly increasing your audit rate is the repair job. The audit system notices the lost pieces but we can tollerrate it. We trigger repair at 35 (currently 52) because we expect a low error rate exactly for the reasons you have mentioned. The repair job will notice that you can’t delieve the piece and while it is on it it will replace that piece. That way the missing pieces will be replaced over time and that also means less audit errors. As soon as the repair job touched all pieces you should get back to 100% audit rate.
So, what we’re saying here is that as long as a “segment” (made of 80 pieces if I’m not mistaken) does not fall under the repair threshold (52 pieces currently), the satellite is just “remembering” any audit errors as unhealthy/missing pieces, but may still audit them (in case they are available again on Nodes’ side, who knows).
On the other hand, once a segment falls below the repair threshold, then all missing pieces get repaired elsewhere and from that point on, initially missing or corrupted pieces will not get audited anymore as they’ve been repaired on other nodes; and as a side effect any old corrupted pieces would then get removed from the nodes eventually thanks to the garbage collection.