Audit weirdness (us2 satellite). (Team aware. No additional information is needed)

SGC · July 20, 2021, 10:47pm

My 16 - 17 month old node failed an audit.
i don’t believe i lost a file, however it is an old node and has been migrated a fair number of times … maybe 7 … so it’s not totally impossible that i could have lost a file… and have like most other long term SNO’s ofc had various issues over my time running the node… so i suppose that could have cause a snag along the way.

that being said… the numbers i’m seeing in the storagenode dashboard doesn’t quite corollate to my logs… i’m seeing a 95% audit failure on the dashboard.

however when i run a ./successrate.sh over the last 3 days, i’m getting a 1 audit failed of 7000+
even if this was over a 1 day period 95% is many multiples above what happened…
splitting the audits over 30 days it would be 200 audits per day and thus failing 1 would still only be like 0.5%

I started a scrub to check my data integrity of my pool, which are running on a zfs with a raidz1 with redundancy and atleast on the 12th this month didn’t show any signs of data corruption.
so should know sometime tomorrow if the pool is fine… but i’m pretty sure it is… if there was a file missing zfs should have complained about it corruption…

i doubt this will escalate, most likely i assume this to vanish and recover…

but with brightsilence’s new audit system possibly being introduced, and it supposely failing nodes in short order for more than 2% audit failures… my dashboard showing a 5% failure rate is unacceptable as i’m very sure i haven’t lost files nor are any corruption or such involved.

anyone else seeing anything like this… seems very weird to me… also i just updated to the new version less than 24 hours ago… so i really hope it’s not an issue with the new version…
since i did feel jumping onboard only 12 hours in was a bit rash.

there is ofc the off chance that its just my old node actually missing a file… but its not that likely either… pretty sure all my migrations have been flawless.

hope this is not a network issue, because then it could be rather serious.

5% failure in audits on the dashboard for 1 failed audit in the logs over 3 days
seems excessive…
and this hasn’t show up before… this if the first time i see the audits successes drop below 100% ever… last checked it like 24-48 hours ago.

peem · July 20, 2021, 11:28pm

I confirm, my friend SNO asked me what was going on, the same satellite

SGC · July 20, 2021, 11:43pm

@deathlessdd also confirmed over discord seeing about the same issue with atleast one of his nodes.

calling in the reinforcements.
this might be pretty important…
@littleskunk @Alexey @BrightSilence

i duno whats going on… but doesn’t look good…
looks around for the emergency button

my node does seem to already recovering a bit.
within less than an hour of noticing it… but like i said… i doubt i lost a file…
seems to happen just after i updated, whatever it was…

BrightSilence · July 21, 2021, 6:22am

That’s not how that number works. Which is why the first time they started showing this score as a percentage I immediately pointed out that that percentage sign has no right to be there. It isn’t a percentage. It’s a score. The way that score works atm always drops it to 0.95 on the first audit failure and then slowly recovers on subsequent successes. You should also keep in mind that us2 is the smallest satellite, so a failure there would take much longer to recover from and a very small number of those 7000 audits is likely from that satellite.

That 2% actually does point to an actual percentage. But more importantly a sustained percentage of at least 2% failures. It would still take a lot of failures to disqualify your node. In fact, it would take slightly longer to disqualify a node with my suggested changes due to the weight to remember the old score having been increased. You shouldn’t worry too much about those changes. If you just have an incidental audit failure like this those changes will help you more than they hurt.

Well, instead of just looking at the success rate, find out what error was logged. That should tell you whether you are missing a file. And if you are, look through the logs for that piece id to find out what happened with it. Now we’re just speculating based on scores. While the actual reason is in the logs.

Alexey · July 21, 2021, 6:40am

We investigate an issue, thank you for sharing it!
I have two nodes are affected too - the oldest one and the youngest one. All on US2 satellite.
For me it looks like a bug.

BrightSilence · July 21, 2021, 7:10am

Good to know, I’ll keep an eye on it, but so far all my nodes look good. So was this an audit of a deleted or expired piece or something? Is there anything you can share about the issue?

Alexey · July 21, 2021, 7:23am

I think it was an audit of expired piece.

Now I’m sure:

$ unzip -p /mnt/x/storagenode3/node.log-2021-03-28-06-52.zip | grep OJDV2HA2UL3EGAYZRCTEF4EPNFWNJKTW57DSCB4PIQZKTEXSFBDQ
2021-03-16T20:33:48.951+0300    INFO    piecestore      upload started  {"Piece ID": "OJDV2HA2UL3EGAYZRCTEF4EPNFWNJKTW57DSCB4PIQZKTEXSFBDQ", "Satellite ID": "12tRQrMTWUWwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo", "Action": "PUT", "Available Space": 83817351680}
2021-03-16T20:34:19.934+0300    INFO    piecestore      uploaded        {"Piece ID": "OJDV2HA2UL3EGAYZRCTEF4EPNFWNJKTW57DSCB4PIQZKTEXSFBDQ", "Satellite ID": "12tRQrMTWUWwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo", "Action": "PUT", "Size": 2319360}
2021-03-17T22:18:59.546+0300    INFO    collector       delete expired  {"Satellite ID": "12tRQrMTWUWwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo", "Piece ID": "OJDV2HA2UL3EGAYZRCTEF4EPNFWNJKTW57DSCB4PIQZKTEXSFBDQ"}

$ grep OJDV2HA2UL3EGAYZRCTEF4EPNFWNJKTW57DSCB4PIQZKTEXSFBDQ /mnt/x/storagenode3/node.log
2021-07-21T04:40:35.173+0300    INFO    piecestore      download started        {"Piece ID": "OJDV2HA2UL3EGAYZRCTEF4EPNFWNJKTW57DSCB4PIQZKTEXSFBDQ", "Satellite ID": "12tRQrMTWUWwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo", "Action": "GET_AUDIT"}
2021-07-21T04:40:35.184+0300    ERROR   piecestore      download failed {"Piece ID": "OJDV2HA2UL3EGAYZRCTEF4EPNFWNJKTW57DSCB4PIQZKTEXSFBDQ", "Satellite ID": "12tRQrMTWUWwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo", "Action": "GET_AUDIT", "error": "file does not exist", "errorVerbose": "file does not exist\n\tstorj.io/common/rpc/rpcstatus.Wrap:73\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:534\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:217\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:58\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:102\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:60\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:95\n\tstorj.io/drpc/drpcctx.(*Tracker).track:51"}

Arkina · July 21, 2021, 7:57am

Jepp i saw that too the last days, only on US2 (dropped to 99.2), but i am back to 100% again, so i didn’t cared about it that much.

Just checked, another node had ist too on US2, dropped to 96.4% and is recovering now.

SGC · July 21, 2021, 8:09am

okay so im not in danger of DQ, this audit score is weird tho… jumping that much from so little.
hasn’t been 12 hours since i noticed anything was up and it already moved over 3%…
that i know off…

thanks this was why i didn’t really bother looking much at my logs, it seemed endemic, and it atleast for me happened in the log from the day of running the new update.

nope 4% didn’t even get around to finishing up my post before it changed again… seems rather erratic… imo

i mean this was one file… how would it look with more…

gets me kinda worried and i know my files are safe… and the system is running without issues.
dropping 5% for one file being missing, doesn’t really tell the story of that this is not a real issue but something one maybe should dedicate some time to looking at in the near future…

5% or more in one step, is sort of an OMG ITS GOING TO DIE IF YOU DON’T IMMEDIATELY TAKE ACTION TO CORRECT THE ISSUES… kinda step

i mean online score being the reference most people have, it is like 60% before suspension / failure essentially… so applying that logic to the audit score, then 5% drop is 1/8 th of what the system can handle before failure… (DQ)

in the case of my storagenode… it’s not even a blip on the radar… i got 15.3TB one file being lost is nothing… sure my audit score should move… but maybe say 99.9% or whatever the near minimum would be at… so that people can see a node is failing audits, but not actually in danger of failing…

and so that one can see it in a simple look at the number, which is what it is there for …
ease of monitoring…

BrightSilence · July 21, 2021, 10:04am

That was the entire point of my suggested changes. It currently jumps all over the place with the slightest fluctuation. It also recovers way too fast. My suggested changes make it drop more slowly and recover MUCH more slowly. The result is that incidental failures don’t make as much of a dent, but it will be less forgiving with sustained issues.

If you’re worried, you SHOULD have a look at your logs. It will be trivial to find the piece that was missing and see that it was deleted because of expiration like @Alexey suggested.

Well, that’s because it IS a real issue. Normal nodes shouldn’t have any audit failures. It just so happens to most likely be an issue that isn’t your fault. I’m sure there will be a solution for this soon and I doubt it actually DQ’ed any nodes if it’s this incidental. The only way this could DQ anyone is if it just so happens to be the last straw that tipped an already failing node over the edge.

The lower the score is, the smaller the drop will be on failed audits and the bigger the rise will be on successful ones. It’s non-linear. Trust me, you are REALLY far from DQ.

You’re still confusing percentages and scores. And I don’t blame you, there is a percentage sign right there. But it doesn’t represent a percentage of failure. If it did, it would be insane to allow 40% of audit failure and only disqualify at 60%. So stop thinking of it as a percentage and start thinking of it as a score on a test. You need a score of 0.6+ to not fail the test. And that corresponds to the highest failure percentage that is still acceptable to be a node operator (in the future around 2% probably). You scored a 0.95 on that test. So you’re in good shape.

SGC · July 21, 2021, 12:29pm

trying to argue that maybe it should be simplified, so people can understand how to read it without needing a lot of knowledge on the subject.

true, but i think there should be a uniform approach to how to read the numbers, just for simplicity sake…

also wasn’t aware your new system wasn’t implemented yet, hopefully we will see that soon, and it will all go smoothly

BrightSilence · July 21, 2021, 12:56pm

Looks like I actually had a similar audit failure. There goes my clean record…

2021-03-16T17:36:26.016Z       INFO    piecestore      upload started  {"Piece ID": "5TIL6XHVZDDBWQ34UHC72GQNO56BC4ZBS7EKTMLGAAE5X4ZV5EWA", "Satellite ID": "12tRQrMTWUWwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo", "Action": "PUT", "Available Space": 7858822696960}
2021-03-16T17:36:55.034Z       INFO    piecestore      uploaded        {"Piece ID": "5TIL6XHVZDDBWQ34UHC72GQNO56BC4ZBS7EKTMLGAAE5X4ZV5EWA", "Satellite ID": "12tRQrMTWUWwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo", "Action": "PUT", "Size": 2319360}
2021-03-17T18:55:02.214Z       DEBUG   pieces  deleted piece   {"Satellite ID": "12tRQrMTWUWwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo", "Piece ID": "5TIL6XHVZDDBWQ34UHC72GQNO56BC4ZBS7EKTMLGAAE5X4ZV5EWA"}
2021-03-17T18:55:02.214Z       INFO    collector       delete expired  {"Satellite ID": "12tRQrMTWUWwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo", "Piece ID": "5TIL6XHVZDDBWQ34UHC72GQNO56BC4ZBS7EKTMLGAAE5X4ZV5EWA"}
2021-07-13T14:33:06.479Z       INFO    piecestore      download started        {"Piece ID": "5TIL6XHVZDDBWQ34UHC72GQNO56BC4ZBS7EKTMLGAAE5X4ZV5EWA", "Satellite ID": "12tRQrMTWUWwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo", "Action": "GET_AUDIT"}
2021-07-13T14:33:06.486Z       ERROR   piecestore      download failed {"Piece ID": "5TIL6XHVZDDBWQ34UHC72GQNO56BC4ZBS7EKTMLGAAE5X4ZV5EWA", "Satellite ID": "12tRQrMTWUWwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo", "Action": "GET_AUDIT", "error": "file does not exist", "errorVerbose": "file does not exist\n\tstorj.io/common/rpc/rpcstatus.Wrap:74\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:508\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:217\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:58\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:111\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:62\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:99\n\tstorj.io/drpc/drpcctx.(*Tracker).track:51"}

Indeed as @Alexey mentioned it was an expired piece. Surprisingly, it was a LOOONG expired piece. Good thing I still have all my logs.

The failure happened a while ago and I guess I didn’t spot it at the time. My score already shows 100% again in the dashboard. But I noticed the API said "auditScore":0.9999995.

I wonder how this happened though, it must be something specific to this satellite or else it would have happened much more often. It is a testing satellite after all.

hewicker · July 21, 2021, 1:24pm

Same for my node. I was also able to retrace the audit failure to an expired piece in my logs. I am at 99.5% audit score right now. Let’s hope this was a one time occurrence.

SGC · July 21, 2021, 2:41pm

if i was to hazard a guess… it almost looks like a test i mean why just one…
whatever happened here might be used maliciously… if one could DQ nodes for data that is long deleted.

also the error seems to still be permeating through the network… got another node that has just failed an audit since this morning… might look like it will end up going through all nodes beyond a certain age…

or i got two, but the one is the one i accidentally ran a rsync --delete on after running it for a minut… so might not be relevant.

Ted · July 21, 2021, 10:37pm

Exactly same happened to me. I figured its the sattelite. Because i just installed a brand new HDD and transferred with 3 robocopies with /MIR the contents.

LinuxNet · July 23, 2021, 6:14am

same here yesterday for us1 and ap1

But is growing again.

SGC · July 23, 2021, 8:27am

whatever the causes, this is bad… very bad…
seems to affect nodes around the 7-8 month mark the worst…
atleast with never having a failed audit before then, my nodes are pretty well positioned to survive.
this might get worse before it gets better