Cancelled audits

Alexey · August 25, 2024, 7:44am

Then I have no idea. Maybe the containment mode doesn’t used anymore.

jammerdan · August 25, 2024, 7:46am

Yes, so it is either broken or the implementation is different somehow or containment mode is no longer used for audits at all.

Alexey · August 25, 2024, 7:46am

Could you please include only PUT requests? Or exclude PUT_REPAIR. I suspect that the PUT_REPAIR may ignore the containment mode…

jammerdan · August 25, 2024, 7:52am

I can see both PUT and PUT_REPAIR in the logs.
Here is the result with excluded PUT_REPAIR:

docker logs --since 2024-08-25T00:55:34Z --until 2024-08-25T06:55:53Z storagenode | grep "upload started" | grep -v "PUT_REPAIR" | grep 12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S | wc -l
17917

Alexey · August 25, 2024, 10:26am

ok, this is unexpected for me.
I would ask the team. But looks like that we sunsetted the containment mode.

jammerdan · October 11, 2024, 3:50am

Now with the implemented display of the reason I can see 2 types of errors and not just for the audits:

"reason": "downloaded size (0 bytes) does not match received message size (256 bytes)"
"reason": "context canceled"

I see far more downloaded size errors than context canceled. Same node, same log time span:

docker logs storagenode| grep "context canceled"  | wc -l
16
docker logs storagenode| grep "match received"  | wc -l
62461

It is always said that context canceled means slow disk. So this does not seem to be the case here unless the match errors mean that too. But I also see the match errors on other nodes. So what do they mean?

snorkel · October 11, 2024, 4:55am

But which one is which in the case were size didn’t match?
Which one is the expected size and which one is the actual stored piece size from those 2?
And could this be related to the async mode recently introduced?

jammerdan · October 11, 2024, 5:02am

The error message is probably not the best ever. A more complete line is "Action": "GET", "Offset": 0, "Size": 10240, "Remote Address": "REDACTED:49270", "reason": "downloaded size (0 bytes) does not match received message size (10240 bytes)"}

From that I assume that 10240 is the size on the disk or at least what is expected to be stored on the disk. Downloaded size I assume is what data transfer actually has happened and received message size is what was expected to be received.

Interesting question. I don’t know. Maybe?

snorkel · October 11, 2024, 5:17am

Maybe some storjling could look into this, cause it can happen for everyone, but we didn’t search for these things. And many have log put on error or fatal mode, so no entries.

jammerdan · October 11, 2024, 5:37am

This would be very good because my audits are tanking. I don’t know how much longer this node is going to survive:

ap1: 96.26%

nerdatwork · October 11, 2024, 5:38am

Are you seeing GET_AUDIT as failed or context canceled ?

jammerdan · October 11, 2024, 5:47am

I cannot rule out the possibility that a node somewhere has a failed GET_AUDIT.
Mostly, however, it is download canceled with the reason size mismatch for the GET_AUDIT request as well as for the subsequent GET_REPAIR request. Sometimes I see a context canceled there.

nerdatwork · October 11, 2024, 6:05am

An audit is smaller than 1KB so having context canceled does not make sense.

Could you post that said failed audit with context canceled ?

jammerdan · October 11, 2024, 6:09am

Maybe it isn’t an audit then.
As audits are delegated to the repair workers after a first failed attempt it could also be a regular repair request that had failed. I would need to check all logs if I have GET_AUDIT request with a context canceled somewhere.

nerdatwork · October 11, 2024, 6:14am

You are a veteran in Storj. I expect you to pinpoint the issue as usual than a maybe

I hope you have read this before or may have forgotten about it.

jammerdan · October 11, 2024, 6:36am

No, I am confused now.

I was under the impression that the audit worker requests a part of the piece. If it fails the task of re-verifying gets delegated to the repair worker. It seems logical that this repair worker should only request a small part of the piece like the audit worker does and not the entire piece. So what you are saying that a context canceled is unlikely for an audit as the requested size is so small should be true then for such a repair request as well. I don’t know for sure but I am assuming it that way.
And what I described before what I have seen is a GET_AUDIT for a specific piece and later the expected subsequent GET_REPAIR request for the same piece id. This repair request is the one that I saw with error context canceled. I am assuming this is the delegated audit re-verifying request.

However what I don’t know is if every single GET_REPAIR impacts the audit score or only the ones that have been delegated from the audit workers. This is where I am confused.

I am aware of the threshold of 96%, if that was part of your question.

nerdatwork · October 11, 2024, 6:47am

Did this GET_AUDIT showed up as failed or context canceled or it succeeded ?

jammerdan · October 11, 2024, 6:50am

It shows download canceled with "reason": "downloaded size (0 bytes) does not match received message size (256 bytes)"

snorkel · October 11, 2024, 12:49pm

I wonder if you get this message for pieces too, not just audits?

jammerdan · October 11, 2024, 12:51pm

Yes, not just audits.