Then I have no idea. Maybe the containment mode doesn’t used anymore.
Yes, so it is either broken or the implementation is different somehow or containment mode is no longer used for audits at all.
Could you please include only PUT requests? Or exclude PUT_REPAIR. I suspect that the PUT_REPAIR may ignore the containment mode…
I can see both PUT and PUT_REPAIR in the logs.
Here is the result with excluded PUT_REPAIR:
docker logs --since 2024-08-25T00:55:34Z --until 2024-08-25T06:55:53Z storagenode | grep "upload started" | grep -v "PUT_REPAIR" | grep 12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S | wc -l
17917
ok, this is unexpected for me.
I would ask the team. But looks like that we sunsetted the containment mode.
Now with the implemented display of the reason I can see 2 types of errors and not just for the audits:
"reason": "downloaded size (0 bytes) does not match received message size (256 bytes)"
"reason": "context canceled"
I see far more downloaded size errors than context canceled. Same node, same log time span:
docker logs storagenode| grep "context canceled" | wc -l
16
docker logs storagenode| grep "match received" | wc -l
62461
It is always said that context canceled means slow disk. So this does not seem to be the case here unless the match errors mean that too. But I also see the match errors on other nodes. So what do they mean?
But which one is which in the case were size didn’t match?
Which one is the expected size and which one is the actual stored piece size from those 2?
And could this be related to the async mode recently introduced?
The error message is probably not the best ever. A more complete line is "Action": "GET", "Offset": 0, "Size": 10240, "Remote Address": "REDACTED:49270", "reason": "downloaded size (0 bytes) does not match received message size (10240 bytes)"}
From that I assume that 10240 is the size on the disk or at least what is expected to be stored on the disk. Downloaded size I assume is what data transfer actually has happened and received message size is what was expected to be received.
Interesting question. I don’t know. Maybe?
Maybe some storjling could look into this, cause it can happen for everyone, but we didn’t search for these things. And many have log put on error or fatal mode, so no entries.
This would be very good because my audits are tanking. I don’t know how much longer this node is going to survive:
ap1: 96.26%
Are you seeing GET_AUDIT
as failed
or context canceled
?
I cannot rule out the possibility that a node somewhere has a failed GET_AUDIT
.
Mostly, however, it is download canceled
with the reason size mismatch for the GET_AUDIT
request as well as for the subsequent GET_REPAIR
request. Sometimes I see a context canceled
there.
An audit is smaller than 1KB so having context canceled
does not make sense.
Could you post that said failed audit with context canceled
?
Maybe it isn’t an audit then.
As audits are delegated to the repair workers after a first failed attempt it could also be a regular repair request that had failed. I would need to check all logs if I have GET_AUDIT request with a context canceled somewhere.
You are a veteran in Storj. I expect you to pinpoint the issue as usual than a maybe
I hope you have read this before or may have forgotten about it.
No, I am confused now.
I was under the impression that the audit worker requests a part of the piece. If it fails the task of re-verifying gets delegated to the repair worker. It seems logical that this repair worker should only request a small part of the piece like the audit worker does and not the entire piece. So what you are saying that a context canceled
is unlikely for an audit as the requested size is so small should be true then for such a repair request as well. I don’t know for sure but I am assuming it that way.
And what I described before what I have seen is a GET_AUDIT
for a specific piece and later the expected subsequent GET_REPAIR
request for the same piece id. This repair request is the one that I saw with error context canceled
. I am assuming this is the delegated audit re-verifying request.
However what I don’t know is if every single GET_REPAIR
impacts the audit score or only the ones that have been delegated from the audit workers. This is where I am confused.
I am aware of the threshold of 96%, if that was part of your question.
Did this GET_AUDIT
showed up as failed
or context canceled
or it succeeded ?
It shows download canceled
with "reason": "downloaded size (0 bytes) does not match received message size (256 bytes)"
I wonder if you get this message for pieces too, not just audits?
Yes, not just audits.