I am seeing following pattern too many times to not to worry:
2024-07-30T07:40:28Z INFO piecestore download started GET_AUDIT
2024-07-30T07:40:52Z INFO piecestore download canceled GET_AUDIT
2024-07-30T14:54:22Z INFO piecestore download started GET_REPAIR
2024-07-30T14:55:08Z INFO piecestore download canceled GET_REPAIR
2024-07-30T21:40:04Z INFO piecestore download started GET_REPAIR
2024-07-30T21:40:32Z INFO piecestore download canceled GET_REPAIR
It tries for the same piece several times, sometimes succeeding, sometimes not and will continue with a new piece for the next audit afterwards.
Cancelled audits look unusual to me, I cannot remember seeing that under normal circumstances. In the past we had this:
and
Is it normal to see cancelled Audits now, maybe something has changed and is the pattern normal to see 1 GET_AUDIT for a piece and after that GET_REPAIR requests for the same piece, maybe as audit substitution or something.
Because it seems to affect audit scores on several nodes for me.
Are cancelled audits the new normal?
And is the pattern to see a GET_AUDIT and later one or more GET_REPAIR requests also a normal one?
Great, such an info message will be helpful I guess.
Context canceled is slow disk, correct? I think this is the most probable cause in my case when many uploads are happening at the same time.
Now when audits can get canceled, what is the effect on the audit score and why are we seeing this pattern that after a GET_AUDIT request there are the subsequent GET_REPAIR requests.
But looking at your log lines, and their timestamps, they don’t elapse that long besides the log line conditionally checks if the context is directly canceled, so it excludes timeouts, etc.
Now when audits can get canceled, what is the effect on the audit score and why are we seeing this pattern that after a GET_AUDIT request there are the subsequent GET_REPAIR requests.
A canceled audit isn’t a failed audit. It’s delegated to the reverifier that uses GET_REPAIR. The reverifier currently tries 3 times before marking the pieces as “lost” and your node will be accounted for it, however, we allow a small amount of bitrot.
What makes this weird is that I see the error download canceled and many nodes. But there is one specific machine where it seems to impact the audit score which is on a steady decline. I don’t see this happen on other nodes. This is strange.
Instead, this new value of 150kB is still reasonably generous while limiting transfer time for a full-size piece to around 15s.
We shorten that time here as well, from 5m to 15s.
How does this align with the config setting storage2.monitor.verify-dir-readable-timeout?
15s does not seem to be sufficient when a node is experiencing high load.
They are independent and have a different goals. The writability/readability checkers are used to detect hardware issues before it’s too late.
150kB/15s seems reasonable, usually disks should be capable to read so small amount of data even under a high load (unless there is also a potential or ongoing hardware issue).
I still did not get a confirmation, that “context canceled” GET_AUDIT is affecting the audit score, since there also at least two retries after a while for the same piece.
I thought the audit is not about the time a node needs to provide the data but about if it provides the correct data. Now with that stricter time settings you are mixing 2 different things together.
It better should not. If it does then 15 seconds is not auditing it is penalizing slow nodes that might be running a huge parallel rsync or a raid build or something else on the same hardware that is busy. 15 seconds is ways too strict.
Maybe there should be additional retries immediately following the first canceled attempt. Chances are after 2 or 3 retries the file is in the cache and can be provided fast and successfully.