Cancelled audits

jammerdan · July 30, 2024, 11:05pm

I am seeing following pattern too many times to not to worry:

2024-07-30T07:40:28Z    INFO    piecestore      download started GET_AUDIT
2024-07-30T07:40:52Z    INFO    piecestore      download canceled GET_AUDIT
2024-07-30T14:54:22Z    INFO    piecestore      download started GET_REPAIR
2024-07-30T14:55:08Z    INFO    piecestore      download canceled GET_REPAIR
2024-07-30T21:40:04Z    INFO    piecestore      download started GET_REPAIR
2024-07-30T21:40:32Z    INFO    piecestore      download canceled GET_REPAIR

It tries for the same piece several times, sometimes succeeding, sometimes not and will continue with a new piece for the next audit afterwards.

Cancelled audits look unusual to me, I cannot remember seeing that under normal circumstances. In the past we had this:

and

Is it normal to see cancelled Audits now, maybe something has changed and is the pattern normal to see 1 GET_AUDIT for a piece and after that GET_REPAIR requests for the same piece, maybe as audit substitution or something.

jammerdan · July 31, 2024, 8:26am

@Alexey: Anything you can tell regarding this:

Because it seems to affect audit scores on several nodes for me.
Are cancelled audits the new normal?
And is the pattern to see a GET_AUDIT and later one or more GET_REPAIR requests also a normal one?

Alexey · July 31, 2024, 8:47am

Could you please post the whole two lines for the same pieces?

jammerdan · July 31, 2024, 9:09am

Here you go. I have removed the Remote address ip. It’s all the same piece.

2024-07-30T07:40:28Z    INFO    piecestore      download started        {"Process": "storagenode", "Piece ID": "V7MROJLCHWRQMWIYPJXZ26YBCCD2SZ2MKMTHIOMBUFA4TQGD2DEQ", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET_AUDIT", "Offset": 8448, "Size": 256, "Remote Address": "REDACTED"}
2024-07-30T07:40:52Z    INFO    piecestore      download canceled       {"Process": "storagenode", "Piece ID": "V7MROJLCHWRQMWIYPJXZ26YBCCD2SZ2MKMTHIOMBUFA4TQGD2DEQ", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET_AUDIT", "Offset": 8448, "Size": 256, "Remote Address": "REDACTED"}
2024-07-30T14:54:22Z    INFO    piecestore      download started        {"Process": "storagenode", "Piece ID": "V7MROJLCHWRQMWIYPJXZ26YBCCD2SZ2MKMTHIOMBUFA4TQGD2DEQ", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET_REPAIR", "Offset": 0, "Size": 181504, "Remote Address": "REDACTED"}
2024-07-30T14:55:08Z    INFO    piecestore      download canceled       {"Process": "storagenode", "Piece ID": "V7MROJLCHWRQMWIYPJXZ26YBCCD2SZ2MKMTHIOMBUFA4TQGD2DEQ", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET_REPAIR", "Offset": 0, "Size": 181504, "Remote Address": "REDACTED"}
2024-07-30T21:40:04Z    INFO    piecestore      download started        {"Process": "storagenode", "Piece ID": "V7MROJLCHWRQMWIYPJXZ26YBCCD2SZ2MKMTHIOMBUFA4TQGD2DEQ", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET_REPAIR", "Offset": 0, "Size": 181504, "Remote Address": "REDACTED"}
2024-07-30T21:40:32Z    INFO    piecestore      download canceled       {"Process": "storagenode", "Piece ID": "V7MROJLCHWRQMWIYPJXZ26YBCCD2SZ2MKMTHIOMBUFA4TQGD2DEQ", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET_REPAIR", "Offset": 0, "Size": 181504, "Remote Address": "REDACTED"}

ifraixedes · July 31, 2024, 10:36am

The reasons are one of the following:

Context canceled
Stream closed by peer (i.e. Auditor)
Downloaded size and received size don’t match

Which one corresponds to your node? I don’t know.

The log message comes from storj/storagenode/piecestore/endpoint.go at a5e2ec3e711db5db7dcd32624b07aec92f7f03ba · storj/storj · GitHub

I added that information in this commit https://review.dev.storj.io/c/storj/storj/+/14032

jammerdan · July 31, 2024, 10:51am

Great, such an info message will be helpful I guess.
Context canceled is slow disk, correct? I think this is the most probable cause in my case when many uploads are happening at the same time.
Now when audits can get canceled, what is the effect on the audit score and why are we seeing this pattern that after a GET_AUDIT request there are the subsequent GET_REPAIR requests.

Is this what @Alexey has suggested:

Which would mean that after the second canceled GET_REPAIR request an audit would considered as failed?

ifraixedes · July 31, 2024, 2:58pm

The context canceled could be from storagenode or satellite side.

For this particular function and only directly looking at it, there is a timeout on the storagenode side storj/storagenode/piecestore/endpoint.go at 052d7694ad0c61b55050118b955e1597ff4f2cf4 · storj/storj · GitHub with a default configuration of 30 minutes

But looking at your log lines, and their timestamps, they don’t elapse that long besides the log line conditionally checks if the context is directly canceled, so it excludes timeouts, etc.

Now when audits can get canceled, what is the effect on the audit score and why are we seeing this pattern that after a GET_AUDIT request there are the subsequent GET_REPAIR requests.

A canceled audit isn’t a failed audit. It’s delegated to the reverifier that uses GET_REPAIR. The reverifier currently tries 3 times before marking the pieces as “lost” and your node will be accounted for it, however, we allow a small amount of bitrot.

Alexey · August 1, 2024, 5:57am

It’s not new in v1.109, so moved here.

jammerdan · August 19, 2024, 2:59am

What makes this weird is that I see the error download canceled and many nodes. But there is one specific machine where it seems to impact the audit score which is on a steady decline. I don’t see this happen on other nodes. This is strange.

Alexey · August 19, 2024, 3:33am

How long it takes between GET_AUDIT “started” for the piece until GET_AUDIT “context canceled” for that piece?

jammerdan · August 19, 2024, 3:42am

It differs. I took 3 random examples:

2024-08-17T13:32:34Z    INFO    piecestore      download started GET_AUDIT
2024-08-17T13:33:23Z    INFO    piecestore      download canceled GET_AUDIT

2024-08-18T01:42:57Z    INFO    piecestore      download started GET_AUDIT
2024-08-18T01:43:16Z    INFO    piecestore      download canceled GET_AUDIT

2024-08-18T01:37:07Z    INFO    piecestore      download started GET_AUDIT
2024-08-18T01:38:15Z    INFO    piecestore      download canceled GET_AUDIT

Alexey · August 19, 2024, 3:46am

Please, find them for the same piece, not just any.

jammerdan · August 19, 2024, 3:49am

They are 3 different pieces with their started and canceled time.

Alexey · August 20, 2024, 2:33am

https://review.dev.storj.io/c/storj/storj/+/12221

jammerdan · August 20, 2024, 2:40am

Oh that’s interesting!
This could explain it.

Instead, this new value of 150kB is still reasonably generous while limiting transfer time for a full-size piece to around 15s.
We shorten that time here as well, from 5m to 15s.

How does this align with the config setting storage2.monitor.verify-dir-readable-timeout?
15s does not seem to be sufficient when a node is experiencing high load.

Alexey · August 20, 2024, 4:06am

They are independent and have a different goals. The writability/readability checkers are used to detect hardware issues before it’s too late.
150kB/15s seems reasonable, usually disks should be capable to read so small amount of data even under a high load (unless there is also a potential or ongoing hardware issue).
I still did not get a confirmation, that “context canceled” GET_AUDIT is affecting the audit score, since there also at least two retries after a while for the same piece.

jammerdan · August 20, 2024, 5:21am

I thought the audit is not about the time a node needs to provide the data but about if it provides the correct data. Now with that stricter time settings you are mixing 2 different things together.

It better should not. If it does then 15 seconds is not auditing it is penalizing slow nodes that might be running a huge parallel rsync or a raid build or something else on the same hardware that is busy. 15 seconds is ways too strict.

jammerdan · August 20, 2024, 6:29am

Maybe there should be additional retries immediately following the first canceled attempt. Chances are after 2 or 3 retries the file is in the cache and can be provided fast and successfully.

Alexey · August 20, 2024, 7:27am

Nope. There is always was a timeout, but earlier it was 5 minutes.

jammerdan · August 20, 2024, 10:11am

Yes, 5 minutes compared to 15 seconds now. That is penalizing slower nodes.