Cancelled audits

I am seeing following pattern too many times to not to worry:

2024-07-30T07:40:28Z    INFO    piecestore      download started GET_AUDIT
2024-07-30T07:40:52Z    INFO    piecestore      download canceled GET_AUDIT
2024-07-30T14:54:22Z    INFO    piecestore      download started GET_REPAIR
2024-07-30T14:55:08Z    INFO    piecestore      download canceled GET_REPAIR
2024-07-30T21:40:04Z    INFO    piecestore      download started GET_REPAIR
2024-07-30T21:40:32Z    INFO    piecestore      download canceled GET_REPAIR

It tries for the same piece several times, sometimes succeeding, sometimes not and will continue with a new piece for the next audit afterwards.

Cancelled audits look unusual to me, I cannot remember seeing that under normal circumstances. In the past we had this:

and

Is it normal to see cancelled Audits now, maybe something has changed and is the pattern normal to see 1 GET_AUDIT for a piece and after that GET_REPAIR requests for the same piece, maybe as audit substitution or something.

1 Like

@Alexey: Anything you can tell regarding this:

Because it seems to affect audit scores on several nodes for me.
Are cancelled audits the new normal?
And is the pattern to see a GET_AUDIT and later one or more GET_REPAIR requests also a normal one?

Could you please post the whole two lines for the same pieces?

Here you go. I have removed the Remote address ip. It’s all the same piece.

2024-07-30T07:40:28Z    INFO    piecestore      download started        {"Process": "storagenode", "Piece ID": "V7MROJLCHWRQMWIYPJXZ26YBCCD2SZ2MKMTHIOMBUFA4TQGD2DEQ", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET_AUDIT", "Offset": 8448, "Size": 256, "Remote Address": "REDACTED"}
2024-07-30T07:40:52Z    INFO    piecestore      download canceled       {"Process": "storagenode", "Piece ID": "V7MROJLCHWRQMWIYPJXZ26YBCCD2SZ2MKMTHIOMBUFA4TQGD2DEQ", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET_AUDIT", "Offset": 8448, "Size": 256, "Remote Address": "REDACTED"}
2024-07-30T14:54:22Z    INFO    piecestore      download started        {"Process": "storagenode", "Piece ID": "V7MROJLCHWRQMWIYPJXZ26YBCCD2SZ2MKMTHIOMBUFA4TQGD2DEQ", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET_REPAIR", "Offset": 0, "Size": 181504, "Remote Address": "REDACTED"}
2024-07-30T14:55:08Z    INFO    piecestore      download canceled       {"Process": "storagenode", "Piece ID": "V7MROJLCHWRQMWIYPJXZ26YBCCD2SZ2MKMTHIOMBUFA4TQGD2DEQ", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET_REPAIR", "Offset": 0, "Size": 181504, "Remote Address": "REDACTED"}
2024-07-30T21:40:04Z    INFO    piecestore      download started        {"Process": "storagenode", "Piece ID": "V7MROJLCHWRQMWIYPJXZ26YBCCD2SZ2MKMTHIOMBUFA4TQGD2DEQ", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET_REPAIR", "Offset": 0, "Size": 181504, "Remote Address": "REDACTED"}
2024-07-30T21:40:32Z    INFO    piecestore      download canceled       {"Process": "storagenode", "Piece ID": "V7MROJLCHWRQMWIYPJXZ26YBCCD2SZ2MKMTHIOMBUFA4TQGD2DEQ", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET_REPAIR", "Offset": 0, "Size": 181504, "Remote Address": "REDACTED"}

The reasons are one of the following:

  • Context canceled
  • Stream closed by peer (i.e. Auditor)
  • Downloaded size and received size don’t match

Which one corresponds to your node? I don’t know.

The log message comes from storj/storagenode/piecestore/endpoint.go at a5e2ec3e711db5db7dcd32624b07aec92f7f03ba · storj/storj · GitHub

I added that information in this commit https://review.dev.storj.io/c/storj/storj/+/14032

3 Likes

Great, such an info message will be helpful I guess.
Context canceled is slow disk, correct? I think this is the most probable cause in my case when many uploads are happening at the same time.
Now when audits can get canceled, what is the effect on the audit score and why are we seeing this pattern that after a GET_AUDIT request there are the subsequent GET_REPAIR requests.

Is this what @Alexey has suggested:

Which would mean that after the second canceled GET_REPAIR request an audit would considered as failed?

The context canceled could be from storagenode or satellite side.

For this particular function and only directly looking at it, there is a timeout on the storagenode side storj/storagenode/piecestore/endpoint.go at 052d7694ad0c61b55050118b955e1597ff4f2cf4 · storj/storj · GitHub with a default configuration of 30 minutes

But looking at your log lines, and their timestamps, they don’t elapse that long besides the log line conditionally checks if the context is directly canceled, so it excludes timeouts, etc.

Now when audits can get canceled, what is the effect on the audit score and why are we seeing this pattern that after a GET_AUDIT request there are the subsequent GET_REPAIR requests.

A canceled audit isn’t a failed audit. It’s delegated to the reverifier that uses GET_REPAIR. The reverifier currently tries 3 times before marking the pieces as “lost” and your node will be accounted for it, however, we allow a small amount of bitrot.

4 Likes

It’s not new in v1.109, so moved here.

What makes this weird is that I see the error download canceled and many nodes. But there is one specific machine where it seems to impact the audit score which is on a steady decline. I don’t see this happen on other nodes. This is strange.

How long it takes between GET_AUDIT “started” for the piece until GET_AUDIT “context canceled” for that piece?

It differs. I took 3 random examples:

2024-08-17T13:32:34Z    INFO    piecestore      download started GET_AUDIT
2024-08-17T13:33:23Z    INFO    piecestore      download canceled GET_AUDIT
2024-08-18T01:42:57Z    INFO    piecestore      download started GET_AUDIT
2024-08-18T01:43:16Z    INFO    piecestore      download canceled GET_AUDIT
2024-08-18T01:37:07Z    INFO    piecestore      download started GET_AUDIT
2024-08-18T01:38:15Z    INFO    piecestore      download canceled GET_AUDIT

Please, find them for the same piece, not just any.

They are 3 different pieces with their started and canceled time.

1 Like

https://review.dev.storj.io/c/storj/storj/+/12221

Oh that’s interesting!
This could explain it.

Instead, this new value of 150kB is still reasonably generous while limiting transfer time for a full-size piece to around 15s.
We shorten that time here as well, from 5m to 15s.

How does this align with the config setting storage2.monitor.verify-dir-readable-timeout?
15s does not seem to be sufficient when a node is experiencing high load.

They are independent and have a different goals. The writability/readability checkers are used to detect hardware issues before it’s too late.
150kB/15s seems reasonable, usually disks should be capable to read so small amount of data even under a high load (unless there is also a potential or ongoing hardware issue).
I still did not get a confirmation, that “context canceled” GET_AUDIT is affecting the audit score, since there also at least two retries after a while for the same piece.

I thought the audit is not about the time a node needs to provide the data but about if it provides the correct data. Now with that stricter time settings you are mixing 2 different things together.

It better should not. If it does then 15 seconds is not auditing it is penalizing slow nodes that might be running a huge parallel rsync or a raid build or something else on the same hardware that is busy. 15 seconds is ways too strict.

Maybe there should be additional retries immediately following the first canceled attempt. Chances are after 2 or 3 retries the file is in the cache and can be provided fast and successfully.

Nope. There is always was a timeout, but earlier it was 5 minutes.

Yes, 5 minutes compared to 15 seconds now. That is penalizing slower nodes.