Cancelled audits

EasyRhino · August 23, 2024, 6:30am

How sure are you that it nets out to only 10kB/s? Because the github link literally has the variable named MinBytesPerSecond and I was just reading that as face value.

And again, 1 megabit seems fine for an client’s upload . However, since a node could be answering an arbitrarily unlimited (right?) amount of upload requests, I think they could start failing a lot if there were a lot of simultaneous requests when an audit happened to hit.

Alexey · August 23, 2024, 6:51am

We are talking about client downloads, thus - it’s called upstream or egress. The audits requests are downloads (egress from the node) too.
The minimum requirement for the downstream bandwidth/customers’ uploads (ingress to the node) is 3Mbps, for the upstream bandwidth/customers’ downloads (egress from the node) is 1Mbps.

Roxor · August 23, 2024, 9:27am

If a node takes all 15 seconds to return data, then it’s slow, and that’s OK. If it can’t return data in that time: it’s offline. If a SNO is doing anything that causes data to take more than those 15 seconds: they’ve made the node unavailable… whether they stopped the process or not.

A node doesn’t need exclusive access: sharing is fine! Slow is fine! Losing uploads and downloads is fine! And taking more than 15 seconds to return data mean it’s offline - and its scores will start to be impacted.

But you need to fail a lot of times to be disqualified: Storj is quite generous, so potato nodes are fine. It’s only broken nodes that have to worry…

jammerdan · August 23, 2024, 9:33am

I don’t know about scrubs. But I know that processes like a raid rebuild or verify run automatically. In case of a rebuild you can’t plan when a drive fails.
My node has received around 6500 upload requests during the last hour. I don’t know if this is much or if this is even a good value. But it completed over 6000 successfully.
So is this a bad node or a slow node?

How does this align with the requirements:
Step 1. Understand Prerequisites - Storj Docs

Uptime (online and operational) of 99.3% per month, max total downtime of 5 hours monthly

And what if reasons contribute to the performance that are not external? Like number of up- and downloads. Piece deletions, filewalkers, bugs that cause high IOPS of which we had many?
Turn off the node as well? Maybe I hardkill upload requests instead… Customers will like it.

It would be definitely a good idea to publish that somewhere somehow.

I have done that. It shows 1 download started and canceled. After that there is still ingress. So what is wrong?

I get that, but what is the idea behind that. Why is the ingress supposed to stop? What is the reason for doing this?

But maybe I should rather kill the node after 14 seconds than failing an audit?

Thanks, I actually missed that the default for this value is not 5 minutes like for the write check.

This does not really help if the lower scores are permanent and cannot recover.

jammerdan · August 23, 2024, 9:39am

I don’t agree that a node with 93% upload success rate and 75% download success rate can be considered unavailable.

Roxor · August 23, 2024, 9:46am

There’s nothing wrong with those numbers. Your node could have 1% success rates and not be disqualified… as long as it continues to return audit data in 15 seconds or less: no worries!

Alexey · August 24, 2024, 11:50am

if it cannot provide a small part of the piece within 15 seconds, now it’s considered as incredible slow. It would be audited for the same piece three more times (instead of two in a previous implementation). If it would fail, well, this audit will be failed. Time to change something in your setup I would assume.

This is a goal. Currently you have 12 days offline before suspension. But please, fix the issue as soon as possible. Every offline hour costs your node several GBs of stored data.

Perhaps. But even for 5 minutes timeout we didn’t have this. Maybe it’s a time, but I’m not sure yet.

Then your node is passed this audit.

It’s explained in the blueprint which I linked before. But in short - to reduce the load on your node to allow it to pass the audit.

Perhaps. But the audit frequency is definitely lower than the default readability check frequency (it’s happening every 1m0s by default). Maybe your node could be overloaded right now, but not when the audit request is to come.

You are welcome! Yes, it’s confusing me as well… However, I’m aware of it and always point out to this:

It can, if your node didn’t lose these pieces. Every successful audit will increase the audit score.

jammerdan · August 25, 2024, 3:51am

Not yet. I guess first thing I will do is to reduce concurrent uploads. This seems not balanced with the downloads. It wont be a great experience for customers but there is no score impact on the SNO. This seems like the way to go first.

Well it is stated as requirement. I am just saying this looks like Storj does not like offline nodes better than slow nodes.

True. But it was aligned with the other 5 minute check. So it was at least not a surprise.

This cannot be the case.
I have found a piece that has a canceled audit:

2024-08-24T11:16:09Z    INFO    piecestore      download started "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"
2024-08-24T11:16:51Z    INFO    piecestore      download canceled "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"

The piece did not get retried yet:

docker logs storagenode | grep ***Piece ID*** | wc -l
2

But since the canceled audit the node received many upload requests:

docker logs --since 2024-08-24T11:16:51Z storagenode | grep "upload started" | wc -l
88869

And also receive uploads from 1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE:

docker logs --since 2024-08-24T11:16:51Z storagenode | grep "upload started" | | grep 1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE | wc -l
23

So if I understand you correctly, that the node shall be put into containment mode after a canceled audit and not receive ingress, then something seems broken.

This is a great information and what I have hoped for. So Storj wants to help a node to pass the audit. But it also shows, that the time to pass an audit in normal mode is not of much importance. Otherwise it would be of non use to allow a node to pass an audit in a special reduced mode with less load from uploads.
Here is what I think what would truly help:

Increase the time. You have changed 2 values, minimum speed and time. Maybe increase of minimum speed would have been enough maybe only with a slight reduction of time.
My idea is that if you would audit the same piece quickly in a row the node could server it from its cache. Which might help. So it would be like this. 1. Audit → canceled. 2. Try again immediately for 3 times in a row to try to get it from the nodes cache. 3. Only after this put the node in containment and try later again.

It didn’t. There is no indication in the logs that there are other issues with audits than the canceled ones. So if score keep going down because of that I expect a node to get disqualified over time.

Alexey · August 25, 2024, 4:30am

Ideally we want to have fast and reliable nodes. But we also aware that Operators are not always can control the internet connection, so right now the requirement of 5h offline for a month is not strictly applied, because it’s not causing troubles yet.
The speed of audits is starting to be a problem - we need to audit nodes more frequently, when the amount of nodes always grows and we have slow nodes.

This is expected that the reverifier should check it after a some time, not immediately, because right now the node was not able to respond within 15s (even 42s based on a timestamp).
But would be interesting to see when it would be reverified (it should be reverified with GET_REPAIR as far as I understand, so it’s better to check with your command).

Or changed. I didn’t receive a confirmation about the containment mode, maybe it’s not used anymore.

Not possible. The time has been reduced precisely to speed up audits and conduct them more frequently with the ever-growing number of nodes.

I do not know, does the OS cache a file, if only some random range of it has been requested. The audit requests not the whole piece, but only random part of it.

Not sure that it’s possible to implement, they are asynchronous and there are multiple auditors (they audit separate segment ranges to do not audit twice/triply/and so on). And it would again would slow down the auditors. I think the main reason to reassign the check to the repair reverifier is to do not slow down the auditor. Unfortunately the recheck would happen much later, because repair reverifier should download the whole piece, not the part as an auditor and only those segments which have an amount of healthy pieces below the threshold, so this request likely would be queued and processed much more later as far as I understand.

I would guess that the containment mode would be applied after the first failed attempt to GET_REPAIR of this piece.

You need also to check the GET_REPAIR requests as well.

jammerdan · August 25, 2024, 5:02am

Yes, we all want that. But many parameters are out of control of the SNO. Even the satellites are independent. It looks like it could be that a single satellite which heavy activities coult impact the performance and therefore the score on other satellites up do disqualification. So one satellite can kill your node from all of them.

Yes, I am just showing, that this is still an “open case” and therefore containment would be expected.

Yes, as said, currently only the 2 instances of this piece in the log which are the lines with GET_AUDIT. Nothing else happened yet with that piece.

Ok, could be changed. I am not aware of all changes.

Good question. I have assumed it does. If it does, then a quick succession of retries would help to get it into the cache and served from there. This could help then the storage is busy with uploads.

My idea was to put the retries for loading it into the cache on the first audit by the auditors to prevent the containment mode for the node. I don’t know when and if the containment mode will be applied but of course the multiple retries in a row can also come from the repair workers to load the file into the cache and serve it from there.

I don’t see any significant errors related to audited pieces there either other than “download canceled”.

Alexey · August 25, 2024, 6:46am

It’s still under your control - you may blacklist it or call a graceful exit.
Of course, this idea would be good to implement:
Per Satellite Available Space, but it’s not requested as a feature request.

If it’s for the GET_REPAIR, I think it could affect the audit score.

jammerdan · August 25, 2024, 6:53am

Yes, what I am trying to say is that the issue is not some other audit failure like “file not found” or something.

Alexey · August 25, 2024, 6:55am

Yes, it was like this for a long time. Both GET_AUDIT and GET_REPAIR used those conditions:

Alexey:

We have had these conditions for failed audit:

The node is online and responding on audit requests

The node cannot provide a piece or it’s corrupted:
2.1. The node cannot provide a requested piece with a timeout of 5 minutes, we put it to the containment mode (all ingress is temporary stopped) and the node will be requested for the same piece two more times with 5 timeouts each. If the node still cannot provide a piece the audit considered as failed and the audit score is affected.
2.2. The node provided a piece, but it’s corrupted, this affects the audit score.
2.3. The node returned a error “file not found” (or similar, including “i/o error”). In this case the audit score is affected.

All other responses considered as unknown error and affects the suspension score instead of the audit score.

Seems there are only two changes in this process:

The timeout is now 15s, not 5 minutes
The first failed audit does not triggers the containment mode right away, it’s delegates the reverifications to the GET_REPAIR. The GET_REPAIR has three attempts before the audit would be considered as failed and the first failure for GET_REPAIR would trigger a containment mode.

So, in essence, everything remains the same, it’s just that the audits are being sped up. The second change likely not a change, the GET_REPAIR has not changed from how it worked before. Just now the auditor may add a piece for reverification.

jammerdan · August 25, 2024, 7:10am

No, it does not:

2024-08-24T18:47:29Z    INFO    piecestore      download started ***PIECE ID*** "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET_AUDIT"
2024-08-24T18:47:50Z    INFO    piecestore      download canceled ***PIECE ID*** "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET_AUDIT"
2024-08-25T00:55:15Z    INFO    piecestore      download started  ***PIECE ID*** "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET_REPAIR"
2024-08-25T00:55:34Z    INFO    piecestore      download canceled ***PIECE ID*** "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET_REPAIR"

First canceled GET_REPAIR was at 2024-08-25T00:55:34Z. So there should not be any ingress since then.
But:

docker logs --since 2024-08-25T00:55:34Z --until 2024-08-25T06:55:53Z storagenode | grep "upload started" | grep 12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S | wc -l
28420

On 2024-08-25T06:55:53Z it downloaded succesfully:
2024-08-25T06:55:53Z INFO piecestore downloaded ***PIECE ID*** "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET_REPAIR"

Alexey · August 25, 2024, 7:17am

This is an egress, because all wordings from the customers’ point of view. Download means egress from the node, not ingress.
You need to check uploads.

jammerdan · August 25, 2024, 7:19am

Sorry, maybe I misunderstood:

Containment mode means all ingress is stopped?
Ingress = Uploads?

Alexey · August 25, 2024, 7:39am

Yes, the containment mode means the same as suspended - all ingress (uploads) to your node are temporary stopped to allow the node to answer on audit/repair request.
And unlike a previous implementation, your node have 4 chances instead of three: the first one with a GET_AUDIT, and three more with GET_REPAIR (as was before). The previous implementation meaned the 3 GET_AUDIT requests, the second and the third in a containment mode.

jammerdan · August 25, 2024, 7:42am

Then I understood correctly and this:

should be 0 then because the node should have been put into containment mode after the first canceled GET_REPAIR at 2024-08-25T00:55:34Z

Alexey · August 25, 2024, 7:42am

It should be zero until it pass (or fail two more times) the GET_REPAIR request for this piece.

jammerdan · August 25, 2024, 7:44am

Which was at 2024-08-25T06:55:53Z:
2024-08-25T06:55:53Z INFO piecestore downloaded ***PIECE ID*** "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET_REPAIR"

That’s why I put --since 2024-08-25T00:55:34Z --until 2024-08-25T06:55:53Z in the docker logs command when I counted the upload requests.