If a node can’t respond with data within 15 seconds it’s not penalizing slow nodes: it’s penalizing broken ones.
Just another step to weed out potatoes. I like that…
I don’t like this change. Not because it will kill potato nodes, because this might be a good idea… I don’t care.
I don’t like this change because it was not communicated. It should be at least a proper separate post (if not an email) saying: yes, we make this and this change, here is how you can check how it affects your node, and if you see it affecting your node, here is a list of things you can try to fix your node.
This is yet another event that implicitly says we don’t really want casual node operators.
If my setup can barely send data back to customers faster than the post office… I may be many things… handsome… charismatic… humble… but I am not a node operator.
I don’t see that node as broken. Stats are saying it has 90% upload and a 75% download success rate.
But I don’t see a reason to audit stricter than downloads to customers are allowed to take place.
This does not make sense.
There are many reasons why the storage can be slow. But as long as it is fast enough for serving customers it should be fast enough for audits and not being penalized artificially.
This is bad move as now with every action on the underlying media, a raid rebuild, raid verify, cloning, node move with rsync or parallel with robocopy or whatever, it can hurt your node score up to disqualification as it seems.
This is not true. Every change is communicated in the Release notes. But yes, not all reads it. Usually the change also has a description of the change and the reason.
Perhaps we do not want broken nodes which cannot provide a part of the piece for audit, that’s true, but it’s not related (and never been) to the Storage Node Operators. Please do not substitute meaning.
I’m not sure that each change should be supported by a long article explaining it and announced as a news. But I would pass your opinion to the team.
Exactly, as long as it is fast enough for serving customers.
So, do you think that if the node cannot provide a part of the piece as slow as less than 10kB/s is able to normally serve the customers?
The minimum requirement is pretty low already, 1Mbps which is translated to at least 131kB/s.
From your logs seems there is a little longer than 15 seconds to receive a part of the piece, almost a minute before it’s canceled.
If you believe, that the audit score is affected by these canceled audits (meaning that retry didn’t help too), then I think it’s better to limit a number of requests which can be handled by the node. I’m not sure does this limit also applied to the audits requests or not.
Shouldn’t long tail cancellation take care of slow performance?
And we also know customer experience can differ depending on the location. Audits are just one location out of many possible locations where customers download from.
From the 75% overall download success rate I would say the node serves customers well enough on the majority of downloads otherwise the transfers would get long tail canceled.
AFAIK you cannot limit download requests.
I have seen slow storage on many different situations. Starting from massive up- and downloads, endless filewalkers of different kinds, slow devices, copy and transfer actions, Raid rebuild and many other. Storj should take into account that it does not necessarily have exclusive access to the storage so there also reasons outside of its scope but on which it depends on.
It is interesting that we can have minutes for read and write checks but can get disqualified when for example a Raid a node is running on gets rebuilt or a node gets moved to another location. This does not align well for me. For me audits are for proving that the data is correct not how fast they get delivered despite the readability checks in place.
BTW:
In the past you have made clear the link between these settings and the audit timeouts:
You may increase a timeout, but please be careful - do not put more than 5 minutes, otherwise you will risk to start to fail audits.
Shouldn’t long tail cancellation take care of slow performance?
Yes, it should. But we also want to reduce the tail before start, as you may see in this thread:
I don’t think I can follow. So let me explain it in a different way. Current RS numbers are 16/20/30/38 and choiceof 6 (both might get changed in the futures to match the required throughtput with the remaining number of nodes) This combination of RS numbers and choiceof factor will select 38x6 nodes in total. Now each subnet has an equal chance to make it into this selection group. If there are 2 nodes on the same subnet they get 50% of that rate. The 2 nodes on the same subnet will never get…
It is interesting that we can have minutes for read and write checks but can get disqualified when for example a Raid a node is running on gets rebuilt or a node gets moved to another location.
I do not think it would be disqualified in that case, if it’s functioning normally, there are also retries for the same piece after reducing a load (the node is temporary placed in the containment mode to allow it to respond). However, it may be the case if the hardware have issues.
In the past you have made clear the link between these settings and the audit timeouts:
Yes, because it was a great indicator for that time. Seems not anymore. At the moment you may ignore that the node cannot read a file for more than a minute to allow it continue to work instead of crash, but seems now it’s greatly increases risk of disqualification (because the check wouldn’t fast enough).
Anyway I still did not get a confirmation from the team, does canceled audits are accounted against the audit score or not.
From the logic it should, but then I would expect to have more than 3 attempts to audit the same piece.
Anyway I still did not get a confirmation from the team, does canceled audits are accounted against the audit score or not.
I see one node has fallen to 97.08 onAP1 right now. Last time I had checked I have not seen any other issues with audits only canceled. I did not see errors like file does not exist or something.
So for me it is apparent that it has impact on the score. It would be much better taking into account the many circumstance why storage can be slow at times maybe to put a node in suspension. That way if you for example run a raid rebuild that takes 3 weeks you can recover after that again. The current 15 secs seems to be a terrible idea.
Every change is communicated in the Release notes.
You mean this post? Release preparation v1.97 I see it clearly explains how to check your node whether you are affected, and brings the important change into the light! And it’s so clear that everyone understood it immediately.
Except possibly to some even technically adept people:
Anything, what’s is not a known error. Known errors for the audit system are: file not found file is corrupted 3x timeout (5 minutes for a few kBs on each attempt) to get a piece
Usually only the online score should be affected. However, if your node cannot provide a piece for audit after 3x attempts with 5 minutes timeout - this audit will be considered as failed. But I also see that the suspension score is affected too - this is mean that your node responded with unknown error on audit request. I would suggest to check your logs for errors related to GET_AUDIT/GET_REPAIR to see what’s going on. Unfortunately the node wouldn’t detect failures because of timeouts, but…
And I really assume here you didn’t want to deceive people, just haven’t understood this very clear message yourself.
Please do not substitute meaning.
I didn’t. When I stated that casual operators are not truly welcome, I really meant it. It seems that to be aware of changes to how their nodes are evaluated, node operators need to put more effort into tracking changes in commits than even Storj employees. This is not what casual mean.
This is an important note to me as well. I do carefully look at commits posted in release notes—from the general and storage nodes section. Now I know that the satellite section, despite mostly covering client-facing developments, and despite being the longest one time after time, also needs careful study.
But I would pass your opinion to the team.
Thank you!
The canceled audit is not counted as a failed audit
A canceled audit isn’t a failed audit. It’s delegated to the reverifier that uses
GET_REPAIR
. The reverifier currently tries 3 times before marking the pieces as “lost” and your node will be accounted for it, however, we allow a small amount of bitrot.
@ifraixedes could you please explain, what this statement mean?
your node will be accounted for it, however, we allow a small amount of bitrot.
Does it mean that the audit score will be affected or not?
If not, when it will be affected?
We have had these conditions for failed audit:
- The node is online and responding on audit requests
- The node cannot provide a piece or it’s corrupted:
2.1. The node cannot provide a requested piece with a timeout of 5 minutes, we put it to the containment mode (all ingress is temporary stopped) and the node will be requested for the same piece two more times with 5 timeouts each. If the node still cannot provide a piece the audit considered as failed and the audit score is affected.
2.2. The node provided a piece, but it’s corrupted, this affects the audit score.
2.3. The node returned a error “file not found” (or similar, including “i/o error”). In this case the audit score is affected. - All other responses considered as unknown error and affects the suspension score instead of the audit score.
Does it mean that the audit score will be affected or not?
From how I understand the code changes at satellite/audit: much stricter audit transfer speed reqs · storj/storj@5ec1232 · GitHub, it seems that the same rules that applied when the setting was 128b and 5m still hold true. Specifically, if an audit request timed out previously and retries failed, it appears to have been considered a failed audit. It looks like only the values have been changed but the behavior remains the same.
So, this is… interesting.
And from face value, it seems completely reasonable from a client perspective that a storj node should be able to offer up files with a bandwidth of 150KB/s per file and also be able to return an audit chunk in 15 seconds. I mean, 15 seconds is an eternity when trying to download data.
However, I have two concerns.
-
If the min bandwidth becomes 150KB/s per chunk, then my math says this is over 1 megabit per chunk, and lots of download requests can happen in parallel. This may limit (or eliminate) someone who has an internet connection of only say 50 megabits upload speed unless they can tightly ration their download requests. And I don’t think they can? (there is storage2.max-concurrent-requests but that’s only upload?)
-
As mentioned, a node may have it’s performance hampered when doing something like a scrub or an rsync (operator induced) or encountering particularly nasty filewalkers/garbage collection (storj induced). In the second case, it wouldn’t be “fair” to fail an audit request. The data really is safe there, it’s just not accessible quickly. Shouldn’t be penalized as harshly as an audit failure, it’s more like an offline fault.
General idea: if storj’s attempt is to start qualifying nodes based on performance, then maintain a separate performance metric. Because it’s different than auditing the data quality, and different than auditing being merely online.
A node has one job: return data when asked for it. Even accepting new data is only optional. 15 seconds does sound like a reasonable time.
The reasons why a node may fail are legion, perhaps nuanced, sometimes interesting, and often great fun to talk about in these forums! But ultimately they’re all excuses . What is fair is not paying SNOs with nodes that don’t meet minimum service levels.
If the min bandwidth becomes 150KB/s per chunk,
The required bandwidth to download 150kB before the 15 seconds timeout is at least 10kB/s.
So, I guess the node with 1Mbps (131kB/s) upstream unlikely would handle more than a one audit request at the time, and should have plenty of upstream bandwidth to handle other downloads, most of each likely would be canceled with so low upstream anyway.
However, I personally think that we should require at least 4Mbps for upstream, 1Mbps seemed too low to me.
A node has one job: return data when asked for it.
According to the stats that I have posted it does that with 75% download success rate. So let’s disqualify it?
A node cannot deny downloads. What it does it serves them best as it can no matter when and what else it is currently doing.
But ultimately they’re all excuses
No. Ask yourself: Why should a nodes score permanently be hurt because it is running a raid rebuild to keep the data safe? Or doing a node move? At the same time nodes can be offline for 30 days and are allowed to fully recover their scores.
Is it better to have a node offline than slow? So maybe better turn your node off if you do a rsync move because it could permanently hurt your nodes score?
And I still do not see where Storj requests exclusivity on the storage that is used. I always thought I am free to use the underlying hardware which can be loaded at times.
And as I say always: Nodes are getting bigger and will have to handle more uploads and downloads. 30TB HDDs have already been launched. Larger ones will follow. The situation will get worse for nodes and for scaling nodes would rather need audit rules to be loosened not tightened.
At the same time nodes can be offline for 30 days and are allowed to fully recover their scores.
And this is an answer to
because it is running a raid rebuild to keep the data safe? Or doing a node move?
especially when you see that it’s affecting an audit score, which is mean, that it has permanent issues to read 150kB of data for less than 15s + three retries later in a containment mode (when the ingress is stopped).
By the way, what’s your readable check timeout for that node?
And this is an answer to
Sorry I don’t quite understand what you want to tell me with my own quotes. Please explain.
My question was, if Storj prefers offline nodes over slow nodes?
especially when you see that it’s affecting an audit score
I don’t think most SNOs monitor their nodes closely enough to notice every hundredth of a point change and correlate it to slowliness. So far, audit score changes have mainly been due to pieces not found or corrupted.
which is mean, that it has permanent issues
I have seen and experienced many issues that affects a node. Lots of them introduced by Storj itself. This does not mean the node has permanent issues.
three retries later in a containment mode (when the ingress is stopped).
I don’t see that. I see the audited piece canceled for the first time but after that ingress still comes in.
What is the idea behind stopping ingress after a timed out audit?
By the way, what’s your readable check timeout for that node?
storage2.monitor.verify-dir-readable-timeout 2m0s
Maybe I should set it to 14 seconds now?
Sorry I don’t quite understand what you want to tell me with my own quotes. Please explain.
You can solve the problem with failing audits because the system is overloaded by the external process like a scrub job or a migration: just stop the node. And run it periodically to do not have a too low online score, until you finish this high load maintenance.
My question was, if Storj prefers offline nodes over slow nodes?
basically - yes. We would be happy if you can keep it online. But if you couldn’t it’s better to be offline than slow. We want to have a great customers experience: the offline node will be excluded from any node selection.
I don’t think most SNOs monitor their nodes closely enough to notice every hundredth of a point change and correlate it to slowliness.
Then maybe they can notice a yellow audit score? And would try to figure out what’s going on. Then they will notice a scary audit errors, and I hope that they would come here and ask.
When I would have a clear picture and several confirmations, that this change is affecting a noticeable amount of system, we may reconsider that. Otherwise I would add a FAQ article here and/or on the support portal and/or the docs portal.
I don’t see that. I see the audited piece canceled for the first time but after that ingress still comes in.
You may search for the same PieceID to see, how many times it’s failed.
What is the idea behind stopping ingress after a timed out audit?
The retry wouldn’t come immediately.
However, maybe there is a change in the containment mode too. Because accordingly design-docs/20190909-auditing-and-node-containment.md at ed8bfe8d4c66a587f5322237f06208f74e301b7a · storj/design-docs · GitHub there were no repairs involved.
Maybe I should set it to 14 seconds now?
I do not think so. If you did change it to a higher value from defaults, that’s mean that the disk on that node cannot keep up, and reducing it likely would crash the node even more often than before. This check performs every 1m0s by default, I do not think that your node is audited with the same frequency. By the way, you also need to increase the readable check interval on the same amount too, otherwise it doesn’t make sense - these checks will overlap, if it’s truly respond after 2 minutes only.
Since it’s not disqualified, I can assume that the percentage of failed audits is still low.