That’s why all my nodes make GE on Stephan’s satellite . Main part ($150-180) your held was from Stephan’s satellite.
That’s why all my nodes make GE on Stephan’s satellite :wink: . Main part ($150-180) your held was from Stephan’s satellite.
GE = Gracefull Exit
This is problem with storj, not SNOs.
If node actually not lost data, it must be switched to suspended state before DQ.
SNO must have a some days to fix node.
As Example one of my server lost SAS adapter. Node is alive but storage no. And it fail audits. All data is alive all that i need is replace adapter.
What evidence are you basing this on? I don’t see any logs or indication to know what went wrong with this node. If files got lost, that’s not something that can be fixed. Lets not jump to conclusions without additional information.
This is not true. For egress the data needs to be on a node. you can’t select a node “more often” if it doesn’t have the requested pieces. A piece is only stored once behind a single IP.
Egress will not increase if you suddenly have one node less.
Result will be the same as disqualification but with full disk of unpaid data
The more simple thing for configuration error is preventing node from start (or working), if storage is unavailable (or become unavailable during the work).
This is taken into development. But I do not have details of implementation.
For me the simplest solution would be separate setup and run, i.e. the node will not try to create a storage location during the normal run. It should be created only with setup.
If storage location is empty, the node must just crash.
My evidence is DQ without suspend. And author text
the word fix can be interpreted as “the problem is fixed and the data is intact”
Many times i wrote about this problem. Storjlabs ignore it and as result SNOs think that Labs just need a reason to not paid held amounts.
Checks on starting not enough. They don’t even solve configuration problems.
Who will pay for recover while pieces unavailable?
In the current implementation of suspension, the pieces are still available, only audit return the unknown error (otherwise it will return “file not found” error and will be marked as failed).
So, in the current implementation of suspension mode the node can be used to retrieve data and do not counted as lost in the checker service, so it’s only excluded from uploads of any kind.
If implement the treating of “file not found” as reason for suspension, then it will put the data into danger to be completely lost, because pieces on such node does not really available for downloads.
So, it should be a different kind of suspension. It should not allow any GET and all pieces must be marked as lost.
The checker service must treat them as lost too and thus the repair service could be triggered on threshold. As result the costs of repair should be taken from the held amount of such node.
Since the costs of repair are high, the held amount will be used at whole after few trigger events. Thus node will start from 75% level of held back.
The exit from suspension would be successful audits. However, since there is no way for satellite to know is it fixed problem or not, the node should start again from vetting.
The node is untrusted entity from the beginning and can’t be trusted “on word”, especially after failure. Thus pieces marked as “temporary lost” will remain marked as lost until will be audited.
While audit is performing, the trigger job doesn’t sleep, so node will continue to lose repaired pieces (they are now on other, healthy nodes). The space on that node will remain used by those temporary unhealthy pieces, until audited.
The repaired pieces will be removed by Garbage collector (i.e. - slow).
The audit of the whole node could take months. All that time node will lose pieces but still keep the unpaid data.
Why this is better than disqualification?
If the author of the post would detect the problem in time and simply turn off the node before solving the problem, its data would also be inaccessible, but the node would not be disqualified.
Did I answer your text? Do you already conduct a full audit of nodes after they exit downtime? Of course not!
You’re right, my reasoning was wrong I didn’t think it all the way through…
Just make a node suspension on 0.7 audit score and leave a chance for sno to fix problem.
After node leave suspension state audits was resumed and audit score continue down to 0.6 and DQ or goes up and restore at 1.
They have not failed audits with “file not found” error, thus no reason to mark them as failed.
In case of offline those nodes doesn’t selected for downloads and uploads and treated as unhealthy on checker. So they are loses pieces as well, but the payment for repair is going from the satellite operator’s pocket at the moment. Because the held amount is used only from disqualified nodes.
It will be changed with disqualification for downtime.
The Labs have no chance of correctly processing the errors of the store primarily because they know nothing about these stores.
“File not found” means that the file is not available at the moment (SAS\SATA adapter died, or unplugged SAS cable, USB died), just as a non-response of a node means that the node is not available at the moment. There is no difference.
In the meantime, everything is so unreliable that many SNO make GE when a sensitive amount accumulates in the held.
Which is the right course of action if data is actually lost. There is nothing you can repair if that is the case and suspension would be dangerous for file availability and expensive for the satellite operator.
Fix implies the problem is fixed, not that the data is in tact. That would really depend what caused this issue to begin with.
Which problem would that be, as it is still completely unclear what the problem with this node was?
I don’t disagree there. Checks need to happen during runtime as well and the node should be taken offline if the data location becomes unavailable. I even wrote a feature suggestion for that.
But without more information I can’t know whether that was the problem with this node.
Exactly, which is why the satellite has to assume the worst in order to make sure the data is safe. I honestly think that the examples you are mentioning would all be fixed with simple checks on availability of the storage location. Keep in mind though, it would still take your node offline and not fixing that in time would still get your node disqualified. But it would buy you time to fix things.
It’s not enough. That should not be a suspension state, it should be a different, because it must be almost equivalent of disqualification, i.e. all GET and PUT are forbidden, except GET_AUDIT, the held amount could be used for repair, the all pieces marked as unhealthy.
After resume the vetting should be enabled again (to prevent mass loses, if problem is actually not fixed and audit service just selected the piece from the unbroken part of the disk).
If the held amount is completely used, the node must start from 75% level of held back.
So, it’s much simpler to crash the node if storage is unavailable.
This is why pro sno run it nodes with tone of scripts up of it. And there is always a place for another script in response to a problem that Labs do not want to solve
I believe they are actually solving this one.
the software is open source so anyone with the required knowledge could just fork the project and make the required changes for the storagenode to crash if it can’t find the requested data (or a few consecutive failed audits). Could even make it not respond to audits if the data is missing so the satellite counts it as an unknown audit error instead of an audit error counting towards DQ.
But at the moment I think storjlabs is reasonable enough to acknowledge our problems and they try to find good solutions. So it’s just a matter of time hopefully.