Just put it here:
I get your view @BrightSilence and I tend to agree with you.
However, as @SGC said:
(even though “random bullet” is a debatable notion here)
In my humble opinion, there is only one situation where disqualification could be fast without warning: if a node is clearly responding with the wrong data when audited. This would mean that it’s trying to cheat and is poisoning the network, then it needs to be killed fast.
All other problems should simply trigger the suspension mode, and notify the SNO that something’s wrong and that they should check what’s going on within a few weeks (or months… dunno) before getting disqualified for good.
This said, I do agree that @JohnSmith should (or could) have considered having a look at the logs anyway, to see what happened, especially because if it proved to be a new problem, it could have helped the team to make the software better.
true… but maybe it should be made more clear what goes wrong then, some sort of system implemented so that the node will protect / disconnect itself upon high numbers of errors / audit failures… so that people actually have a chance to diagnose the issue…
it’s not nice to know that months and even years of maintenance and work could be gone to a random fluke that wasn’t actually anything serious… just because the software doesn’t know how to preserve life atleast long enough for humans to respond.
and the random bullet could very well be a very well considered one from a sniper… it just seems random to the guy in the trench
Well I think my previous posts show that I think some things can be improved. But I would say if data is removed, that should also lead to a hard disqualification. Given that the storage location is available. The same would go for unreadable data. If the storage location can be read and written to, but files are either unreadable, no longer there or return wrong information. There is nothing you can do to fix it anyway. In those scenarios, disqualification after failing too many audits is just the right thing to do.
@SGC I see you responded as well. But I think this message applies to what you said too. The node needs to be better at knowing whether it’s just that the data location isn’t available or that actual data is lost. Disqualification should only happen in the latter case. If that’s fixed, then there is no question about why the disqualification happened, because it could only be one thing.
For what it’s worth, I’m certain the dashboard will already show an audit score below 60% on the node that is disqualified. But you do have to bother to look.
I disagree. The SNO could have tried to migrate their nodes and misconfigured the target path. It’d be better to let them know something is wrong, so they can face-palm themselves and solve the issue.
If the node does return data that proves to be invalid, then it’s different: considering each file is identified by some kind of UUID (if I’m not mistaking), I guess that if a node were to target a folder containing data from another node, it shouldn’t find a single file matching requests coming from satellites. Which means that a node returning wrong data probably is cheating. That’s my take anyway.
As @BrightSilence said, there is room for improvement indeed, some may want to upvote one of his suggestions to make the software more robust: Make the node crash with a fatal error if the storage path becomes unavailable
I would count that as the storage location not being available. The node could place a file in the storage location that it can poll for availability. If it’s not the node shouldn’t start or shut down. That would catch misconfigurations as well. Even better would be a file that stores the public identity of the node in the storage location to test whether the data linked to matches the identity used. That would even catch issues where node A would point to storage location of node B.
But if files are missing but that test file is there, then disqualifying is still the right thing to do.
I would like to recommend to read a whitepaper: https://storj.io/storjv3.pdf , section “4.8 Structured file storage” and “4.14.1 Piece hashes”
It’s not a GUID at all. From the storagenode perspective it’s indeed can’t find any audited piece and will answer with “file not found”, not with a wrong data. From the satellite point of view the node lost all the data and must be immediately disqualified, otherwise it will be offered to customers and they could receive the scaring message “file not found” too.
Sounds like a pretty decent solution to me!
I’m not questionning that, that’s why I assumed that only a cheating node could reply with the wrong data.
If a mechanism like the one @BrightSilence is implemented, then yes, not finding files should lead to disqualification. But currently, a simple misconfiguration could cause that.
If we’re sure the node is cheating or lost all files only, then yes! Otherwise it could be suspended immediately to avoid sending scary ‘file not found’ messages to customer. Don’t you think?
But, I thought that maybe it’d be simpler and easier to pull out a node (by suspending it) from the network when anything goes wrong, anything… And if it gets back to normal within a certain amount of time, un-suspend it. Otherwise, disqualify it.
I’m truly having a hard time to see why, in some cases, it should be immediately and definitely killed. ^^’
I already said what I think about suspension instead of disqualification:
It should be extended to work as a slow disqualification:
- forbid not only PUT, but GET too (see storj/audit-suspend.md at 5786595d141ff8b8c8be21e5e8696dd73e04d08d · storj/storj · GitHub);
- take any costs from your held amount to recover missed pieces if repair trigger is fired;
- reduce reputation and levels of trust (i.e. - put 25% to held amount, then 50%, then 75% depending on level of loses);
- enable vetting again.
The recovered pieces will be removed from such node with a Garbage collector.
Very educational answer @Alexey, many thanks.
I see why it’s not as simple as I thought.
Well then… I’m not sure what to suggest
If I understand what you’re suggesting, SNOs should pay for repairs triggered while their nodes are suspended.
That sounds pretty fair to me. I’d rather receive a poor payment one month because my node got temporarily suspended, instead of getting disqualified.
Yes, but the held amount could be wiped very fast, because of
And thus your node will fall back to 75% level of earnings to held amount very fast too.
Then we will be forced to enable the vetting process again. So - 5% of potential traffic during the next month or two.
And pieces become deleted too every time, because your node is untrusted and pieces are still marked as lost. And of course they are not paid anymore.
So, this is the same as a disqualification but with slow clean up of the unpaid space.
I’m not sure is it a good suggestion ever.
Do you mean a single failed audit?
You’d have to fail quite a few audits, it isn’t immediate. Notifications about failed audits could be helpful though.
Right now repairs cost storj around $9/TB stored. From what I see on my 8TB nodes, the held amount will probably be really close to the repair cost between months 10 and 15. Then it will only cover half of the repair cost. But if I exclude the surge payments early on, it would only cover ~70% being at maximum amount and ~35% after I get half of it back.
It’s even worse in case of nodes that started small and then migrated to a big drive.
So I’m really surprised that people get disqualified because of misconfiguration or other equally simple and minor issue and there’s no way to recover the node after a DQ that happened because of the reason above.
That’s what I am thinking. A misconfigured node should not even be able to go online.
Why does the node not check that its data directories are valid and reachable and in the correct path?
This could be either a self test, or an external test, that is run before the server is marked as online for the network.
That’s not what I meant: there should be some leeway: the problem should raise some kind of “error-meter” so the node gets suspended after a certain number of errors, but whatever the root cause of the issue, the result should be suspension, no disqualification.
But @Alexey kind of showed me it would be great from an SNO point of view, but not that simple from a satellite point of view, so… I’m not sure anymore what would be the better approach for everyone to be happy.
It’s a complicated matter.
Yes, being notified would be nice, but still: what is a major issue for me, is the fact that in some situations nodes can be disqualified in a few days (sometimes even quicker), letting no time for SNOs to check what’s going wrong, especially if they do not have their node at hand (if they are in holidays for instance).
Besides, if it is acceptable in some cases to suspend a node for a few weeks so we’ve time to investigate what’s going on before disqualification , I don’t see why it could not be the solution for all kinds of problems.
@Alexey’s answer seems to suggest that it would never be a good thing for the network though (and I see why now, the network could be endangered in the meantime), in which case I really don’t see how to go forward on that matter
It really is a complicated matter.
The repair cost is $10/TB only to Storage Node Operators, but it is not a full cost. Please, read the explanation from engineer in my post.
The held amount is not enough to cover all costs of repair process.
So, if node fail too many audits there is only one way - mark all pieces on the failed node as unhealthy and use all held amount to cover at least part of the costs. As result - the reputation of the node should be reset to the initial state, i.e. - 75% of earnings should held back, the node must going through vetting process (5% of potential traffic for at least a month), the garbage collector will slowly remove repaired pieces.
I do not see any advantages comparing to starting from scratch.
No, please read my post again
Repairing customer data costs storj $13.5/TB, this translates to $9/TB of data stored which is the amount that has to be covered by held amount in case of DQ.
My example node is 7.2TB, in the event of DQ repairing its content will cost storj $64.8. The held amount on it right now is $50, and if the traffic pattern doesn’t change, it will increase to $63.3 (almost enough to cover repairs) and then half of that amount will be returned to me after month 15.
Another example node with same amount of space has $44 held. Its held amount will get to around $58 at most, which is not enough to cover the repair cost.
It’s even worse in case of a SNO who started with a little 1TB drive and upgraded to a huge drive later.
So this is why I’m surprised DQ happens easily because of issues like a missing mount or a disconnected usb drive.
Please, read the Storage Node Operator Terms and Conditions in
This is $10/TB, which Storj Labs should pay to only SNO, when download pieces from storage nodes to the repair job service, then recombine the file and upload all files back to the network, which are charged by the cloud provider of the compute service where is repair job running (all cloud providers charge for egress traffic). So, the total costs of repairing of 7.2TB of data will include the payouts to SNOs and charges for egress traffic of cloud provider (1.19€ per TB (Hetzner) and $90 per TB (Google))
For example:
- lets assume that 45 nodes are marked as unhealthy and all files need to be repaired, so repair job got triggered (we still have 35 healthy nodes);
- the repair job needed 30 pieces to reconstruct the one file, so it will download 7.2TB 30 times, i.e. total costs of payout to healthy nodes is 7.2 * 30 * $10 = $2,160.00
- the repair job reconstruct the missed pieces (80-35=45) for each file and upload them to the network, i.e. 45 * 7.2 TB = 324 TB. In case of Google Cloud it will cost $29,160.00; in case of Hetzner 385.56€
The total cost of repair job is $2,160.00 + $29,160.00 = $31,320.00 (Google) or $2,160.00 + 385.56€ * 1.13 $/€ = $2,595.68
(Hetzner)
Even if we assume that unhealthy nodes have an age of 7 months and have $16.20 in held back, this will give you only $16.20 * 45 = $729.00
This is not enough to cover the all costs. But I calculated the worst scenario with 0 egress traffic from nodes to customers. If they will have egress, then the held amount will be higher, but we can’t really count on that.