Is there such a thing as an "un-disqualify"?

Pac · July 10, 2020, 10:45pm

But, I thought that maybe it’d be simpler and easier to pull out a node (by suspending it) from the network when anything goes wrong, anything… And if it gets back to normal within a certain amount of time, un-suspend it. Otherwise, disqualify it.

I’m truly having a hard time to see why, in some cases, it should be immediately and definitely killed. ^^’

Alexey · July 10, 2020, 10:46pm

I already said what I think about suspension instead of disqualification:

Node disqualified?

The suspension mode is a good suggestion from the SNO point of view, but a very bad from the satellites’ and customers’ points of view:

your node doesn’t have requested files, but they can be requested by customers and they got an error “file not found”;

it can trigger the repair job, because all pieces on your node will be marked as temporary unhealthy;

if the repair job has been triggered, the payment for repairing would be taken from the satellite’s operator pocket, not from the held amount of the suspended node;

the suspended node will lose the repaired pieces, because they will be stored on more reliable nodes;

the satellite operator will pay for inaccessible data from its pocket until you fix the problem or until a week is gone.

The current implementation of suspension will be used only if your node would answer with unknown error. It’s not so great number of nodes as if were taken into account the “no such file” errors.

It should be extended to work as a slow disqualification:

forbid not only PUT, but GET too (see storj/docs/blueprints/audit-suspend.md at 5786595d141ff8b8c8be21e5e8696dd73e04d08d · storj/storj · GitHub);
take any costs from your held amount to recover missed pieces if repair trigger is fired;
reduce reputation and levels of trust (i.e. - put 25% to held amount, then 50%, then 75% depending on level of loses);
enable vetting again.

The recovered pieces will be removed from such node with a Garbage collector.

Pac · July 10, 2020, 10:48pm

Very educational answer @Alexey, many thanks.

I see why it’s not as simple as I thought.

Well then… I’m not sure what to suggest

Pac · July 10, 2020, 10:53pm

If I understand what you’re suggesting, SNOs should pay for repairs triggered while their nodes are suspended.

That sounds pretty fair to me. I’d rather receive a poor payment one month because my node got temporarily suspended, instead of getting disqualified.

Alexey · July 10, 2020, 10:56pm

Yes, but the held amount could be wiped very fast, because of

And thus your node will fall back to 75% level of earnings to held amount very fast too.
Then we will be forced to enable the vetting process again. So - 5% of potential traffic during the next month or two.
And pieces become deleted too every time, because your node is untrusted and pieces are still marked as lost. And of course they are not paid anymore.
So, this is the same as a disqualification but with slow clean up of the unpaid space.

I’m not sure is it a good suggestion ever.

BrightSilence · July 10, 2020, 11:49pm

Do you mean a single failed audit?

You’d have to fail quite a few audits, it isn’t immediate. Notifications about failed audits could be helpful though.

hoarder · July 11, 2020, 1:04am

Right now repairs cost storj around $9/TB stored. From what I see on my 8TB nodes, the held amount will probably be really close to the repair cost between months 10 and 15. Then it will only cover half of the repair cost. But if I exclude the surge payments early on, it would only cover ~70% being at maximum amount and ~35% after I get half of it back.
It’s even worse in case of nodes that started small and then migrated to a big drive.

So I’m really surprised that people get disqualified because of misconfiguration or other equally simple and minor issue and there’s no way to recover the node after a DQ that happened because of the reason above.

jammerdan · July 11, 2020, 7:48am

That’s what I am thinking. A misconfigured node should not even be able to go online.
Why does the node not check that its data directories are valid and reachable and in the correct path?

This could be either a self test, or an external test, that is run before the server is marked as online for the network.

Pac · July 11, 2020, 8:10am

That’s not what I meant: there should be some leeway: the problem should raise some kind of “error-meter” so the node gets suspended after a certain number of errors, but whatever the root cause of the issue, the result should be suspension, no disqualification.

But @Alexey kind of showed me it would be great from an SNO point of view, but not that simple from a satellite point of view, so… I’m not sure anymore what would be the better approach for everyone to be happy.

It’s a complicated matter.

Yes, being notified would be nice, but still: what is a major issue for me, is the fact that in some situations nodes can be disqualified in a few days (sometimes even quicker), letting no time for SNOs to check what’s going wrong, especially if they do not have their node at hand (if they are in holidays for instance).

Besides, if it is acceptable in some cases to suspend a node for a few weeks so we’ve time to investigate what’s going on before disqualification , I don’t see why it could not be the solution for all kinds of problems.

@Alexey’s answer seems to suggest that it would never be a good thing for the network though (and I see why now, the network could be endangered in the meantime), in which case I really don’t see how to go forward on that matter

It really is a complicated matter.

Alexey · July 11, 2020, 8:19am

The repair cost is $10/TB only to Storage Node Operators, but it is not a full cost. Please, read the explanation from engineer in my post.
The held amount is not enough to cover all costs of repair process.
So, if node fail too many audits there is only one way - mark all pieces on the failed node as unhealthy and use all held amount to cover at least part of the costs. As result - the reputation of the node should be reset to the initial state, i.e. - 75% of earnings should held back, the node must going through vetting process (5% of potential traffic for at least a month), the garbage collector will slowly remove repaired pieces.
I do not see any advantages comparing to starting from scratch.

hoarder · July 11, 2020, 9:18am

No, please read my post again
Repairing customer data costs storj $13.5/TB, this translates to $9/TB of data stored which is the amount that has to be covered by held amount in case of DQ.
My example node is 7.2TB, in the event of DQ repairing its content will cost storj $64.8. The held amount on it right now is $50, and if the traffic pattern doesn’t change, it will increase to $63.3 (almost enough to cover repairs) and then half of that amount will be returned to me after month 15.
Another example node with same amount of space has $44 held. Its held amount will get to around $58 at most, which is not enough to cover the repair cost.
It’s even worse in case of a SNO who started with a little 1TB drive and upgraded to a huge drive later.

So this is why I’m surprised DQ happens easily because of issues like a missing mount or a disconnected usb drive.

Alexey · July 11, 2020, 9:53am

Please, read the Storage Node Operator Terms and Conditions in

This is $10/TB, which Storj Labs should pay to only SNO, when download pieces from storage nodes to the repair job service, then recombine the file and upload all files back to the network, which are charged by the cloud provider of the compute service where is repair job running (all cloud providers charge for egress traffic). So, the total costs of repairing of 7.2TB of data will include the payouts to SNOs and charges for egress traffic of cloud provider (1.19€ per TB (Hetzner) and $90 per TB (Google))
For example:

lets assume that 45 nodes are marked as unhealthy and all files need to be repaired, so repair job got triggered (we still have 35 healthy nodes);
the repair job needed 30 pieces to reconstruct the one file, so it will download 7.2TB 30 times, i.e. total costs of payout to healthy nodes is 7.2 * 30 * $10 = $2,160.00
the repair job reconstruct the missed pieces (80-35=45) for each file and upload them to the network, i.e. 45 * 7.2 TB = 324 TB. In case of Google Cloud it will cost $29,160.00; in case of Hetzner 385.56€

The total cost of repair job is $2,160.00 + $29,160.00 = $31,320.00 (Google) or $2,160.00 + 385.56€ * 1.13 $/€ = $2,595.68 (Hetzner)

Even if we assume that unhealthy nodes have an age of 7 months and have $16.20 in held back, this will give you only $16.20 * 45 = $729.00
This is not enough to cover the all costs. But I calculated the worst scenario with 0 egress traffic from nodes to customers. If they will have egress, then the held amount will be higher, but we can’t really count on that.