[MAIN NODE] Your node has been disqualified

ARA · July 11, 2020, 11:47am

My main node got disqualified after downtime ( 05.07 - 10.07 )

Is there a way I can get my node back up again?
I had some issues with my server that took few days to solve, now I am disqualified…

Alexey · July 11, 2020, 11:53am

Unfortunately no. The disqualification is permanent.
Please, search for reason of failed audits (PowerShell):

sls "GET_AUDIT" "C:\Program Files\Storj\Storage Node\storagenode.log" | sls "failed" | sls -NotMatch usedserialdb

ARA · July 11, 2020, 11:56am

Okei sad
( I know reason for failed audits, had a issue on my server that took couple of days to fix )

So “Total Held Amount” will be lost and I have to start a new node ?

Alexey · July 11, 2020, 11:57am

Yes. The Total Held Amount will be used to pay to SNO for repairing lost data

ARA · July 11, 2020, 11:58am

puff 9 months gone…

anyway thanks.

Alexey · July 11, 2020, 11:59am

You are welcome!
Not gone, you have received the Total Earned

ARA · July 11, 2020, 12:06pm

Yes, but my goal was to expand with the held amount.

but nothing we can do, I have to have a emo day and then think over lesson learned.

now as my main node is down ( which had the minority of the traffic, will the rest of my nodes get more traffic ? ) I remember I read something that it was based on the IP.
The disk I have now with data from the disqualified , should I format it before setting up new node ?

TheMightyGreek · July 11, 2020, 12:14pm

The rest of your nodes should get more ingress, as for egress it will also increase because they get selected more often but it also depends on the amount of data you have on them.

Alexey · July 11, 2020, 12:22pm

Yes, it’s better to do that. It could reveal bad blocks too.

anon68609175 · July 11, 2020, 12:29pm

That’s why all my nodes make GE on Stephan’s satellite . Main part ($150-180) your held was from Stephan’s satellite.

ARA · July 11, 2020, 12:35pm

That’s why all my nodes make GE on Stephan’s satellite :wink: . Main part ($150-180) your held was from Stephan’s satellite.

GE?

Krey · July 11, 2020, 12:46pm

GE = Gracefull Exit

This is problem with storj, not SNOs.
If node actually not lost data, it must be switched to suspended state before DQ.
SNO must have a some days to fix node.

As Example one of my server lost SAS adapter. Node is alive but storage no. And it fail audits. All data is alive all that i need is replace adapter.

BrightSilence · July 11, 2020, 1:12pm

What evidence are you basing this on? I don’t see any logs or indication to know what went wrong with this node. If files got lost, that’s not something that can be fixed. Lets not jump to conclusions without additional information.

kevink · July 11, 2020, 1:13pm

This is not true. For egress the data needs to be on a node. you can’t select a node “more often” if it doesn’t have the requested pieces. A piece is only stored once behind a single IP.
Egress will not increase if you suddenly have one node less.

Alexey · July 11, 2020, 1:19pm

Result will be the same as disqualification but with full disk of unpaid data

Alexey · July 11, 2020, 1:31pm

The more simple thing for configuration error is preventing node from start (or working), if storage is unavailable (or become unavailable during the work).
This is taken into development. But I do not have details of implementation.

For me the simplest solution would be separate setup and run, i.e. the node will not try to create a storage location during the normal run. It should be created only with setup.
If storage location is empty, the node must just crash.

Krey · July 11, 2020, 1:36pm

My evidence is DQ without suspend. And author text

the word fix can be interpreted as “the problem is fixed and the data is intact”

Many times i wrote about this problem. Storjlabs ignore it and as result SNOs think that Labs just need a reason to not paid held amounts.

Checks on starting not enough. They don’t even solve configuration problems.

Alexey · July 11, 2020, 1:59pm

Who will pay for recover while pieces unavailable?

In the current implementation of suspension, the pieces are still available, only audit return the unknown error (otherwise it will return “file not found” error and will be marked as failed).
So, in the current implementation of suspension mode the node can be used to retrieve data and do not counted as lost in the checker service, so it’s only excluded from uploads of any kind.

If implement the treating of “file not found” as reason for suspension, then it will put the data into danger to be completely lost, because pieces on such node does not really available for downloads.

So, it should be a different kind of suspension. It should not allow any GET and all pieces must be marked as lost.
The checker service must treat them as lost too and thus the repair service could be triggered on threshold. As result the costs of repair should be taken from the held amount of such node.
Since the costs of repair are high, the held amount will be used at whole after few trigger events. Thus node will start from 75% level of held back.

The exit from suspension would be successful audits. However, since there is no way for satellite to know is it fixed problem or not, the node should start again from vetting.
The node is untrusted entity from the beginning and can’t be trusted “on word”, especially after failure. Thus pieces marked as “temporary lost” will remain marked as lost until will be audited.

While audit is performing, the trigger job doesn’t sleep, so node will continue to lose repaired pieces (they are now on other, healthy nodes). The space on that node will remain used by those temporary unhealthy pieces, until audited.
The repaired pieces will be removed by Garbage collector (i.e. - slow).
The audit of the whole node could take months. All that time node will lose pieces but still keep the unpaid data.

Why this is better than disqualification?

Krey · July 11, 2020, 2:05pm

If the author of the post would detect the problem in time and simply turn off the node before solving the problem, its data would also be inaccessible, but the node would not be disqualified.

Did I answer your text? Do you already conduct a full audit of nodes after they exit downtime? Of course not!

TheMightyGreek · July 11, 2020, 2:06pm

You’re right, my reasoning was wrong I didn’t think it all the way through…