[MAIN NODE] Your node has been disqualified

Just make a node suspension on 0.7 audit score and leave a chance for sno to fix problem.
After node leave suspension state audits was resumed and audit score continue down to 0.6 and DQ or goes up and restore at 1.

They have not failed audits with “file not found” error, thus no reason to mark them as failed.
In case of offline those nodes doesn’t selected for downloads and uploads and treated as unhealthy on checker. So they are loses pieces as well, but the payment for repair is going from the satellite operator’s pocket at the moment. Because the held amount is used only from disqualified nodes.
It will be changed with disqualification for downtime.

The Labs have no chance of correctly processing the errors of the store primarily because they know nothing about these stores.
“File not found” means that the file is not available at the moment (SAS\SATA adapter died, or unplugged SAS cable, USB died), just as a non-response of a node means that the node is not available at the moment. There is no difference.

In the meantime, everything is so unreliable that many SNO make GE when a sensitive amount accumulates in the held.

2 Likes

Which is the right course of action if data is actually lost. There is nothing you can repair if that is the case and suspension would be dangerous for file availability and expensive for the satellite operator.

Fix implies the problem is fixed, not that the data is in tact. That would really depend what caused this issue to begin with.

Which problem would that be, as it is still completely unclear what the problem with this node was?

I don’t disagree there. Checks need to happen during runtime as well and the node should be taken offline if the data location becomes unavailable. I even wrote a feature suggestion for that.

But without more information I can’t know whether that was the problem with this node.

1 Like

Exactly, which is why the satellite has to assume the worst in order to make sure the data is safe. I honestly think that the examples you are mentioning would all be fixed with simple checks on availability of the storage location. Keep in mind though, it would still take your node offline and not fixing that in time would still get your node disqualified. But it would buy you time to fix things.

1 Like

It’s not enough. That should not be a suspension state, it should be a different, because it must be almost equivalent of disqualification, i.e. all GET and PUT are forbidden, except GET_AUDIT, the held amount could be used for repair, the all pieces marked as unhealthy.

After resume the vetting should be enabled again (to prevent mass loses, if problem is actually not fixed and audit service just selected the piece from the unbroken part of the disk).
If the held amount is completely used, the node must start from 75% level of held back.

So, it’s much simpler to crash the node if storage is unavailable.

This is why pro sno run it nodes with tone of scripts up of it. And there is always a place for another script in response to a problem that Labs do not want to solve

I believe they are actually solving this one.

the software is open source so anyone with the required knowledge could just fork the project and make the required changes for the storagenode to crash if it can’t find the requested data (or a few consecutive failed audits). Could even make it not respond to audits if the data is missing so the satellite counts it as an unknown audit error instead of an audit error counting towards DQ.
But at the moment I think storjlabs is reasonable enough to acknowledge our problems and they try to find good solutions. So it’s just a matter of time hopefully.

I’m agree with @kevink
If someone wants to solve the problem right now, they could use the https://docs.docker.com/engine/reference/builder/#healthcheck for example

I have now deleted all my nodes and started the setup process of my new nodes.
My new main node is now up and running, but now I have 3 questions.

  1. First lets talk about settings, under the node installer I got no option to choose bandwidth as it was possible earlier. The YML file shows 0 B, is 0B just a default ( no max ) setting?

total allocated bandwidth in bytes (deprecated)

storage.allocated-bandwidth: 0 B

  1. I had a input of 8.19TB ( 10% less then total space ) under my installation of the node, in the YML config i can see that its only 7.4 TiB ( is that because of auto 10% ? )

total allocated disk space in bytes

storage.allocated-disk-space: 7.4 TiB

  1. Do STORJ keep any log data on my IP ? Now as my nodes got disqualified, will I be punished by that on my new nodes ?

what was the reason of DQ?

a faar i know, there is IP punishments, only DQ by node ID. You can have more than 1 node on this IP, others will be OK no punishments.

The reason for my DQ was that I was fixing a bug on my server that runs my nodes and had to change out some parts. This took longer then expected so my node had downtime 05.07 - 10.07.

I did not think of my node gets disqualified to be honest as my mind was all on fixing the bug and getting things up and running again.

a faar i know, there is IP punishments, only DQ by node ID. You can have more than 1 node on this IP, others will be OK no punishments.

Okei, so its 100% depended on the NODE ID. GOOD!

Yes DQ is only on node ID so other nodes behind the same IP are not effected. You can however not be disqualified for down time at the moment. Only for failing audits. In fact, all errors other than missing data would lead to suspension instead of disqualification. So this means your node was only but either missing data or unable to access the data. If you still have logs for this node you can search them for lines with GET_AUDIT and failed in them. Those should show the exact errors.

This setting is deprecated. You can no longer set a bandwidth limit. Limits were never enforced and instead of implementing it they surveyed SNOs to see whether there was demand for this setting and there basically wasn’t. So it’s been removed.

8.19TB is roughly equal to 7.4TiB. It’s just a difference in binary vs decimal notation. I think you meant to set 8.19TiB or 9TB. Assuming you had a 10TB HDD.

The check for drive unmounts has been implemented in the v1.11.1

1 Like

This is great news!!!

@Alexey quick question.
I been running my new nodes for 3 months now after the disqualification mention in this post earlier and compared the 3 month downloaded storage against my first 3 month on my old node and I can see that my new nodes is not even close on the results I had on my old node the first 3 months. What can be the reasoning for this happening ?

3 month old node was 2 TB ( same period ) and now on my new node after 3 months I only got 369.94GB

The space and bandwidth are used by real people, not machines, thus you can’t use a past usage to predict a future usage.
Last year and beginning of the current year have had a lot of surge pricing and tests. Now regular usage with growing number of nodes in the network.

4 Likes

A post was merged into an existing topic: Network statistics