Disqualified. Could you please help to figure out why?

Thanks @Alexey , I just sent you my node ID, I’ll wait for the time needed to the engineering staff to check satellite logs.

Thanks again
Pietro

1 Like

I have received information about your node

Wow under 4 hours for a DQ, but the SNO should otherwise be online for 30 days so that its online/auditscore normalized. Spot the mistake…

3 Likes

It’s not related to online score.
It’s related to failed audits due timeouts. For the audit to fail due to a timeout, your node must be online and respond to the audit request, but not submit a snippet (small part of the piece) for audit after 5 minutes with three attempts (was 13!).

If your node offline - it cannot answer on audit request. Such audit considered as offline and affects your online score but not the audit score.

So, completely unrelated to requirement to be online for the next 30 days after downtime. This requirements only to recover the online score.
To be disqualified for downtime your node must be offline more than 30 days.

In my opinion, there is no way to react so quickly after noticing a problem with the node, because the reaction time to avoid the DQ is far too short.

It would be better to give a little more time (days!) for troubleshooting and notify several times by mail of maybe an existing disqualification.

1 Like

There would be helpful to trigger suspension, however it would mean that all pieces on that node should be marked as unhealthy. As result - if repair job got triggered, the satellite operator would pay to SNOs from their own pocket for repair.
This also mean that these pieces will be removed from the suspended node after repair.
See storj/audit-suspend.md at 6c34ff64adde51c5d819b2800ef1e59f67f75f0e · storj/storj · GitHub

Repair (would self pay for it) vs. start from scratch!
I would prefer repair and I still think it’s better than a complete disqualification!

Yes, just you will pay too much. See

You could make a feature request there: Storage Node feature requests - voting - Storj Community Forum (official)

2 Likes

But it only took 4 hours for being disqualified. This is what counts.

I am with @maxsch here: Reaction to disqualify is too quick and time to avoid disqualification is way too short.

This is not fair treatment and not respect for your SNOs. Or how do you think it is to go to bed at night with a perfectly running node and wake up in the morning with the node being disqualified without a chance to fix or revoke the ban?

I don’t see why the pieces should be marked as unhealthy. Normal SNO reaction when audits are failing would be to put the node offline, fix issues, put the node back online. Doing that would not result in marking all pieces unhealthy, wouldn’t it?. So when I do that manually and pieces remain healthy there is no reason pieces should become unhealthy when this happens automatically. It is the same procedure.

What needs to be done is to put the node into an offline/suspended state when audits are failing and to give the node operator time to fix issues, at least 7 days. When the problems have been fixed, audit the pieces that have previously failed and the node could be back online.

4 Likes

Because they are unavailable, since the node cannot return a part of the piece within 5 minutes.
This health status should be updated, when your node went out of yellow zone.

Returning “file is not available, sorry” to the customer is not an option.

I agree, but not for all types of failure. For example “file not found” cannot be fixed on SNO side.

Please, create a feature request there: Storage Node feature requests - voting - Storj Community Forum (official)

That’s exactly my thought and that’s exactly what happened to me. One could be 30 days offline without being disqualified but having a temporary problem for a few of hours would led to loose the node.

Moreover there are no ERROR critical messages in my logs related to the disqualification, at least Storj wasn’t able to tell me which lines of my logs are related to the problem, I can only see WARN messages. The storage is fine, before starting the node after the problem I performed an fsck which reported no errors at all, every single piece on my CMR Seagate Ironwolf is healthy.

So the question is: How could I have noticed the problem in time, not only to avoid the disqualification, but more important to fix the problem in order to give a better service to the Storj network?

It seems that Storj prefer to disqualify the nodes instead of giving SNO’s the opportunity to cure them.

As I said when I opened the thread, in two years I invested money and time, I was (and I still am) very excited about the project, but I’m now very disappointed about how Storj treats its most loyal operators.

At least, in cases similar to mine, they should evaluate the problem and decide to revoke the ban if even the Storj staff in my shoes would have been disqualified as well. Put in other ways: if the SNO conduct is in line with Storj required quality level and if he/she could not reasonably have avoided the problem.

So what should I do now after losing 4 out of 6 satellites? Start a fresh identity, which means going through months of held amount, vetting, reputation, etc. again or shut down the node immediately and reuse the hardware for a good NAS?

3 Likes

We should disqualify or remove the unreliable node from nodes selection process somehow to deal with it later to do not affect customers. All pieces on unreliable nodes must be marked as unhealthy to trigger repair earlier before the number of available pieces would fall below the threshold.
There is no alternative, losing customers will lead to fail the network. No customers - no nodes, it’s simple as that. So failing node must be isolated ASAP. Thus so short time interval.

At the moment the suspension is implemented for unknown errors, which are not “file not found”, not timeouts and not pieces corruption.
This suspension for unknown errors is implemented to figure out what are errors could be to include their detection to the pre-flight check and storage monitor function. After adding detection for all remained class of unknown errors this suspension should not be triggered too much in the future.

The timeout error is still the issue - when the hardware or OS become unresponsive for any reason (usually - out of RAM or out of space on system drive or dying HDD, RAM corruption, etc.), so it’s response on audit request (so it’s online), but cannot provide a piece for audit (because underlaying OS functions performed too slow or hangs). This issue shows that this node is not reliable, thus it should be disqualified ASAP if too many audits are failed to do not affect customers and data. If you allow to survive it too long without marking pieces as unhealthy you easily fall into situation when there is not enough pieces for recover.

The suspension can protect such node from disqualification and gives graceful period to fix the issue before the actual disqualification. It also treated as unhealthy on the repair service. So almost all downsides of disqualification are included, except using held amount to recover if the number of healthy pieces would be lower than threshold.

The unknown errors are rare, so the loses in money for the satellite operator is relatively small. The timeout errors are much often issue. Then “file not found” and “corrupted” issues.

Perhaps to enable suspension for audit failures due timeouts we should implement a usage of held amount. The price of recover is high and the node’s held amount could be not enough to cover the costs. As result your node will be suspended, lost all held amount, the reputation is zeroed (to force start to collect the held amount again from 75% level) and data being slowly deleted from it, but while it there it will consume the valuable space.
Do you still think it’s better than disqualification when you can start from scratch?

If so - please, make a feature request to allow the Team to take it into consideration. Please, put yourself into satellite’s operator and the customers’ shoes.
Or even better - make a pull request on GitHub

1 Like

Alexey, many thanks for your kindly reply. I completely agree with the statement above, but my point is that Storj didn’t give me the opportunity to fix the issue. The node was not suspended, I didn’t receive any email like “Your node has a problem, please fix it otherwise it will be disqualified”, I din’t notice anything wrong in my log files and I din’t see any message in the dashboard until it was too late.

So I’m asking you what was I supposed to do to avoid the problem? The board was responsive, I could connect via ssh, my remote monitoring was still able to connect to the service and therefore didn’t send me any alert. Perhaps was a kernel bug, so not my fault, but again: since Storj knows perfectly that I was failing the audits, why this information was not propagated to me?

Trust me, I’m not polemic, I just want to understand what should I (and other SNOs) do in the future to avoid similar problems and giving you suggestion to increase the quality of the service.

Because if Storj had warned me on time I would have fixed the problem on time and the network would have benefit a better quality.

EDIT: it could also be a Storj node software bug, who knows? Storj cannot be 100% sure it’s not a software bug, and neither I could be sure it’s a bug, it’s just an hypothesis: but in this case would have I deserved a disqualification?

4 Likes

I know, but it’s unlikely changed if there would not be a feature request or pull request on the GitHub.

I do not know honestly, I did not find solution so far. The best option would be to implement monitoring I suppose. We have a whole tag for that: Topics tagged monitoring

However, I do not know how it could help, even if it detects something wrong with your node. There should be some actions involved - like terminate a storagenode’s process or PC shutdown. And I have no ideas how to implement this.
If node itself could work normally during that time, I think would be possible to detect the problem and shutdown itself, as we do when your disk disappearing, but it’s not the case. It even cannot log the audit attempt not saying to do something smart. So, it should be something external - like block the network activity on your router when your node hanging for example.

I submitted a feature request as suggested (please vote!):

I think I will start again with a new identity but if it’s very, very frustrating.

1 Like

If your node disqualified only on one satellite, you can keep the node running - all remained satellites will still pay for service.
The starting from scratch is only required when your node disqualified on all satellites.

In theory yes, however with the current utilization of the Storj DCS network, if 75% of node earnings come from only 2 satellites (Saltlake and Europe-North) and you get banned from these, then starting all over is the most likely option.

2 Likes

Yes, exactly. I got disqualified on Europe-North which was giving me 2/3 of the incomes. Saltlake is still working but remaining only with it (and us2 which is quite irrelevant) is not worth. Maybe I can start a second node in parallel and shutting down the first after a while.

If you keep it running, you can hide the DQd satellites (with storage2.trust.exclusions) from dashboard:

1 Like

Done! Many thanks! :slightly_smiling_face: