Disqualified. Could you please help to figure out why?

pietro · August 6, 2021, 5:53pm

Hi all! After about 2 years of honored service my node just got disqualified on 4 satellites. I can see some errors in the logs, but I can’t distinguish between harmful and non harmful messages.

The storage is in good condition, no problem on the disk and no problem on the file system.

Looking in the Linux system logs, I noticed a kernel Oops this morning:

Aug  5 09:41:55 localhost kernel: [1006026.390386] docker0: port 1(vethf84dbbb) entered forwarding state
Aug  6 04:08:36 localhost kernel: [1072425.039980] 8<--- cut here ---
Aug  6 04:08:36 localhost kernel: [1072425.039992] Unable to handle kernel paging request at virtual address a37ced93
Aug  6 04:08:36 localhost kernel: [1072425.040000] pgd = 67c1b429
Aug  6 04:08:36 localhost kernel: [1072425.040005] [a37ced93] *pgd=00000000
Aug  6 04:08:36 localhost kernel: [1072425.040017] Internal error: Oops: 5 [#1] PREEMPT SMP ARM

before that the log is clean. Could it be possible to get disqualified in less than 1 day?

Could I send node logs to someone or post here? Maybe you could suggest a grep statement to filter only relevant lines.

At the moment I just want to understand the cause of the disqualification.

Many thanks
Pietro

Stob · August 6, 2021, 6:04pm

Hi @pietro,

Sorry for your loss. If your node is failing audits by giving out bad data, or accepting an audit request but not responding fully to the request in a timely fashion then it is possible to be disqualified that quickly.

In the log file you can look far any entries with ‘WARN’ or ‘FATAL’ or ‘ERROR’. I don’t use linux but these might work?

This is regarding the audits -

github.com

storj/storj/blob/5786595d141ff8b8c8be21e5e8696dd73e04d08d/docs/blueprints/audit-suspend.md

# Storagenode "Suspension" State Blueprint

## Introduction

Currently, when a storagenode is audited for an erasure share, there are five possible outcomes:

1. Success: The node responds with the correct data
2. Failure: The node responds with incorrect data
3. Offline: The node cannot be contacted
4. Contained: The node can be contacted, but the connection times out before all the data can be received by the satellite
5. Unknown: The node responds with any other error

Only cases 1 and 2 directly affect a node's audit reputation, which can cause disqualification.

When the [downtime tracking service](./storage-node-downtime-tracking.md) is fully implemented, case 3 can indirectly cause a disqualification.

Case 4 can also indirectly cause disqualification, since a node placed in containment mode will be re-audited at some point with the same 5 potential outcomes.

Case 5 is the only situation where there is currently no potential penalty for responding to an audit with some type of error. Fortunately, having this case has allowed us to find, diagnose, and fix several problems with storagenodes, increasing network durability. Unfortunately, it allows us to perceive nodes that consistently respond to audits with unknown errors as "healthy", giving us an inflated view of durability.

This file has been truncated. show original

pietro · August 6, 2021, 6:12pm

Thanks @Stob for the quick reply. The grep for fatal (case insensitive) gave no result at all on all my logs starting from 19 July.

I keep old logs (4 compressed files) rotating them with logrotate every 100M.

Also GET_AUDIT messages are without errors.

Any other clue?

Pietro

pietro · August 6, 2021, 6:19pm

I found this in the log:

2021-08-05T22:42:05.394Z	WARN	contact:service	Your node is still considered to be online but encountered an error.	{"Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Error": "contact: failed to dial storage node (ID: ****) at address ****:28967 using QUIC: rpc: quic: timeout: no recent network activity"}

(node ID and hostname hidden for privacy)

Could it be the cause? First message yesterday night, does half day of problems suffice for getting disqualified?

Alexey · August 6, 2021, 7:41pm

No, your node should be online, response on audit requests (otherwise it will hit an online score, not the the audit score), but failed to provide a piece within 5 minutes and do the same two more times for the same piece. The other alternative is to provide a corrupted piece.
In both cases the audit failure will not be recorded into storagenode’s log.

You could though figure that out, if your logs have lacks of audit requests for several hours - this is likely an indication of timeouts audits failure due to unresponsiveness of your node.
If you provide your NodeID here or in DM or https://support.storj.io - I can ask the on-call engineer to check the satellites logs (it is time consuming, so do not expect a quick response especially on the eve of the weekend).

pietro · August 6, 2021, 8:06pm

Thanks @Alexey , I just sent you my node ID, I’ll wait for the time needed to the engineering staff to check satellite logs.

Thanks again
Pietro

Alexey · August 13, 2021, 7:01pm

I have received information about your node

maxsch · August 13, 2021, 7:55pm

Wow under 4 hours for a DQ, but the SNO should otherwise be online for 30 days so that its online/auditscore normalized. Spot the mistake…

Alexey · August 14, 2021, 9:22am

It’s not related to online score.
It’s related to failed audits due timeouts. For the audit to fail due to a timeout, your node must be online and respond to the audit request, but not submit a snippet (small part of the piece) for audit after 5 minutes with three attempts (was 13!).

If your node offline - it cannot answer on audit request. Such audit considered as offline and affects your online score but not the audit score.

So, completely unrelated to requirement to be online for the next 30 days after downtime. This requirements only to recover the online score.
To be disqualified for downtime your node must be offline more than 30 days.

maxsch · August 14, 2021, 7:16pm

In my opinion, there is no way to react so quickly after noticing a problem with the node, because the reaction time to avoid the DQ is far too short.

It would be better to give a little more time (days!) for troubleshooting and notify several times by mail of maybe an existing disqualification.

Alexey · August 14, 2021, 7:31pm

There would be helpful to trigger suspension, however it would mean that all pieces on that node should be marked as unhealthy. As result - if repair job got triggered, the satellite operator would pay to SNOs from their own pocket for repair.
This also mean that these pieces will be removed from the suspended node after repair.
See storj/audit-suspend.md at 6c34ff64adde51c5d819b2800ef1e59f67f75f0e · storj/storj · GitHub

maxsch · August 14, 2021, 7:41pm

Repair (would self pay for it) vs. start from scratch!
I would prefer repair and I still think it’s better than a complete disqualification!

Alexey · August 14, 2021, 7:58pm

Yes, just you will pay too much. See

You could make a feature request there: Storage Node feature requests - voting - Storj Community Forum (official)

jammerdan · August 15, 2021, 1:52am

But it only took 4 hours for being disqualified. This is what counts.

I am with @maxsch here: Reaction to disqualify is too quick and time to avoid disqualification is way too short.

This is not fair treatment and not respect for your SNOs. Or how do you think it is to go to bed at night with a perfectly running node and wake up in the morning with the node being disqualified without a chance to fix or revoke the ban?

I don’t see why the pieces should be marked as unhealthy. Normal SNO reaction when audits are failing would be to put the node offline, fix issues, put the node back online. Doing that would not result in marking all pieces unhealthy, wouldn’t it?. So when I do that manually and pieces remain healthy there is no reason pieces should become unhealthy when this happens automatically. It is the same procedure.

What needs to be done is to put the node into an offline/suspended state when audits are failing and to give the node operator time to fix issues, at least 7 days. When the problems have been fixed, audit the pieces that have previously failed and the node could be back online.

Alexey · August 15, 2021, 7:07am

Because they are unavailable, since the node cannot return a part of the piece within 5 minutes.
This health status should be updated, when your node went out of yellow zone.

Returning “file is not available, sorry” to the customer is not an option.

I agree, but not for all types of failure. For example “file not found” cannot be fixed on SNO side.

Please, create a feature request there: Storage Node feature requests - voting - Storj Community Forum (official)

pietro · August 15, 2021, 7:24am

That’s exactly my thought and that’s exactly what happened to me. One could be 30 days offline without being disqualified but having a temporary problem for a few of hours would led to loose the node.

Moreover there are no ERROR critical messages in my logs related to the disqualification, at least Storj wasn’t able to tell me which lines of my logs are related to the problem, I can only see WARN messages. The storage is fine, before starting the node after the problem I performed an fsck which reported no errors at all, every single piece on my CMR Seagate Ironwolf is healthy.

So the question is: How could I have noticed the problem in time, not only to avoid the disqualification, but more important to fix the problem in order to give a better service to the Storj network?

It seems that Storj prefer to disqualify the nodes instead of giving SNO’s the opportunity to cure them.

As I said when I opened the thread, in two years I invested money and time, I was (and I still am) very excited about the project, but I’m now very disappointed about how Storj treats its most loyal operators.

At least, in cases similar to mine, they should evaluate the problem and decide to revoke the ban if even the Storj staff in my shoes would have been disqualified as well. Put in other ways: if the SNO conduct is in line with Storj required quality level and if he/she could not reasonably have avoided the problem.

So what should I do now after losing 4 out of 6 satellites? Start a fresh identity, which means going through months of held amount, vetting, reputation, etc. again or shut down the node immediately and reuse the hardware for a good NAS?

Alexey · August 15, 2021, 8:03am

We should disqualify or remove the unreliable node from nodes selection process somehow to deal with it later to do not affect customers. All pieces on unreliable nodes must be marked as unhealthy to trigger repair earlier before the number of available pieces would fall below the threshold.
There is no alternative, losing customers will lead to fail the network. No customers - no nodes, it’s simple as that. So failing node must be isolated ASAP. Thus so short time interval.

At the moment the suspension is implemented for unknown errors, which are not “file not found”, not timeouts and not pieces corruption.
This suspension for unknown errors is implemented to figure out what are errors could be to include their detection to the pre-flight check and storage monitor function. After adding detection for all remained class of unknown errors this suspension should not be triggered too much in the future.

The timeout error is still the issue - when the hardware or OS become unresponsive for any reason (usually - out of RAM or out of space on system drive or dying HDD, RAM corruption, etc.), so it’s response on audit request (so it’s online), but cannot provide a piece for audit (because underlaying OS functions performed too slow or hangs). This issue shows that this node is not reliable, thus it should be disqualified ASAP if too many audits are failed to do not affect customers and data. If you allow to survive it too long without marking pieces as unhealthy you easily fall into situation when there is not enough pieces for recover.

The suspension can protect such node from disqualification and gives graceful period to fix the issue before the actual disqualification. It also treated as unhealthy on the repair service. So almost all downsides of disqualification are included, except using held amount to recover if the number of healthy pieces would be lower than threshold.

The unknown errors are rare, so the loses in money for the satellite operator is relatively small. The timeout errors are much often issue. Then “file not found” and “corrupted” issues.

Perhaps to enable suspension for audit failures due timeouts we should implement a usage of held amount. The price of recover is high and the node’s held amount could be not enough to cover the costs. As result your node will be suspended, lost all held amount, the reputation is zeroed (to force start to collect the held amount again from 75% level) and data being slowly deleted from it, but while it there it will consume the valuable space.
Do you still think it’s better than disqualification when you can start from scratch?

If so - please, make a feature request to allow the Team to take it into consideration. Please, put yourself into satellite’s operator and the customers’ shoes.
Or even better - make a pull request on GitHub

pietro · August 15, 2021, 8:33am

Alexey, many thanks for your kindly reply. I completely agree with the statement above, but my point is that Storj didn’t give me the opportunity to fix the issue. The node was not suspended, I didn’t receive any email like “Your node has a problem, please fix it otherwise it will be disqualified”, I din’t notice anything wrong in my log files and I din’t see any message in the dashboard until it was too late.

So I’m asking you what was I supposed to do to avoid the problem? The board was responsive, I could connect via ssh, my remote monitoring was still able to connect to the service and therefore didn’t send me any alert. Perhaps was a kernel bug, so not my fault, but again: since Storj knows perfectly that I was failing the audits, why this information was not propagated to me?

Trust me, I’m not polemic, I just want to understand what should I (and other SNOs) do in the future to avoid similar problems and giving you suggestion to increase the quality of the service.

Because if Storj had warned me on time I would have fixed the problem on time and the network would have benefit a better quality.

EDIT: it could also be a Storj node software bug, who knows? Storj cannot be 100% sure it’s not a software bug, and neither I could be sure it’s a bug, it’s just an hypothesis: but in this case would have I deserved a disqualification?

Alexey · August 15, 2021, 1:29pm

I know, but it’s unlikely changed if there would not be a feature request or pull request on the GitHub.

I do not know honestly, I did not find solution so far. The best option would be to implement monitoring I suppose. We have a whole tag for that: Topics tagged monitoring

However, I do not know how it could help, even if it detects something wrong with your node. There should be some actions involved - like terminate a storagenode’s process or PC shutdown. And I have no ideas how to implement this.
If node itself could work normally during that time, I think would be possible to detect the problem and shutdown itself, as we do when your disk disappearing, but it’s not the case. It even cannot log the audit attempt not saying to do something smart. So, it should be something external - like block the network activity on your router when your node hanging for example.

pietro · August 15, 2021, 4:21pm

I submitted a feature request as suggested (please vote!):

I think I will start again with a new identity but if it’s very, very frustrating.