If a disk disconnects while a node is running, if audits happen during this time is this still considered as missing pieces that could lead to very fast disqualification?

Pac · August 13, 2020, 8:20am

I’m asking because I recently had all my disks disconnected at the same time for roughly 6 to 8 hours before I noticed something was wrong and was able to reboot the whole thing and get back to normal.
During this time, almost all of my nodes that were running on these disks got suspended with suspension scores sometimes very low (below 30%), but the “Audit” scores stayed at 100% on all of them though.

Which I must say is a very pleasant surprise because considering how fast a score goes down, I thought all my nodes were done for…

I’m just wondering if I got lucky, or if a recent version of the node software behaves differently?
(I’m on 1.9.5)
Because AFAIK many people got their nodes DQed in the past because of disks sudden unmount/disconnection.

Alexey · August 13, 2020, 6:44pm

Yes. Your node will return “piece not found”, this is immediately fail audit.
If you didn’t noticed it fast enough - your node will be disqualified in a hour.

The suspension applied only for unknown errors:

github.com

storj/storj/blob/0518b163707a801ab44b97900a1def1a5b94227d/docs/blueprints/audit-suspend.md

# Storagenode "Suspension" State Blueprint

## Introduction

Currently, when a storagenode is audited for an erasure share, there are five possible outcomes:

1. Success: The node responds with the correct data
2. Failure: The node responds with incorrect data
3. Offline: The node cannot be contacted
4. Contained: The node can be contacted, but the connection times out before all the data can be received by the satellite
5. Unknown: The node responds with any other error

Only cases 1 and 2 directly affect a node's audit reputation, which can cause disqualification.

When the [downtime tracking service](./storage-node-downtime-tracking.md) is fully implemented, case 3 can indirectly cause a disqualification.

Case 4 can also indirectly cause disqualification, since a node placed in containment mode will be re-audited at some point with the same 5 potential outcomes.

Case 5 is the only situation where there is currently no potential penalty for responding to an audit with some type of error. Fortunately, having this case has allowed us to find, diagnose, and fix several problems with storagenodes, increasing network durability. Unfortunately, it allows us to perceive nodes that consistently respond to audits with unknown errors as "healthy", giving us an inflated view of durability.

This file has been truncated. show original

BrightSilence · August 13, 2020, 9:47pm

That’s what I would have expected as well. But this node got suspended, not disqualified. I can’t explain that.

Pac · August 14, 2020, 6:57am

I know your view on this @Alexey and although I understand and do get it’s bad for the network, in my opinion this needs fixing ASAP…

If a node is healthy and pieces cannot be found because they are actually missing, then disqualify the node as fast as possible why not… I still disagree with this approach but fine, fair enough.
But when it’s the whole disk that got disconnected for any reason, the SNO needs to get the chance to fix the issue before DQ.

I’m still having a hard time to see why an offline or misconfigured node gets the chance to get fixed for 7 days, but not a node that “appear” to have missing pieces: In both cases from the sat’ perspective the network has unavailable pieces and may need to start repairing… How is it different?

So basically in my case, what we’re saying is that I got like super lucky as all my nodes should have been disqualified in a hour only?! Man that’s frightening, StorjLabs can’t keep doing that to home SNOs! Even with an ace monitoring system that would have notified me, I wouldn’t have gotten up at 3 in the morning to fix my node, seriously!

My containers did not get terminated as dashboards were still responding (locally at least), but some of them were showing no graphs at all. Besides, the load average of the Raspberry Pi was around 0.01 when I noticed the problem, so Nodes were basically doing nothing (it’s usually averaging around 0.50).

I’m struggling to decipher the logs, and for now I really don’t know what happened exactly.

The idea I did not get disqualified just by sheer luck, and that it may happen anytime now makes me feel like I’m living with the sword of Damocles hanging above my head

Alexey · August 14, 2020, 7:19am

This is not my view, this is the current state.

However, the feature to prevent node to be DQ because of unmounted disk is in the “work in progress” state. I have no details of implementation or any ETA at the moment.

That’s the problem. The node just report to the satellite: “piece not found”. Because the OS is returned that state. So we need to implement some kind of alternative checks. And this feature is in progress.

In case of unknown error the chance that the piece is still here is great. In case of “piece not found” there is no doubts - the piece is definitely lost.

I understand that. However, it’s how it works now. As I said, the feature to prevent DQ because of missed mount point is in progress.

Pac · August 14, 2020, 8:05am

Well, I think that’s the problem: there are definitely doubts.

Alright. IMHO this could be a high priority ticket ^^’

I’m sorry if I sounded harsh in my previous post, I was overwhelmed with frustration.
Thanks a lot for all these clarifications @Alexey. Much appreciated.

jammerdan · August 14, 2020, 8:12am

I think the best you could do now is to find or write a script that detects a lost mount/disk and shuts the node down in that case. Then it is an offline node and not an unhealthy node.

Pac · August 14, 2020, 8:21am

@jammerdan You’re probably right.

But the more tools and monitoring utilities that are needed to make sure things do not go tits up, the more SNO is kind of a high competency job, not a casual activity home users could do on the side.

If StorjLabs wants home users to be able to participate in providing storage space, I think the node software should be pretty autonomous and forgiving. Otherwise, the network will end up providing storage hosted by professionals only, which may not go in the right direction with regards to the decentralization perspective.

This said things are improving, the storage node software is already way better now than a few month ago. It just needs a few more adjustments I guess

jammerdan · August 14, 2020, 10:13am

I agree. I like the original run it and forget it approach. I think we are way beyond that.

KernelPanick · August 14, 2020, 1:12pm

I use this, until they can do some kind of proactive fix. It doesn’t have to be iSCSI

BrightSilence · August 14, 2020, 2:56pm

Please have a look at this change in progress: https://review.dev.storj.io/c/storj/storj/+/2272

Given that this will be implemented, a piece not found error will mean that the data is actually lost. Because the node would have been taken offline if the storage location wasn’t available. I think this change will solve the issue you described completely.

So this part should no longer be the case when the above change is implemented.

Alexey · August 14, 2020, 7:32pm

We already have some:

SGC · August 14, 2020, 9:21pm

there can be no doubt that it is bad for storj and bad for the SNO’s if good nodes are DQ for the wrong reasons and steps will continually be taken to fix such issues as they are understood and when time can be allotted for fixing them.

So report if you got a problem or experience a failure, keep an eye on your node from time to time, and if at all possible setup some kind of alert to warn you of potential problems.

also make sure your disk’s are healthy and if not consider replacing them or upgrading to a larger and begin a new node on the old drive…

and most importantly keep calm and keep storjing

Pac · August 14, 2020, 10:45pm

Yeah well… I’m not sure all SNOs who got suddenly DQed in just a few hours would be happy to just keep Storjing ^^

That is indeed a great improvement and I’m looking forward to it.
I’m wondering what’s gonna happen to the docker container though if the node software shuts itself down… will it keep rebooting for ever until the error is fixed? Or is the node software gonna “not really shut down”, but put itself in an inactive state with a message on the dashboard maybe?

Right, cheers. I should check them out.

SGC · August 15, 2020, 5:59am

me and kevink discussed node states at one time… one of the best ideas imo, was to simply through the os run a dedicated nic for the storagenode or virtual nic and then ifdown it if a dead man switch isn’t triggered in time…

that should in any all just about all cases kill the connection completely not matter the states of the different softwares or most hardware of the OS

i managed to put my debian / proxmox into a state where i cannot access or close the storagenode… tho i can view the logs and it’s still running… i cannot reboot the machine, but i can still put in certain commands that doesn’t essentially require disk access

i did this by removing my l2arc without using the right command in zfs… which basically stalls the entire pool in one big lag of which it never recovers… takes like hours to just get the system to reboot and my watch dog will not catch it because the os is essentially working.

even trying to shutdown the storagenode would take beyond 30minutes… tried to wait for it… but at one point i didn’t dare to wait any longer…

i have no doubt that state would have killed my node outright if i had left it in it… because the node was clearly working and the storage was sort of resporting … from time to time even tho it took like 10-15 min to do a basic ls

the log was also screaming error

Alexey · August 15, 2020, 7:09am

The docker container will be restarted by docker daemon, if you specified the --restart unless-stopped flag in the docker run command. However, if you put your data and identity on the disk in own subfolders, the docker will start the container but storagenode will fail because the path doesn’t exist and since it’s a subfolder it can’t start from scratch in the empty mountpoint. So, the container will be in “restarting” state until you fix the problem.