Returned to find my node offline and restarted to no avail

Alexey · September 2, 2020, 7:27pm

Please, do not mix database corruption and lost pieces of customers’ data.
The database is replaceable, the customers’ data is not.
You can delete all databases, but not touch the data and storagenode will work as usual. You will just lost statistics for the past usage and maybe some not sent orders.

Rapturoso · September 2, 2020, 7:28pm

I didn’t lose any pieces of customer data. My hard disc drives closed files correctly. No corruption of any customer data has happened.

Alexey · September 2, 2020, 7:29pm

Then storagenode should pass audits.

If the pieces lost or corrupted, then your node will fail audits. If too much in row - it will be disqualified to do not affect customers.

Unfortunately there is no other way around.

Rapturoso · September 2, 2020, 7:30pm

It failed audits as the storage location was not available. That’s not the same as corrupted customer files, at all!

Alexey · September 2, 2020, 7:33pm

This feature is available starting 1.11.1

Before this version only custom checks are worked:

For the storagenode there was no difference between lost pieces or disconnected mountpoint, in both cases the OS returns the same error “file not found”.

Rapturoso · September 2, 2020, 7:46pm

That doesn’t help me at all though does it? The node is healthy and ready to come back online. This shouldn’t be happening at all. If a node’s data is unavailable, the service should know this and then bring the node offline or tell the audit system that the files are currently unavailable and that the node is in a safety maintenance mode until someone can attend the problem and fix it. Both of these should be tantamount to the node being offline to the network and the network should treat it as such, without immediate and indiscriminate disqualification. If then the storage comes back within a reasonable period i.e. given enough time for someone to attend the problem, fix the problem then bring the node back online within 24 hours, then this should allow the node to continue as normal, not be disqualified from the network for a minor outage!

I have spent significant time maintaining this node and won’t have some stupid policy dictate to me that because it was unavailable for less than an hour, it will be disqualified, this is unfair and unethical.

Rapturoso · September 2, 2020, 7:54pm

What on Earth is happening? I have checked my local node Web UI and now it says that us-central-1 has an audit score of 100% and a suspension score of 100%.

Does the network refuse to update the node statistics once it’s been needlessly suspended by a minor outage ?

Rapturoso · September 2, 2020, 7:56pm

So what’s going to happen to my held amount that i have been accruing? Is this going to be needlessly stolen by other storj node operators even though my node data is completely intact ?

Rapturoso · September 2, 2020, 8:00pm

You also mention a few posts for watchdog type scripts and setups. Surely this should be incorporated into the node itself rather than have an external process try to manage the node’s health and up-time? All this seems extremely lazy and inconvenient on the part of the developers and just goes to prove that fundamentals aren’t in place to prevent needless node banning. I sincerely hope this is revised as a matter of urgency as node operators should not have to risk having their node being banned from the network just for having their storage system offline a few minutes and they should not be expected to implement home brew solutions just to prevent their nodes from being banned. this should be part of the node service provided by storj.

Alexey · September 2, 2020, 8:03pm

There is no way to prove that the data intact except full audit.
The full audit is more expensive than recover, so unfortunately if your audit score is below 0.6, the held amount will be used for recover.

As I said the current implementation of disk check availability already in release.

Rapturoso · September 2, 2020, 8:04pm

Who do I complain to to get this problem escalated and my node undisqualified? I require a point of contact for escalation of this matter as it seems I have exhausted the minimal help that’s available on this forum. I need to speak to a supervisor please.

Alexey · September 2, 2020, 8:05pm

The undisqualification is impossible if audit score less than 0.6
There is no workaround.
If it were possible, there what would happen:

Since the data is unhealthy until audited, it is marked as lost
Since it’s failed audit, it lost the reputation, so the trust level is 0. So, back to vetting with 5% of potential traffic until pass 100 audits (~1 month)
While it’s audited, the lost data could reach the threshold and trigger the repair, so payment for recovering will be taken from the held amount.
Since audit is a long process (the satellite audit only a few pieces, there are other nodes in the network), all that time you will be paid only for audited pieces
Recovered pieces will be removed from your node by garbage collector and placed into trash for a week. It will remove pieces slowly from your node.
The held amount will be used at whole earlier than your node finish the audit.

So, it is not better than start from scratch:

the held amount is lost anyway;
the reputation is reset to zero;
the held level started back from 75%;
your node still holds pieces which all marked as lost until audited, and they are not paid until that. However, they are holds the space.

Rapturoso · September 2, 2020, 8:09pm

What are you talking about, I can quite categorically state that the data is 100% intact. It doesn’t need a full audit at all. How do you know there’s no way to prove this? Were you available and physically present at the time the drive was properly shut down and ejected from the system, witnessing the writing and finalising of all files and closing them properly? No, You were not, but I was and I can categorically tell you that this was done properly without data loss. I did a proper procedure to finalise writes, properly unmount and eject the wrong drive. I have plenty of them and it’s easy to get confused as to which ones are for Storj.

This doesn’t however negate the fact that I need this problem to be escalated. I need to speak to a supervisor, please.

Rapturoso · September 2, 2020, 8:13pm

The audit score is 59.9 so using the rounding-up technique, my node should not be disqualified.

Alexey · September 2, 2020, 8:18pm

The storagenode is untrusted entity by default. If it’s failed audits its trust level is zero.
The only way to be sure that data is intact - is audit it.

Rapturoso · September 2, 2020, 8:21pm

Also node operators should not have their node disqualified within an hour for properly removing a storage location temporarily. The storage location could become available again, which mine has, without faults, then the node should be able to come back online and operate as normal. However, this is not happening and your network is needlessly disqualifying nodes that are behaving in a normal manner. Your disqualification system is entirely overzealous and unfair. I want to speak to a supervisor please. Please provide contact details so that i may take this matter further.

Alexey · September 2, 2020, 8:29pm

You should stop the node before removing storage location and run after the maintenance complete.

For the storagenode lost mountpoint and lost pieces are looks the same - “file not found”. This is what the storagenode got from the OS in both cases.

In the current release the storagenode will create a file and write to it every minute to be sure that disk is still here.
However, it will not help, if mountpoint is still writable.

For that we recommended in the documentation to put the data to the subfolder on the disk instead of root of the disk. In such case if disk disappear the storagenode will not be able to create an empty storage in the mountpoint and will fail (this work even for the older version).

Rapturoso · September 2, 2020, 8:35pm

Again I will reiterate, the audit system is totally flawed. It needs to differentiate between corrupted data/missing files and totally offline storage. It’s not difficult for an audit node to differentiate between these conditions. One checks for a file or files, if it finds none, then the storage node should be flagged by the audit node to be in temporary safety maintenance mode and then checked subsequently, periodically, until the storage node reports the file or files that the audit node is looking for and then become flagged as back online and out of temporary safety maintenance mode by the audit node when the audit node has finished it’s checks and the storage node has passed those file present and correct checks.

If however the audit node checks and finds the file or files but they are corrupt, or finds some files but not others, then the storage node’s audit should plumet, eventually earning a disqualification from the network.

It’s totally unfair and unethical, not to mention incorrect, to smack the audit score of a node just because the storage isn’t available temporarily.

Rapturoso · September 2, 2020, 8:39pm

Also my storage node keeps telling me that I need to upgrade to 1.3.3 but I have 1.10.1 running. I have tried to download again through the documentation links but every time it keeps telling me that 1.10.1 is installed even though the downloaded msi installer is fresh from your installation documentation web site.

Alexey · September 2, 2020, 8:40pm

The workaround for missed disk in runtime is implemented.