Disqualified after 2 hours of [edit] failed audits?

Hacker · July 3, 2020, 2:15pm

The data is in a subfolder and the node did not shut down with an error, unfortunately.

jammerdan · July 3, 2020, 3:34pm

This will only help if disqualification for downtime is disabled.

Pentium100 · July 3, 2020, 3:57pm

There is a bit of a disconnect between the rules pretty much demanding datacenter-grade reliability (maybe even higher than some datacenters), but both the recommendations and the payments do not match that.

The recommendation is (or at least was) a Rasperry Pi with a single USB hard drive, so I would expect the problem with USB cable or the controller would be rather common, but the node cannot handle it.

ACarneiro · July 4, 2020, 9:19am

You are, of course, absolutely right. But that’s why the system was designed with resiliency built into its core.
As @Pentium100 alluded to, it does seem somewhat unreasonable to have all that redundancy built into the design of the network so that it can be run by “amateurs” with cheap, low spec kit and then demand a level of availability that would be challenging even with professional grade equipment, especially when a somewhat trivial problem can quickly be met with the “nuclear” consequence of total, irrevocable disqualification with complete loss of retained earnings.
I’m not saying it’s wrong “per se” and I don’t know what the technical solution would be (although there have been some many interesting proposals) but there does seem to be a bit of an incoherence with the strategy.

BrightSilence · July 4, 2020, 10:53am

Availability and reliability are constantly being conflated. I don’t think they are the same at all. Availability has basically not been a factor for many months now. You can be offline for weeks and be fine. And when the new down time tracking system is implemented you get plenty of chance to recover from a downtime related suspension. How is that anywhere close to data center SLA’s?
Reliability is a different story though. There aren’t all that many ways to lose a significant amount of files on an HDD and most of them are either manual intervention or a drive starting to fail. Both those cases should lead to disqualification as it would be way too much of a risk to keep something like that in the network and keep trusting it. You shouldn’t treat your node as your family photos. It’s expendable. If you want, you can run multiple of them to lower the impact if one fails. Or if you have more HDD space to waste you can run RAID which is available on any consumer grade NAS to protect against HDD failures. But nobody is going to sue you for breach of contract if your node fails. None of the income you’ve already collected is going to go away. Just a small bit of held amount. How is that anywhere close to data center grade?
It’s just the wrong mindset to think of your node as something that is never allowed to fail. The chance of a well managed node on consumer grade hardware failing is low enough and the cost is low enough that it’s mostly not even worth protecting against.

That leaves us with scenarios that are currently not handled right. One example would be the node being online but the storage isn’t. That SHOULD be downtime, because no data is lost, but it’s now counted as data loss. Hence why I suggested the node should crash as soon as the data folder is no longer available. That redirects this issue to the more easily recoverable downtime scenario instead of counting it as data loss.

Still none of this is anywhere close to data center grade. And I’m pretty sure that the vast majority of nodes is not running on data center grade hardware and is actually making a profit. I think some people have just managed data center grade setups in the past and are having trouble switching their mindset to something that doesn’t require what they are used to.

Pentium100 · July 4, 2020, 11:41am

I know of some companies that may use ordinary PC hardware with no RAID for their email server or whatever. It works great and can work for years - until it doesn’t.

The requirements, if you try to make sure you follow them. It may be even worse than a datacenter, because a datacenter usually has backups to recover the data in case the usual precautions fail.

I do not know the numbers for that so I cannot comment about it.

There can be a bug in the software. Also, IIRC, the node writes data files in async mode, so it can report “OK” while the data is not yet on the drive.

It takes a year to get back to something resembling the previous situation though.

Hacker · July 4, 2020, 11:58am

Well, it was over half of the earnings in my case.

BrightSilence · July 4, 2020, 1:02pm

Yes and they will have lost vital data when it fails. Your node won’t. You can just start over and at most you have some loss of income. You’re still not having the mindset that the node is much more expandable than that.

One of the examples of an upfront cost you don’t have to worry about because your nodes requirements are nowhere near that of a data center. All of these precautions would be a lot more expensive than just accepting that there is a small chance of your node failing. The cost of that risk is relatively small.

There has at least been some suggestion that you would get a week to recover. While that’s definitely not final, there is not a single data center SLA that would allow such leniency.

When this happened in the past, Storj disabled disqualification and made it right for those effected. I’m not too worried about that. If a bug disqualifies nodes, Storj has a problem as well, it’d be in their best interest to keep the effected nodes in the network as well.

If you’re running a single large node on a large array maybe. If you’re instead running several smaller nodes, it’s a matter of months. And with the low chance of the failure happening in the first place, it’s a much cheaper hit to take than spending on prevention of it happening in the first place.

We’re not going to convince each other. We’ve had this discussion before. In the mean time, I made enough money to pay for the entire setup with enough left to spare. Even if I lose all my nodes now, it’d have been a profit and I’ll just start over. On average it’s the most profitable way to run nodes. Especially if like me you’re using hardware you already had with disk space already paid for by Storj income. Sure, my case is a small sample size, but averages can be calculated based on average failure rates and it’d quickly show it’s just cheaper to take the risk of loss.

That said, this should only go for loss because of HDD failure or mistakes by the node operator. Software bugs causing disqualification should be corrected by Storj Labs (and have been in the past). And there should be some fix for when the storage location isn’t available so nodes that haven’t lost data don’t get disqualified for it.

Early on, that’s probably the case. But that changes quickly when the node gets older. The largest amount of held back is collected when the node doesn’t have much data yet. And half of it is returned after month 15. By that time it’s probably closer to 10% of total earnings. Though this will depend on how much space you’re sharing.

Pac · July 4, 2020, 11:58pm

I think this sums up what needs to be done quite nicely to fix this particular pb. This situation affecting @Hacker feels like a bug to me, and I think it’d be worth opening a support ticket explaining all that, or linking to this conversation. Maybe they’ll consider fixing this as a priority.

The software keeps getting better little by little, and that’s great
But it still feels to me like there are still quite a few SNOs getting DQed pretty quickly for unfair reasons, or simply because they weren’t notified their nodes were malfunctionning.

When something is wrong with the node, what ever the reason, what about suspending it and notifying the SNO so they get a chance to fix the issue? I don’t see why this approach could not be applied to all issues, actually.

Alexey · July 5, 2020, 12:04pm

I just want to point out to the similar discussion:

Node disqualified?

The suspension mode is a good suggestion from the SNO point of view, but a very bad from the satellites’ and customers’ points of view:

your node doesn’t have requested files, but they can be requested by customers and they got an error “file not found”;

it can trigger the repair job, because all pieces on your node will be marked as temporary unhealthy;

if the repair job has been triggered, the payment for repairing would be taken from the satellite’s operator pocket, not from the held amount of the suspended node;

the suspended node will lose the repaired pieces, because they will be stored on more reliable nodes;

the satellite operator will pay for inaccessible data from its pocket until you fix the problem or until a week is gone.

The current implementation of suspension will be used only if your node would answer with unknown error. It’s not so great number of nodes as if were taken into account the “no such file” errors.

github.com

storj/storj/blob/5786595d141ff8b8c8be21e5e8696dd73e04d08d/docs/blueprints/audit-suspend.md

# Storagenode "Suspension" State Blueprint

## Introduction

Currently, when a storagenode is audited for an erasure share, there are five possible outcomes:

1. Success: The node responds with the correct data
2. Failure: The node responds with incorrect data
3. Offline: The node cannot be contacted
4. Contained: The node can be contacted, but the connection times out before all the data can be received by the satellite
5. Unknown: The node responds with any other error

Only cases 1 and 2 directly affect a node's audit reputation, which can cause disqualification.

When the [downtime tracking service](./storage-node-downtime-tracking.md) is fully implemented, case 3 can indirectly cause a disqualification.

Case 4 can also indirectly cause disqualification, since a node placed in containment mode will be re-audited at some point with the same 5 potential outcomes.

Case 5 is the only situation where there is currently no potential penalty for responding to an audit with some type of error. Fortunately, having this case has allowed us to find, diagnose, and fix several problems with storagenodes, increasing network durability. Unfortunately, it allows us to perceive nodes that consistently respond to audits with unknown errors as "healthy", giving us an inflated view of durability.

This file has been truncated. show original

brozinic · July 5, 2020, 5:54pm

Okey… Welcome to the club , I’m talking to myself

Exactly the same fault happened for me now… just got home and thought that my external drive was so silent… Plugged it out and in once and it was back on the drives. Then BAM , your node is disqualified! So now I’m drinking beer and thinking about alternatives? Is this really it? Then I will not return again…

I never have had an disk autoeject like it did before. And have been on this projekt from nearly the start.

Hacker · July 5, 2020, 8:01pm

Yup, exactly, it’s a crappy situation. Similar amounts, but held was higher than earned. Very frustrating and not really motivating to return.

peem · July 6, 2020, 6:37am

I am also using a USB-connected disk … I do not want to be next disqualified for disconnecting it during work … how to prevent it? does StorjLabs already have a solution?

brozinic · July 6, 2020, 8:01am

I’m just glad that it wasn’t my server diskspace… then I would be really angry now

SGC · July 6, 2020, 8:03am

one simple old hardware solution that might fix it… it kinda should but it doesn’t exactly have the required hardware redundancy to do so… atleast with 100% certainty

but if one ran a raid on multiple usb drives… then if one disconnected it wouldn’t matter… ofc that requires the usb bus to be stable… and since the problem is most likely inherent to the bus being a very hot swap system made to manage a ton of different devices, at high speeds and without to much trouble of configuration… the stability kinda took the backseat…

one thing i would take a look at is power management… usb can be turned off for various reasons… the bus can be rebooted, i’ve in recent days had trouble with that if i reboot one of my computers the usb drives will change drive letters in windows… which is pretty annoying… but i’m not using that for my node…

there are a few ways to solve it…

storj should solve this… because it cannot be acceptable that a hardware issue, kills a perfectly good long term reliable node over a brief period of standart OS usb bus related issues.
not if they want consumers to be SNO’s
the solution would be something as simple as suspending a node instead of DQ,

2 simply making a failsafe in the software that keeps track of errors… if errors or faild audits start to rise over a certain point, then it offlines the node

3 homebrew scripts to monitor the storagenode… and shut it down…

4 hardware… use better hardware, test your hardware before you put it into production… but then we are into enterprise levels of commitment…

brozinic · July 6, 2020, 8:25am

Some way to come back… I mean my node was maybe down 1-2 hours? If I go online again with all the terrabytes of clean data then maybe the system could make a fast check and take a little “fine” for that offline/fail audit mess. Not just completely throw away all the good terrabytes of data!

What should I do now with all the data? Erease it manually or just let it be and it will selfdelete?

SGC · July 6, 2020, 8:37am

you should report it to the storjlings and see if they won’t reverse the DQ… since it is essentially a programming issue… can’t really blame USB for acting like USB, that just needs to be accounted for in the code of the storagenode software…

brozinic · July 6, 2020, 9:49am

@Alexey

What’s your take on this?

Hacker · July 6, 2020, 9:58am

That would multiply the cost immensely and would only be interesting to SNO’s who’d like to contribute their resources to Storj at a loss.

Hacker · July 6, 2020, 10:00am

I’ve asked here - After DQ on a satellite, does the useless data get deleted?