Drive failure, system recovery

lbaker · May 4, 2022, 1:51pm

Just my luck… Been running my system about a year… Just starting to make money… and my drive does a hard failure… I’ve rebuild with larger mirrored drives but starting over from scratch is really disappointing

Is there a method to inform StorJ that my old node is down and not coming back?

Seems there needs to be better methods to recover from hard failures. They system takes care of StorJ, but really penalizes the operators. Now it will take another year to get vetted and start getting income.

Stob · May 4, 2022, 2:17pm

Hi @lbaker
I understand your frustration.

The Storj system will deal with your dead node and you don’t need to do anything.

This has been discussed before and it doesn’t suit Storj’s decentralised model. If anything your node failing due to hardware failure demonstrates exactly why the vetting process works. Your node was not ‘reliable’ enough and lost data, and with it the vetting it had already accumulated. Even paying to ‘restore’ the data on your failed node would not be cost effective for you, as a node operator, as you would need to ‘pay’ for 29 segment downloads just to recreate your missing segment.

I run a RAID on my node and would recommend the same for other operators exactly because of this type of failure, re-vetting process and loss of income.

lbaker · May 4, 2022, 7:50pm

As you say, a node failed… But to vet StorJ… I ran a node for an entire year with no issues. My payout was exactly ZERO. In an entire year, it never earned enough to get over the gas fee’s and payout. Even then, a year of service, and only $22 sitting in the account. If StorJ was really earning money, the drives would have been updated to better hardware. This system expects people to give and give and give, and never seems to get to the point they give back.

So, while Storj was vetting my site (for a year), they really didn’t fair too well in the vetting process either. If they expect us to provide 100% reliability, they need to make that worthwhile, and lets be honest, they dont. They pay for crap service, thats it… The ONLY reason this stays online is because I run the server primarily for other reasons. It NOT because it will ever pay for itself.

Pac · May 4, 2022, 9:52pm

Out of curiosity @lbaker: How much space were you dedicating to Storj?

But yeah, losing a node is pretty damn frustrating

ZBS · May 5, 2022, 9:31am

I agree, it’s ok to have penalty…but currently, the vetting process takes more than a year and fullfill a 8TB drive takes ages.

Most of the operators keep expanding the existing nodes to avoid wasting time with vetting, but this increases the risk of failure, because have one big node is worst than have like 5 smaller one.

RAID dont protect you, it’s an illusion. If a drive die permanently, there is no guarantee to recovery ALL the data, or enough to avoid disqualification.
Even if you recovery the 90%, you can still be disqualified by audits. It’s like playing roulette.

So why use RAID and waste space with the risk to lose everything anyway? Run more nodes and in case of failure you lost only one.

A possible solution is to have some sort of operator reputation (even with different nodes), So if one of your node fail permanently, your reputation gets hurt and as punishment, you cannot recovery the data and in case you want run the node again you are forced to start again from zero BUT if your global reputation as node operator is good, there is no vetting process, as you already proved to be a good operator.

If your node keeps going offline, losing data ecc ecc, at some point you get disqualified as a operator.

PRO:

Reduce the point of failure
More free space because RAID has no sense (the network already has the erasure)
An operator can keep raising his reputation as a part of the job.
Good operator get rewarded and not random punished by the current system.

CONS:

Unfortunatly require something like a KYC or anyway a reference between different nodes.

twl · May 5, 2022, 1:46pm

Just a quick heads-up to let you know these $22 are not lost, in case that was unclear

Iigloo · May 6, 2022, 5:32pm

For the Storj network a failed node or two doesn’t matter. But for the individual it is very important to have some kind of raid. I use shr1 and if one of my drives were to fail. I could just rebuild the array, no downtime. If I put on ext4 aingle usb drive and it dies. I am game over.

Alexey · May 8, 2022, 6:45am

You mix the reputation of the operator (human) with reputation of the node (software+hardware). They even related, but independent
Satellites checks nodes (hardware), not operator (human), because they want to have a reliable nodes.
So for the satellite the operator’s reputation doesn’t matter, thus every single node must be vetted to make sure that it will not disappear because of some OS or hardware issue. And operator’s reputation has nothing to do with reliability of the exact hardware unit.

ZBS · May 8, 2022, 7:39am

Hi @Alexey, thx for your reply.

I seem to have understood it already exists an operator reputation system. Is it right?

If so, how it works? Have you some link or wiki to it?

Thx

Alexey · May 8, 2022, 8:11am

No, it has no use regarding a decentralized cloud storage with zero trust. The only reputation of the node is matter and calculated independently for each node.
In the Storj network it also independent for satellites - each satellite calculates reputation scores for each node individually.
The node’s reputation is consist of:

audit score - the core part, if it’s below 60%, the node will be disqualified;
suspension score - it depends only on unknown audit errors, if error is known (ex. “file not exist”) - this event affects the audit score instead; if suspension score is below 60%, the node will be suspended (no ingress) until suspension score would grow above 60%;
online score - it depends on node’s ability to answer on audit requests (independently of the result, so determined only the fact of answering). If it’s lower than 60%, the node will be suspended until online score would grow above 60%;

There is also a vetting status, but since it’s temporary - it doesn’t included to the reputation. The vetting node can receive only 5% of ingress from the customers until got vetted. To be vetted on one satellite the node should pass 100 audits from it. This gives a plenty time to check the reliability of the node. Because most of defects are revealed after start. Of course there could be unfortunate case as for @lbaker - the node were vetted, but hardware died after.

You can read more there: Reputation Matters When it Comes to Storage Nodes

But from a forum perspective - the registered forum user’s reputation is exist as a trust level system, see Understanding Discourse Trust Levels
However, not all registered users on the forum are operators and vice versa.