Is there such a thing as an "un-disqualify"?

Nobody gets disqualified atm for downtime, so that was not the reason for his disqualification.

I don’t get the OP, why he wants to delete 4 nodes, because 1 got disqualified. Try to find out, what the reason was and cut your losses. Then move the data of the oldest remaining node to the biggest hard disk and setup a new node on the now spare hard disk.

A good start for troubleshooting is to grep your logs for “GET_AUDIT” and “failed” in the same line.

4 Likes

The reward for effort is pennies per hour invested (if that). I know my bill rate. It was interesting, but never was it ever worth my time. All nodes ran on the same aging server which I otherwise only turn on for a few hours monthly. It is madness to invest more time figuring out the “why” if that effort would not change anything. :slight_smile:

Cheers!

1 Like

This issue really needs to be solved, sure there are a near infinite amount of people willing to attempt to share storage space… but word of mouth will eventually kill the whole project if the majority of SNO’s doesn’t have a good experience…

the facts are simple… SNO’s need time and warning to be able to correct their storagenodes issues, a random bullet in the head isn’t very educational… i suppose that’s why soldiers wears helmets :smiley:

your valor will be remembered…

1 Like

probably not after understanding that their server needs to be online 24/7 and they won’t get rich from sharing space and the rewards during the first months are definitely not worth the effort (if you want to count the spent time).

the majority of SNOs has a good experience. You just read more about the bad experience because people don’t come to the forum screaming “STORJ is so great!! Being a SNO is so easy and I set my node up 5 month ago and it still works and I actually don’t even check the forum!!”.
People come here to scream when they have a problem or their node got DQed for whatever fair or unfair reason.

Can’t disagree with that part though.

8 Likes

What makes you say that?

Which issue would that be?

Nobody has any idea what happened with this node as the owner doesn’t feel like looking at the logs. Which I would argue takes a lot less time then coming back here to post 3 times.

Clearly audits have not been 100% otherwise you can’t be disqualified. For all we know data was removed or the HDD failed. Either way, I don’t see why we should bother debating a situation we literally know nothing about.

If @JohnSmith wants help or a useful debate, he can look at the logs and let us know what’s going on. But if it’s not worth his time, then it’s definitely not worth ours.

3 Likes

Just put it here:

1 Like

I get your view @BrightSilence and I tend to agree with you.

However, as @SGC said:

(even though “random bullet” is a debatable notion here)

In my humble opinion, there is only one situation where disqualification could be fast without warning: if a node is clearly responding with the wrong data when audited. This would mean that it’s trying to cheat and is poisoning the network, then it needs to be killed fast.

All other problems should simply trigger the suspension mode, and notify the SNO that something’s wrong and that they should check what’s going on within a few weeks (or months… dunno) before getting disqualified for good.

This said, I do agree that @JohnSmith should (or could) have considered having a look at the logs anyway, to see what happened, especially because if it proved to be a new problem, it could have helped the team to make the software better.

3 Likes

true… but maybe it should be made more clear what goes wrong then, some sort of system implemented so that the node will protect / disconnect itself upon high numbers of errors / audit failures… so that people actually have a chance to diagnose the issue…

it’s not nice to know that months and even years of maintenance and work could be gone to a random fluke that wasn’t actually anything serious… just because the software doesn’t know how to preserve life atleast long enough for humans to respond.

and the random bullet could very well be a very well considered one from a sniper… it just seems random to the guy in the trench

2 Likes

Well I think my previous posts show that I think some things can be improved. But I would say if data is removed, that should also lead to a hard disqualification. Given that the storage location is available. The same would go for unreadable data. If the storage location can be read and written to, but files are either unreadable, no longer there or return wrong information. There is nothing you can do to fix it anyway. In those scenarios, disqualification after failing too many audits is just the right thing to do.

@SGC I see you responded as well. But I think this message applies to what you said too. The node needs to be better at knowing whether it’s just that the data location isn’t available or that actual data is lost. Disqualification should only happen in the latter case. If that’s fixed, then there is no question about why the disqualification happened, because it could only be one thing.

For what it’s worth, I’m certain the dashboard will already show an audit score below 60% on the node that is disqualified. But you do have to bother to look.

1 Like

I disagree. The SNO could have tried to migrate their nodes and misconfigured the target path. It’d be better to let them know something is wrong, so they can face-palm themselves and solve the issue.

If the node does return data that proves to be invalid, then it’s different: considering each file is identified by some kind of UUID (if I’m not mistaking), I guess that if a node were to target a folder containing data from another node, it shouldn’t find a single file matching requests coming from satellites. Which means that a node returning wrong data probably is cheating. That’s my take anyway.

As @BrightSilence said, there is room for improvement indeed, some may want to upvote one of his suggestions to make the software more robust: Make the node crash with a fatal error if the storage path becomes unavailable

I would count that as the storage location not being available. The node could place a file in the storage location that it can poll for availability. If it’s not the node shouldn’t start or shut down. That would catch misconfigurations as well. Even better would be a file that stores the public identity of the node in the storage location to test whether the data linked to matches the identity used. That would even catch issues where node A would point to storage location of node B.

But if files are missing but that test file is there, then disqualifying is still the right thing to do.

1 Like

I would like to recommend to read a whitepaper: https://storj.io/storjv3.pdf , section “4.8 Structured file storage” and “4.14.1 Piece hashes”
It’s not a GUID at all. From the storagenode perspective it’s indeed can’t find any audited piece and will answer with “file not found”, not with a wrong data. From the satellite point of view the node lost all the data and must be immediately disqualified, otherwise it will be offered to customers and they could receive the scaring message “file not found” too.

Sounds like a pretty decent solution to me! :slightly_smiling_face:

I’m not questionning that, that’s why I assumed that only a cheating node could reply with the wrong data.
If a mechanism like the one @BrightSilence is implemented, then yes, not finding files should lead to disqualification. But currently, a simple misconfiguration could cause that.

If we’re sure the node is cheating or lost all files only, then yes! Otherwise it could be suspended immediately to avoid sending scary ‘file not found’ messages to customer. Don’t you think?

But, I thought that maybe it’d be simpler and easier to pull out a node (by suspending it) from the network when anything goes wrong, anything… And if it gets back to normal within a certain amount of time, un-suspend it. Otherwise, disqualify it.

I’m truly having a hard time to see why, in some cases, it should be immediately and definitely killed. ^^’

I already said what I think about suspension instead of disqualification:

It should be extended to work as a slow disqualification:

The recovered pieces will be removed from such node with a Garbage collector.

4 Likes

Very educational answer @Alexey, many thanks.

I see why it’s not as simple as I thought.

Well then… I’m not sure what to suggest :sweat_smile:

If I understand what you’re suggesting, SNOs should pay for repairs triggered while their nodes are suspended.

That sounds pretty fair to me. I’d rather receive a poor payment one month because my node got temporarily suspended, instead of getting disqualified.

Yes, but the held amount could be wiped very fast, because of

And thus your node will fall back to 75% level of earnings to held amount very fast too.
Then we will be forced to enable the vetting process again. So - 5% of potential traffic during the next month or two.
And pieces become deleted too every time, because your node is untrusted and pieces are still marked as lost. And of course they are not paid anymore.
So, this is the same as a disqualification but with slow clean up of the unpaid space.

I’m not sure is it a good suggestion ever.

Do you mean a single failed audit?

You’d have to fail quite a few audits, it isn’t immediate. Notifications about failed audits could be helpful though.

Right now repairs cost storj around $9/TB stored. From what I see on my 8TB nodes, the held amount will probably be really close to the repair cost between months 10 and 15. Then it will only cover half of the repair cost. But if I exclude the surge payments early on, it would only cover ~70% being at maximum amount and ~35% after I get half of it back.
It’s even worse in case of nodes that started small and then migrated to a big drive.

So I’m really surprised that people get disqualified because of misconfiguration or other equally simple and minor issue and there’s no way to recover the node after a DQ that happened because of the reason above.