Storagenode had corrupted data in over 400 blocks on the hard-drive storage

fragamemnon · January 2, 2020, 8:07am

Well, my new node is up and running.

NiJo · August 3, 2020, 10:56am

I have a 5T node running for months in all satellites, that due to a combination of an array misconfiguration and faulty drive, got 3MB of corrupted sectors on the storj folder, impacting 393MB on 180 blob-files, of which 150 are from 1 satellite, and other satellites have between 1 and 10 corrupted files.

The faulty HDD was promptly ejected from the software-JBOD-array (DrivePool), but left connected to try further repairs of the corrupted files (unsucessfull).

The problem occurred around one week ago.
I was expecting the audits/network/storj-magic to slowly recover the damaged files when they are detected, and somehow “make me pay for it”.

I checked today, I see no changes to the damaged (no new “good” version of files with the same name), and I still have 100% scores.

Now I saw this thread and I am affraid of having the full node with 5TB disqualified due to 1.5MB to 300MB of corrupted but contained data.
What is the expected behaviour of the storj-service in these situations?
Isn’t there a command to force repair some specific files, or check the full node, or similar?

BrightSilence · August 3, 2020, 12:28pm

Files are never repaired on your node. If a repair is triggered, it will upload newly repaired pieces to a random set of new nodes.

More importantly, audits are done on a random small subset of data. You’ve seen no drop in scores because audits simply haven’t checked the damaged files yet. Since it’s only a limited set of files that are effected, your node may survive for a long time in this state. Maybe even forever. If you’re unlucky at some point, it may repeatedly audit damaged files and your node could eventually be disqualified. When that happens the pieces on your node will be marked as lost and segments which availability falls below the repair threshold will be recovered to other nodes.

So yeah, make sure you take the damaged HDD out of operation (it’ll surely cause more damage if you don’t). And then hope the damage isn’t too much to survive.

NiJo · August 3, 2020, 1:13pm

Hmmm,
So, correct me if I’m wrong:

Any single corrupted (or missing) block in my node will only be checked once, randomly in an audit or when it’s requested/downloaded (upload from my node),
then marked as corrupted/lost (and deleted in the node if it’s corrupted?),
Contribute as a single Failure in that audit score,
recovered from other nodes to some other node,
And never bother me again.
?

So, as long there are only a few corrupted blobs, and I’m lucky enough to not have sequential audits fail from them, the node will be fine?

Also, is it better to leave the corrupted blobs there or remove them?

BrightSilence · August 3, 2020, 1:41pm

Close, but not entirely. Segments are audited at random. There is no guarantee that a piece will be audited ever. Downloads don’t count towards the score. Downloads are overprovisioned, so if a customer downloads this segment which includes the corrupt piece on your node, it will still have enough good pieces from other nodes to reconstruct the file. So the customer isn’t impacted.

As far as I know audits are only used to judge whether the node should be disqualified. I don’t think the pieces are marked as lost on a failed audit. This might only happen after the node is disqualified. Disqualification happens when the score drops below 60. At this point the chance that your node doesn’t have a piece becomes too high to risk dealing with your node anymore and the satellite disqualifies it.

As far as I know, only after disqualification are the pieces on your node marked as lost. Basically it’s all about determining whether your node should statistically be trusted to have a piece. A certain margin is acceptable as there is a lot of redundancy built in. But when that becomes to risky you get disqualified. It doesn’t really bother with repairing individual pieces because most pieces are never audited anyway. You could try to repair pieces that were corrupt, but basically for every corrupt piece that is audited there are likely many more that never will. So it would be a waste of time to deal with the individual piece and it’s best to just use that stat to determine the reliability of the node as a whole.

I realize that that might be a little counterintuitive, but it’s actually the best way to deal with it when you don’t have the option to audit everything all the time.

NiJo · August 3, 2020, 5:36pm

So, the damage of lost/corrupt files will plague my node for a very long time?

I really hope it’s simply due to dev priorities (totally agree), and here is why:

Storj uses erasure codes to increase reliability and bandwidth/storage efficiency at the expense of a little complexity during file store and retrieval.
Some of the 80 pieces/nodes may be missing/down without any problem, only 29 are required.

However, do you still expect or recommend SNOs to implement reliability/redundancy in their end (the last stage of the storj storage hierarchy)?
(which would make the system a 80*2 / 29 )

I’ve studied and dealt with information theory and erasure codes a few years ago in my work, and from a theoretical point of view I think it would be beneficial to have twice the storage/nodes without redundancy then half but with redundancy, even if it required increasing the 80/29 ratio.

When I joined storj I thought this was a given, and that left me with only 2 choices:

Start one node per HDD.
Start one node per server, and use LVM/software-Raid in JBOD.

The second option seemed much better, since it allows me to dynamically increase/decrease space, and there’s only one service instance per server and internet connection (one process consuming bandwidth, one node to keep track, etc).
When an HDD fails (shows signs of failure), 95% of times 99% of data is recoverable to another HDD.
I thought the 1% of data or the 100% of data in 5% of the cases, the corrupted data would be recovered from the network to other nodes.

I agree that an exhaustive Audit would be counter-productive given the built in redundancy of the whole network, but there should be a way to start a self-check/repair, even if triggered by SNO, or automatically to the last blobs after a abrupt shutdown, etc.
There are multiple events that can cause small file corruptions, even with RAID1 and UPSs in place.
Leaving a missing/corrupt blob in pace without fixing/trashing leads to a monotonically increasing number of faults and inevitably/probabilistically to lower node score and disqualifications, even if it is running flawlessly for a long time.

If I have 100 missing/corrupted blobs in my node, I should be able to come forward, force a check, inform the satellite/network of the missing blobs, and pay for the errors due to “I had them and got paid for them, but they were actually corrupted”.

If such command was available, the following decision would be much easier to make and advantageous for the whole Storj network:

Have RAID1/RAID5 redundancy on my storj space.
Have the same drives in JBOD and provide +100% of space, or +25% with much lower CPU usage (RAID5 with 4+1 drives), and perform a check-repair when an HDD fails, losing only a few $ (much less than the extra space generated).

BrightSilence · August 3, 2020, 8:03pm

In a sense… yes. It won’t be fixed. But it’s probably also small enough that you may never fail an audit at all. And if you do, it’s likely a one off occurrence and your score will recover pretty quickly.

From an architectural standpoint storagenodes are by definition untrusted entities. So this expectation can’t be there. How SNOs manage their data is their business, the satellite only judges it based on the results. But Storjlabs has always said redundancy on the nodes end is a waste of resources and I happen to agree. Better to just share the space on HDD’s you would otherwise use for redundancy as a separate node. That would actually make you more money in the long run. The k/n values of 80/29 are based on the assumption that nodes don’t have their own redundancy. And in part based on node churn rates seen on V2 and I’m sure by now also based on churn rates on V3. Read the white paper for more info on this.

I’m going to resist the temptation to go into the separate node vs RAID discussion here, but I suggest you search the forums for existing threads about this topic. There are good arguments for both sides.

The problem with this is that the satellite can’t know which pieces are lost without doing a full audit for all data on your node. This would be extremely costly to the point where it is cheaper to disqualify your node and have repair handle it. Because of this it’s impossible to repair damage. And since storagenodes are by definition untrusted entities, satellites can’t trust self reported file failures. Instead the satellites use statistics based on continuous audits on a tiny sample of data. The node can’t fake a good result on those audits. These results are then used to predict how likely it is for your node to not be able to respond with the correct piece. If that likelihood drops too low, it is disqualified. This is the only way to do this without spending a LOT of compute and bandwidth on doing a full check on all data.

There are, but all of these would impact so little data that your node will never be disqualified for it. I think that is almost certainly the case with the issues you’re seeing as well.

I believe there is an existing suggestion on the forum for this already. But here’s the problem with this, the satellite can’t possibly know that you are correctly reporting all the issues. If you count these issues towards the audit score, it would be an instant disqualification. If you don’t this could give SNOs a way to play around the disqualification threshold. Remove data that is never downloaded, report it as lost and hope to get new data that’s more profitable. Or when data loss occurs, report only part of it so it won’t lead to disqualification, but you can still get paid for the rest. Giving SNOs these kinds of margins to play around in makes the network less reliable, more vulnerable to cheating and makes it a LOT more complicated to get an accurate measure of node reliability and thus also segment reliability. A measure that is by far the most important in the entire network. If reliability fails, you might as well close up shop, because that’s a hard game over.

So while it may in part be a limitation in dev resources, but it’s definitely also in large part by design. If a node has lost enough data that they want to report it, then why should the satellite trust that node not to repeat that behavior?

NiJo · August 3, 2020, 8:38pm

Very good arguments, thank you.
I miscalculated the complexity of allowing full audits/checks/repair.

Wow. With this possible attack, which I was far from imagining, I withdraw my request for this feature.

And I also remembered that the corrupted blobs will be “healed” automatically when the owner deletes the file, deleting all blobs from all nodes, corrupt or not. Right?
This way, it’s a race between typical storage duration vs churn + rate at which these corruptions occur, and the latter being so small, I think we are fine…

Ok, I’m going to trash the problematic HDD and stop worrying with my audit score.
I reckon that I’m far from disqualification at 300MB corrupted in 3TB used, but is there any study/estimate about the amounts of corrupted blobs (e.g. abrupt power-loss) per TB shared that will statistically lead to a 60% audit score?

BrightSilence · August 3, 2020, 10:15pm

There is some information on how the score is calculated in the blueprints on github. But the values for certain variables aren’t included in these docs.

github.com

storj/storj/blob/85a74b47e75222902221fd3499c22982386c2c74/docs/blueprints/node-selection.md#design

# Reputation and Node Selection

## Abstract

Node selection is the process wherein the set of all possible storage nodes is reduced by the satellite for uploading segments.  Node selection applies to new file uploads via an uplink, as well as repair traffic from a satellite.  The node selection processes endeavors to fairly distribute upload traffic among storage nodes.  Node selection takes into consideration how new a node is, the overall performance characteristic of a storage node as characterized by its reputation score, and the IP address of each node.

## Background

The white paper section 4.15 describes a 'preferences' system used in node selection, based on reputation:

> After disqualified storage nodes have been filtered out, remaining statistics collected during audits will be used to establish a preference for better storage nodes during uploads. These statistics include performance characteristics such as throughput and latency, history of reliability and uptime, geographic location, and other desirable qualities. They will be combined into a load-balancing selection process, such that all uploads are sent to qualified nodes, with a higher likelihood of uploads to preferred nodes, but with a non-zero chance for any qualified node.  Initially, we’ll be load balancing with these preferences via a randomized scheme, such as the Power of Two Choices, which selects two options entirely at random and then chooses the more qualified between those two.
>
> On the Storj network, preferential storage node reputation is only used to select where new data will be stored, both during repair and during the upload of new files, unlike disqualifying events.  If a storage node’s preferential reputation decreases, its file pieces will not be moved or repaired to other nodes.

The existing reputation-like system uses uptime and audit responses.  It does not currently consider geographic location, throughput, or latency.  In addition to factors which affect reputation, there are other factors in node selection.  These considerations currently include IP address, advertised available bandwidth, advertised available disk space, software version compatibility, and whether the node appeared to be online in the latest communication with the satellite.

One final factor involved in node selection is node 'vetting.'  During upload


## Design

This file has been truncated. show original

The white paper also has information on this. But you may have to dig through the code to find the actual implementation and even then, the values used are probably in the config somewhere and not included in the code. I wouldn’t worry about it too much though. Just have a look at your audit score once in a while. If you’ll ever see it drop at all, I imagine it won’t drop far.

As for the possible “attacks” it’s all theoretical with V3 luckily. V2 was a bit of a different story, but Storj Labs has learned a lot from that and reading through things like the whitepaper and discussions on the forums it seems pretty clear that excluding any possibility or even slight margin for cheating has been a core design principle from day one. So I imagine it will stay that way and anything that would give even the smallest opening won’t be allowed in the product.