Storagenode had corrupted data in over 400 blocks on the hard-drive storage

Well, my new node is up and running.

2 Likes

I have a 5T node running for months in all satellites, that due to a combination of an array misconfiguration and faulty drive, got 3MB of corrupted sectors on the storj folder, impacting 393MB on 180 blob-files, of which 150 are from 1 satellite, and other satellites have between 1 and 10 corrupted files.

The faulty HDD was promptly ejected from the software-JBOD-array (DrivePool), but left connected to try further repairs of the corrupted files (unsucessfull).

The problem occurred around one week ago.
I was expecting the audits/network/storj-magic to slowly recover the damaged files when they are detected, and somehow ā€œmake me pay for itā€.

I checked today, I see no changes to the damaged (no new ā€œgoodā€ version of files with the same name), and I still have 100% scores.

Now I saw this thread and I am affraid of having the full node with 5TB disqualified due to 1.5MB to 300MB of corrupted but contained data.
What is the expected behaviour of the storj-service in these situations?
Isnā€™t there a command to force repair some specific files, or check the full node, or similar?

Files are never repaired on your node. If a repair is triggered, it will upload newly repaired pieces to a random set of new nodes.

More importantly, audits are done on a random small subset of data. Youā€™ve seen no drop in scores because audits simply havenā€™t checked the damaged files yet. Since itā€™s only a limited set of files that are effected, your node may survive for a long time in this state. Maybe even forever. If youā€™re unlucky at some point, it may repeatedly audit damaged files and your node could eventually be disqualified. When that happens the pieces on your node will be marked as lost and segments which availability falls below the repair threshold will be recovered to other nodes.

So yeah, make sure you take the damaged HDD out of operation (itā€™ll surely cause more damage if you donā€™t). And then hope the damage isnā€™t too much to survive.

Hmmm,
So, correct me if Iā€™m wrong:

  • Any single corrupted (or missing) block in my node will only be checked once, randomly in an audit or when itā€™s requested/downloaded (upload from my node),
  • then marked as corrupted/lost (and deleted in the node if itā€™s corrupted?),
  • Contribute as a single Failure in that audit score,
  • recovered from other nodes to some other node,
  • And never bother me again.
    ?

So, as long there are only a few corrupted blobs, and Iā€™m lucky enough to not have sequential audits fail from them, the node will be fine?

Also, is it better to leave the corrupted blobs there or remove them?

Close, but not entirely. Segments are audited at random. There is no guarantee that a piece will be audited ever. Downloads donā€™t count towards the score. Downloads are overprovisioned, so if a customer downloads this segment which includes the corrupt piece on your node, it will still have enough good pieces from other nodes to reconstruct the file. So the customer isnā€™t impacted.

As far as I know audits are only used to judge whether the node should be disqualified. I donā€™t think the pieces are marked as lost on a failed audit. This might only happen after the node is disqualified. Disqualification happens when the score drops below 60. At this point the chance that your node doesnā€™t have a piece becomes too high to risk dealing with your node anymore and the satellite disqualifies it.

As far as I know, only after disqualification are the pieces on your node marked as lost. Basically itā€™s all about determining whether your node should statistically be trusted to have a piece. A certain margin is acceptable as there is a lot of redundancy built in. But when that becomes to risky you get disqualified. It doesnā€™t really bother with repairing individual pieces because most pieces are never audited anyway. You could try to repair pieces that were corrupt, but basically for every corrupt piece that is audited there are likely many more that never will. So it would be a waste of time to deal with the individual piece and itā€™s best to just use that stat to determine the reliability of the node as a whole.

I realize that that might be a little counterintuitive, but itā€™s actually the best way to deal with it when you donā€™t have the option to audit everything all the time.

So, the damage of lost/corrupt files will plague my node for a very long time?

I really hope itā€™s simply due to dev priorities (totally agree), and here is why:

Storj uses erasure codes to increase reliability and bandwidth/storage efficiency at the expense of a little complexity during file store and retrieval.
Some of the 80 pieces/nodes may be missing/down without any problem, only 29 are required.

However, do you still expect or recommend SNOs to implement reliability/redundancy in their end (the last stage of the storj storage hierarchy)?
(which would make the system a 80*2 / 29 )

Iā€™ve studied and dealt with information theory and erasure codes a few years ago in my work, and from a theoretical point of view I think it would be beneficial to have twice the storage/nodes without redundancy then half but with redundancy, even if it required increasing the 80/29 ratio.

When I joined storj I thought this was a given, and that left me with only 2 choices:

  • Start one node per HDD.
  • Start one node per server, and use LVM/software-Raid in JBOD.

The second option seemed much better, since it allows me to dynamically increase/decrease space, and thereā€™s only one service instance per server and internet connection (one process consuming bandwidth, one node to keep track, etc).
When an HDD fails (shows signs of failure), 95% of times 99% of data is recoverable to another HDD.
I thought the 1% of data or the 100% of data in 5% of the cases, the corrupted data would be recovered from the network to other nodes.

I agree that an exhaustive Audit would be counter-productive given the built in redundancy of the whole network, but there should be a way to start a self-check/repair, even if triggered by SNO, or automatically to the last blobs after a abrupt shutdown, etc.
There are multiple events that can cause small file corruptions, even with RAID1 and UPSs in place.
Leaving a missing/corrupt blob in pace without fixing/trashing leads to a monotonically increasing number of faults and inevitably/probabilistically to lower node score and disqualifications, even if it is running flawlessly for a long time.

If I have 100 missing/corrupted blobs in my node, I should be able to come forward, force a check, inform the satellite/network of the missing blobs, and pay for the errors due to ā€œI had them and got paid for them, but they were actually corruptedā€.

If such command was available, the following decision would be much easier to make and advantageous for the whole Storj network:

  • Have RAID1/RAID5 redundancy on my storj space.
  • Have the same drives in JBOD and provide +100% of space, or +25% with much lower CPU usage (RAID5 with 4+1 drives), and perform a check-repair when an HDD fails, losing only a few $ (much less than the extra space generated).

In a senseā€¦ yes. It wonā€™t be fixed. But itā€™s probably also small enough that you may never fail an audit at all. And if you do, itā€™s likely a one off occurrence and your score will recover pretty quickly.

From an architectural standpoint storagenodes are by definition untrusted entities. So this expectation canā€™t be there. How SNOs manage their data is their business, the satellite only judges it based on the results. But Storjlabs has always said redundancy on the nodes end is a waste of resources and I happen to agree. Better to just share the space on HDDā€™s you would otherwise use for redundancy as a separate node. That would actually make you more money in the long run. The k/n values of 80/29 are based on the assumption that nodes donā€™t have their own redundancy. And in part based on node churn rates seen on V2 and Iā€™m sure by now also based on churn rates on V3. Read the white paper for more info on this.

Iā€™m going to resist the temptation to go into the separate node vs RAID discussion here, but I suggest you search the forums for existing threads about this topic. There are good arguments for both sides.

The problem with this is that the satellite canā€™t know which pieces are lost without doing a full audit for all data on your node. This would be extremely costly to the point where it is cheaper to disqualify your node and have repair handle it. Because of this itā€™s impossible to repair damage. And since storagenodes are by definition untrusted entities, satellites canā€™t trust self reported file failures. Instead the satellites use statistics based on continuous audits on a tiny sample of data. The node canā€™t fake a good result on those audits. These results are then used to predict how likely it is for your node to not be able to respond with the correct piece. If that likelihood drops too low, it is disqualified. This is the only way to do this without spending a LOT of compute and bandwidth on doing a full check on all data.

There are, but all of these would impact so little data that your node will never be disqualified for it. I think that is almost certainly the case with the issues youā€™re seeing as well.

I believe there is an existing suggestion on the forum for this already. But hereā€™s the problem with this, the satellite canā€™t possibly know that you are correctly reporting all the issues. If you count these issues towards the audit score, it would be an instant disqualification. If you donā€™t this could give SNOs a way to play around the disqualification threshold. Remove data that is never downloaded, report it as lost and hope to get new data thatā€™s more profitable. Or when data loss occurs, report only part of it so it wonā€™t lead to disqualification, but you can still get paid for the rest. Giving SNOs these kinds of margins to play around in makes the network less reliable, more vulnerable to cheating and makes it a LOT more complicated to get an accurate measure of node reliability and thus also segment reliability. A measure that is by far the most important in the entire network. If reliability fails, you might as well close up shop, because thatā€™s a hard game over.

So while it may in part be a limitation in dev resources, but itā€™s definitely also in large part by design. If a node has lost enough data that they want to report it, then why should the satellite trust that node not to repeat that behavior?

2 Likes

Very good arguments, thank you.
I miscalculated the complexity of allowing full audits/checks/repair.

Wow. With this possible attack, which I was far from imagining, I withdraw my request for this feature.

And I also remembered that the corrupted blobs will be ā€œhealedā€ automatically when the owner deletes the file, deleting all blobs from all nodes, corrupt or not. Right?
This way, itā€™s a race between typical storage duration vs churn + rate at which these corruptions occur, and the latter being so small, I think we are fineā€¦

Ok, Iā€™m going to trash the problematic HDD and stop worrying with my audit score.
I reckon that Iā€™m far from disqualification at 300MB corrupted in 3TB used, but is there any study/estimate about the amounts of corrupted blobs (e.g. abrupt power-loss) per TB shared that will statistically lead to a 60% audit score?

There is some information on how the score is calculated in the blueprints on github. But the values for certain variables arenā€™t included in these docs.

The white paper also has information on this. But you may have to dig through the code to find the actual implementation and even then, the values used are probably in the config somewhere and not included in the code. I wouldnā€™t worry about it too much though. Just have a look at your audit score once in a while. If youā€™ll ever see it drop at all, I imagine it wonā€™t drop far.

As for the possible ā€œattacksā€ itā€™s all theoretical with V3 luckily. V2 was a bit of a different story, but Storj Labs has learned a lot from that and reading through things like the whitepaper and discussions on the forums it seems pretty clear that excluding any possibility or even slight margin for cheating has been a core design principle from day one. So I imagine it will stay that way and anything that would give even the smallest opening wonā€™t be allowed in the product.

1 Like