Audit system improvement idea

From what I understand, right now the audit system checks randomly if a node has some file. If the node cannot prove that it has the file, it fails the audit check. Too many failed audits leads to disqualification.

On this forum people have complained that this means that even if you lose 1% of data (due to bad sectors on a disk for example, which happens quite often), your node will likely be disqualified. This leads to possibly many terabytes of data loss. According to posts on this forum by Storjlings, repairs are expensive and the escrow amount does not always cover it, so disqualification should be avoided as much as possible.

I propose a solution: What if, when a node fails an audit, that file is deleted from the node. The node will no longer store that file and will not receive any audits for that file in the future.

  • If people manage to save 95% of data from a failing disk, they will fail some audits, but their node will likely survive.
  • This will prevent nodes from going away completely, which avoids huge repair costs and reduces node churn.
  • People who try to cheat the system (for example by pretending they store more files than they actually do) will still get disqualified.

I don’t know enough about the audit process, maybe you already do this. Anyway, I thought of this, so I figured I’d share in case it helps.

EDIT: Unsurprisingly it turned out not everything I say here is correct, please read the comments below as well, especially those from BrightSilence and littleskunk

1 Like

It is an interesting idea to prevent nodes from being disqualified because it might result in SNOs turning away from Storj completely.
On the other hand, a node that has lost customer data has proven to be unrealiable. So simply deleting the file from a node but keep it otherwise intact, does not resolve the fact that it has proven to be unrealiable.
So if the node shall be allowed to keep serving the remaining files or storing new files there need to be some kind of re-vetting of this failed node.

this is good idea, but if people start to cheat, by deleting very low used files? to get other files that more used, this shold be somehow punished.

Good point, I hadn’t thought about the fact that this allows a malicious SNO to very slowly delete files. I don’t know if this is something that will actually happen though. Generally as a SNO you want to keep as many files as possible, because more files means more egress and more money. I don’t think it is really possible to predict which files will be downloaded more than others. Besides, it may not even increase profits that much to make it worth doing (considering the very small amount of space that you will be able to free).

So simply deleting the file from a node but keep it otherwise intact, does not resolve the fact that it has proven to be unrealiable.

The node will still fail audits. These errors don’t just get ignored, the node will get punished if they fail too many audits just like they would now.

in fact, new veeting time will lower peaple to make this file deletion. But there are always some hot files that are popular and files that not used as backups. Most profig gives hot files.

The big issue with this is that the vast majority of segments will never be audited. It wouldn’t be possible for the satellites to audit every segment they manage. So the auditing system doesn’t determine exactly which files are still there, its role is to determine the probability that a node can be trusted to store files reliably by randomly testing segments. You mention 95% and that may perhaps seem like a high percentage. But now consider the immense amount of segments stored on Tardigrade and the fact that in the worst case there are 35 pieces of which 29 are absolutely necessary to restore the segment. The chance that 6 out of those 35 are actually not available would become way too high and data reliability can no longer be guaranteed. I did some rough calculations, but if 5% of pieces could be lost there is a 0.6% chance for a segment at the repair threshold to be lost completely. This would lead to a file durability of 99.4%, which is nowhere near in the same ballpark as the 99.9999999% storj has set as a production goal (and has so far been able to actually make happen as well).

I understand that you would want them to be more lenient, but it’s simply not possible to let nodes lose data and still offer anywhere close to the reliability the network wants to ensure.

Additionally, you mentioned this:

If you have lost 1% of data on the disk to bad sectors, your disk is gone. I’ve never seen such extensive sector loss on a disk that survived. And chances are that disk is entirely dead before you could even copy the rest to another disk. It’s simply not realistic that you would recover from this to begin with.
If your disk starts corrupting significant amount of data, your node SHOULD be disqualified ASAP. This is simply not a good example of why data loss could happen for which it may not be wise to disqualify a node.

There are other examples I can think of that could be treated differently. Most are related to the data for some reason not being available. For example, when the node is running with a wrong path. Right now the node starts over as if it is a new node, leading to tons of issues. Or when the HDD for some reason gets unmounted or is inaccessible by the node. If the node software could detect this and simply not start or stop running, it could be treated as down time instead of data loss and the node might survive if the SNO is quick enough to resolve that issue. This would prevent data loss as a result from writing data to the wrong path (or inside the container in case of docker setups), which in my opinion is the only kind of data loss that could be explained away sufficiently so that the node might be trusted again in the future.

So instead of allowing data loss to some extent, the node software should be updated to make sure preventable data loss doesn’t happen, which will catch almost all examples we’ve seen mentioned in the forums. And remain very strict about what’s allowed for any other type of data loss to ensure reliability.

4 Likes

I have - a few bad sectors in the filesystem metadata area can make a lot of files disappear.

However, I agree, failing 1% of audits means you likely lost at least 1% of data and the node should not be considered reliable, however, I think that such node should be given a chance to upload whatever data it still has back to the network for a portion of the escrow money.

Right now, if I notice my node failing audits, I can do one of three things:

  1. Attempt graceful exit, knowing that it will most likely fail (since GE fails if a single file is missing) and I’ll see none of the escrow money.
  2. Keep the node running hoping that it does not get disqualified.
  3. Delete the node immediately and create a new one.

From my perspective, option 2 is better than option 1 - in the worst case my node gets disqualified anyway, in the best case my node continues running. Option 1 is only good if I want to shut down a node that has perfect reputation.
However, I’d bet that it would be better for the network if I chose option 1. So, maybe it should be made more attractive?

You have to lose more than 1% and people will always complain about getting DQed. I have seen many storage nodes that deleted data hoping that they would get away with that.

The chance to get 2 audits for the same segment is low anyway. It makes no difference if we delete the data or not. Keep in mind the audit system is unable to detect lost pieces. It can only detect bad storage nodes. For that reason it makes no sense to update the segment after failing an audit.

At the moment the audit system is working just fine. For every storage node that gets DQed I can find the failed audits in the satellite logs.

But you have to update it to calculate when a repair is needed right? So, what is the point of issuing audits for the same file multiple times after the first one failed?

Wrong. The repair job can deal even with 10% missing pieces and would still reconstruct the file.

The point is this is not happening. The only way to get multiple audits for the same segment is creating a new node and store only 3 pieces. In that situation the audit system should better DQ early to set the right incentive.
As soon as you have a few thousend files you will not get 2 audits for the same segment.

Yeah, you’ve made some good points, maybe the problem is not as bad as I thought.

I would like for nodes to be able to corrupt a couple of files without being disqualified. Like you said, there is already a chance that those files will never be audited. Even if they are, the audit fail percentage will probably stay high enough.

As long as satellites don’t keep re-auditing failed audits in the hope that they will be successful once, it shouldn’t be a problem. If that happens, nodes will get disqualified quickly if they lose a single file.

There’s a lot that can cause a node to lose a very small bit of data (corruption in memory, single bad sector, etc.). If the current system deals with that sufficiently without disqualifying the entire node, ignore my post.

But you mark the piece as missing after a failed audit, right? Otherwise there may be a situation when too many of the pieces are corrupt or missing.

The audit job is unable to check all segments. -> It makes no sense to mark missing pieces. Markt bad storage nodes instead!

Before that will happen the corresponding storage nodes are getting DQed and the repair service will start doing its job.

Why not give SNOs a way to verify the data they have, report the missing pieces, pay the reconstruction fee and keep the node? Or alternatively force them to do a graceful exit if the amount of data lost is too high.

SNO benefits from it because keeping the node is obviously more profitable than starting a new one.
Network benefits from it because the earlier it knows about data loss the earlier it can launch a recovery process and avoid losing data.
Malicious node does not benefit from this as it only allows one to recover from minor loss, normal audit process should still disqualify that node.

I see no reason to not implement this, unless network benefits from disqualifying nodes and collecting escrow.

Because this will operation is actualy more expensive than a repair. So lets just wait for repair to kick in and thats it.

If a node operator was curious, could they somehow compare the number of entries in the garbage collector bloom filter to the number of blobs on their node to see if they were missing any files?

How can verification be more expensive than a repair?
You don’t need network to verify local data consistency except for providing a list of pieces and their hashes that should be present on a node. You don’t need it at all if you store this data locally, something a modified node can already do. Then, you don’t pay for GE as it just relocates data to other nodes.
And when you repair, you have to pay to download data from other nodes.

This is not a suggestion to audit all data on a node, it’s a suggestion to let nodes report their data loss.

I will sugest that storagenode make data integrity chech itself, and send raport to sattelite. It can contain much more hashes of pieces, and satelite will check it. This way can detect faster if something is bad.

I could get behind that, though there may be some challenges in implementing a system that transfers possibly unreliable data to other nodes. This process could even be started automatically after disqualification.

In these cases the node will likely survive the current system. You’d have to have lost a fairly significant amount to be caught multiple times in relatively quick succession. Honestly, I feel like the current scores may actually be to lenient as you can also recover this score relatively quickly. But I’m sure some calculation went into the current system, so I’ll have to assume it’s up to par.

That’s not how bloom filters work. Think of it as complex logic that matches all piece hashes you should be holding. However, in order for this to be relatively small it matches a lot more hashes as well. This is also why it won’t delete all pieces that shouldn’t be there in one go. I think right now it cleans up about 90% every time. The next time it will run with a different seed and clean up a different 90% of what remains. You can’t count the number of entries in the bloom filter as it doesn’t have any entries.

Because you trigger repair only once on disqualification, but verification like this could be triggered many times over and is usually not needed. Besides, why should the satellite trust a node which has already lost a significant amount of data to never do that again?

The current system detects bad nodes fast enough already. And your suggestion would send a LOT more data and give the satellite much more to check. Either you send everything, which would be huge. Or the node would have the option to send only the data it knows is still good and could fool the satellite. The only reliable way to sample is to have the satellite pick the piece at random and have the node respond.

Right now SNO has absolutely no reason to keep a disqualified node online. Right now the only rational action is to rm -rf the node and start again asap if SNO is still interested.

What’s the harm if verification is done on local machine or you have to pay to get the data from satellite?
There’s no reason to trust any of the nodes. A node should be allowed to participate in the network as long as it’s profitable or has enough escrow to cover the repairs. It all boils down to numbers.