Repair node operator

A repair node operator would have a SBC and a broadband connection. The satellite would send a message to the repair node…
Please repair this block. 1234abcd
The repair node would answer Ive downloaded the block 1234abcd from these IPs add .00001 storj to my account

A repair node would have no storage for data.

1 Like

An interesting read.

1 Like

This idea has been discussed a few times, but it’s quite complex to verify that an untrusted node does repair correctly.

The node would have to prove that it recreated the correct (encrypted) segment data. And even if that’s done, it would have to prove that it created valid RS pieces for that segment and those are now securely stored on new nodes.

Right now creating RS pieces and distributing them is always done by a trusted entity with a stake in keeping the data save. Either the uplink or the satellite. Doing this in a way that is independently verifiable is simply not built in yet. I tackled doing distributed audits a while ago. Distribute audits across storagenodes

That was already very complex and ended up being barely viable. If at all.

Repair is even more complex.

Technically you could have a node do a repair and then follow it up by auditing a stripe of the segment. Either by the satellite itself or using my suggested distributed method.
If the satellite itself does this it needs to know what pieces are original and trusted and check the other pieces against that. Otherwise the repair could create valid RS pieces, but not for the actual segment data.
The same goes for the distributed method. Nodes should then also be compensated at least a little bit for doing that work. Since more work needs to be done to ensure independent validation the pay for an equal amount work has to be significantly less than what they pay for their repair workers right now.

The upside of that would be avoiding the larger data exchange that repair would require and related egress bandwidth. But the downside is that at least some of that now has to go to nodes and a lot of added complexity. Additionally, the added audits steps would slow down the speed of repairs. Especially if audits are distributed as well, which might be necessary due to the scale and amount of audits this would trigger. And since repair has now been moved to separate much cheaper repair workers, I don’t know if the savings would still be worth it.

I’d love to hear some other ideas on how to tackle this though.

2 Likes

Nah, you just delay payout until random audit has established reputation.

Not an option, that could take years. Repair happens when availability of a segment is low. It needs to be confirmed quick. You can’t risk finding out a year later that the repaired pieces were all bogus. That will guarantee data loss. The system needs to assume there are people who intent to do harm and fortify against that.

Edit: also, failed repair should be taken as seriously as failed audits. Just withholding pay isn’t enough. Nodes that do this frequently need to be quickly disqualified.

1 Like

Ahha but with node repair you could repair at 79/80 instead of satellite repair at 30/80. The integrity is assured by satellite repair.
Actively malicious nodes, yes I see your point. Perhaps have contracts with actual people rather than random sign ups.

1 Like

Any repair would require the download of 29 pieces at the least. So if you trigger repair at 79/80 you need to download and pay egress for 29 pieces every time a single piece gets lost. That would get really expensive and yet still doesn’t guarantee the newly added pieces are valid. Over time a segment could still be lost if malicious repairs keep damaging pieces.

1 Like

“Over time” random audit would discredit malicious nodes.
“Cost” what cost? Just set the payments accordingly

Nodes are paid for repair egress. That cost.
Audits would discredit the receiving node and not the node that messed up the repair. That’s definitely not a solution. I don’t like the idea of relying on audits to begin with. That was just a quick first thought, but not the best way to do it.

I think I thought of a better way now. Satellites already store piece hashes for pieces on nodes. So, in theory the nodes receiving repaired pieces could generate a piece hash for the piece they received, which the satellite could match against what it was expecting it to be. If it doesn’t match, that should impact the repair reputation of the node doing the repair.

However, repair recreates pieces that were lost or that never existed to begin with. That means the satellite currently wouldn’t have a piece hash to compare against. A solution to that could be that when the uplink creates a segment, it will create 110 pieces and start the uploads and stop after 80 are done. Instead of just reporting back on the 80 pieces that are actually stored on nodes, the uplink could send the satellite a complete list of all 110 piece hashes. The satellite could then hold those to compare for future repair. When a piece is lost the satellite could mark it as such, but still keep the piece hash.

Any piece hash that doesn’t match or doesn’t get reported back to the satellite by a node that received it can be marked as lost right away and could count against the repair reputation of the node doing the repair. If after a certain time out the segment is still below the repair threshold, the same repair could be sent to a different node.

My biggest concern would be that there is still a lot of traffic going to and from the satellite to coordinate all that. Ideally that would be as little as possible. But it’s a start.

1 Like