Repair node operator

andrew2.hart · June 19, 2021, 10:21am

A repair node operator would have a SBC and a broadband connection. The satellite would send a message to the repair node…
Please repair this block. 1234abcd
The repair node would answer Ive downloaded the block 1234abcd from these IPs add .00001 storj to my account

A repair node would have no storage for data.

nerdatwork · June 19, 2021, 10:41am

An interesting read.

github.com

storj/storj/blob/889d2eaaeaffa25c04df222f45b1d68db0f7d521/docs/blueprints/trusted-delegated-repair.md

# Delegating Repair Work Outside The Satellite

## Abstract

At the time of this writing, repair work is all performed by the satellite directly. This has some drawbacks. Most notably, (1) repair throughput is limited by the network resources of the satellite, and (2) satellites with high-uptime network connections are using those same connections for repair traffic, which leads to untenably high egress traffic costs. We want to be able to delegate repair work to workers outside the satellite's immediate trust boundary, either to a set of servers managed by the same satellite operations staff in a different datacenter, or to storage nodes who have computing power to spare and want to be paid for the work.

## Background

Performing repair work on the satellite is the simplest configuration in terms of security and code complexity. Workers are simply goroutines in one of the satellite processes, and have access to the satellite's own identity keys. Repair results can be immediately trusted, destination storage nodes can be easily selected, and PUT orders can be trivially signed.

However, this is leading to excessively high costs per byte repaired. With the current configuration on a Tardigrade satellite being run in GCP, we pay between $0.08 and $0.23 in egress costs per gigabyte of repaired pieces (depending on GCP region, monthly usage totals, and geographic destination). A theoretical satellite being run outside of GCP may have a lower bandwidth cost, but might not be able to complete repair work fast enough (even with a large number of repair workers) due to having all repair traffic flow through the bottleneck of the satellite's own limited network drops.

Delegating repair work to servers hosted elsewhere will save on bandwidth costs. These servers do not need a network connection with high uptime SLAs, like the satellite API servers do; repair servers might be able to do their job successfully even if they experience a brief network outage every day. It is a fundamentally different class of network traffic, so satellite operators should not need to pay as much for it as they would for the highly-available API traffic.

Delegating repair work to storage nodes will have similar cost benefits, but will also come with the extra benefit of not requiring the maintenance and administration of the additional servers. Also, this would decentralize repair further, allow storage node operators to earn more, and let storage node operators participate more in the health of the network.

This blueprint contemplates the former approach: delegating repair work to servers hosted elsewhere. We will refer to this as "_trusted delegated repair_" or _TDR_, to distinguish it from delegating repair work to storage nodes ("_foreign repair_" or _FR_). We want to allow for foreign repair at some point, and this blueprint should keep that direction open, but it may be some time before we can implement that.

## Design

This file has been truncated. show original

BrightSilence · June 19, 2021, 10:53am

This idea has been discussed a few times, but it’s quite complex to verify that an untrusted node does repair correctly.

The node would have to prove that it recreated the correct (encrypted) segment data. And even if that’s done, it would have to prove that it created valid RS pieces for that segment and those are now securely stored on new nodes.

Right now creating RS pieces and distributing them is always done by a trusted entity with a stake in keeping the data save. Either the uplink or the satellite. Doing this in a way that is independently verifiable is simply not built in yet. I tackled doing distributed audits a while ago. Distribute audits across storagenodes

That was already very complex and ended up being barely viable. If at all.

Repair is even more complex.

Technically you could have a node do a repair and then follow it up by auditing a stripe of the segment. Either by the satellite itself or using my suggested distributed method.
If the satellite itself does this it needs to know what pieces are original and trusted and check the other pieces against that. Otherwise the repair could create valid RS pieces, but not for the actual segment data.
The same goes for the distributed method. Nodes should then also be compensated at least a little bit for doing that work. Since more work needs to be done to ensure independent validation the pay for an equal amount work has to be significantly less than what they pay for their repair workers right now.

The upside of that would be avoiding the larger data exchange that repair would require and related egress bandwidth. But the downside is that at least some of that now has to go to nodes and a lot of added complexity. Additionally, the added audits steps would slow down the speed of repairs. Especially if audits are distributed as well, which might be necessary due to the scale and amount of audits this would trigger. And since repair has now been moved to separate much cheaper repair workers, I don’t know if the savings would still be worth it.

I’d love to hear some other ideas on how to tackle this though.

andrew2.hart · June 19, 2021, 6:17pm

Nah, you just delay payout until random audit has established reputation.

BrightSilence · June 19, 2021, 6:28pm

Not an option, that could take years. Repair happens when availability of a segment is low. It needs to be confirmed quick. You can’t risk finding out a year later that the repaired pieces were all bogus. That will guarantee data loss. The system needs to assume there are people who intent to do harm and fortify against that.

Edit: also, failed repair should be taken as seriously as failed audits. Just withholding pay isn’t enough. Nodes that do this frequently need to be quickly disqualified.

andrew2.hart · June 19, 2021, 6:35pm

Ahha but with node repair you could repair at 79/80 instead of satellite repair at 30/80. The integrity is assured by satellite repair.
Actively malicious nodes, yes I see your point. Perhaps have contracts with actual people rather than random sign ups.

BrightSilence · June 19, 2021, 6:42pm

Any repair would require the download of 29 pieces at the least. So if you trigger repair at 79/80 you need to download and pay egress for 29 pieces every time a single piece gets lost. That would get really expensive and yet still doesn’t guarantee the newly added pieces are valid. Over time a segment could still be lost if malicious repairs keep damaging pieces.

andrew2.hart · June 19, 2021, 6:53pm

“Over time” random audit would discredit malicious nodes.
“Cost” what cost? Just set the payments accordingly

BrightSilence · June 19, 2021, 8:07pm

Nodes are paid for repair egress. That cost.
Audits would discredit the receiving node and not the node that messed up the repair. That’s definitely not a solution. I don’t like the idea of relying on audits to begin with. That was just a quick first thought, but not the best way to do it.

I think I thought of a better way now. Satellites already store piece hashes for pieces on nodes. So, in theory the nodes receiving repaired pieces could generate a piece hash for the piece they received, which the satellite could match against what it was expecting it to be. If it doesn’t match, that should impact the repair reputation of the node doing the repair.

However, repair recreates pieces that were lost or that never existed to begin with. That means the satellite currently wouldn’t have a piece hash to compare against. A solution to that could be that when the uplink creates a segment, it will create 110 pieces and start the uploads and stop after 80 are done. Instead of just reporting back on the 80 pieces that are actually stored on nodes, the uplink could send the satellite a complete list of all 110 piece hashes. The satellite could then hold those to compare for future repair. When a piece is lost the satellite could mark it as such, but still keep the piece hash.

Any piece hash that doesn’t match or doesn’t get reported back to the satellite by a node that received it can be marked as lost right away and could count against the repair reputation of the node doing the repair. If after a certain time out the segment is still below the repair threshold, the same repair could be sent to a different node.

My biggest concern would be that there is still a lot of traffic going to and from the satellite to coordinate all that. Ideally that would be as little as possible. But it’s a start.

boelle · November 7, 2025, 8:56am

why have this great idea not made it to the roadmap yet?

i get that it needs to be done by a trusted part…
but could more though vetting process not take care of that?

ie the node itself reports it uptime, what speed the connection is etc
and then on top of that test it more from the outside

and maybe also add auth ident? just like the storage nodes used to have

andrew2.hart · November 7, 2025, 3:07pm

I don’t know. Maybe storj likes to pay google egress fees.
They enjoy paying. They love it.

nerdatwork · November 7, 2025, 3:41pm

Thank you for bringing attention to this matter. -et

Vadim · November 7, 2025, 4:20pm

there is no 80 any more, it much lower amount today as i remember.

BrightSilence · November 7, 2025, 4:33pm

Surprising that my post from 2021 is no longer entirely accurate.

Vadim · November 7, 2025, 4:35pm

Sorry didnt folow the dates.