Repair node operator

nerdatwork · June 19, 2021, 10:41am

An interesting read.

storj/storj/blob/889d2eaaeaffa25c04df222f45b1d68db0f7d521/docs/blueprints/trusted-delegated-repair.md

# Delegating Repair Work Outside The Satellite

## Abstract

At the time of this writing, repair work is all performed by the satellite directly. This has some drawbacks. Most notably, (1) repair throughput is limited by the network resources of the satellite, and (2) satellites with high-uptime network connections are using those same connections for repair traffic, which leads to untenably high egress traffic costs. We want to be able to delegate repair work to workers outside the satellite's immediate trust boundary, either to a set of servers managed by the same satellite operations staff in a different datacenter, or to storage nodes who have computing power to spare and want to be paid for the work.

## Background

Performing repair work on the satellite is the simplest configuration in terms of security and code complexity. Workers are simply goroutines in one of the satellite processes, and have access to the satellite's own identity keys. Repair results can be immediately trusted, destination storage nodes can be easily selected, and PUT orders can be trivially signed.

However, this is leading to excessively high costs per byte repaired. With the current configuration on a Tardigrade satellite being run in GCP, we pay between $0.08 and $0.23 in egress costs per gigabyte of repaired pieces (depending on GCP region, monthly usage totals, and geographic destination). A theoretical satellite being run outside of GCP may have a lower bandwidth cost, but might not be able to complete repair work fast enough (even with a large number of repair workers) due to having all repair traffic flow through the bottleneck of the satellite's own limited network drops.

Delegating repair work to servers hosted elsewhere will save on bandwidth costs. These servers do not need a network connection with high uptime SLAs, like the satellite API servers do; repair servers might be able to do their job successfully even if they experience a brief network outage every day. It is a fundamentally different class of network traffic, so satellite operators should not need to pay as much for it as they would for the highly-available API traffic.

Delegating repair work to storage nodes will have similar cost benefits, but will also come with the extra benefit of not requiring the maintenance and administration of the additional servers. Also, this would decentralize repair further, allow storage node operators to earn more, and let storage node operators participate more in the health of the network.

This blueprint contemplates the former approach: delegating repair work to servers hosted elsewhere. We will refer to this as "_trusted delegated repair_" or _TDR_, to distinguish it from delegating repair work to storage nodes ("_foreign repair_" or _FR_). We want to allow for foreign repair at some point, and this blueprint should keep that direction open, but it may be some time before we can implement that.

## Design

This file has been truncated. show original