I’ve been wondering whether some of the tasks a satellite performs can be distributed across nodes. Mainly auditing and repairing. I took some time to outline a possible solution for distributed auditing first since that seemed less complicated. Turns out it was longer than I expected, so I’ll leave the challenge to tackle repair for some other day. I’m sure there are holes in this approach that I overlooked, so don’t hold back, I can take it.
Idea also posted here: https://ideas.storj.io/ideas/V3-I-40
Currently audits are done by the satellite by retrieving a stripe from all storagenodes that store an erasure share for that stripe. These erasure shares are then audited by using a Berlekamp-Welch algorithm to detect which (if any) may be corrupt.
Moving this process to storagenodes introduces a view challenges
- Storagenodes should be considered untrusted environments, so any work they do needs to be verified.
- With the planned removal of Kademlia, storagenodes have no ability to find other nodes themselves.
- Storagenodes need to be able to determine whether audit requests they get from other storagenodes are legitimate.
Selection of nodes and stripes to be audited
A recent design document already outlined a new method of selecting nodes and stripes to be audited. This task should remain in control of the satellite. Next the satellite should exclude nodes that are currently offline (listed in the failed_uptime_checks table with back_online IS NULL).
Distributing audit tasks
Because we can’t assume the node performing the audit can be trusted, we need to verify the results of the audits performed. If we let one node run the Berlekamp-Welch algorithm to determine which erasure shares are corrupt, there is no way for the satellite to verify that result. A node could simply claim all erasure shares are correct and move on.
To prevent that possibility, the remaining nodes should be split up in batches of size n, with n being the minimum number of erasure shares required to recreate the stripe (currently 29). The satellite will select a node for each batch to perform the audits and and send it an audit task containing signed download requests that contain the node IDs, node addresses and erasure shares to be audited.
The node performing the audit sends the signed audit request to each of the n nodes for the specific erasure share. The nodes receiving the audit request check the signature to make sure it is a legitimate audit request initiated by a trusted satellite and return the requested erasure share. The auditing node will then us erasure coding to try to recreate the stripe.
- If successful, it will create a hash of the stripe and send that hash back to the satellite to verify it against a stored has for that stripe.
- If not successful, it will return an audit failure message to the satellite.
If the satellite for this stripe receives an audit failure from one of the auditing nodes or detects a mismatch between the returned stripe hash and the stored stripe hash, it will need to fall back to doing an audit on the entire stripe like before to detect which nodes/erasure shares are corrupt. This audit could optionally exclude nodes in batches that did return the correct stripe hash. Auditing on the satellite would only need to take place if one of the distributed audits fails.
- By moving most of the work for audits to nodes, the resources available to perform audits scale with the number of nodes
- The satellite is only performing audits when a problem has been found
- Still requires the satellite to perform audits in case of failure on the distributed audit
- Chance of failure on the distributed audit is relatively high since all erasure shares are required to recreate the stripe, therefor it is essential that nodes are actually online. This is only partially mitigated by checking the failed_uptime_checks table and might require uptime checks to happen right before the audit is distributed
- Should nodes be compensated for doing this work and if so how?
- Should auditor participation by storagenodes be a setting and if so, what should the default be?
- How to make sure a node does not get too many audit tasks at the same time?
- What time outs should be considered for 1) waiting for erasure shares to be downloaded from nodes being audited and 2) the entire process of auditing and returning the stripe hash to the satellite