Distribute audits across storagenodes

I’ve been wondering whether some of the tasks a satellite performs can be distributed across nodes. Mainly auditing and repairing. I took some time to outline a possible solution for distributed auditing first since that seemed less complicated. Turns out it was longer than I expected, so I’ll leave the challenge to tackle repair for some other day. I’m sure there are holes in this approach that I overlooked, so don’t hold back, I can take it. :smiley:

Idea also posted here: https://ideas.storj.io/ideas/V3-I-40

Currently audits are done by the satellite by retrieving a stripe from all storagenodes that store an erasure share for that stripe. These erasure shares are then audited by using a Berlekamp-Welch algorithm to detect which (if any) may be corrupt.

Moving this process to storagenodes introduces a view challenges

  1. Storagenodes should be considered untrusted environments, so any work they do needs to be verified.
  2. With the planned removal of Kademlia, storagenodes have no ability to find other nodes themselves.
  3. Storagenodes need to be able to determine whether audit requests they get from other storagenodes are legitimate.

Selection of nodes and stripes to be audited
A recent design document already outlined a new method of selecting nodes and stripes to be audited. This task should remain in control of the satellite. Next the satellite should exclude nodes that are currently offline (listed in the failed_uptime_checks table with back_online IS NULL).

Distributing audit tasks
Because we can’t assume the node performing the audit can be trusted, we need to verify the results of the audits performed. If we let one node run the Berlekamp-Welch algorithm to determine which erasure shares are corrupt, there is no way for the satellite to verify that result. A node could simply claim all erasure shares are correct and move on.
To prevent that possibility, the remaining nodes should be split up in batches of size n, with n being the minimum number of erasure shares required to recreate the stripe (currently 29). The satellite will select a node for each batch to perform the audits and and send it an audit task containing signed download requests that contain the node IDs, node addresses and erasure shares to be audited.

Auditing
The node performing the audit sends the signed audit request to each of the n nodes for the specific erasure share. The nodes receiving the audit request check the signature to make sure it is a legitimate audit request initiated by a trusted satellite and return the requested erasure share. The auditing node will then us erasure coding to try to recreate the stripe.

  • If successful, it will create a hash of the stripe and send that hash back to the satellite to verify it against a stored has for that stripe.
  • If not successful, it will return an audit failure message to the satellite.

Fallback
If the satellite for this stripe receives an audit failure from one of the auditing nodes or detects a mismatch between the returned stripe hash and the stored stripe hash, it will need to fall back to doing an audit on the entire stripe like before to detect which nodes/erasure shares are corrupt. This audit could optionally exclude nodes in batches that did return the correct stripe hash. Auditing on the satellite would only need to take place if one of the distributed audits fails.

Upsides

  • By moving most of the work for audits to nodes, the resources available to perform audits scale with the number of nodes
  • The satellite is only performing audits when a problem has been found

Downsides

  • Still requires the satellite to perform audits in case of failure on the distributed audit
  • Chance of failure on the distributed audit is relatively high since all erasure shares are required to recreate the stripe, therefor it is essential that nodes are actually online. This is only partially mitigated by checking the failed_uptime_checks table and might require uptime checks to happen right before the audit is distributed

Remaining considerations:

  • Should nodes be compensated for doing this work and if so how?
  • Should auditor participation by storagenodes be a setting and if so, what should the default be?
  • How to make sure a node does not get too many audit tasks at the same time?
  • What time outs should be considered for 1) waiting for erasure shares to be downloaded from nodes being audited and 2) the entire process of auditing and returning the stripe hash to the satellite
2 Likes

@BrightSilence Thank you for taking the time to propose this idea in such a structured way!

I agree it makes sense to distribute more tasks to storage nodes and offload the satellite. This is our long term plan too.

Your proposal assumes that the satellite keeps hashes of the stripe being audited, but this is not actually true. Currently, the satellite selects a segment to audit, and then a random stripe from this segment. Then it requests all nodes storing a piece from this segment to return the respective erasure shares. Then the satellite uses Berlekamp-Welch algorithm to detect if any of the returned erasure shares is an outlier in the reconstructed polynomial (stripe).

The beauty of the Berlekamp-Welch algorithm is such that it does not require any prior knowledge about the stripe (like a hash). The erasure shares from the storage nodes are all that is required to detect if any of them is wrong data.

Hence, the satellite does not store any stripe hashes. Also, it is not practical for doing this, because the audit is designed in a way that any stripe can be audited at any time. Keeping hashes for all possible stripes is comparable to storing the entire data on the satellite.

I realized this may have been an issue later on. However, it could still work except instead of comparing the hash to a stored hash you compare the hashes you get back from multiple auditing nodes checking on different batches of the same stripe. That way no stripe hash storage on the satellite is required, but you can still verify both auditing nodes recreated the same stripe based on different sets of erasure shares.

This approach does require at least 2n pieces to be available (n being the minimum number of erasure shares required to rebuild the piece). Other pieces could only be audited by the satellite as is.

hrmm, actually, it wouldn’t matter if there is overlap. Even with only n pieces you can have 2 nodes audit the same n erasure shares and compare results.

Say n = 29. There are 35 pieces available. Give 2 nodes each 29 erasure shares to check such that all are covered and 23 overlap. Compare results. Job done.

OK. This makes sense. The satellite may request 2-3 storage nodes to perform the same audit. If the result from all of them is successful and they return the same stripe hash then the satellite can accept this audit as successful. Otherwise, the satellite should fallback to normal centralized audit.

The main question for me is if this is economically viable. The satellite has to send all contact info to the auditing storage node. This is a message with the Node ID and address for all 80 (or less) nodes holding a piece from the segment. Sent to 2 or 3 nodes. This is egress data for the satellite, which is paid to the cloud provider where the satellite is running. Compare this to the current approach where the satellite downloads the erasure shares from the storage nodes, which is ingress data available for free from most cloud providers.

We have to measure the and ingress and egress data in both cases and make the calculations.

Yep, that was the danger with trying to tackle audits first. :slight_smile: There is likely much more to win with repairs.
For audits it might indeed not be economically viable. Either way, I enjoyed the thought exercise. If it helps in any way, even better. But if not, I’m ok with that too.

These were all good thoughts! Thanks for sharing them! We are eager to hear more of your ideas.

1 Like