An additional long-tail optimization for S3 MT

hashbackup · October 13, 2021, 7:06pm

The current long-tail optimization for uploads is to create 110 unique pieces, select 110 nodes, try all 110 uploads concurrently, and stop when 80 report success. I’m assuming that for the S3 MT Gateway this works in a similar way. (I tried to find the code but got lost pretty quickly in the Storj MinIO repo.)

For background, a segment requires k=29 pieces, is repaired when there are fewer than m=50 pieces, is initially created with o=80 pieces, and is uploaded in n=110 pieces, with the 30 slowest nodes getting canceled.

That large redundancy of 80-29=51 pieces is to avoid frequent repairs given that nodes randomly fail and/or leave the network. I uploaded a 2-segment file 8 days ago that was originally 159 pieces and is now 151 pieces. So let’s say we lose 1/2 a piece a day for each segment. It would then require repair after 60 days, though I’m guessing piece dropouts actually have a logarithmic distribution rather than linear.

An optimization that might perform well for S3MTG could go something like this: instead of creating and attempting to upload all 110 pieces and waiting for 80 successes before responding “upload successful”, the S3MTG could tell the S3 client the upload worked after only 50 pieces (maybe even less) are stored and then the gateway would continue to wait in the background until at least 80 are successful. If Something Bad happens during this second phase, it’s not really a big deal because it will trigger a segment repair, or a repair can be manually triggered if necessary. The faster repairs happen, the lower the initial number of nodes needs to be.

This doesn’t really work for the uplink CLI because there’s no other process running to finish the upload in the background. I guess it would work for the local S3 ST Gateway.

What could work down the road for the uplink CLI is to only upload 50 pieces and then trigger a repair. My understanding is that repairs are expensive today because the satellite has to fetch 29 pieces, create new pieces, and pay egress to send them to storage nodes at 8 cents/GB. But if storage nodes could effect repairs themselves, maybe with some coordination help from the satellite, it might be an option. You’d only need enough redundancy on the initial upload to last through the repair, which should be able to complete within a few days I’d think.