Design draft: fixing deletes and server side copy

Hey friends, just wanted to point you to a new document that it looks like we maybe haven’t discussed here:

The goal of this design doc is twofold - accelerate the performance of object deletion as seen by customers (a common customer complaint), and simplify our codebase dramatically (stop reference counting pieces).

There are three main impacts to storage node operators. If this design doc is implemented:

  • When objects are copied, storage node operators will get paid for each copy of relevant pieces their nodes have, even though they only store one copy.
  • Satellites will have far less individual operations to perform on storage nodes, and this should reduce connection overhead.
  • However, deleted objects will take just a tad longer to get into the trash. There won’t be more garbage mind you (the amount of stuff that goes through the trash will stay the same as present), but deletes won’t arrive until garbage collection starts processing.

The benefits on the delete and server side copy complexity side make this an extremely interesting change to us, so this design doc is actually already in implementation phase. We’re excited that the lazy filewalker should hopefully reduce the pain of GC. We’re looking at more improvements to GC going forward.

Let us know what you think


Would it be like the said piece being downloaded for a copy to another node? In this case the “Egress” rate would be applied as per the satellite.

Oh, no, sorry, I should explain more.

When objects are “copied” we try to make it an efficient copy. No bytes of the object are actually transfered, but we want to represent that object now in two places as far as the customer can tell. Egress bandwidth payment will remain exactly the same - only paid when data is actually transferred.

So what I meant is, in this scenario, after an object is “copied” in the Satellite, the storage nodes will get paid at rest once for each Satellite copy, even though the storage node has just the one copy.

We’re essentially giving up on having the Satellite try and keep track of how many real copies the storage node has for the sake of simplicity.


Thank you. This makes a lot more sense now :slight_smile:

So basically that copy (piece) will be treated as per stored ($ per TB) data rate of given satellite.

While I understand the value of simplified code, this sacrifices something that could be a competitive advantage over other S3 providers for stuff like cheap snapshots. Though, if this is something that no customer would actually want, then, I guess, it’s a good idea.

(BTW, some time after that message I got a reply from that guy basically stating that he no longer works at the company that would benefit from atomic snapshots. :shrug: maybe next time…)

Oh no no, we can still do cheap snapshots. We’re not giving up on that. What we’re giving up is knowing exactly when the last reference to one of those snapshots disappears. The garbage collection process we have will clean it up, but not right at the soonest point that it could. That’s the tradeoff.

To make an analogy to programming language design, our change here is like changing from a reference counting or precise garbage collection strategy to a conservative or best effort mark and sweep.


Ok, but given you are going to pay node operators more for these snapshots, I assume you will need to pass this cost to customers as well, right?

Oh I see what you mean - yes, the customer invoicing will treat both copies separately. I misunderstood what you meant by “cheap.” You’re right, that’s another tradeoff.

1 Like

I noticed this first on the changelog and already responded there. My main concern with this would be that this concentrates all deletes in massive IO spikes on the storage node end. The lazy file walker will help in some occasions, but it doesn’t actually work on all IO schedulers. I fear this could cause issues.

Furthermore GC is infrequent to begin with, causing data to remain on nodes longer. This is made worse by the bloom filters only removing about 90% of garbage on each run. Plus then that data stays in trash for another week. All of this adds up to more and more unpaid data on nodes taking up space.


You are right that these are concerns, though we’re hopeful that in practice, some of these concerns will be offset by the other tradeoffs (node operators getting paid for each Satellite-side copy, instead of just once, etc).

Ultimately, I think the approach we’re hoping to take is to continue to improve GC (find ways to reduce its load demands and have it run more frequently). If we can do this (we have some ideas), we think it will be a net win overall.

1 Like

Yeah it would be nice to get rid of all potential IO spikes.

1 Like

I agree.


Due to high IO my fail rate is terrible.

I understand. I will keep an eye on IO performance as well as attempt to quantify the impact of GC taking care of deletes. I already monitor reported disk use by the satellite compared to total disk use local in my earnings calculator. So I will keep an eye on that too.