Design draft: fixing deletes and server side copy

jtolio · June 22, 2023, 1:24pm

Hey friends, just wanted to point you to a new document that it looks like we maybe haven’t discussed here:

https://review.dev.storj.io/c/storj/storj/+/9978/

The goal of this design doc is twofold - accelerate the performance of object deletion as seen by customers (a common customer complaint), and simplify our codebase dramatically (stop reference counting pieces).

There are three main impacts to storage node operators. If this design doc is implemented:

When objects are copied, storage node operators will get paid for each copy of relevant pieces their nodes have, even though they only store one copy.
Satellites will have far less individual operations to perform on storage nodes, and this should reduce connection overhead.
However, deleted objects will take just a tad longer to get into the trash. There won’t be more garbage mind you (the amount of stuff that goes through the trash will stay the same as present), but deletes won’t arrive until garbage collection starts processing.

The benefits on the delete and server side copy complexity side make this an extremely interesting change to us, so this design doc is actually already in implementation phase. We’re excited that the lazy filewalker should hopefully reduce the pain of GC. We’re looking at more improvements to GC going forward.

Let us know what you think

nerdatwork · June 22, 2023, 2:13pm

Would it be like the said piece being downloaded for a copy to another node? In this case the “Egress” rate would be applied as per the satellite.

jtolio · June 22, 2023, 3:33pm

Oh, no, sorry, I should explain more.

When objects are “copied” we try to make it an efficient copy. No bytes of the object are actually transfered, but we want to represent that object now in two places as far as the customer can tell. Egress bandwidth payment will remain exactly the same - only paid when data is actually transferred.

So what I meant is, in this scenario, after an object is “copied” in the Satellite, the storage nodes will get paid at rest once for each Satellite copy, even though the storage node has just the one copy.

We’re essentially giving up on having the Satellite try and keep track of how many real copies the storage node has for the sake of simplicity.

nerdatwork · June 22, 2023, 3:53pm

Thank you. This makes a lot more sense now

So basically that copy (piece) will be treated as per stored ($ per TB) data rate of given satellite.

Toyoo · June 22, 2023, 8:49pm

While I understand the value of simplified code, this sacrifices something that could be a competitive advantage over other S3 providers for stuff like cheap snapshots. Though, if this is something that no customer would actually want, then, I guess, it’s a good idea.

(BTW, some time after that message I got a reply from that guy basically stating that he no longer works at the company that would benefit from atomic snapshots. :shrug: maybe next time…)

jtolio · June 23, 2023, 12:55am

Oh no no, we can still do cheap snapshots. We’re not giving up on that. What we’re giving up is knowing exactly when the last reference to one of those snapshots disappears. The garbage collection process we have will clean it up, but not right at the soonest point that it could. That’s the tradeoff.

To make an analogy to programming language design, our change here is like changing from a reference counting or precise garbage collection strategy to a conservative or best effort mark and sweep.

Toyoo · June 23, 2023, 7:42am

Ok, but given you are going to pay node operators more for these snapshots, I assume you will need to pass this cost to customers as well, right?

jtolio · June 23, 2023, 1:41pm

Oh I see what you mean - yes, the customer invoicing will treat both copies separately. I misunderstood what you meant by “cheap.” You’re right, that’s another tradeoff.

BrightSilence · June 23, 2023, 8:01pm

I noticed this first on the changelog and already responded there. My main concern with this would be that this concentrates all deletes in massive IO spikes on the storage node end. The lazy file walker will help in some occasions, but it doesn’t actually work on all IO schedulers. I fear this could cause issues.

Furthermore GC is infrequent to begin with, causing data to remain on nodes longer. This is made worse by the bloom filters only removing about 90% of garbage on each run. Plus then that data stays in trash for another week. All of this adds up to more and more unpaid data on nodes taking up space.

jtolio · June 24, 2023, 1:48am

You are right that these are concerns, though we’re hopeful that in practice, some of these concerns will be offset by the other tradeoffs (node operators getting paid for each Satellite-side copy, instead of just once, etc).

Ultimately, I think the approach we’re hoping to take is to continue to improve GC (find ways to reduce its load demands and have it run more frequently). If we can do this (we have some ideas), we think it will be a net win overall.

jammerdan · June 24, 2023, 3:24am

jammerdan · June 24, 2023, 3:26am

Yeah it would be nice to get rid of all potential IO spikes.

nerdatwork · June 24, 2023, 3:57am

I agree.

Due to high IO my fail rate is terrible.

BrightSilence · June 26, 2023, 8:13am

I understand. I will keep an eye on IO performance as well as attempt to quantify the impact of GC taking care of deletes. I already monitor reported disk use by the satellite compared to total disk use local in my earnings calculator. So I will keep an eye on that too.