Faster delete performance

I did some poking around. Delete starts off here:

And ends up here:

The threshold is .75, so I think it is waiting until it deletes 75% of the pieces then lets garbage collection get the rest. This is like long-tail cancellation for deletes - I think.

I’m not a Go programmer, but it appears that pieces are deleted while the client is still waiting for a response. My suggestion would be to run deleteObjectPieces in a background thread / goroutine. It might be as easy as adding “go” in front of that call and returning an empty list for deletedObjects. That would probably make all deleted objects be in the “pending” state for a short while (I guess that’s what pending means - not all pieces have been deleted yet). I just tried overwriting a pending object and it worked fine, so doesn’t seem like that would be an issue.

Or, instead of starting a background thread immediately, use a fixed-size pool of workers to handle all piece deletes. This wouldn’t be like garbage collection which happens every week, and it’s not like the current piece delete queuing and jobs (I think) because the client is waiting while those are happening.

I read a post that customers haven’t complained about delete performance. That could be true; they might just leave. My view on performance is that it’s important because it’s easily measured by prospective customers, whereas things like durability, availability, and wide distribution are a lot harder to verify. If a company says “we have x performance, y durability, z availability” and the customer sees much different performance, it might cast doubt on the other things.

1 Like

I did fast deletion like this:

64 processes did the trick. The faster would be only to remove a whole bucket with uplink rb --force

What happen if the session with modified uplink would gone? The simplest way to reproduce - use ssh without screen. What happen in that case?

That is intensional. If we take the shortcut we might end up filling the queue faster than the satellite could process it. That would be a battle we can never win. We need to push this back pressure back to the customer. That way the satellite can continue operating. The performance might not be the best but that is still better than a satellite in a crash loop.

I appreciate your time investment digging into it. Do you have any other ideas you would like to share?

2 Likes

The threshold is .75, so I think it is waiting until it deletes 75% of the pieces then lets garbage collection get the rest. This is like long-tail cancellation for deletes - I think.

Not, quite. It waits until 75% deletes have been executed and then returns. The deletes still keep continuing in the background

The threshold is there roughly for two reasons:

  1. To add back-pressure to the client so the satellite is able to keep up with individual requests.
  2. To avoid waiting for some storage node that takes too much time.

That would probably make all deleted objects be in the “pending” state for a short while (I guess that’s what pending means - not all pieces have been deleted yet).

There’s no pending state for the deleted objects. They are deleted immediately and then the delete requests are sent to the storage nodes. When anything fails during sending the requests we’ll let garbage collection take care of that data.

Or, instead of starting a background thread immediately, use a fixed-size pool of workers to handle all piece deletes.

Pool of workers is usually not a good idea in Go, it’s usually better to spin up a goroutine whenever needed. However, a similar logic has been implemented in storj/satellite/metainfo/piecedeletion at main · storj/storj · GitHub with a combiner queue per storagenode.

1 Like

Because deletes are slow, solutions like Alexey’s are created that use 64 threads to delete objects. My intuition is that this kind of behavior is more likely to overload a satellite than queued deletes. But you guys know the rates of new space allocation vs deletion and I don’t, so if you think it’s likely that queued deletes would never catch up, okay.

There are other advantages of deferring deletes beside faster client performance:

  1. If you ever implement S3 versioning, you will have to implement delete markers anyway. These could likely be used somehow for faster deletes.

  2. If you run the “delete pieces” queue every hour, there’s more opportunity to optimize communication with storage nodes that may have several pieces to delete.

  3. You can change the frequency you run the “delete pieces” queue based on satellite load, availability of free space, etc.

  4. Another option would be to set the object expiration time to “now” and let whatever mechanism handles expiration also handle regular deletes.

Because deletes are slow, solutions like Alexey’s are created that use 64 threads to delete objects.

Usually the batch of deletions happen due to two reasons:

  1. deleting all content from bucket
  2. deleting all content with a specific prefix

Currently we have added deleting a bucket, however, we haven’t yet implemented the deletion by prefix. Having a single request those would handle most bulk deletions.

  1. If you ever implement S3 versioning, you will have to implement delete markers anyway. These could likely be used somehow for faster deletes.

Yes, however, I think it would be faster to delete the data from the database immediately rather than delay it.

  1. If you run the “delete pieces” queue every hour, there’s more opportunity to optimize communication with storage nodes that may have several pieces to delete.
  2. You can change the frequency you run the “delete pieces” queue based on satellite load, availability of free space, etc.

Currently the deletions are already combined from multiple concurrent deletion requests.

The issue with adding a deletion queue would require keeping that queue somewhere, which would need to consume memory. Pushing it to disk also seems resource intensive.

For example, at peak we’ve seen 1M segment deletions per hour per server. Each segment would cost ~500B in memory. However, when flushing all that data, the storage nodes wouldn’t be able to keep up with the deletion requests. So, the deletion would be need to be throttled to the storage nodes – which kind of defeats the purpose… so it might roughly end up that same place, except the deletion is delayed 1h.

  1. Another option would be to set the object expiration time to “now” and let whatever mechanism handles expiration also handle regular deletes.

We’ve discussed this a few times internally whether to drop the piece deletion logic altogether in favor for letting garbage collection handle the deletes. That would however leave a bunch of data on storage nodes for more time, but deletes would be faster.

We haven’t yet fully convinced ourselves that it’s a good idea, but we haven’t also rejected the idea altogether.

The difference to me as a user is I’m not waiting for it, and when I run performance tests, I see fast deletes. If it takes a week to actually remove the data, it doesn’t matter to me. If it takes 3-6 seconds to delete a file during the day (from the performance testing thread) and I have to wait for that or see it in a performance test, that matters to me as a user.

From the performance tests I did, deletes on Storj are 4x slower than B2 and GCS, and 18x slower than S3 for files of sizes 1K-64M. To me, that gives me a feeling like “Is this really ready?” It’s a subjective thing; even if it isn’t an important performance criteria for backups, people make these kinds of judgements all the time, especially about a new service they are trying out. That’s why it’s important to me, but maybe I’m just a performance freak (I am!)

2 Likes

Here is my completely dumb idea:

This code starts the delete process by deleting objects from the object table and getting a list of segments that match the stream_id from the object.

Instead of doing that, update a (new) deleted_at field in the object record with the current time (defaults to null); this is used for billing. Any existing queries on the object table have to add “and deleted_at is null” to skip deleted objects. Not fun, but at least not too hard. The segments still exist, so that sets up a race condition between the real delete handler and other processes like verifying a random segment, repair, etc. Have to handle that.

To record deletes (the queue), use timestamped text files. Each hour a new file is created. I know files are old fashioned, but they also are extremely fast and system-load friendly. You’d need a mutex when recording deletes, and while locked, can decide whether to start a new queue file or write to the current one based on the time. Each entry is a project id, bucket name, and object key. We’re only doing an SQL update, not a query, so these are the only fields available (or I guess you could do one of those “update … returning” queries; never used those). Write the list of objects to be deleted, unlock, then the delete is finished from the user POV and you’re done with the client request.

Sometime later, a scavenger looks through the deleted object worklists, does the query that was deferred above, deletes objects, segments, and pieces. I didn’t see where/how you know or record exactly when an object is deleted for billing purposes, so that might have to happen before deleting the object from the db, using the deleted_at timestamp.

If you have only one scavenger process, it’s sort of a good thing load-wise, but also eliminates the feature of deleting multiple pieces from a node. I dunno, maybe not, it’s hard for me to follow. If that’s a problem, a piece worklist file could be created with (node_id, piece_id, segment_id) entries. Sort that file on node_id - gnu sort is pretty fast! Then send out multiple piece deletes for each node. Sort it again on segment_id to do segment and pointer deletes in the db.

You can hang onto these work files and run them whenever the satellite load is light, when disk space is low, on regular intervals, whatever. They can also be thinned out as you get responses from storage nodes. If things work right, these files become your list of garbage.

I’m sure there are all kinds of flaws here. Feel free to use it if it makes any shred of sense, or have a good laugh about it. :slight_smile:

Edit: already broken for the case of creating an object, deleting it, and creating it again. It would still exist in the object table with deleted_at set. Messy! Maybe work this into versioning somehow, I dunno.

2 Likes

3-6 sec delete is definitely not normal or great, and way above what we should see. There was a week where we had that, however at that time 75% of the time spent, ended up being bottlenecked at the database.

At that point, I suspect it would be more effective to let GC deal with the deletions, rather than worry about attaching persistence to pods and managing the files. Or in other words use 0% waiting threshold and have a lossy queue to not fill up the memory.

The GC works by creating full bloom filters that each storage node should hold and then sends those to the storage nodes.

I’ll dig into why the performance has gone up. It could be just the storage node communication overhead or something thereabouts… but it also could be something we missed.

PS: Also thanks for digging and providing suggestions. Definitely great to hear feedback from outside.

1 Like

I have rerun tests several times the past few days and have not seen these very long delete times again, so maybe it was indeed a db fluke. But to me that is more justification for deferring deletes and doing the work at a time of your choosing vs letting clients dictate when you do the work. :slight_smile:

I re-ran the tests a few minute ago (see below). You mentioned setting the threshold to 0 to eliminate waiting on storage nodes. This can be observed already in the test results by looking at the performance difference between deleting 1K and 4K files, stored inline, vs deleting larger files that are on storage nodes. Deleting inline files is 3-4x faster.

It seems both db and storage nodes could be bottlenecks. Setting the threshold to 0 eliminates the SN bottleneck, but my understanding is that all of the db work is still being done while the client waits. If deletes are deferred, it eliminates both bottlenecks.

So now that I’ve convinced you to reconsider deferred deletes :rofl: , I had some other ideas:

In the current non-versioned implementation, it would probably be better to delete the object record immediately rather than set a deleted_at field, because deleting a db record is probably as fast as an update unless it causes a tree rotation. Doing the object delete also avoids having to change all the object queries to avoid deferred deleted object records. So rather than queuing project, bucket, key, you’d queue the stream_id and any other fields you need later from the deleted record.

I’m not familiar with GCS, but understand that there isn’t persistent storage by default. So instead of that, you could run a simple service alongside the db that manages the delete log. All it does is takes delete requests (stream_id + other data) and writes them to a persistent, timestamped file - no need for locking now. Ideally, the delete request handler nodes could make a one-time connection to the delete log service, then for delete requests, delete the object record and then do a non-blocking network write to the delete logger. If this network write fails for some reason, you’d have segment and piece records that aren’t associated with an object record to cleanup, so it’s a simpler form of garbage collection. If you want to keep some backpressure, blocking writes might work better.

Speaking of GC, it seems if you keep these delete logs until they are fully processed, you could avoid the probabilistic garbage collection being done with the Bloom filter. When you “run” a delete log, remove any entries for pieces that are acknowledged deleted from a SN and keep the others for later processing. You can also process log files on different schedules as they age, so they get processed more quickly when new, but less frequently as they age and shrink. If a node gets disqualified, or whatever condition makes its data invalid, you delete its pieces from the delete logs too as if it acknowleged the delete. These logs would all be kept sorted on nodeid so it’s easy to merge them too, ie, as you process an hourly log, you can merge the leftovers into a daily log of failed deletes. If you process the daily log once a day, you merge the leftovers into the weekly log, those leftovers go into a monthly log, etc.

When I mentioned implementing object versioning and delete markers, you said:

With object versioning, you can’t delete the data immediately: all you can do is mark it deleted, because that version is still retrievable. If you ask for a key w/o a version, you’ll get a 404 (Not Found). But you can also ask for a specific version, even if deleted, and you’ll get the data. So the version id becomes part of the object’s unique identifier. All of the object queries also have to get changed to avoid hidden versions by default, sort of like I mentioned before with deleted_at, only now it would be hidden_at I guess.

Here are the results of the most recent test, with shorter, more consistent delete times. The latest version of HB has a lot of options that can be used with these tests to set the number of rounds per test, file sizes tested, and a delay to run continuous testing, sort of like “top”. I guess it would be kinda cool to add JSON output so the results could be graphed.

[root@hbtest ~]# hb dest -c hb test
HashBackup #2576 Copyright 2009-2021 HashBackup, LLC
Using destinations in dest.conf
Warning: destination is disabled: b2
Warning: destination is disabled: gs

2021-10-23 14:22:03 ---------- Testing sjs3 ----------

  1 KiB:
    Round 1: up: 0.4s, 2.606 KiB/s  down: 0.2s, 4.859 KiB/s  del: 0.2s, 5.236 KiB/s
    Round 2: up: 0.2s, 4.101 KiB/s  down: 0.2s, 4.600 KiB/s  del: 0.1s, 7.305 KiB/s
    Round 3: up: 0.2s, 4.006 KiB/s  down: 0.2s, 4.101 KiB/s  del: 0.3s, 3.365 KiB/s
  > Average: up: 0.3s, 3.420 KiB/s  down: 0.2s, 4.498 KiB/s  del: 0.2s, 4.800 KiB/s

  4 KiB:
    Round 1: up: 0.5s, 8.564 KiB/s  down: 0.2s, 21.841 KiB/s  del: 0.3s, 15.958 KiB/s
    Round 2: up: 0.3s, 13.645 KiB/s  down: 0.1s, 28.868 KiB/s  del: 0.1s, 28.971 KiB/s
    Round 3: up: 0.3s, 12.035 KiB/s  down: 0.2s, 22.501 KiB/s  del: 0.2s, 19.036 KiB/s
  > Average: up: 0.4s, 10.983 KiB/s  down: 0.2s, 24.025 KiB/s  del: 0.2s, 20.038 KiB/s

  16 KiB:
    Round 1: up: 0.9s, 16.889 KiB/s  down: 0.6s, 26.580 KiB/s  del: 0.6s, 27.891 KiB/s
    Round 2: up: 1.0s, 15.895 KiB/s  down: 0.7s, 24.477 KiB/s  del: 0.6s, 27.117 KiB/s
    Round 3: up: 1.1s, 14.468 KiB/s  down: 0.6s, 26.585 KiB/s  del: 0.7s, 21.405 KiB/s
  > Average: up: 1.0s, 15.687 KiB/s  down: 0.6s, 25.842 KiB/s  del: 0.6s, 25.115 KiB/s

  256 KiB:
    Round 1: up: 1.0s, 260.367 KiB/s  down: 0.7s, 391.016 KiB/s  del: 0.7s, 351.846 KiB/s
    Round 2: up: 0.9s, 295.807 KiB/s  down: 0.6s, 415.211 KiB/s  del: 0.8s, 316.241 KiB/s
    Round 3: up: 0.9s, 286.181 KiB/s  down: 0.5s, 481.278 KiB/s  del: 0.5s, 471.965 KiB/s
  > Average: up: 0.9s, 279.965 KiB/s  down: 0.6s, 425.915 KiB/s  del: 0.7s, 369.317 KiB/s

  1 MiB:
    Round 1: up: 1.5s, 704.531 KiB/s  down: 0.7s, 1.398548 MiB/s  del: 0.6s, 1.652928 MiB/s
    Round 2: up: 1.3s, 772.627 KiB/s  down: 0.8s, 1.294126 MiB/s  del: 0.6s, 1.711734 MiB/s
    Round 3: up: 1.2s, 856.161 KiB/s  down: 0.7s, 1.410401 MiB/s  del: 0.8s, 1.283051 MiB/s
  > Average: up: 1.3s, 772.862 KiB/s  down: 0.7s, 1.365643 MiB/s  del: 0.7s, 1.523940 MiB/s

  4 MiB:
    Round 1: up: 1.5s, 2.592684 MiB/s  down: 1.1s, 3.742109 MiB/s  del: 0.6s, 6.733881 MiB/s
    Round 2: up: 1.3s, 2.997739 MiB/s  down: 1.2s, 3.300226 MiB/s  del: 0.8s, 4.836759 MiB/s
    Round 3: up: 1.6s, 2.555011 MiB/s  down: 1.1s, 3.481191 MiB/s  del: 0.7s, 5.765874 MiB/s
  > Average: up: 1.5s, 2.701065 MiB/s  down: 1.1s, 3.498556 MiB/s  del: 0.7s, 5.674434 MiB/s

  16 MiB:
    Round 1: up: 1.9s, 8.357356 MiB/s  down: 1.6s, 10.209063 MiB/s  del: 0.5s, 29.879560 MiB/s
    Round 2: up: 1.7s, 9.453829 MiB/s  down: 1.5s, 10.394132 MiB/s  del: 0.6s, 28.540895 MiB/s
    Round 3: up: 2.2s, 7.390860 MiB/s  down: 1.5s, 10.510600 MiB/s  del: 0.5s, 31.748618 MiB/s
  > Average: up: 1.9s, 8.316364 MiB/s  down: 1.5s, 10.369774 MiB/s  del: 0.5s, 29.999228 MiB/s

  64 MiB:
    Round 1: up: 2.8s, 22.645626 MiB/s  down: 3.8s, 16.967962 MiB/s  del: 0.8s, 79.814446 MiB/s
    Round 2: up: 3.4s, 18.717041 MiB/s  down: 2.8s, 22.964971 MiB/s  del: 0.7s, 89.791919 MiB/s
    Round 3: up: 3.1s, 20.706693 MiB/s  down: 3.9s, 16.356410 MiB/s  del: 0.7s, 91.333468 MiB/s
  > Average: up: 3.1s, 20.564925 MiB/s  down: 3.5s, 18.335471 MiB/s  del: 0.7s, 86.668108 MiB/s

Edit: I glossed over some details about the delete logs. There are two different ways they can be used.

The first way, and what I’d suggest as an initial proof-of-concept, is to have hourly object delete logs with the stream-id and any extra info from the already-deleted object record. To process these, you’d do what is being done today (except the object record is already gone). This could be done in a single thread to have a very controlled load, or the delete log could be processed in parallel by splitting it into N sections and running them concurrently. Or, if it turns out to take more than an hour to process a delete log, you could start a process for the previous delete log whenever a new log is started to get “hourly” concurrency. After a delete log is processed, it can be removed because all of the work has been done. There may still be undeleted pieces, like today, because nodes are offline, and the GC would catch these. You could set the threshold to 100 to lower load and concurrency, or set it lower for more concurrency.

The second, more involved way to handle delete logs starts out with the same hourly object delete log files. These would be used to delete the segment records, like above, and to delete the pointer records like above, but instead of talking with the storage nodes, it creates piece delete logs with the nodeid and pieceid. All of the database work has been done at this point. The piece delete logs can be run at any time, like the object delete logs. Piece delete log items would initially be grouped by pieceid, but then sorted on nodeid to group a node’s pieces together for the actual delete. Like the object log, you could run a piece delete in a single process or divide it up into N processes for more concurrency. The big difference between object delete logs and piece delete logs is that piece deletes can fail because of various SN conditions (offline, etc). When processing a piece delete log, you save the failures for re-processing later. The hourly failures can be merged into a daily failure piece delete log. When the daily failures are re-processed, any failures can be merged into a weekly piece delete log, etc. So altogether you’d have one monthly piece delete log, a weekly piece delete log, a daily piece delete log, and one or more hourly piece delete logs. The failed piece delete logs make the current GC processing unnecessary (I think).

It actually has come up several times in our internal discussion. It’s more of – we know it’s faster, but we’re not sure what will be the exact impact on the storage nodes.

CockroachDB internally adds a tombstone to the row and only later actually garbage collects the rows. So,it shouldn’t cause such tree rotations.

This adds several moving systems to the mix. This would imply one distributed service for the queueing and one service for processing the queue. The queue can be non-distributed, but that would mean the processing queue would need to connect to multiple services. If it’s non-distributed, then the upgrades to the deletion queue service wouldn’t be feasible (at least not trivially).

Similarly, people would end up needing to pay for the data that they have deleted in that case, because on the database layer there’s no indication in segment table that a stream has been deleted. Adding the information to the segments table would be equivalent to deleting them.

Of course, since such approach, would need to that the service won’t be able to track every stream id, we would need to add GC for the missed stream-ids. At that point, we just as might avoid the whole queueing system and let the “GC” handle all that stuff and find out things automatically. Since GC is faster than finding out which stream-ids are missing an object, then it would be more efficient to use GC.

We have something similar with graceful exit for transferring pieces, however, that system is very complex and has several performance issues due to needing to update and create the queue.

Similarly keeping such a precise database would significantly increase the amount of data needed to be stored by the satellite. The preciseness also adds the requirement of the DB being distributed.

I think S3 nomenclature makes things more confusing than necessary in this case. When we talk about deletion it’s specifically about the form where the data nor object information cannot be accessed any longer. This is irrespective of the versioning or “hidden” objects.

Based on my guesses the effort to maintain and store all the data doesn’t seem to be worth the effort. Compared to letting GC collection handle it. I suspect we may need to do some questionaire on what the storagenode operators would expect.

The question isn’t whether such a solution can be engineered, it obviously can as you’ve demonstrated. The question is whether the extra moving systems, that may fail and need to maintained, and storagenode tradeoffs with regards to trash and unused space are better than the current approach or only using GC solution.

1 Like

I was reading some related blueprints last night. Interesting!

Deletion performance
Garbage collection
Scaling garbage collection
Forgotten deletes

The pull request about forgotten deletes suggests a similar idea of delete logs, but keeping them on the storage node.

The delete performance tests I’ve been running have all been on single segment files. It’s important to note that these tests all run on a single connection because of http keepalives / pipelining, so there is no connection overhead included in the timings because of the small file sizes.

Today I ran tests on a multi-segment file created with the S3MTGW. It used partsize 1 (byte) in the dest.conf file, on purpose, to create as many segments as possible.

Here’s the first test, with a 100-byte file. I’m not sure exactly how this works at the satellite. Does it create a single object with 100 inline segments, or 100 sub-objects that are related by a parent object (my guess)? The upload was slow but the delete didn’t seem too bad:

mbp:hbrel jim$ py destcmd.py -c hb test -s 100 200 300
Using destinations in dest.conf

2021-10-27 12:24:40 ---------- Testing sjs3 ----------

  100 bytes:
    Round 1  # for /Users/jim/hbrel/hb/hb-577131.tmp numchunks=100
Up  25.7s    3 bytes/s   Down   8.2s   12 bytes/s   Del 132.4ms 755 bytes/s  
    Round 2  # for /Users/jim/hbrel/hb/hb-577131.tmp numchunks=100
Up  27.7s    3 bytes/s   Down   6.5s   15 bytes/s   Del 344.0ms 290 bytes/s  
    Round 3  # for /Users/jim/hbrel/hb/hb-577131.tmp numchunks=100
Up  26.5s    3 bytes/s   Down   8.7s   11 bytes/s   Del 614.8ms 162 bytes/s  
  > Average  Up  26.6s    3 bytes/s   Down   7.8s   12 bytes/s   Del 363.7ms 274 bytes/s  

Now the same test with 1M, 2M, and 4M files and partsize 10K, to test SN piece deletes rather than inline segments. The 1M file is the same number of segments as the previous test, but pieces are remote.

mbp:hbrel jim$ py destcmd.py -c hb test -s 1m
Using destinations in dest.conf

2021-10-27 12:41:23 ---------- Testing sjs3 ----------

  1 MiB:
    Round 1  # for /Users/jim/hbrel/hb/hb-482842.tmp numchunks=103
Up 143.6s    7.1 KiB/s   Down  51.4s   19.9 KiB/s   Del   6.5s  156.4 KiB/s  
    Round 2  # for /Users/jim/hbrel/hb/hb-482842.tmp numchunks=103
Up 149.9s    6.8 KiB/s   Down  51.6s   19.8 KiB/s   Del   7.3s  139.7 KiB/s  
    Round 3  # for /Users/jim/hbrel/hb/hb-482842.tmp numchunks=103
Up 145.6s    7.0 KiB/s   Down  51.7s   19.8 KiB/s   Del   6.7s  153.9 KiB/s  
  > Average  Up 146.4s    7.0 KiB/s   Down  51.6s   19.9 KiB/s   Del   6.8s  149.6 KiB/s  

Test complete

mbp:hbrel jim$ py destcmd.py -c hb test -s 2m -r 1
Using destinations in dest.conf

2021-10-27 12:54:08 ---------- Testing sjs3 ----------

  2 MiB:
     # for /Users/jim/hbrel/hb/hb-450367.tmp numchunks=205
Up 291.3s    7.0 KiB/s   Down 103.5s   19.8 KiB/s   Del   8.7s  235.4 KiB/s  

Test complete

mbp:hbrel jim$ py destcmd.py -c hb test -s 4m -r 1
Using destinations in dest.conf

2021-10-27 13:03:54 ---------- Testing sjs3 ----------

  4 MiB:
     # for /Users/jim/hbrel/hb/hb-778020.tmp numchunks=410
Up 586.3s    7.0 KiB/s   Down 213.2s   19.2 KiB/s   Del  10.0s  408.0 KiB/s  

Test complete

The good and somewhat surprising news to me is that while upload and download are scaling linearly, delete is doing better than linear, ie, it isn’t doubling when the file size doubles. That could be because of the “combine” function combining multiple requests to the same storage node. When I did some checking on these files with large numbers of segments, a file with 15K pieces was stored on 5K nodes.

Back to the topic at hand, here’s a test with a 1G file using a 5M part size on both Storj and S3. It shows deletes for this file are 86x faster on S3, on average.

[root@hbtest ~]# /usr/bin/time -v hb dest -c hb test -s 1g -t del
HashBackup #2581 Copyright 2009-2021 HashBackup, LLC
Using destinations in dest.conf

2021-10-27 17:29:37 ---------- Testing sjs3 ----------

  1 GiB:
    Round 1  Up  95.1s   10.8 MiB/s   Del   8.4s  122.2 MiB/s  
    Round 2  Up  82.4s   12.4 MiB/s   Del  10.5s   97.4 MiB/s  
    Round 3  Up  84.7s   12.1 MiB/s   Del   8.7s  117.8 MiB/s  
  > Average  Up  87.4s   11.7 MiB/s   Del   9.2s  111.3 MiB/s  

2021-10-27 17:34:39 ---------- Testing s3 ----------

  1 GiB:
    Round 1  Up  18.2s   56.1 MiB/s   Del 103.0ms   9.7 GiB/s  
    Round 2  Up  17.7s   57.9 MiB/s   Del 118.3ms   8.5 GiB/s  
    Round 3  Up  17.9s   57.1 MiB/s   Del  98.3ms  10.2 GiB/s  
  > Average  Up  18.0s   57.0 MiB/s   Del 106.5ms   9.4 GiB/s  

Test complete

	Command being timed: "hb dest -c hb test -s 1g -t del"
	User time (seconds): 70.80
	System time (seconds): 31.49
	Percent of CPU this job got: 27%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 6:09.71
	Maximum resident set size (kbytes): 48440
	Major (requiring I/O) page faults: 95
	Minor (reclaiming a frame) page faults: 21220
	Voluntary context switches: 35011
	Involuntary context switches: 6700
	File system inputs: 12914296
	File system outputs: 4195120
	Page size (bytes): 4096
	Exit status: 0