Design draft: Improve deletion performance

ifraixedes · November 11, 2019, 2:47pm

Hi peeps!

As you may know, delete large files takes lots of time for this reason we are working on a new design document to improve the performance of delete operations.

You can find it on https://github.com/storj/storj/pull/3520

We would like that you have a look and provide us feedback.

Odmin · November 11, 2019, 4:40pm

I have an idea/proposition like alternative approach:

Draft:

Uplink prepare “task” for delete many files and send it for satellite.
Sattelite accept “task” and return task id and task status for uplink.
Sattelite distribute “delete task” for storage nodes
Storage nodes accept “tasks” and return task id and task status for satellite
Storage nodes processing “delete task” without any interactions with satellite until the task will be finished.
Storage node report for satellite that task (ID) was finished (like sent orders)
Satellite waiting for all tasks for storage nodes will finished and report status for uplink
Uplink will check task status and when it finished delete was done.

570RJ · November 11, 2019, 5:33pm

It would be better for satellites to support batch operations instead of delegating.

BrightSilence · November 12, 2019, 7:35am

I have a few remarks:

In general I like @Odmin’s suggestions. But I wonder whether the uplink needs to be notified about delete completion to begin with. After the delete has been accepted by the satellite, it should now be the problem of the satellite to clean up the no longer needed data. On the other hand, the uplink might want final confirmation especially when it’s about deleting sensitive data.

As for UDP, the design doc mentions maybe needing encryption, but I would argue that authentication is more important to prevent spoofing attacks sending deletes to nodes. I would assume something like DTLS would need to be implemented to do both, which I’m guessing will take away some of the advantages for the latency and throughput optimizations over UDP. Something to consider.

ifraixedes · November 12, 2019, 10:24am

@Odmin @570RJ and @BrightSilence many thanks for your feedback. It’s very appreciated!

Making the uplink to wait for the confirmation is going to result in the same or even low speed that the current implementation.

I even have my doubts that not waiting for the confirmation will significantly increase the speed despite not waiting for the storage node to delete the pieces, especially for large files. I know that this problem can be the same on the satellite side as it’s currently described, but we’ll be more flexible about introducing several mechanisms to overcome it.

One of the problems of delegating the delete of the pieces to the storage node is that a malicious uplink could upload a big amount of data, then delete it but not sending the requests to the storage node. That will cause storage nodes to hold a big amount of garbage.

The satellite can track the data to delete (through the task IDs), but that involves to store that reference in a database a part of having a chore to care about which tasks haven’t been reported by the storage nodes and take an action on it.

On the other hand, the uplink might want final confirmation especially when it’s about deleting sensitive data.

Satellite is a trustee of uplinks and storage nodes. Uplinks trust that the data is deleted and storage nodes trust that the data deleted from the network won’t leave a big amount of trash for a long period.

As for UDP, the design doc mentions maybe needing encryption, but I would argue that authentication is more important to prevent spoofing attacks sending deletes to nodes. I would assume something like DTLS would need to be implemented to do both, which I’m guessing will take away some of the advantages for the latency and throughput optimizations over UDP. Something to consider.

@BrightSilence we have left to the future to explore this, but we and personally I, appreciate your input.

BrightSilence · November 12, 2019, 10:33am

Yes I meant it more like a confirmation, rather than having the uplink actually wait for it. So more like a “consider it deleted, but confirmation pending” kind of thing.

But I agree that if the satellite is a trustee, it should probably be trusted to handle the deletes as well. And satellites should be better able to load balance those deletes especially if they can be handle asynchronously from the uplinks request.

It sounds to me that having the satellite manage the deletion process is the way to go despite the additional costs. It’s probably worth it for the better customer experience and reliability of deletes. But then again, I don’t pay the satellite bills.

Odmin · November 12, 2019, 11:33am

Sorry, I mean that uplink and satellite will check “delete task” status and no need waiting, just check task status time by time.

The main reasons of this idea:

Significantly reduce satellite and uplink traffic (sent batch jobs)
Sattelite not send delete request for each piece (TCP confirmation issue will be avoided, no need switch to UDP)
Satellite will look like a judge and not did any delete jobs, just manage and track tasks
Storagenode will do all dirty work on his side and reporting to the satellite when batch job well done

I think in any case we should have the “judge”, who will track delete tasks, Without judgment fair transactions between uplinks and storage nodes is not possible.