Calculation the size of the deleted files is unnecessary (and painfull)

jammerdan · November 5, 2024, 6:37am

From here:
https://review.dev.storj.io/c/storj/storj/+/15144

when we use free space calculation based on dedicated partition, calculation the size of the deleted files is unnecessary (and painfull).

I think I was told some time ago that for every deleted file in trash a stat needs to be performed to get the file size. This sounds indeed painful and I never understood why it has to be this way. We are deleting entire folders if the folder size is known and after deletion the folder is gone why shall I bother with calculation of sizes of single files?

Alexey · November 5, 2024, 9:03am

That’s the culprit - it’s unknown. Thus - stat and calculations.
For a dedicated partitions it’s not needed, because we do not need to modify the usage calculated before for the amount of the deleted pieces. We always can just do an analogue of df --si for the partition and calculate the used space and the free space in a one command. Unfortunately, the trash amount would not be known, unless we would calculate it the same way as now. Slow and painful. So when you enable the dedicated partitions - the dashboard would be always off.

jammerdan · November 5, 2024, 9:33am

Ok, this I don’t understand. Maybe I have a misconception how deletion works: I understand it like this that there is a retain process that moves the files that will be deleted into the correct subfolders. So if we touch every file already during retain and move them into subfolders, why don’t we record the file size at this point and calculate folder sizes from it? And then when a folder gets finally removed we take the pre-calculated folder size as space that has been released.

Alexey · November 5, 2024, 10:14am

We do not store this information anywhere (maybe only in the badger cache, if it’s enabled).
We use this information to update a usage but do not store the record itself. So, when you would remove it from the trash, we need to stat it again.

There are reasons why. The database may be corrupted. Very often on unstable setups, which can hang or reboot in any random time.
It also would take a lot of space to track these records, which FS should handle just fine. Especially if it has enough RAM for a metadata. If it’s a limited setup, well, it will suffer. However, it looks more robust than lean only on a database cache (remember the Disk usage discrepancy? thread?).

jammerdan · November 5, 2024, 11:15am

I don’t know much about Badger cache, for example how long entries will be cached or what process will make use of it either by filling the cache or reading from it.
I mean yes of course if the data is in the Badger cache, why read it again from the file system? And I make the suggestion again, why read it for every single file in the trash instead of using the folder sizes where the files just have been moved to?

For the trash I don’t think you would need a huge database as it is very structured. You have a maximum of 1024 subfolder per date folder. A simple text file would do. It would contain up to 1024 lines (1 per subfolder) and a number representing the size in bytes per line. I don’t know how to calculate it but it does not sound like a very huge text file.

I have found 2 interesting links/quotes that I want to add here:

We seem to do something similar for the save-state-resume-filewalker:
lazyfilewalker optimization for storage node used spaced calculation · Issue #6900 · storj/storj · GitHub

Discussion here: https://review.dev.storj.io/c/storj/storj/+/12806?tab=comments

TLDR:

… storing the last_checked date + size for each prefix directory.

So why shouldn’t it be possible to store the size of a prefix directory after all trash files have been moved into it or while moving.

So with some cache for the trash, we would not have to stat every single file that we have just moved with a retain process into a trash folder.

Alexey · November 6, 2024, 4:59am

The size of the folder is unknown. It consists of all the files in it, so each file will be queried to find out its size, then all the sizes will be summed up and you will get the size of the folder.
It will take the same amount of time or more to save this information to the database, because when moving to the trash, several databases need to be updated - the used space and this new proposed one, the same will happen when deleting. I would suggest to enable a badger cache instead - it doesn’t require to implement another one metadata storage.

The slowest operation there is unlinking itself. Adding an additional cache would not help there too much. Especially when this cache already implemented and called a badger cache.
See

jammerdan · November 8, 2024, 6:02am

As I understand it, my idea would still reduce the number of additional IO operations, specifically on ext4 file systems, as quoted before:

My understanding is that the current implementation performs a stat call for every piece when moving it to the trash, and again when deleting it. If the metadata is not cached, which I believe is not the case in @arrogantrabbit’s setup, this could result in disk access for every call.

My idea is to utilize the file size retrieved during the retain process to calculate the prefix folder’s size and write it to a file. I assume that the size is retrieved during retain to update the used space and trash information.

For example, if retain process moves 30k pieces from a single prefix folder in blobs to a prefix folder in trash the current implementation would perform 30k additional reads and again 30k additional reads when performing deletion of the prefix folder. My proposal is to perform the inevitable 30k reads during the move, but also use the sizes already at this point to calculate the trash prefix folder’s size by summing the size numbers. Once the move is complete for the prefix folder, the total size of the trash prefix folder can be written to a file. Now when this folder gets deleted only one final read to retrieve its size to the file created earlier. This would give the total size of it that can be used to update the free space accordingly rather than performing 30k additional reads again.
In my view, this approach would spare 30k additional reads for a single prefix folder.

Alexey · November 8, 2024, 6:13am

Just why we need to implement another one cache mechanism, if we already implemented a more generic badger cache?
The badger cache would speedup any metadata requests, including move to the trash and deleting from the trash or by the TTL collector.
See my results:

I think the referred sentence is related to a more complicated code, rather than metadata requests.

jammerdan · November 8, 2024, 6:51am

As written before I don’t know much about the Badger cache, how it works and how it is implemented to speed up metadata calls. I also don’t know if it stores the metadata of every piece until its deletion or if there are periodically flushes to clear out the cache.
But even if I assume the metadata of all pieces are stored in there, it would require some effort to retrieve size information for 30k pieces one by one. And question is how efficient that is and how many read operations that requires. But I guess what should be possible is to instead of storing the trash prefix folders size in a file, it could be easily stored in the badger cache too.

On the other hand relying on Badger cache means you cannot run lazy mode and lose all of that advantages. My initial idea with simple text file would not require Badger cache (for those who don’t use it) and thus would be compatible with lazy mode and could maybe still spare a significant amount of additional read operations.

Alexey · November 8, 2024, 8:10am

It stores a generic metadata of each piece - the creation date and size. It went to the cache on each operation, if the piece information is missing there, it would be added.
The badger cache implementation of course is useful only for storagenode use case, because the piece is never changed (unless corrupted), so makes sense to remove the piece info from there only when the piece is finally deleted.

It’s not, because the cache is relatively small and hot, so it almost always would be in RAM. Thus it uses more RAM by the way.

I provided results above. You may also check that.

yes, it’s. This is because it could be used only exclusively. Lazy mode respawns multiple storagenode processes.

Yes, but it’s fragile and requires a compacting of this file, it’s pretty close to the hash storage, which is evolving right now. It also solves the trash problem very neat, just updating the expiration date for the piece (somewhat reminiscent of your similar suggestion for automatically deleting data from the trash), so a TTL collector can be reused.

So, please do not stop to provide suggestions! The team may use them as well.

jammerdan · November 8, 2024, 7:05pm

So every upload/download retain etc. will read from it and also write to it when info is missing? Well maybe on deletion writing that data to it does not make sense.

If it is in the RAM then that is great. How much RAM would it need on a big node with like 10 or 20M pieces?

Maybe there is a way. It would be the best of both worlds if Badger cache could run with lazy mode.

Yes maybe. But as said my suggestion can still be done within Badger cache. You would only have to store the summed sizes per trash prefix folder and read it from there when you delete the folder. But I have no idea how much performance gain this would achieve then.

Yes, it probably could be used for many more things too.

Haha, yes I remember that idea. Modifying the expiration date and let the collector do the work sounds still neat.

Thanks.

Alexey · November 9, 2024, 4:25am

Yes, I would expect it working like this. I checked that all long run operations were faster after the first scan (to fill up the cache). Some operations were significantly faster (used-space-filewalker, several times of magnitude), other just faster up to two times (there are other IOs are involved, like moving to the trash, deleting from the trash or deleting by a TTL collector).

I do not know. My biggest node has ~20M pieces and uses RAM as this:

CONTAINER ID   NAME           CPU %     MEM USAGE / LIMIT     MEM %     NET I/O           BLOCK I/O   PIDS
6cf52ff51667   storagenode2   9.93%     1.733GiB / 15.47GiB   11.21%    1.07TB / 313GB    0B / 0B     447

The size of the cache is

$ du --si -s /mnt/x/storagenode2/storage/filestatcache/
5.8G    /mnt/x/storagenode2/storage/filestatcache/