Suggestion: add one more subfolder level to blobs folders structure

DisaSoft · July 25, 2024, 9:45pm

Today I wrote about my problem with slow file deletion on ext4 filesystem (mostly for SLC folders):
How do you solve slow file deletion on ext4?

Right now Storj uses folder structure “/ab/filename.sj1” with 1024 subfolders for every Satellite. During intensive tests SLC sends tens of millions files. I see more than 50mil records in piece_expiration.db so number of files in each folder can reach 50 000 or even 100 000. And we hope one day Satellites working with customer’s data will surpass these numbers

So I decided to make some simple tests to understand how the number of files in folder affects file deletion.

Test steps:

Create folder with N files with random 50-symbols Storj-like filenames, random content and random size from 4Kb to 1Mb.
Pick 1000 random files in random order.
Remount filesystem to reset caches.
Call “rm -f” with list of picked files and measure how much time it needs to finish the job.
Repeat for N = 1000, 1500, 2000, 2500, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000, 15000, 20000, 25000, 30000, 35000, 40000, 45000, 50000, 60000, 70000, 80000, 90000, 100000.

Here are results (test is long and I had no time to run it few times with averaging, so there are some fluctuations):

N - number of files in folder	t - time to delete 1000 random files from there
1000	1.58
1500	2.73
2000	3.07
2500	4.6
3000	4.95
4000	5.55
5000	6.13
6000	7.63
7000	7.87
8000	7.07
9000	9.12
10000	8.18
15000	9.79
20000	9.92
25000	12.87
30000	10.32
35000	12.51
40000	17.13
45000	14.6
50000	13.79
60000	18.25
70000	22.06
80000	24.11
90000	26.32
100000	29.75

The results show that the total number of files in a directory significantly affects the speed of deleting of each individual file.
I would like to note that the file deletion rate in this synthetic test is significantly (times) higher than on real Storj expired pieces. This is probably due to the fact that the files for the test were created together during a short interval of time and are located on the disk not far from each other, unlike real data, which were written in different folders for a month. For this reason, it is likely that increasing the number of files in the Storj folders has an even greater effect than in my test.

So my suggestion is to add one more subfolder [2-7a-z] to folder structure and make it “/ab/c/filename.sj1”. This way each of 1024 folders will get 32 subfolders and number of files will decrease on average of 32 times. (50 000 to ~1500 or 100 000 to ~3000).

This relatively simple change can increase the speed of deleting old files by at least several times.

Here’s the script I used for the tests in case anyone wants to reproduce them quickly. I used the language I am most familiar with. Sorry, it’s PHP
But it’s measure time of system “rm -f” so it doesn’t matter.

test.php

<?php $folder='/home/storj2/node/2/'; $dev='/dev/sdc1'; $alphabet='abcdefghijklmonpqrstuvwxyz0123456789'; $testNumbers=array(1000,1500,2000,2500,3000,4000,5000,6000,7000,8000,9000,10000,15000,20000,25000,30000,35000,40000,45000,50000,60000,70000,80000,90000,100000); foreach ($testNumbers as $numFilesInFolder) { $numFilesToDelete=1000; print "Test folder with $numFilesInFolder files\n"; $totalSize=0; $files=array(); exec('rm -rf '.$folder.'test/'); mkdir($folder.'test/'); $time_start = microtime(true); for ($i=1;$i<=$numFilesInFolder;$i++) { $filename=''; for ($j=1;$j<=50;$j++) { $r=rand(0,mb_strlen($alphabet)); $filename.=mb_substr($alphabet,$r,1); } $filename.='.test'; $files[]=$filename; $filesize=rand(1,256); $totalSize+=$filesize; $fl=fopen($folder.'test/'.$filename,'w'); for ($j=1;$j<=$filesize;$j++) { fwrite($fl,openssl_random_pseudo_bytes(4096)); } fclose($fl); } $time_end = microtime(true); print "$numFilesInFolder files with total size ".round($totalSize*4096/1024/1024)." Mb are created in ".($time_end-$time_start)." seconds\n"; print "Let's give some time for disk cache to finish writes and remount filesystem...\n"; sleep(15); exec('umount '.$dev); sleep(5); exec('mount -a'); sleep(10); $itemsToDelete=array_rand($files,$numFilesToDelete); $filesToDelete=array(); foreach ($itemsToDelete as $item) { $filesToDelete[]=$files[$item]; } $command='rm -f '; foreach ($filesToDelete as $file) { $command.=$folder.'test/'.$file.' '; } $time_start = microtime(true); exec($command); $time_end = microtime(true); print "RM command test: delete $numFilesToDelete random files from folder with $numFilesInFolder files finished in ".($time_end-$time_start)." seconds\n"; exec('rm -rf '.$folder.'test/'); } ?>

Ambifacient · July 26, 2024, 12:21am

Informative post, I will also want to test this sometime on ZFS and XFS.

I think to strengthen your recommendation you should also repeat the tests using your suggested nested format. i.e. N files spread over 1024 folders with 1000 random deletes, as we don’t know what the cost is for the additional layer. One could also suggest just adding more folders at the top level.

Roxor · July 26, 2024, 3:03am

You want to add 32x more directories to traverse: but your test doesn’t have a single extra directory to traverse? (or maybe I don’t understand it)

Wouldn’t the test be a) delete 1000 files-in-one-directory vs 1000 files-across-32-directories, b) delete 10000 files-in-one-directory vs 10000 files-across-32-directories, c) etc…

If you’re deleting the same number of files in the end: but parsing/updating more separate directories: you many not see the gains you expect.

Julio · July 26, 2024, 8:09am

cylical logic roxors!.. errr rocks!

1/10 of cent

Toyoo · July 26, 2024, 8:59am

Not a simple change. Requires moving all existing files.

Besides, as measured in this thread, going the opposite way also has benefits.

DisaSoft · July 27, 2024, 7:36pm

Later test showed that ext4 on HDD is slow just because of large number of files and it’s not so important how they are spreaded.

I made a few more tests which are more close to reality - deleting 1000 random files from millions files spreaded in 1024 folders (“/ab/filename” stricture) and 1000 files from millions spreaded in 1024*32 folders (“/ab/c/filename”) and didn’t find a reason to change folder strucutre. Somewhere “/ab/filename” is faster, somewhere “/ab/c/filename” is faster (depending on number of millions files), but it’s still tens of seconds to delete 1000 old files. And more than 100 seconds when total number of files >= 5 mil.

After that I made similar experiment (only for “/ab/filename” stricture) on ZFS filesystem with metadata cache on SSD. Deletion of 1000 random files from 10mil took… 0.13 seconds.

So I don’t see any reason to stay with ext4 and try to tune is with any small optimizations. It’s just not viable for large node use case. Cache of filesystem metadata looks inevitable.

Roxor · July 27, 2024, 8:09pm

I’m hoping Storj’s cache gets good enough we can start to test it soon. It won’t be better than ZFS’s special-metadata device… but it will work on all filesystems (and hopefully will still be a large improvement)

DisaSoft · July 27, 2024, 8:28pm

Unfortunately, no any cache on Storj software side can help to delete old files faster when filesystem itself is slow. It will probably help to see occupied space in dashboard actual and closer to real disk usage, but it will not help with fact that this space is occupied by expired pieces (which are already “forgotten” by Satellite in it’s TBM calculations).

Zetanova · August 12, 2024, 8:39am

@DisaSoft would it possible for you to run your test a bit modified again.
Instead of deleting files one by one with rm, use mv in two variants

to move all files to deleted into a trash folder and then rm this trash folder
to move all files into a new folder ignoring the deleted files and rm this old folder

Alexey · August 12, 2024, 8:49am

You just described the current behavior. However, it doesn’t matter. The node should calculate the size of the deleted pieces to update the databases for the dashboard. So, they will still be processed one by one. Or first call the recursive stat, then remove the folder recursively (so two passes instead of only one).

DisaSoft · August 12, 2024, 11:27am

Too late. I already dropped ext4 nodes and started new with zfs + special device metadata SSD. For now everything works extremely fast, but they yet are not so full as ext4 nodes were.

Roxor · August 12, 2024, 11:54am

I think you made the right choice. If ultimately our issue is we have millions of files, and HDDs have limited iops… then using a SSD layer with 1000x the iops seems like clean and simple solution.