Suggestion: add one more subfolder level to blobs folders structure

Today I wrote about my problem with slow file deletion on ext4 filesystem (mostly for SLC folders):
How do you solve slow file deletion on ext4?

Right now Storj uses folder structure “/ab/filename.sj1” with 1024 subfolders for every Satellite. During intensive tests SLC sends tens of millions files. I see more than 50mil records in piece_expiration.db so number of files in each folder can reach 50 000 or even 100 000. And we hope one day Satellites working with customer’s data will surpass these numbers :wink:

So I decided to make some simple tests to understand how the number of files in folder affects file deletion.

Test steps:

  1. Create folder with N files with random 50-symbols Storj-like filenames, random content and random size from 4Kb to 1Mb.
  2. Pick 1000 random files in random order.
  3. Remount filesystem to reset caches.
  4. Call “rm -f” with list of picked files and measure how much time it needs to finish the job.
  5. Repeat for N = 1000, 1500, 2000, 2500, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000, 15000, 20000, 25000, 30000, 35000, 40000, 45000, 50000, 60000, 70000, 80000, 90000, 100000.

Here are results (test is long and I had no time to run it few times with averaging, so there are some fluctuations):

N - number of files in folder t - time to delete 1000 random files from there
1000 1.58
1500 2.73
2000 3.07
2500 4.6
3000 4.95
4000 5.55
5000 6.13
6000 7.63
7000 7.87
8000 7.07
9000 9.12
10000 8.18
15000 9.79
20000 9.92
25000 12.87
30000 10.32
35000 12.51
40000 17.13
45000 14.6
50000 13.79
60000 18.25
70000 22.06
80000 24.11
90000 26.32
100000 29.75

The results show that the total number of files in a directory significantly affects the speed of deleting of each individual file.
I would like to note that the file deletion rate in this synthetic test is significantly (times) higher than on real Storj expired pieces. This is probably due to the fact that the files for the test were created together during a short interval of time and are located on the disk not far from each other, unlike real data, which were written in different folders for a month. For this reason, it is likely that increasing the number of files in the Storj folders has an even greater effect than in my test.

So my suggestion is to add one more subfolder [2-7a-z] to folder structure and make it “/ab/c/filename.sj1”. This way each of 1024 folders will get 32 subfolders and number of files will decrease on average of 32 times. (50 000 to ~1500 or 100 000 to ~3000).

This relatively simple change can increase the speed of deleting old files by at least several times.

Here’s the script I used for the tests in case anyone wants to reproduce them quickly. I used the language I am most familiar with. Sorry, it’s PHP :slightly_smiling_face:
But it’s measure time of system “rm -f” so it doesn’t matter.

test.php
<?php $folder='/home/storj2/node/2/'; $dev='/dev/sdc1'; $alphabet='abcdefghijklmonpqrstuvwxyz0123456789'; $testNumbers=array(1000,1500,2000,2500,3000,4000,5000,6000,7000,8000,9000,10000,15000,20000,25000,30000,35000,40000,45000,50000,60000,70000,80000,90000,100000); foreach ($testNumbers as $numFilesInFolder) { $numFilesToDelete=1000; print "Test folder with $numFilesInFolder files\n"; $totalSize=0; $files=array(); exec('rm -rf '.$folder.'test/'); mkdir($folder.'test/'); $time_start = microtime(true); for ($i=1;$i<=$numFilesInFolder;$i++) { $filename=''; for ($j=1;$j<=50;$j++) { $r=rand(0,mb_strlen($alphabet)); $filename.=mb_substr($alphabet,$r,1); } $filename.='.test'; $files[]=$filename; $filesize=rand(1,256); $totalSize+=$filesize; $fl=fopen($folder.'test/'.$filename,'w'); for ($j=1;$j<=$filesize;$j++) { fwrite($fl,openssl_random_pseudo_bytes(4096)); } fclose($fl); } $time_end = microtime(true); print "$numFilesInFolder files with total size ".round($totalSize*4096/1024/1024)." Mb are created in ".($time_end-$time_start)." seconds\n"; print "Let's give some time for disk cache to finish writes and remount filesystem...\n"; sleep(15); exec('umount '.$dev); sleep(5); exec('mount -a'); sleep(10); $itemsToDelete=array_rand($files,$numFilesToDelete); $filesToDelete=array(); foreach ($itemsToDelete as $item) { $filesToDelete[]=$files[$item]; } $command='rm -f '; foreach ($filesToDelete as $file) { $command.=$folder.'test/'.$file.' '; } $time_start = microtime(true); exec($command); $time_end = microtime(true); print "RM command test: delete $numFilesToDelete random files from folder with $numFilesInFolder files finished in ".($time_end-$time_start)." seconds\n"; exec('rm -rf '.$folder.'test/'); } ?>
9 Likes

Informative post, I will also want to test this sometime on ZFS and XFS.

I think to strengthen your recommendation you should also repeat the tests using your suggested nested format. i.e. N files spread over 1024 folders with 1000 random deletes, as we don’t know what the cost is for the additional layer. One could also suggest just adding more folders at the top level.

4 Likes

You want to add 32x more directories to traverse: but your test doesn’t have a single extra directory to traverse? (or maybe I don’t understand it)

Wouldn’t the test be a) delete 1000 files-in-one-directory vs 1000 files-across-32-directories, b) delete 10000 files-in-one-directory vs 10000 files-across-32-directories, c) etc…

If you’re deleting the same number of files in the end: but parsing/updating more separate directories: you many not see the gains you expect.

cylical logic roxors!.. errr rocks! :stuck_out_tongue:

1/10 of cent

Not a simple change. Requires moving all existing files.

Besides, as measured in this thread, going the opposite way also has benefits.

2 Likes

Later test showed that ext4 on HDD is slow just because of large number of files and it’s not so important how they are spreaded.

I made a few more tests which are more close to reality - deleting 1000 random files from millions files spreaded in 1024 folders (“/ab/filename” stricture) and 1000 files from millions spreaded in 1024*32 folders (“/ab/c/filename”) and didn’t find a reason to change folder strucutre. Somewhere “/ab/filename” is faster, somewhere “/ab/c/filename” is faster (depending on number of millions files), but it’s still tens of seconds to delete 1000 old files. And more than 100 seconds when total number of files >= 5 mil.

After that I made similar experiment (only for “/ab/filename” stricture) on ZFS filesystem with metadata cache on SSD. Deletion of 1000 random files from 10mil took… 0.13 seconds.

So I don’t see any reason to stay with ext4 and try to tune is with any small optimizations. It’s just not viable for large node use case. Cache of filesystem metadata looks inevitable.

2 Likes

I’m hoping Storj’s cache gets good enough we can start to test it soon. It won’t be better than ZFS’s special-metadata device… but it will work on all filesystems (and hopefully will still be a large improvement) :crossed_fingers:

Unfortunately, no any cache on Storj software side can help to delete old files faster when filesystem itself is slow. It will probably help to see occupied space in dashboard actual and closer to real disk usage, but it will not help with fact that this space is occupied by expired pieces (which are already “forgotten” by Satellite in it’s TBM calculations).

2 Likes

@DisaSoft would it possible for you to run your test a bit modified again.
Instead of deleting files one by one with rm, use mv in two variants

  1. to move all files to deleted into a trash folder and then rm this trash folder
  2. to move all files into a new folder ignoring the deleted files and rm this old folder

You just described the current behavior. However, it doesn’t matter. The node should calculate the size of the deleted pieces to update the databases for the dashboard. So, they will still be processed one by one. Or first call the recursive stat, then remove the folder recursively (so two passes instead of only one).

Too late. I already dropped ext4 nodes and started new with zfs + special device metadata SSD. For now everything works extremely fast, but they yet are not so full as ext4 nodes were.

2 Likes

I think you made the right choice. If ultimately our issue is we have millions of files, and HDDs have limited iops… then using a SSD layer with 1000x the iops seems like clean and simple solution.

3 Likes