How do you solve slow file deletion on ext4?

DisaSoft · July 25, 2024, 11:48am

Hi!

I have a few relatively large nodes 10Tb+, working on Toshiba HDDs (“Enterprise Capacity” series, non-SMR) in Ubuntu 24.04 and formatted in ext4 and mounted with “noatime”. Other hardware is not a “potato” - Ryzen 9 7900 with 32GB RAM. All is used only by a few Storj nodes, 1 node per drive, no RAIDs.
Last month these drives were filled to 60-80% with test SLC pieces and now it came time of it’s expirations.

The problem is that old files are deleting much slower than it were uploaded. And it’s not because of slow GC process or something related to node software. Just filesystem itself is only able to delete only a few files per second (even with native “rm”) while satellites can upload (and it uploaded) tens files per second.

I even stopped uploading of new pieces with setting of minimum Node size to give more Disk IO to cleaning processes, but it’s still too slow. Mounting of partition with “data=writeback” speeds up deleting may be 2 times, but it’s still times slower than uploading speed. Now I still have pieces with TTL expired 10-14 days ago just because collector is not able to delete them fast enough.

I see a critical problem here. Much more critical than slow filewalker which can be solved with DBs, caches, etc. While filesystem is not able to delete files at least as fast as new files are uploaded - early or later it will end up with Node 100% filled with non-paid expired garbage.

How do you tune ext4 for quick deleting of files?

P.S. May be Storj should rework whole architecture how pieces are stored? Looks like It can’t work long term with current loads. It can fill up, but not clean.

DisaSoft · July 25, 2024, 12:35pm

They are connected to Marvell Technology Group Ltd. 88SE9215 PCIe 2.0 x1 4-port SATA 6 Gb/s Controller (rev 11). Read/Write performance is perfect, just file deleting is slow.

zip · July 25, 2024, 12:37pm

I don’t think this is normal.
I recently, by mistake, did rm -rf on an incorrect folder and in a few seconds 20 directories from AP1 sat folder were gone. Also on ext4.

Do you have databases on SSD? As these TTL deletes also modify the SQLite databases, which can be quite IOPS intensive, and if the drive is starved, then even rm will be slow.

Also, I have 16TB Toshibas MG08 and I saw a significant (like a third) difference in IOPS between the same drives in the same system, running same FS, but running a different firmware versions, such as 0103 vs 4302.
So this might be the drives themselves as well.
I planned to do some firmware updates, but these MG08 drives aren’t really aimed towards end consumer, so Toshiba provides pretty much no support for these drives and finding a correct firmware and flashing tool might be problematic.

DisaSoft · July 25, 2024, 12:49pm

Looks like it depends very much on number of files in folder. AP1 typically has much less files per folder than SLC and deleting files from there is much faster. SLC has tens of millions files in just 1024 folders.

Yes, databases are on separate SSD, but deleting from HDD is slow even with native OS “rm” command and even when node (docker container) is shutdown and there are no any concurrent readings/writings.

My drives are different models of MG09 series - 12, 16, 18, 18 Tb. All with the same behavior. Perfect read/write performance, but slow deleting from folders with large number of files.

zip · July 25, 2024, 12:53pm

True.
Maybe @Toyoo will help, he knows his filesystems quite well.

JoshieGarza · July 25, 2024, 1:59pm

is this the one affected by the EFS-13492 issue?

DisaSoft · July 25, 2024, 3:29pm

I haven’t found any information about EFS-13492 issue, but to be sure I tried another one card - ASMedia Technology Inc. ASM1064 Serial ATA Controller (rev 02) and it’s not faster. May be even slower.

DisaSoft · July 25, 2024, 10:01pm

I made some tests for file deletion time depending on number of files in folder and added my suggestions here: Suggestion: add one more subfolder level to blobs folders structure

EasyRhino · July 25, 2024, 10:50pm

this guy chias

Regarding the slow file deletes. um… fragmentation? I only have one ext4 node but it’s quite fragmented and performance for things like listing files has gotten really bad.

using more ram for file caching may help some. the other ext4 optimization that I can think of is potentially shrinking inode size?

I did a dumber test to attempt to confirm:

actually, right now - while I have a rclone running in the background - if I attempt to “ls” a single folder it takes like…two minute and 21 seconds. That is, the first time. The second time is near instant presumably because it’s in cache.

I did another low key test.

traversed into a previously unscanned storj folder.
used touch to create the files ‘hi’ and ‘hi2’
creation was instant
did a rm hi?
took over a minute.
but doing an ls was faster.

So maybe there is a way storj can delete the files without having an implicit re-listing of the files? I’m not smart enough with the filesystem or with programming to know if that’s possible.

incidentally I’ve switched other drives to zfs. having a lot of ram for cache plus experimenting with metadata cache on ssd it doesn’t seem to have these particular problems.

Toyoo · July 25, 2024, 11:08pm

Well, not unexpected. An h+tree has a logarithmic complexity for its operations. Directory defragmentation should help a bit short-term (fsck -D).

Mark · July 25, 2024, 11:34pm

Ouch, How’s your audit score holding up on that satellite? Around 98%? In theory it should slowly climb as you ingress more data and the customer deletes the files you are missing right?

zip · July 25, 2024, 11:48pm

I was actually migrating from XFS to ext4, and instead of deleting the new folder I thought I have unmounted, to be able to rename the mount point, it was still mounted.
So I had the source still, so just did rsync on that AP folder and thus lost no data.

DisaSoft · July 25, 2024, 11:49pm

It looks like “-D” is not working. With “-D” it does absolutely the same as wthout:

fsck -D /dev/sdj1
fsck from util-linux 2.39.3
e2fsck 1.47.0 (5-Feb-2023)
/dev/sdj1: clean, 71929124/274661376 files, 2529573826/4394581504 blocks

P.S. I’ll try with “-f -D”.

Unique · July 26, 2024, 1:41am

You said using writeback speeds it up, so pros and cons aside, turn journaling off completely. Also mount with noatime,nodiratime.

donald.m.motsinger · July 26, 2024, 11:32am

This off-topic chatter gets annoying…

agente · July 26, 2024, 12:09pm

Focus on " How do you solve slow file deletion on ext4?". I’m curious… I have same probs with low ram setups

andrew2.hart · July 26, 2024, 1:09pm

Are you running your storagenode as root or a user?

DisaSoft · July 26, 2024, 1:27pm

I use docker containers and as I see owner of files created by node is root.

JWvdV · July 26, 2024, 6:18pm

I had the same problem with high RAM setups (3GB/TB), so I don’t think RAM is the problem here. I think the metadata operations are just too slow. Something that probably is true for almost every filesystem now: performance - Why the amount of files affects much deleting speed? - Stack Overflow

We could list up a lot of topics about the same subject here:

Best filesystem for storj

I’m hoping for the file-stat-cache (storagenode/blobstore: blobstore with caching file stat information (… · storj/storj@2fceb6c · GitHub). Although, the filesystem should be handling all meta data in a smooth way, reading and changing it cost a lot of random IO which is usually quite show. So, I’m hoping it will reduce the amount of random IO and make interviewde file operations considerably faster. Therefore I also hope it will be made compatible with the lazy filewalker.

Essentially I think we’re looking at some growing pains of STORJ. For the time being I reverted to ZFS with special VDEVS for meta information on SSD.

Roxor · July 26, 2024, 6:29pm

ZFS special-metadata does seem like the way to go: if you have a couple SSDs and enough space. Do you have a feel for how long your used-space-filewalker takes? I don’t have it set up yet to test: but I’m hoping it’s like 5min/TB/satellite or something.

Hours instead of days would be nice