On Ext4, fragmentation and the filewalker

Hi SNOs!

I’m writting this message to see if anyone has gone through the same issues as me, and found any solution.

I have a disk from long time ago, formatted with Ext4. When I decided to update my Storj rig and start more than one node, I started spinning up all nodes in the same disk (I know, not something supported by Storj) and then moved them to new disks once I got them.

My current situation is that the disk that holded all nodes at the beginning (which was never close to completely filled) is much much slower than other disks.

I use to run a du on the disk before starting the node, so the filewalker goes as smooth as possible thanks to kernel caches.

Well, on that disk the performance of du is like 100 times worse, so it is impossible to complete in less than the time the kernel takes to discard the cache entries. So on that node, when it starts the iowait starts to grow and the load average of the system also grows because of that.

Investigating a bit, I saw that on the e2fsck report ~50% of firectories of the disk are non-contiguous, which means the inode of the directory is fragmented.

Any advice about how to deal with it?
The only advice I could found on the internet is to copy all contents to a different folder and remove the original one to force the filesystem to reorder, but that is a bit difficult as the deletion of the previous folder would take a lot because of the fragmentation.

I’m also wondering how other filesystems will behave with this situation, I’m even thinking about moving everything to ZFS with a good SSD metadata cache (I can have ~128 GB of RAM on that machine, currently running an i5 10500).

Edit

I have also submitted this feature request to be voted:

Also, I saw good tips on On tuning ext4 for storage nodes, but the issue on my system is far beyond the performance gains from those tunings :frowning:

If you can dedicate so much RAM, why is the kernel not able to keep metadata cached? It should.

It is keeping it for some time but at the end it discards the cache (even with vm.vfs_cache_pressure = 10).

Also, it does not solve the poor FS performance, if I have to change something in the computer that requires shutting it down, that means ~16-24h of high load and 100% I/O usage.

Also, I don’t understand how I ended up with ~50% of directories fragmented…

i know this doesn’t exactly solve the current issue, but it should help not create the issue again the future.

i duno about how you can fix it… i don’t suppose one can defrag the partition like in windows… but that might take longer than actually copying the data to another disk.
because of head trashing

looks like this in the docker run command


Template
docker run -d --restart unless-stopped --stop-timeout 300 -p 192.168.1.100:28967:28967/tcp -p 192.168.1.100:28967:28967/udp \
#--log-opt max-size=1m \
-p 192.168.1.100:14002:14002 -e WALLET="0x111111111111111111111" \
-e EMAIL="your@email.com" -e ADDRESS="global.ip.inet:28967" \
-e STORAGE="4TB" --mount type=bind,source="/sn3/id-sn3",destination=/app/identity \
--mount type=bind,source="/sn3/storj",destination=/app/config --name sn3 storjlabs/storagenode:latest \
--filestore.write-buffer-size 4096kiB --pieces.write-prealloc-size 4096kiB

Thanks for the response!

I saw that thread this morning but when I checked I discovered the pieces.write-prealloc-size is 4 Mib by default (and I haven’t modified it) and filestore.write-buffer-size only affects memory allocated in RAM.

Am I right?

I’m open to move all data, format and put the data back, but I want to be sure it does not happen (so heavily) in the future

i think the ram changes is also required, else the file will be written to disk earlier…
with enough room for any file in RAM, an incoming file will be stored and then written in its entirety.

but yeah the exact details of how it works i’m not completely 100% on…

like say how does this behave if a file is changed afterwards.
i do believe the other parameter defines how much space on disk is preallocated on disk.
but then what happens if the file ever only becomes 1MB
then one would have a write hole on the disk…

so yeah … my understanding of how exactly this works is incomplete.
i think storj may have changed some of the setting such as preallocation, due to people complaining about fragmentation.

but ext4 isn’t really my area, barely ever used it or know anything about it.
i know storagenodes writes an immense amount of small files / IO and that it can be really rough to manage.

there is the option of moving the databases to ssd, which should help… but i got a lot of caches so haven’t tested that myself…

your problem does sort of make me wonder if this is related to why the filewalker becomes more and more demanding to run as the nodes age…
doesn’t seem to be directly correlated to size.

not that i can really prove that… just how it seems to me.

Yeah, I wish there was a better option. I’ve got a staged restart script which I use for upgrades, but I don’t think there’s a good solution now for unclean shutdowns—except maybe some specific file systems or tools like bcache. For clean shutdowns, I stop all the containers explicitly and start them again with the staging script, but this means an hour or two of downtime for some of nodes. Not a big price for avoiding HDD trashing.

I currently do something similar.

That make sense, but I thought with the preallocation will no longer have this fragmentation - but it makes a lot of sense.

However, I’m not sure about how this affects directory fragmentation.

yeah no clue… can that even be avoided, i assume directory fragmentation would mean that files are not sequentially located on the drive based on directories…
but how possible is that really long term…

data keeps coming in and get deleted… i suppose to some degree files could be kept in sequence, but its pretty doubtful its completely possible when dealing with filesystem as complex as a storagenode.

but i will be following to see if you end up unveiling some profound insights along your deep dive into this.

1 Like

According to this thread, it looks like directory fragmentation is what happens when you create and delete a lot of files and the directory’s inodes are not compacted, which ends up needing more entries (located in a different sector of the disk) to allow referencing more files.

Looks like a problem difficult to solve in an Storj Ext4 disk, aside from doing directory optimizations (e2fsck -D) from time to time.

Regarding file fragmentation: I’m not sure about how it affects the performance of the filewalker, as it should affect reading a piece, but not reading the metadata (stat() operation).

I will probably run some experiments using other filesystems (currently looking at XFS and ZFS+special block device) to compare with Ext4.

It is understandable that with time and creation/deletion of files this problem will get worse on all filesystem, but I suppose some of them suffer this more than others.

1 Like

I have also submitted a feature request, feel free to vote it if you feel like it is helpful!

1 Like

How to imput these values in docker run command?
--filestore.write-buffer-size 4MiB or
--filestore.write-buffer-size=4MiB or
--filestore.write-buffer-size="4MiB" ?

I look at -e STORAGE="14TB" and this makes me thinking the 3-rd one is the right way, but the -e is a parameter and --filestore is a flag. I don’t know if this matters.
Thanks for any clarifications!

Thanks!
BTW, the kiB notation is incorrect. It must be KiB. :wink:

1 Like

my node has probably been running for years with kiB VS KiB :astonished: but it seems to be fine

Gosh, why did they do it this way‽ :sob:

And there is more… :smiling_imp: As we can see, the KiB reffers to non-volatile storage (HDD) and KB to volatile memory (RAM). So, if we want to follow the rules :blush: , those 2 config flags should look like:

--filestore.write-buffer-size 4096KB \
--pieces.write-prealloc-size 4096KiB

as one reffers to RAM, and the second to HDD.
But I imagine the code of storagenode considers KB or kb or kB or Kb the same as 1000 bytes, and KiB, kib, kiB, KIB, Kib the same as 1024 bytes.

1 Like