On tuning ext4 for storage nodes

The ext4 file system is usually the default file system on Linux, as well as many NAS devices. It has several knobs that can be used to adapt the file system for some specific use cases. I’ve been doing some experiments lately on how to tune them for storage nodes, and found that there are some nice gains to get from the right kind of tuning.

An inode is a place on a file system that stores information on files or directories, such as the file name, modification date, access permissions and so on. The default size of a single ext4 inode is 256 bytes. It used to be 128 bytes (back in ext2/ext3 days, even now the constant defining this number is named EXT2_GOOD_OLD_INODE_SIZE), but to support some new features more space was needed. However, right now Storj doesn’t really need these new features, so we can change it back to 128 bytes. To do so, pass -I 128 as an additional parameter to mke2fs. The most important points of this trade-off are:

  • You can no longer store files with modify/access times after year 2038. Storj does use modify time to search for trash files to delete, but for the next 15 years this will not be a problem, and drives that exist now will likely not be very profitable in 15 years—if they will work at all. So it’s not a big drawback for today’s nodes.

  • Some of the space within the inode can be used to store very small files, without allocating more blocks. This may speed up access to the contents of these small files. Similarly, this space can also be used to store extended metadata. The smallest Storj files are now 1kB, and storage nodes do not use metadata, so these features aren’t useful here.

  • Given that inodes are now half the size, more of them fit in a single disk sector. This speeds up directory reads, and from my observations just making inodes smaller reduces the time to run du (roughly the equivalent of the file walker process) by around 30%.

Some data structures within the ext4 file system are pre-allocated during the initial setup. The significant ones here are: the journal, inodes, and superblock backups.

The journal stores metadata operations before they are commited to other data structures. The default size depends on the file system size, but unless your file system is smaller than 128GB, the journal gets 262144 blocks, which usually translates to 1GB. In theory the larger the journal is, the faster the file system will operate on metadata (and on data, if the data=journal setting is used), but on Storj workloads I don’t see much difference between 1GB and, let say, 128MB. Journal size can be set with -J size=128M, and can be changed later with tune2fs. I suspect it could be tuned down even more, but I haven’t tested it and, as the disk space gain would be very small, it’s probably not worth spending time doing so.

There has to be one inode for each file or directory. However, ext4 requires to allocate the disk space for them upfront. Hence there’s often a lot of disk space set aside only for inodes, never to be used for storing data. To set the amount of disk space used for inodes, the user decides on the ratio of disk space to inodes at the file system creation time. Then this ratio is maintained over the whole lifetime of the file system (e.g. when you grow the file system, new inodes are also preallocated, and when you shrink it, the amount of disk space for inodes is also reduced), with no ability to change this ratio. The default is one inode per 16kB. There has to be enough inodes for each file and directory that will exist, so we cannot just arbitrarily set this ratio to some very high amount. Currently the average size of a Storj file seems to be around 800kB, so to account for some risk of suddenly having a large number of small files being uploaded (despite the per-segment fee), I think 64kB is still quite safe. You can set this ratio by passing -i 65536 to mke2fs.

A superblock is the most critical piece of information within the file system, as without it, you can’t figure out where the files are. As such, ext4 stores multiple copies of the superblock. It used to be that there was one copy per each 128MB of disk space, but with modern disk sizes this results in thousands of copies. The -O sparse_super2 option changes this to just two backup copies. The trade-off is that you can no longer resize the file system without unmounting it.

All of the above together resulted in having around 1.5% of disk space freed for storing data without disrupting operation of my nodes. So, to summarize: if you plan to have a separate file system just for a storage node, create your file system with mke2fs -t ext4 -m 0 -i 65536 -I 128 -J size=128 -O sparse_super2.

14 Likes

Very interesting information, thanks.

Another idea is to mount the disk using the noatime flag (to avoid the read/modify the access times metadata).

fstab line:
UUID=[YOUR_UUID_HERE] [YOUR_MOUNT_POINT_HERE] ext4 defaults,noatime 0 0

1 Like

I think you’ve gone astray somewhere. Your -m 0 gives you 5% on its own

-m 0 doesn’t actually give you more space. It just releases existing free space for non-root users. Left it here in the summary as it’s a pretty obvious switch for Storj, mentioned here on the forums many times.

Is that just an “idea” or will there be noticeable performance improvements?

Small but exist, the system edits the last access time every time a file is being read, and writes this information on the disk.
So all uploaded pieces and audits will have one write operation less.

New created files write other metadata, so the write is there anyway and storj not modify pieces, they get created or deleted.

The performance is noticeable with big node, but pretty irrelevant for the small one.

Very interesting thanks.
I guess I will give this a try soon.

not that there cannot be amazing performance gains from tuning.
but the defaults are the defaults for a reason, sure those reasons might not apply for this exact use case…

i would caution against changing to much or deviating far from the standards… usually my rule of thumb is 50% to 200% because then i can usually be confident that if it does hit me later.

its more a gentle tap rather than the almighty Thor opening up the heavens and throwing Mjolnir at me.

i’ve had many of my “tuning” end up being detrimental rather that what i thought it would do.

one good example is that i initially settled on using 256K recordsizes with my ZFS, however i later found that this was actually causing my setup to run worse and not better, even tho short term it looked like the benefits in migration speeds outweighed the adverse extra memory and cache usage.

another thing i initially did was run 512Byte sectors, which worked great… until later when i found out that the reason we have been moving towards larger sectors are the memory usage for indexing the file system… each sector needs to be recorded and thus 512B sector sizes vs 4Kn results in 8x on memory usage.

i think 90% of what i change i usually end up setting back to defaults… ofc once in a while i do fine some amazing tuning parameter, but just be vary… if you don’t want to spend time on your setup… be careful about changing stuff.

if these recommendations are good or bad… i have no idea, because i haven’t tested them…
just saying think twice and be wise.

no atime and xattr=off is recommended on zfs, because it saves read and write io.
been running that since i started using zfs.
supposedly reduced io from 4 to 1 or something

A saving of 1.5% disk space is somewhat small, and the fact is that nobody has posted a performance (time) measurement showing how the modified ext4 settings compare to default ext4 settings.

It would for example be interesting to know whether there is a major difference between the default ext4 settings and the suggested ext4 settings (-i 65536 -I 128 -J size=128 -O sparse_super2) when running the following command:

# cd /mnt/storj; time find . | wc -l

Many performance issues related to HDDs are related to how many HDD head seeks are required in order to satisfy a read request. How do the suggested ext4 settings reduce the need to perform a seek?

Indeed. There would be better gains with btrfs, actually—around 3% compared to ext4 defaults—almost twice that! But…

…it was slow as hell. I’ve tested several scenarios. Regarding the original post, except for the -I 128 switch reducing time of du (and hence likely to run the filewalker process), I’ve observed no changes to default ext4 with regards to performance.

btrfs was more than twice slower under all metrics even under the best settings I found (mkfs.btrfs -m single -d single, mount -o nodatacow,noatime,max_inline=0). I suspect that bandwidth.db updates dominate I/O performance there, maybe moving them to flat files—like the orders were moved—would help. But at this point this is only a suspicion and I’m waiting for some other tests to finish to check this as well.

Pretty much just that the inodes are closer together (being smaller and having less of them).

The problem is that any performance comparison would need to be done in a repeatable manner.

For example - if I run the command now, then create a new virtual disk, copy everything to it and run the command then, I will get different results because the files would be written in a different order.

I would rather be able to online resize the file system than squeeze out a small bit of space by using sparse_super2. Using 128B inodes is interesting, not that I would change this on my node now, but something to keep in mind. Smaller inodes would mean more of them can fit in RAM cache.

defaults work reasonably well for most use cases. If the use case is different than most, tweaking the settings might lead to an improvement. For example - the default cluster size for NTFS is 4K. However, if I am only storing large files, I might use 64K to have better performance and less fragmentation.

You write this as if you didn’t believe I tried :wink: Indeed I’ve dedicated a separate HDD for these tests, reformatting it for each test and running my script that attempts to reproduce storage node operations as faithfully as I can, reproducing each single ftruncate and fsync that the node’s code seems to perform. Trying to do 5-10 tests for each setting, depending on how variable are the results.

For example, ext4’s defaults seem to have a coefficient of variation in my setup at around 0.025 (meaning, for a test usually taking ~1000 seconds, standard deviation of time measurements is at 25 seconds). btrfs is much better in this regard, getting the coefficient of variation at 0.0027 (almost tenfold decrease compared to ext4).

3 Likes

Oh, about half of the files that my nodes store are less than 4kB. Not sure whether this is universal though.

I was not talking about storj :slight_smile: in that example.

Storj uses files from 512B in size
image

I do not have the count of files by size though.

2 Likes

it is something like that in most cases i believe… but still the sector size isn’t that huge of a deal… one can still have like 1million files on 4GB using a 4kn sectors…

so its basically not worth the hassle for running 512B, because that uses crazy amounts of RAM at high capacity, and RAM is much more expensive than HDD storage.

but storing 1 mil x 512B files on a 512B sector setup, would only be 512MB of storage…
so certainly some gains to be had…

however when comparing to some files might be 2MB on their own… and thus they would take up 40000 sectors while on 4k they only take up 5000…

so that 1 file alone would negate the space savings of tens of thousands of 512B files…

Storage nodes add a 512-byte header to each .sj1 file, so the smallest file that is actually written to disk is 1kB. But yeah, I kinda agree with 4kB blocks. 16kB though starts to make it somewhat wasteful. My 12 TB worth of nodes would waste 180 GB with 16kB blocks.

1 Like

Have you studied the possibility of also taking the journal out of the disk to have it in a faster storage like NVMe? Would it speed things up?

I do not have a way to add an NVMe to the testing device. But again as far as I understand, this wouldn’t help with the file walker process. Journal is only read on recovery, never in normal operation. It helps speeding it up for writes (e.g. should help with sqlite writes), but it isn’t used at all when scanning directories like in du.

What could help here is something like bcache, but again—I do not have means to test it. Besides, I do not believe NVMe should be necessary for a correctly functioning storage node. Storage node doesn’t have any logic that would require NVMe speeds, it should be possible to get rid of current inefficiences in a purely software way without putting additional requirements towards hardware.

2 Likes

One of the issues I’m facing is that after a lot of read and writes, long directories’ inodes got full and required extra inodes to handle all the information, which makes reading metadata slower.

That can be improved by making inodes bigger instead of smaller (I’m testing with 512). What do you think about it?

Reference

1 Like

I don’t see why would this work. The directory entries do not store file size nor modification time, so then to read the sizes of all files, you need to read twice as much data from disk when reading inodes of these files. Besides, a single directory entry takes 8 bytes + name of the file, which for Storj is usually is 54, making it 62 bytes. This means you’d fit around 4 real entries within the free space of the inode. This doesn’t really seem useful when a significant majority of Storj subdirectories (~98% on my nodes) is larger than 4 entries.

Frankly, I don’t understand the argument from your reference link. Large directory/file does not spill the block/extent list into new inodes (they do into new full blocks), and do not use the free space in the inode for this purpose either. This free space is, by kernel documentation, used only for extended attributes, useless for Storj, contents for small files, also useless for Storj because Storj files are at least 1kB, or for directory entries—see discussion above.

This was theory. In my tests du got faster after I set the inode size to 128. And it got even faster with inode size of 128 and not creating 2-letter directories (instead having all *.sj1 files for a given satellite in a single directory), suggesting that for this kind of operation smaller directories are actually hurting performance and hinting that spillover is not a problem on its own. But indeed I haven’t tested setting it to 512, maybe I’ll try it with the next batch of tests.

2 Likes