On tuning ext4 for storage nodes

I think you’ve gone astray somewhere. Your -m 0 gives you 5% on its own

-m 0 doesn’t actually give you more space. It just releases existing free space for non-root users. Left it here in the summary as it’s a pretty obvious switch for Storj, mentioned here on the forums many times.

Is that just an “idea” or will there be noticeable performance improvements?

Small but exist, the system edits the last access time every time a file is being read, and writes this information on the disk.
So all uploaded pieces and audits will have one write operation less.

New created files write other metadata, so the write is there anyway and storj not modify pieces, they get created or deleted.

The performance is noticeable with big node, but pretty irrelevant for the small one.

Very interesting thanks.
I guess I will give this a try soon.

not that there cannot be amazing performance gains from tuning.
but the defaults are the defaults for a reason, sure those reasons might not apply for this exact use case…

i would caution against changing to much or deviating far from the standards… usually my rule of thumb is 50% to 200% because then i can usually be confident that if it does hit me later.

its more a gentle tap rather than the almighty Thor opening up the heavens and throwing Mjolnir at me.

i’ve had many of my “tuning” end up being detrimental rather that what i thought it would do.

one good example is that i initially settled on using 256K recordsizes with my ZFS, however i later found that this was actually causing my setup to run worse and not better, even tho short term it looked like the benefits in migration speeds outweighed the adverse extra memory and cache usage.

another thing i initially did was run 512Byte sectors, which worked great… until later when i found out that the reason we have been moving towards larger sectors are the memory usage for indexing the file system… each sector needs to be recorded and thus 512B sector sizes vs 4Kn results in 8x on memory usage.

i think 90% of what i change i usually end up setting back to defaults… ofc once in a while i do fine some amazing tuning parameter, but just be vary… if you don’t want to spend time on your setup… be careful about changing stuff.

if these recommendations are good or bad… i have no idea, because i haven’t tested them…
just saying think twice and be wise.

no atime and xattr=off is recommended on zfs, because it saves read and write io.
been running that since i started using zfs.
supposedly reduced io from 4 to 1 or something

A saving of 1.5% disk space is somewhat small, and the fact is that nobody has posted a performance (time) measurement showing how the modified ext4 settings compare to default ext4 settings.

It would for example be interesting to know whether there is a major difference between the default ext4 settings and the suggested ext4 settings (-i 65536 -I 128 -J size=128 -O sparse_super2) when running the following command:

# cd /mnt/storj; time find . | wc -l

Many performance issues related to HDDs are related to how many HDD head seeks are required in order to satisfy a read request. How do the suggested ext4 settings reduce the need to perform a seek?

Indeed. There would be better gains with btrfs, actually—around 3% compared to ext4 defaults—almost twice that! But…

…it was slow as hell. I’ve tested several scenarios. Regarding the original post, except for the -I 128 switch reducing time of du (and hence likely to run the filewalker process), I’ve observed no changes to default ext4 with regards to performance.

btrfs was more than twice slower under all metrics even under the best settings I found (mkfs.btrfs -m single -d single, mount -o nodatacow,noatime,max_inline=0). I suspect that bandwidth.db updates dominate I/O performance there, maybe moving them to flat files—like the orders were moved—would help. But at this point this is only a suspicion and I’m waiting for some other tests to finish to check this as well.

Pretty much just that the inodes are closer together (being smaller and having less of them).

The problem is that any performance comparison would need to be done in a repeatable manner.

For example - if I run the command now, then create a new virtual disk, copy everything to it and run the command then, I will get different results because the files would be written in a different order.

I would rather be able to online resize the file system than squeeze out a small bit of space by using sparse_super2. Using 128B inodes is interesting, not that I would change this on my node now, but something to keep in mind. Smaller inodes would mean more of them can fit in RAM cache.

defaults work reasonably well for most use cases. If the use case is different than most, tweaking the settings might lead to an improvement. For example - the default cluster size for NTFS is 4K. However, if I am only storing large files, I might use 64K to have better performance and less fragmentation.

You write this as if you didn’t believe I tried :wink: Indeed I’ve dedicated a separate HDD for these tests, reformatting it for each test and running my script that attempts to reproduce storage node operations as faithfully as I can, reproducing each single ftruncate and fsync that the node’s code seems to perform. Trying to do 5-10 tests for each setting, depending on how variable are the results.

For example, ext4’s defaults seem to have a coefficient of variation in my setup at around 0.025 (meaning, for a test usually taking ~1000 seconds, standard deviation of time measurements is at 25 seconds). btrfs is much better in this regard, getting the coefficient of variation at 0.0027 (almost tenfold decrease compared to ext4).

2 Likes

Oh, about half of the files that my nodes store are less than 4kB. Not sure whether this is universal though.

I was not talking about storj :slight_smile: in that example.

Storj uses files from 512B in size
image

I do not have the count of files by size though.

2 Likes

it is something like that in most cases i believe… but still the sector size isn’t that huge of a deal… one can still have like 1million files on 4GB using a 4kn sectors…

so its basically not worth the hassle for running 512B, because that uses crazy amounts of RAM at high capacity, and RAM is much more expensive than HDD storage.

but storing 1 mil x 512B files on a 512B sector setup, would only be 512MB of storage…
so certainly some gains to be had…

however when comparing to some files might be 2MB on their own… and thus they would take up 40000 sectors while on 4k they only take up 5000…

so that 1 file alone would negate the space savings of tens of thousands of 512B files…

Storage nodes add a 512-byte header to each .sj1 file, so the smallest file that is actually written to disk is 1kB. But yeah, I kinda agree with 4kB blocks. 16kB though starts to make it somewhat wasteful. My 12 TB worth of nodes would waste 180 GB with 16kB blocks.

1 Like

Have you studied the possibility of also taking the journal out of the disk to have it in a faster storage like NVMe? Would it speed things up?

I do not have a way to add an NVMe to the testing device. But again as far as I understand, this wouldn’t help with the file walker process. Journal is only read on recovery, never in normal operation. It helps speeding it up for writes (e.g. should help with sqlite writes), but it isn’t used at all when scanning directories like in du.

What could help here is something like bcache, but again—I do not have means to test it. Besides, I do not believe NVMe should be necessary for a correctly functioning storage node. Storage node doesn’t have any logic that would require NVMe speeds, it should be possible to get rid of current inefficiences in a purely software way without putting additional requirements towards hardware.

1 Like

One of the issues I’m facing is that after a lot of read and writes, long directories’ inodes got full and required extra inodes to handle all the information, which makes reading metadata slower.

That can be improved by making inodes bigger instead of smaller (I’m testing with 512). What do you think about it?

Reference

1 Like

I don’t see why would this work. The directory entries do not store file size nor modification time, so then to read the sizes of all files, you need to read twice as much data from disk when reading inodes of these files. Besides, a single directory entry takes 8 bytes + name of the file, which for Storj is usually is 54, making it 62 bytes. This means you’d fit around 4 real entries within the free space of the inode. This doesn’t really seem useful when a significant majority of Storj subdirectories (~98% on my nodes) is larger than 4 entries.

Frankly, I don’t understand the argument from your reference link. Large directory/file does not spill the block/extent list into new inodes (they do into new full blocks), and do not use the free space in the inode for this purpose either. This free space is, by kernel documentation, used only for extended attributes, useless for Storj, contents for small files, also useless for Storj because Storj files are at least 1kB, or for directory entries—see discussion above.

This was theory. In my tests du got faster after I set the inode size to 128. And it got even faster with inode size of 128 and not creating 2-letter directories (instead having all *.sj1 files for a given satellite in a single directory), suggesting that for this kind of operation smaller directories are actually hurting performance and hinting that spillover is not a problem on its own. But indeed I haven’t tested setting it to 512, maybe I’ll try it with the next batch of tests.

1 Like

My understanding was that ext4 stored file entries in a directory as references on the directory inode to the inodes of the files and the names. And when it becomes full it referenced another inode of references.

If extra directory entries are stored on full blocks, then that reference is wrong and directory fragmentation can hardly be improved :sweat_smile:.

I’m currently testing with 128 too

is is really matter?
how much bandwidth your node has?