The ext4 file system is usually the default file system on Linux, as well as many NAS devices. It has several knobs that can be used to adapt the file system for some specific use cases. I’ve been doing some experiments lately on how to tune them for storage nodes, and found that there are some nice gains to get from the right kind of tuning.
An inode is a place on a file system that stores information on files or directories, such as the file name, modification date, access permissions and so on. The default size of a single ext4 inode is 256 bytes. It used to be 128 bytes (back in ext2/ext3 days, even now the constant defining this number is named
EXT2_GOOD_OLD_INODE_SIZE), but to support some new features more space was needed. However, right now Storj doesn’t really need these new features, so we can change it back to 128 bytes. To do so, pass
-I 128 as an additional parameter to
mke2fs. The most important points of this trade-off are:
You can no longer store files with modify/access times after year 2038. Storj does use modify time to search for trash files to delete, but for the next 15 years this will not be a problem, and drives that exist now will likely not be very profitable in 15 years—if they will work at all. So it’s not a big drawback for today’s nodes.
Some of the space within the inode can be used to store very small files, without allocating more blocks. This may speed up access to the contents of these small files. Similarly, this space can also be used to store extended metadata. The smallest Storj files are now 1kB, and storage nodes do not use metadata, so these features aren’t useful here.
Given that inodes are now half the size, more of them fit in a single disk sector. This speeds up directory reads, and from my observations just making inodes smaller reduces the time to run
du(roughly the equivalent of the file walker process) by around 30%.
Some data structures within the ext4 file system are pre-allocated during the initial setup. The significant ones here are: the journal, inodes, and superblock backups.
The journal stores metadata operations before they are commited to other data structures. The default size depends on the file system size, but unless your file system is smaller than 128GB, the journal gets 262144 blocks, which usually translates to 1GB. In theory the larger the journal is, the faster the file system will operate on metadata (and on data, if the
data=journal setting is used), but on Storj workloads I don’t see much difference between 1GB and, let say, 128MB. Journal size can be set with
-J size=128M, and can be changed later with
tune2fs. I suspect it could be tuned down even more, but I haven’t tested it and, as the disk space gain would be very small, it’s probably not worth spending time doing so.
There has to be one inode for each file or directory. However, ext4 requires to allocate the disk space for them upfront. Hence there’s often a lot of disk space set aside only for inodes, never to be used for storing data. To set the amount of disk space used for inodes, the user decides on the ratio of disk space to inodes at the file system creation time. Then this ratio is maintained over the whole lifetime of the file system (e.g. when you grow the file system, new inodes are also preallocated, and when you shrink it, the amount of disk space for inodes is also reduced), with no ability to change this ratio. The default is one inode per 16kB. There has to be enough inodes for each file and directory that will exist, so we cannot just arbitrarily set this ratio to some very high amount. Currently the average size of a Storj file seems to be around 800kB, so to account for some risk of suddenly having a large number of small files being uploaded (despite the per-segment fee), I think 64kB is still quite safe. You can set this ratio by passing
-i 65536 to
A superblock is the most critical piece of information within the file system, as without it, you can’t figure out where the files are. As such, ext4 stores multiple copies of the superblock. It used to be that there was one copy per each 128MB of disk space, but with modern disk sizes this results in thousands of copies. The
-O sparse_super2 option changes this to just two backup copies. The trade-off is that you can no longer resize the file system without unmounting it.
All of the above together resulted in having around 1.5% of disk space freed for storing data without disrupting operation of my nodes. So, to summarize: if you plan to have a separate file system just for a storage node, create your file system with
mke2fs -t ext4 -m 0 -i 65536 -I 128 -J size=128 -O sparse_super2.