LVM + EXT4 metadata pool recommended size

mtone · July 30, 2024, 3:02am

Nice thing with LVM cache is that it’s very low risk: you can loose/remove/resize/re-add the cache at any time. Just be sure to follow On tuning ext4 for storage nodes when formatting your main volume initially – especially using 128b inode size and noatime+nodiratime.

I have a node with 5.0 TB with 28M files on a 14TB disk. I dropped and recreated a 245GB cache. Then ran du until steady-state, clearing RAM cache between each run with sync && echo 3 > /proc/sys/vm/drop_caches:

time du -h # elapsed: 29 minutes for initial scan
time du -h # elapsed: 13 minutes, 70535 promoted chunks
time du -h # elapsed: 9.4 minutes, 70544 promoted chunks
time du -h # elapsed: 8.3 minutes, 76001 promoted chunks
time du -h # elapsed: 7.7 minutes, 77025 promoted chunks
time du -h # elapsed: 7.4 minutes, 77782 promoted chunks
time du -h # elapsed: 7.3 minutes, 78156 promoted chunks
time du --inodes -h # elapsed: 5.3 minutes (inode count is faster), 78300 promoted chunks

Allright, enough! Running lvdisplay tells us:

LV Size                12.73 TiB
Cache used blocks      7.83%
Cache metadata blocks  16.40%
Cache dirty blocks     0.00%
Cache read hits/misses 12780848 / 916165
Cache wrt hits/misses  1 / 0
Cache demotions        0
Cache promotions       78300

So 7.83% of 244G is ~19GB cache usage, or 3.8GB per TB.

However, unlike ZFS special device with only metadata, during normal operation LVM cache will end up with junk data that has nothing to do with our goal of caching inodes. Having too large a cache might waste space a bit but if you have the space I think that’s better than getting important filesystem information constantly being evicted between filewalker runs. And as drive gets full, eventual fragmentation would also increase cache needs. So I would double that to 7.6 per TB.

TL;DR: I think between 4-8GB/TB is a good LVM cache size, as lower and upper bounds. For my 14TB disk, a 112GB cache should be comfortable.

One detail to keep in mind is that LVM cache by default supports a maximum of 1 million chunks that relates to the chunksize parameter – you’ll get an error if cache is too large for a given chunksize. In general, I would guess more chunks is better for improved cache granularity of inode coverage, while also limiting CPU/memory/metadata overhead of the hot spot algorithm. So to get highest (within default) chunk counts for a given cache size, 128b chunksize gives you max ~122G cache, ~244G with 256b chunks, and so on. Personally I’ll stick to those upper bounds when partitioning my cache SSD. Anyone more knowledgeable please chime in.

FWIW my personal notes to remove & replace a LVM cache:

LVM REPLACE CACHE
# Stop node and unmount partition
lvconvert --uncache poolX/storjX	# VG/LV
vgreduce poolX /dev/sdX			# Remove ssd PV from VG. Run pgs to find.
pvremove /dev/sdX			# Remove ssd PV
# (If not re-adding cache can remount at this point)
pvcreate /dev/sdX			# Add new SSD
vgextend poolX /dev/sdX			# Add SSD PV to VG
lvcreate --cache -n ssdX --chunksize 256 -L 244G poolX/storjX /dev/sdX 		# Add the cache with label ssdX. The max with chunksize 256 is ~244GB (for 1,000,000 chunks).
# Remount