LVM + EXT4 metadata pool recommended size

javierxam · July 29, 2024, 9:00pm

How much size of cache would be necessari per TB stored for metadata in a LVM cache pool?

JWvdV · July 29, 2024, 9:24pm

LVM cache comes in two tastes:

hot spot cache, which is a read cache for the blocks most used by the filesystem. This isn’t filesystem specific. The heuristiek is simply: which blocks are being read the most, this blocks are being mirrored in the cache.
write cache, which simply handle the word and pass it later in to the hdd.

Since this solution is filesystem agnostic, it’s an additional layer of complexity. Solving only a part of your problem: file deletions aren’t optimized by the read cache.

So again, look for ZFS or bcachefs if you are going to convert to another setup.

Look also here: Best filesystem for storj

mtone · July 30, 2024, 3:02am

Nice thing with LVM cache is that it’s very low risk: you can loose/remove/resize/re-add the cache at any time. Just be sure to follow On tuning ext4 for storage nodes when formatting your main volume initially – especially using 128b inode size and noatime+nodiratime.

I have a node with 5.0 TB with 28M files on a 14TB disk. I dropped and recreated a 245GB cache. Then ran du until steady-state, clearing RAM cache between each run with sync && echo 3 > /proc/sys/vm/drop_caches:

time du -h # elapsed: 29 minutes for initial scan
time du -h # elapsed: 13 minutes, 70535 promoted chunks
time du -h # elapsed: 9.4 minutes, 70544 promoted chunks
time du -h # elapsed: 8.3 minutes, 76001 promoted chunks
time du -h # elapsed: 7.7 minutes, 77025 promoted chunks
time du -h # elapsed: 7.4 minutes, 77782 promoted chunks
time du -h # elapsed: 7.3 minutes, 78156 promoted chunks
time du --inodes -h # elapsed: 5.3 minutes (inode count is faster), 78300 promoted chunks

Allright, enough! Running lvdisplay tells us:

LV Size                12.73 TiB
Cache used blocks      7.83%
Cache metadata blocks  16.40%
Cache dirty blocks     0.00%
Cache read hits/misses 12780848 / 916165
Cache wrt hits/misses  1 / 0
Cache demotions        0
Cache promotions       78300

So 7.83% of 244G is ~19GB cache usage, or 3.8GB per TB.

However, unlike ZFS special device with only metadata, during normal operation LVM cache will end up with junk data that has nothing to do with our goal of caching inodes. Having too large a cache might waste space a bit but if you have the space I think that’s better than getting important filesystem information constantly being evicted between filewalker runs. And as drive gets full, eventual fragmentation would also increase cache needs. So I would double that to 7.6 per TB.

TL;DR: I think between 4-8GB/TB is a good LVM cache size, as lower and upper bounds. For my 14TB disk, a 112GB cache should be comfortable.

One detail to keep in mind is that LVM cache by default supports a maximum of 1 million chunks that relates to the chunksize parameter – you’ll get an error if cache is too large for a given chunksize. In general, I would guess more chunks is better for improved cache granularity of inode coverage, while also limiting CPU/memory/metadata overhead of the hot spot algorithm. So to get highest (within default) chunk counts for a given cache size, 128b chunksize gives you max ~122G cache, ~244G with 256b chunks, and so on. Personally I’ll stick to those upper bounds when partitioning my cache SSD. Anyone more knowledgeable please chime in.

FWIW my personal notes to remove & replace a LVM cache:

LVM REPLACE CACHE
# Stop node and unmount partition
lvconvert --uncache poolX/storjX	# VG/LV
vgreduce poolX /dev/sdX			# Remove ssd PV from VG. Run pgs to find.
pvremove /dev/sdX			# Remove ssd PV
# (If not re-adding cache can remount at this point)
pvcreate /dev/sdX			# Add new SSD
vgextend poolX /dev/sdX			# Add SSD PV to VG
lvcreate --cache -n ssdX --chunksize 256 -L 244G poolX/storjX /dev/sdX 		# Add the cache with label ssdX. The max with chunksize 256 is ~244GB (for 1,000,000 chunks).
# Remount

mtone · July 30, 2024, 3:10am

file deletions aren’t optimized by the read cache.

Citation needed. Because I think cache certainly helps reading what’s there in the first place, and a LVM read cache provides that. Only the writes are not helped, but writes are typically batched by the OS.

Yesterday I deleted a 5TB node backup from a cached disk with rm -rf and while I didn’t measure the time, it was not problematic at all – a dozen or two minutes?

it’s an additional layer of complexity

So far I disagree. Despite the praise it gets for optimal efficiency (as the only metadata-focused solution), I steer away from ZFS special device specifically due to increased complexity: needing a SSD mirror, needing to spend precious slots for that extra SSD, not being able to survive without a cache present, making all nodes dependent on that mirrored SSD (unless you have a dozen cache SSDs), making it difficult to move a given node to another PC, general RAM demands of ZFS, etc.

JWvdV · July 30, 2024, 5:57am

No need for citation, because you’re answering it already: file deletions it a lot of meta data writing (unlinking the inode, updating the freemap, sometimes a reshuffle of indexes/H-trees in ext4). We’ve some topics around here, where they complain about deletions taking up to seconds per file in ext4. Which is being done during filewalking, so the inide info should be already in the cache. See for example How do you solve slow file deletion on ext4? - #33 by JWvdV

Can you do again the same, with the same disk without cache? This doesn’t actually prove that much. Because in this forum we’re looking at a variety of different hardware, so what may be true for your case might be not true for any other case. What I know and can argue is this: ZFS special devs and bcachefs-metadata devices are filesystem specific optimizations where meta data can be put on an SSD. This improves all metadata operations, whether it is listing the files, finding meta data (like size) and so also file deletions, without warming up a cache or bookkeeping on which part of a disk is being used the most (hot spot cache, you’re likely talking about).

Furthermore, we’ve got some comparisons in which LVM-cache hasn’t been included to a pity supporting the recommendation of ZFS or vcachefs with meta data on SSD devices:

The last one is the easiest one: it’s a common misconception you need more RAM with ZFS. Actually, because the system isn’t clogged by data, I have more RAM left and a more responsive system after conversion to ZFS. My drives aren’t overloaded anymore too (see iostat -x for that), so less wear and a happier wife since it’s less noisy in my hobby room now.
Furthermore, ZFS is the same amount of memory unless you’re using deduplication: https://blogs.oracle.com/solaris/post/does-zfs-really-use-more-ram

I for myself even mirror my boot drive. And if I had only one drive attached to my system, I would consider not to mirror the special devs. But since I have 50TB attached to one PC with over 10 drives, I don’t want to accept the risk of the failure of that single SSD having all meta data on it. So I mirror the meta data.

And yes, it’s more complicated to migrate a drive to another PC. But essentially you can just move over one special dev, and then it already should be able to import the drive again. An operation, that can even be fulfilled with an USB-stick.
Within one computer you’ve got zfs send/recv working on the speed of sequential IO. Which also can be used to go from different sizes pools.

And yeah, LVM is a different layer of complexity and file system agnostic while with ZFS and bcachefs it’s included in the filesystem itself. But LVM+ext4 essentially might perform the same like L2ARC on SSD, which is much alike.

But feel free to stick by what you’re most comfortable with. Because I essentially think, that’s the biggest factor. About three months ago, I would probably have taken the same stance on ZFS as you do now till I put it in trial and turned out to be working marvellous: unknown makes unloved.

agente · July 30, 2024, 10:26am

I’m using only L2arc (metadata only). You don’t need to mirror and data are persistent. I’m loosing writing benefits of special devs, if I understand well, but I’m speeding up fw a lot from basic ext4.

mtone · July 30, 2024, 11:47pm

I agree it’s anecdotal but it reflects my observations so far: all my storj node problems appear solved – startup scan, GC, trash, TTL – all run smoothly and I haven’t witnessed heavy continuous HDD I/O since migrating. Rsync is much faster, multi TB deletions were quick enough, and the thing runs well on a little RAM. Unless there’s a problem why look for a fix?

The main potential issue that remains with LVM cache is extra writes on SSDs. I’m still monitoring this, trying to measure average over a few weeks of normal operation without me tinkering. Comparing SSD/RAM behavior with persistent L2ARC is probably the next topic I’d look into.

But feel free to stick by what you’re most comfortable with. Because I essentially think, that’s the biggest factor. About three months ago, I would probably have taken the same stance on ZFS as you do now: unknown makes unloved.

Agree - that may be where I sit. A month ago I was struggling with my NTFS node on Windows. Learned a lot about linux FS & caching in these forums, thanks for the input.

Toyoo · July 31, 2024, 12:36am

The metadata blocks to be modified first need to be identified by reading other metadata blocks. Then, the metadata blocks to be modified need to be read to know what to modify. Plus, the same metadata blocks are likely to be modified when multiple files from the same directory need to be removed. Cache helps with all three, though granted a write-through cache like LVMcache only with the first two.

JWvdV · July 31, 2024, 11:34am

Sure, but in case of filewalking this must already to a great extent be cached at the moment the node decides to delete a certain file. So, the writing is the biggest part in the deleting we see problems from.

Toyoo · July 31, 2024, 9:24pm

Yep, that’s what I’m talking about. These metadata blocks will already be cached by LVMcache, where the speed-up comes from.

JWvdV · July 31, 2024, 10:59pm

I mean, they also are in the slab inode /
dentry cache = RAM. So we’re talking about different cached here.

Toyoo · July 31, 2024, 11:47pm

Oh, well, if you have enough RAM for that, then you don’t need SSD cache. I assumed the scenario where we implement LVMcache exactly because there’s not enough RAM for keeping file metadata.

JWvdV · August 1, 2024, 5:58am

Isn’t about much RAM, just because the last read meta data always will be in cache. It takes some time before it becomes dirty and will be evicted.

Toyoo · August 1, 2024, 11:49am

With file walkers last-read metadata is not worth much, because each directory and piece is visited only once.

Within the expiration process (which is not a file walker), pieces are expired in the order as returned by database, so for all we know—in random order, so last-read metadata is unlikely to help. Sorting by piece ID would help, but then you’d need resources to do the sorting, and with the recent traffic patterns I wouldn’t be surprised if just the list of pieces to expire crosses into gigabytes.

There is a nice trade-off solution for the expiration process, but it would come at a cost of code complexity: generating long runs of pieces in order with heapsort with bounded memory usage.

JWvdV · August 3, 2024, 5:49am

How do you mean? That’s what the vfs cache is for. I mean, during the filewalk instantaneously after reading the metadata of the file (name), it is being decided whether or not to delete it. I can’t imagine this data it’s already been deleted from the cache by then.

Piece expiration indeed is another cup of tea, in which even ordering probably won’t help (although quite easy possible ordering on piece ID, probably even a cheap solution of is an indeed column). But piece ID just isn’t related to last read cache. At most you could say, pieces from same satellite with same first two letters share the same path so the directory is still is in the dentry cache if there are multiple pieces from one directory.

Toyoo · August 4, 2024, 12:19am

For that you need a minuscule amount of RAM cache, not a use case for any persistent cache. You would likely never be so starved of RAM not to cache file metadata for the few seconds in which this decision is being made by a node.

This initial reading of metadata from HDD in the first place is the problem that makes file walkers slow for node operators. That is the problem that persistent caches like LVMcache are used to solve.

I’ve checked expiration dates on one of my nodes. With the recent test traffic one hour of piece expirations would remove somewhere between 1k and 20k pieces. I would expect sorting to speed up the 20k pieces scenario by 10% best case. That would be my estimate of the amount of I/O saved on not re-reading the directory metadata. Yet if it turned out that there’s enough pieces expired that you can’t do that in an hour, or your node comes back from a failure and suddenly needs to remove hundreds of thousands, maybe millions of pieces… then sorting itself might become a bottleneck.

Still, my setup should be safe to do this as an experiment. It’s not a big change, just a single line:

diff --git a/storagenode/storagenodedb/pieceexpiration.go b/storagenode/storagenodedb/pieceexpiration.go
--- a/storagenode/storagenodedb/pieceexpiration.go
+++ b/storagenode/storagenodedb/pieceexpiration.go
@@ -45,6 +45,7 @@ func (db *pieceExpirationDB) GetExpired(ctx context.Context, now time.Time, cb f
                SELECT satellite_id, piece_id
                        FROM piece_expirations
                        WHERE piece_expiration < ?
+                       ORDER BY satellite_id, piece_id
        `, now.UTC())
        db.mu.Unlock()

The only thing is that proving it helps is a challenge, as the traffic keeps changing, so it’s difficult to have reliable measurements. And my nodes are not that much memory starved for this to have the full effect.

Not if between collecting the first piece and second piece from, say, the 8g directory, the node trashes the cache by filling it up with metadata of hundreds of pieces from other directories. Depends on the amount of RAM and the traffic.

As it’s only direntry cache that needs to be preserved, and direntries are only ~60 bytes per piece, I wouldn’t expect trashing to happen unless the node is very memory-starved.

JWvdV · August 4, 2024, 5:44am

So, that was exactly the same point I made. Therefore, a read cache unlikely won’t speed up deletion in context of filewalking. It only speeds up the filewalking itself. But we have some topics in which they complain about deleting a file taking seconds.

That might be very well, I actually would also add an UNIQUE index or PRIMARY KEY on satellite and piece_id. It used to be in the past, see version 15 on storj/storagenode/storagenodedb/database.go at 6f4de2f66e9e060dd2bb70e3b8108be0efa6f534 · storj/storj · GitHub but has recently been dropped in certain 60.

But in the very end, I’m still very curious to see benchmarks on LVM+cache vs special vdev in ZFS / bcachefs with meta on SSD.

Alexey · August 4, 2024, 7:27am

You may perform this test with the benchmark tool:

Also I recently found another one benchmark:

exactly for the filewalker

mtone · August 4, 2024, 7:35am

I looked at my logs for some large trash cleanup “numKeysDeleted” number. Found this one with 1 million files & 200GB. Node is approx 5-6TB, lvmcache+ext4.

It took 1 minute 13 seconds.

2024-08-01T21:55:15Z	INFO	pieces:trash	emptying trash started	{"Process": "storagenode", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}
2024-08-01T21:55:15Z	INFO	lazyfilewalker.trash-cleanup-filewalker	starting subprocess	{"Process": "storagenode", "satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}
2024-08-01T21:55:15Z	INFO	lazyfilewalker.trash-cleanup-filewalker	subprocess started	{"Process": "storagenode", "satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}
2024-08-01T21:55:15Z	INFO	lazyfilewalker.trash-cleanup-filewalker.subprocess	trash-filewalker started	{"Process": "storagenode", "satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Process": "storagenode", "dateBefore": "2024-07-25T21:55:15Z"}
2024-08-01T21:55:15Z	INFO	lazyfilewalker.trash-cleanup-filewalker.subprocess	Database started	{"Process": "storagenode", "satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Process": "storagenode"}
2024-08-01T21:56:28Z	INFO	lazyfilewalker.trash-cleanup-filewalker.subprocess	trash-filewalker completed	{"Process": "storagenode", "satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "numKeysDeleted": 1072168, "Process": "storagenode", "bytesDeleted": 199530982164}
2024-08-01T21:56:28Z	INFO	lazyfilewalker.trash-cleanup-filewalker	subprocess finished successfully	{"Process": "storagenode", "satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}
2024-08-01T21:56:29Z	INFO	pieces:trash	emptying trash finished	{"Process": "storagenode", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "elapsed": "1m14.670587994s"}

Oh, I see there’s the “elapsed” time key. So I grabbed all of them into a chart. Few weeks of logs over a few nodes. Some outliers up to 20 minutes but typically under a minute (edit: some possibly before I enabled caching / while migrating – sorry for the quick and dirty analysis)

Toyoo · August 4, 2024, 12:47pm

No, a persistent read cache that manages to cache all file system metadata would have an effect roughly equivalent to sorting. Which is not much, but not zero.

This would make both inserts and deletes slower, while not actually ensuring the right order.