ZFS performance, ARC vs L2ARC vs special vdev

IsThisOn · February 27, 2024, 2:37pm

decommissioned my old TrueNAS System
and wanted to compare different ZFS pools:

pool1: ARC only
pool2: persistent L2ARC metadata
pool3: special vdev metadata only

Disclaimer: This is far, far away from a scientific testing!
This is only good to give you a very, very rough estimate.

I tested how long it took lazy filewalker to run.

The storage node has 4GB RAM and, is on a fast NVME SSD and has a local DB.
STORJ data is a NFS share on a TrueNAS SCALE System with 64GB RAM (32GB ARC).
Pool1 was a single Seagate IronWolf 8TB drive.
Pool2 was a Toshiba N300 and a Samsung SSD PM871 256GB.
Pool3 I forgot to check which HDD I used. I think it was the Seagate IronWolf 8TB. As SSD I used Samsung SSD PM871 256GB
Other ZFS settings: Record Size 16MiB, Sync disabled, Atime off, lz4.

Here is how long it took to filewalk a 5TB node.

pool	first run	second run
ARC	459min	417min
L2ARC	88min	85min
special vdev	79min	78min

My personal conclusions:

With ARC alone, filewalker takes ages.
ZFS needs some kind of cache for the STORJ workload
L2ARC and special vdev work both great. Unlike special vdev, L2ARC does not need to be mirrored.

My open questions:

Why does ZFS show absurdly high compression numbers for all pools that can’t be true? I have discussed this in other posts before. No matter which pool layout, zfs get compressratio will show 5.39x which simply can’t be true. It should be 1.
How would a very big L2ARC behave during boot? Core has L2ARC still not reboot persistent by default because it used to drag out the boot process. For SCALE it is enabled by default
Is the L2ARC solution scalable? Currently it uses 15GB for a 5,5TB node and X amount of RAM. Not sure how to find out how many RAM L2ARC uses. More L2ARC would take up more RAM.

arrogantrabbit · February 27, 2024, 8:17pm

That… probably invalidates the whole test, as NFS introduces fixed latency and SCALE had (still has?) some known performance regressions from Core.

L2ARC fill rate is limited to a few megabytes per second. In 85 minutes it would at most cache under 100Gb of data. It also depends how did you configure primary and secondary caches – all or metadata only.

Furthermore, any type of walkers are likely going to mostly bypass caching.

This is highly suspect. How did you test this? Did you add special device, configure small file threshold, then zfs send | zfs receive, and then run the test?

It can, and it is, but not due to data compression in the conventional sense, but due to partial sector space saving.

It does not. At least not since TrueNAS 12. I had 2TB L2ARC with persistence. zero impact on boot time.

See, 15GB is nothing. It’s either due to all your workload bypass cache, and/or limit on cache fill rate.
I recommend reading this: ZFS L2ARC

The big difference is that former accelerates second and subsequent accesses, and only if first one was not sequential (i.e. pre-warming the cache with file walker usecase is destroyed) while special device accelerates all accesses, including first, and sequential.

I would say that storing and quickly accessing large number of small files, a task not specific to storj, benefits from fast access to meditate, and the best way to accomplish this is to place it to SSD; once this is done it’a a no-brainer to also store small files there. Both of these are accomplished with special device.

IsThisOn · February 27, 2024, 9:20pm

First point, these performance regressions are mostly gone according to LawrenceSystems.
Second point: This is exactly why I think this test gives some insight.

I am a big believer in “only use unused resources”.

I personally would not use TrueNAS Scale with STORJ on top of it for multiple reasons.
For one, I want STORJ to run on a different VLAN, ratelimiter, with the DB on a SSDs which I don’t have on bigtank machines, the list goes on.

So for me personally, running some STORJ Docker that access a dataset via NFS on a datatank is more realistic than TrueNAS SCALE with STORJ.

And even if NFS did add some latency, the difference between the results is still massive.

Yes.
It did even less, 15GB “only”.
Dataset was set to metadata only.

Why?

That is exactly how I did it

But why? I would get it if there were some RAIDZ stuff involved, but this is a single drive.
There is no pool geometry or padding or anything like that.
I highly doubt that without LZO, the dataset would be 25TB. Because that should be the case right, if the 5TB has a compression ratio of 5x?

I am not worried about the 15GB on the SSD
I am worried about how much ARC it uses.
Some people used to recommend setting L2ARC not higher than 8x the amount of RAM because of this.

Not sure if I would agree 100% here.
I would say it this way, both methods can save all metadata.
And that is a benefit for a HDD.

True, but also for L2ARC.
Again, I would not bet on L2ARC being reasonably scalable for STORJ, but it did also perfectly fine.

There seems to be a small misunderstanding on your part:
L2ARC does not sped up the process, because it cached STORJ blobs.
It cached all metadata.

My theory is, that because L2ARC is basically just evicted ARC, the L2ARC drive was filled up during the replication task. Metadata has either evicted ARC from pool1 or evicted ARC from the writes to pool2. That is why the first run did not offer worse results than the second run. L2ARC was already hot.

Alexey · February 29, 2024, 8:44am

10 posts were split to a new topic: Business model related to usage of ZFS (somehow)

striker43 · March 4, 2024, 9:48am

I am building a new server right now and wanted to test Truenas Scale with it. It’s my first time working with ZFS and I was researching and reading a lot about it (also in this forum), but don’t have any experience so far.
Anyways, I wanted to share some tests that I did that might help others in future.

When I installed Truenas Scale I found the easiest method to run multiple storagenodes would be to create a debian VM and pass a zvol of my storage pool into it because in Truenas Scale directly I am not able to run docker commands and I didn’t want to create all my containers via the Kubernetes-GUI that Truenas Scale provides.

I found many discussions in this forum about the best record size of ZFS datasets for a storagenode, but not so much about the best block size when using zvol’s. So I just created one zvol for each block size possible, passed them to my Debian VM and formatted the virtual disks as EXT4 using this config (mke2fs -t ext4 -m 0 -i 65536 -I 128 -J size=128 -O sparse_super2) and copied a small node from my Unraid-Server over to each of the virtual disks:

root@debian:/mnt# find . -maxdepth 1 -type d -exec sh -c 'echo -n "{}: "; find "{}" -type f | wc -l' \;
.: 1237938
./storj_8: 206323
./storj_16: 206323
./storj_32: 206323
./storj_4: 206323
./storj_64: 206323
./storj_128: 206323
root@debian:/mnt# du -sh *
389G	storj_128
389G	storj_16
389G	storj_32
389G	storj_4
389G	storj_64
389G	storj_8
root@debian:/mnt# df -h
Filesystem      Size  Used Avail Use% Mounted on
...
/dev/vdb         25T  389G   25T   2% /mnt/storj_16
/dev/vdc        4.0T  389G  3.7T  10% /mnt/storj_4
/dev/vdd        8.0T  389G  7.7T   5% /mnt/storj_8
/dev/vde         20T  389G   20T   2% /mnt/storj_32
/dev/vdf        6.4T  389G  6.1T   6% /mnt/storj_64
/dev/vdg         13T  389G   13T   3% /mnt/storj_128

So the node has 206323 files and is 389GB big, so the average node file has 1,886mb. Debian shows that all of the volumes are equal. I then checked the Truenas Dashboard to see how efficiently each block size can store my node’s data (the storj zvol has block size 16KiB):

So as expected, there is a huge difference in terms of storage efficiency with the different block sizes.

Questions:

I think I will go with a sector size of 64K as a trade-off between storage efficiency and performance/overhead. Are you aware of any disadvantages?
If you are also using Truenas Scale and don’t want to access your pool via a network protocol (SMB, NFS, iSCSI,…), how are you running your node then? Also via passing zvol’s into a VM? Or directly on the host somehow?
I am currently testing on a 4x18TB disk RAIDZ2 pool. Once all the remaining drives will arrive, I will create a bigger pool. Will that have any impact on the efficiency with using different block sizes?

IsThisOn · March 4, 2024, 2:11pm

Why would you wanna run multiple storagenodes on the same host?

Why would that be easier than creating a STORJ App?

Unlike datasets, zvol have a fixed volblocksize. You will suffer storage efficiency, performance and fragmentation.

Because zvol are good for blockstorage like VMs, not for stuff like STORJ or a network share with files.

Jup. What pool config are you using?

Bad storage efficiency, fragmentation and io amplification for smaller than 64k.

On TrueNAS as an app. Directly on the host.

Jup. See table here:
here
Best setting you could get with RAIDZ is by either using 6 drives (66.66%) or 18 (88.88%) drives.
While 18 drives is dangerously wide.

So to summarize it all up:
Don’t do it like that.

Pentium100 · March 4, 2024, 2:29pm

If your pool has raidz devs and ashift=12 (4K sector drives), then you need a larger zvol block size. There is a formula, but 64K works pretty much always.
If the block size is too small, your zvol will take up double the space in your pool.

I don’t use Truenas, but I run my node inside a VM, passing the zvol as a virtual disk.

striker43 · March 4, 2024, 2:36pm

Thank you for your responses

Because I already have multiple nodes that are running on single disks inside of an external USB-enclosure (and their performance is poor because of 8-disk USB-connected enclosure which is somehow limiting IOPS heavily). I want to migrate them to the new Truenas pool.

For me it would be easier because I know docker and worked with it already, but not with k8 or the Truenas App system.

Ah okay, thank you for this hint.

Right now for testing things out it’s a RAIDZ2 with 4x18TB disks + 2TB NVMe l2ARC for metadata. I plan to add more drives for the production pool.

Can you please give me some more details about how to set this up? Are you using the “official” Truenas STORJ-App or did you create your own app config for the k8? And I guess you are only running one big node then?

Thanks for this information.
Did you have any issues with this setup so far?

Pentium100 · March 4, 2024, 3:13pm

My pool is made up of three 6-drive raidz2 vdevs (4TB, 6TB, 8TB - I know different capacities are not optimal, but I slowly expended the pool and different drives were the cheapest per TB at different times). The drives are 7200RPM. My node currently has 27.7TB of data.

Recently I had problems with performance during fiewalker operation. There is a thread about it

But the TLDR version is that I had two problems:

The filewalker runs inside the VM with lower priority, so it is supposed to “give way” to normal node operations. However, to the host, all IO operations from the VM look the same, so the VM uses all available IO, possibly creating problems for other VMs on the host.
I had passed the zvol with discard=unmap parameter and mounted the fs with discard option so that when files are deleted from my node, the disk space is freed on the host as well. This seemed to work OK for a while, until the node started doing a lot of deletes (100GB or so) at once. The discard operations saturated the IO resultin in problems inside the VM and on the host. Combine that with the filewalker and it’s fun times.

What I did to solve the problems (or so it looks):

Write a couple of scripts to 1) put the filewalker process in a cgroup and limit its speed and 2) adjust the limit so that the IO load is around 60%
Remove discard from the mount options and instead run a script once in a while that runs fstrim over small ranges (10GB) and with the minimum extent length set to 1MB (no point in discardng a single 4K block), though I may change that to 64K. If the particular range had some free space that needed trimming, then wait 5x the time it took fstrim to run before going to the next range.

IsThisOn · March 4, 2024, 6:06pm

Ahh that makes total sense. I guess it should be easy to just run multiple instances of the TrueNAS Scale STORJ app?

At first this might seem easier, but as you can see by @Pentium100 real life scenario, there will be problems.

I personally would start with the right amount of disks. I have not tested pool expansion, and it seems to be finally supported, but there is still resilvering which can be avoided.

I linked a broken link to the formula. Now fixed. No, it does not work pretty much always. It will pretty much always give you worse results than expected. Especially for zvols with RAIDZ2.

Sorry, I am currently not running any node at all. I just did some testing. But because I still have not found out what the problem is with the compression rate and the payout and ingress is pretty low, I am currently not bothered renting out unused space.

striker43 · March 4, 2024, 7:08pm

Ah okay, thanks a lot for all your feedback

In the meantime I think I figured out how to run docker images directly in Truenas. It seems like I have to set them up using the GUI and I can’t clone them easily and use one as a template… So I will have to configure everything I need manually via the GUI
But at least I am able to run it. During my first research I found that it’s not possible at all to use docker anymore directly in Truenas Scale, but it turns out that it’s just not possible to use the docker cli.

That makes a lot of sense and I will do that. Right now I am not starting anything for “production”, I am just playing around a bit to get used to Truenas and ZFS and find the best settings as long as I am waiting for the rest of my ordered disks…

I think I will then use zfs datasets and create Custom Apps directly in Truenas GUI for all my docker containers that I want to migrate to the new server.

Pentium100 · March 4, 2024, 8:28pm

Reading the link and the tables it looks like 64K is pretty good. Not optimal, especially in some cases (like 9 drive raidz2), but usually close and better than the default 16K (or in older zfs versions 8K).

I remember testing this and IIRC using smaller block sizes (say, on a mirror vdev) was slower to delete the zvol or to delete data from it (probably due to more metadata operations), but I am not completely sure, this was some years ago.