Best Record size for zfs

by the way, this literally looks like the mediasonic 4bay probox:
https://www.amazon.com/Mediasonic-ProBox-HF2-SU3S2-SATA-Enclosure/dp/B003X26VV4

I would venture a guess that if they’re basically a rebranded versions of one another, then it’ll have the same issues with random disconnects to do with drives being seated/unseated and many other things. Can’t I’d go with this vs gutting a 4 bay chassis and finding a way to wire in a SAS to SATA fan out cable into it.

huh weird… must be the same then.

I’ll tell you a little about my experience.

I have an unusual situation in my server - relatively a lot of memory but little space on hard drives. I decided to experiment with ZFS for nodes.

At first, by mistake, I set Recordsize to 1 KB (I forgot to put the letter K in the recordsize=1024 parameter). This reduced the average load on RAIDZ1 of 4 disks to 25% by IOPS, while I use the sync=disabled parameter for faster recording.

Suddenly I saw that a 50 GB node began to occupy about 200 GB on the disk, and in a week the storage capacity was almost over. I found the reason (in the incorrect installation of recordsize), changed the parameter to 1024K, copied the data to a new dataset so that recordsize was applied and reconnected it to docker.

With this installation, the volume of the node has decreased to the level that is displayed in the dashboard. The overhead practically became zero, but another problem appeared - the load on the disks increased to 95% when reading. I came across read amplification.

I began to run tests with the recordsize=256k parameter, but in this case, the volume of the node on the zfs disk is approximately comparable to the same in ext4 with default settings, and slightly less than xfs, but not significantly.

Now I’m looking for a way to balance overhead and reduce read amplification. I have a small empty mirror on nvme drives, on which I want to take out databases, but I would like to roughly understand how many percent of the average read operations is database access? I generally think that the node should read all Storj databases from its own cache in memory, without accessing the disk…

the read io i usually get on zfs storj pools is like 1/8 of the writes and then when we consider that writes are a worse workload for HDD’s about double… then reads are like 1/16th of the full workload when running steady state with plenty of RAM for ZFS ARC.

i’ve tried many different recordsize’s over the years, but eventually settled on 64K due to its improved performance when dealing with fragmentation and slightly better cache / RAM utilization compared to the higher end recordsizes… ran 256K for about 2 years and 512K for 6months before that.

however larger recordsizes does improve migration speeds, but i usually use
zfs send | zfs recv
for that, which makes the transfer sequential, helping a lot with time a migration takes.
and in which case the recordsize becomes basically a mute point.

one also has to keep in mind that ZFS has dynamic recordsizes, basically the recordsize is the max size.
while the ashift (zfs pool physical sector size) defines the minimum size.

one doesn’t run ZFS for the performance, one runs it for the reliability…
however some workloads run great on ZFS while others need special gear and devices to improve zfs performance… l2ARC helps a little but not a lot and in some cases just slows the system down, since it requires RAM to keep track of the L2ARC.

the special metadata device is amazing for small files and metadata on ZFS but it will also tear through SSD life at a speed that is face melting in most cases.

best thing you can do for ZFS is have a lot of RAM the it can usually handle just about any read task you can throw at it.

on a side note… have you remembered to configure your ZFS pool optimally for high io
with things such as
zfs set atime=off poolname
zfs set xattr=off poolname
zfs set logbias=throughput poolname

on top of those as the bare minimum i would also recommend
minimum running ZSTD-2 if not ZSTD-3 compression, go with ZSTD-2 if your cpu is questionable.
you may be able to go higher but for storj i don’t think the CPU costs are worth it as it only compresses the databases and such… but there is still an improvement from that.

even tho one would think that the ZLE compression would makes most sense, but more compression means better cache, RAM and storage write and read performance.
before ZSTD one would use LZ4 compression… but after ZSTD LZ4 is an inferior choice.
at worst they are about even and in some cases ZSTD gets double the compression for the same cpu time.

but do keep in mind that ZSTD-9 will most likely choke your CPU no matter how powerful it is…
so keep it to 2 or 3 and storj data is encrypted so won’t compress.
only the db’s and such will benefit, but it does help…

default ZFS recordsize is 128K, which is the default for a reason.

i’m sure i’m forgetting some stuff… but maybe ill remember later :smiley:
hope this helps.

I I tried different levels of zstd between 1 and 13, and didn’t see any difference in the stored volume. What is the point of increasing the compression level for encrypted data?

for the encrypted data its pointless, but for other stuff like metadata and databases does compress and it does have a significant performance benefit even for storj.

because stuff like disk reads and writes will take up less io, and thus gives more disk bandwidth for those tasks.

did run ZLE for a long time because i figured that was the best solution for storj.
due to the fact that compression of encrypted data is basically impossible.

since compression is about sorting patterns and encryption if about breaking patterns, so their fundamental concepts are at direct odds with each other.

ZFS will try to compress all written data, but if it doesn’t compress then it will give up and just write it without compression.

Hello,

I’m a bit conflicted on block size.
On the one hand, storj uses 2.3M files mostly (I can confirm your observations on my 8 TB node) and that would make a large block size preferable, especially considering compression (I currently have a ratio of 1.2x. Not the best but it’ll help squeeze one more TB out).

However on a somewhat typical workload, I get the following output:

> zpool iostat -yr 5
storj         sync_read    sync_write    async_read    async_write      scrub         trim    
req_size      ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512             0      0      0      0      0      0      0      0      0      0      0      0
1K              0      0      0      0      0      0      0      0      0      0      0      0
2K              0      0      0      0      0      0      0      0      0      0      0      0
4K             65      0      0      0      0      0      4      0      0      0      0      0
8K              4      0      0      0      0      0      1      1      0      0      0      0
16K             0      0      0      0      0      0      0      1      0      0      0      0
32K             0      0      0      0      0      0      0      2      0      0      0      0
64K             0      0      0      0      0      0      2      1      0      0      0      0
128K            0      0      0      0      0      0      0      5      0      0      0      0
256K            0      0      0      0      0      0      0      1      0      0      0      0
512K            0      0      0      0      0      0      0      1      0      0      0      0
1M              0      0      0      0      0      0      0      0      0      0      0      0
2M              0      0      0      0      0      0      0      0      0      0      0      0
4M              0      0      0      0      0      0      0      0      0      0      0      0
8M              0      0      0      0      0      0      0      0      0      0      0      0
16M             0      0      0      0      0      0      0      0      0      0      0      0
----------------------------------------------------------------------------------------------
> sudo iostat -dx -k 3
Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
sde             69.67    400.00     0.00   0.00   21.03     5.74    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    1.47  97.33

The problem: My drive is fully utilized. We can see that a high record size is appropriate for writing (as expected), but the read requests don’t add up. To accomodate for those well, I’d need a record size of roughly 16K. My drive is currently at 100% load, mostly with reads (writes are well cached by ARC and written to disk only every 30 seconds - those are non-problematic). For me, it seems like smaller record sizes are way to go (in contrast to the solution given in #8), am I wrong? Has anyone else made similar observations?

FWIW, I assigned a 100 GB SSD cache to the pool. I will report back as soon as it fills up and see it it lift the load on my poor hard drive a bit.

1 Like

Just to be very sure, block size or record size?
Because record is adjusted to file size.

I highly doubt it, STORJ is not really compressable

Depends on the workload and sometimes pool layout. What pool layout do you have? A single disk?

What kind of cache? There are many caches in ZFS. Most don’t work like the users think they work. That is also why most users use a drive that is not suitable for workload.

1 Like

Good catch, I mean record size here.

I have a single disk dedicated to storj. I’m just using ZFS for consistency with other file systems on my server, for the compression and the cache but I’m not set on using ZFS if there are better alternatives.

I’m using a 100GB partition on an 512GB NVME drive, I’m not observing performance issues related to the cache.

For storagenode usecase instead of cache (I assume L2ARC) it will be much more beneficial to add that SSD as a special device to the pool that hosts blobs, and configure for small under 4K files as well. Definitely if you don’t have ram to fit all metadata; but even if you do it is still going to be progressively better overall, as the node grows.

Once your node grows you’ll be approaching iops limit of hdds (200ish iops), and cache won’t help enough as reuse is somewhat sparse (there are gm ery few files reused a lot but most used only once or twice).

It’s not recommended to use partial drives btw. Give ZFS full disk.

Then it does not matter, because it is variable. Leave it at 128k.

L2ARC is the only cache in TrueNAS that can handle “just a random SSD without PLP or redudancy” so I hope you are using L2ARC. Problem with L2ARC is, that it uses ARC. For the very specific use case of STROJ filewalker it could be helpful. Have not tested it myself, only what other said here.

+1
would also use it as special vdev (in a normal use case in mirror because you loose everything if one special vdev drive fails)

After 1 year with ZFS, I find that
zfs set primarycache=metadata xattr=sa atime=off recordsize=1024k sync=disabled
is best.
Using L2Arc at least 128Gb is enough to handle 40-50% read requests from SSD for 4Tb node.

1 Like

I’m worried that the impact of SSD as normal device would be negligible as it occupies so little space (100 GB SSD vs (soon) 13 TB). By using the SSD as (persistent) cache, there is at least the chance of reuse (albeit this will not be the case for the majority of data).

You are right on the iops - under full load, I observe 100-200 iops on the drive. Luckily, there is already some benefit from the cache (which is good) but the read block size is still at 4k (which is suboptimal).

For the use case in storj (huge writes but small reads), that sounds like very reasonable advice!

True - note however that the RAM impact decreases with increasing record size. Due to the huge record size of 1M currently on my data set, I only use 146.2 MiB of RAM. I guess this figure will increase to ~1 GiB with a record size of 128k. I have some RAM to spare, but that has to be kept in mind definitely.

Thanks!

Cool, thanks for sharing! Why did you go for this record size in the first place? What were/are your concerns with different record sizes? And how is your read and write performance (and write amplifcation), do you also observe the pattern of huge writes and small reads?
40-50% is an amazing number. Looks like I’d have to use way more than 250GB as cache to get a comparable output.
I also noticed that your cache only has metadata in cache. Does the cache help a lot with drive utilization?

It’s somewhat counterintuitive but very logical. The issue here is not space but random IO. Every read from HDD involves seek latency (huge) and transfer time (depends on object size). For small objects seek greatly exeeds transfer time. Moreover, every file access involves metadata read — another small object fetch.

If you have cache — yes, subsequent reads will be fetched from the cache, provided you have enough ram to fully cache metadata, but the first read must come from disk. Furthermore, looking at my node, number of chunks that end up repeatedly read is very small (but those that do get repeatedly read are read a lot of times).

With special device you would send all metadata and small files (say, under 4K) to SSD and leave only large files on HDD. That way small files don’t have latency hit, metadata does not contribute extra IO to hdd, and disks are left doing whatever they do best — store large files, where seek time is much smaller compared to transfer time, and therefore is less impactful.

Rule of thumb is special device should be 0.3% of pool size. For 10TB pool it’s 30GB. If we also want to store say 1M of small under 4K files — it’s another sub-4GB of space. 50GB SSD completely covers the needs of 10TB pool and completely eliminates random IO associated with metadata lookup and small files. This also absolves you from needing to cache that — so you have larger cache remaining for actual data (albeit it may not be even needed at this point).

Here is a thread with stats from my node on chunks re-use and other performance related tidbits: Notes on storage node performance optimization on ZFS

Record size indeed leave at default. ZFS with compression enabled will not waste space.

2 Likes

Rule of thumb is special device should be 0.3% of pool size. For 10TB pool it’s 30GB.

i need 70GB/10TB for metadata only

1 Like

I’ve done with ZFS for now because I found it inefficient for Storj. It is long story, man, really. Recordsize about 512-1024K gives less storage overhead and less IOPS overhead.

As for the efficiency of ZFS in general - no, it is not high, and the array suffers a lot from directory listing calls, and most importantly, SSD does not help here.

Without repeating the wholeong story again I have to point out that I found the opposite to be true: ZFS is the the best filesystem for storj from performance perspective, by a large margin. If you run it on a 2GB raspberry pi — then yeah, ext4 may be faster under some specific conditions. But if you plan to scale in the future — past few TB and few hundred IOPS — ZFS is the only choice.

This is demonstrably false. Special device removes [significantly more than, but at least] 50% of IOPS from the array, even before caching is considered. Let’s not spread misinformation.

1 Like

That is the problem with discussing a complex topic with unclear words. We don’t even know if he is talking about SLOG or L2ARC or special vdev. He could be right, SSD for SLOG really do not help for a STRORJ workload :grinning:

1 Like

Except when it does, e.g. by accelerating synchronous writes, that by default databases that are likely on the same volume are making, thereby removing some of the random io from disks… or maybe there is something else writing to disk synchronously, so slog will offload that, leaving more iops for storagenode…; which just reinforces the point you are making: there are too many moving parts and ways to fry the fisch, its never black and white, and context is everything

1 Like

Nothing worth to be written in SYNC nor atime=on :wink: