Notes on storage node performance optimization on ZFS

Roxor · June 14, 2024, 3:09pm

It certainly doesn’t need it. However, I was imagining what it would be like to be a large SNO… with 100+ nodes. Knowing the pain of 9-months-holdback, and say an average of 2-3 years to fill 10-20TB HDDs (ignoring recent test ingress)…

…at some point would you want some nodes: perhaps just the full ones: to have some redundancy? Just to avoid maybe losing a full 20TB node and having to fill it again?

I can imagine deciding that RAIDZ1 for every 8 HDDS (so capacity of 7) would be worth it to protect full nodes. They’d be lower-IO anyways if they were rejecting ingress most of the time.

Don’t mean to derail the ZFS discussion. I can just see that at some point a SNO may decide to give up some capacity to protect all the time they spent growing nodes.

pangolin · June 14, 2024, 3:39pm

This would be a 100TB+ node then or something not in line with TOS.

Doesn’t make sense to me, I learned to keep things simple.

littleskunk · June 14, 2024, 4:33pm

I am also using a single disk setup.

The smallest raidz I can think of would be a 2+1 setup and than maybe a lot of them in parallel. It still means you get just 1/3 of the IOPs you would get from 3 single drive setups. And for a storage node that slowdown should get noticable for garbage collection runtime and other stuff. And this is the lowest IOPs impact I can think of. If you add more drives to one array it will be even worse.

Plus you don’t have the full capacity available.

Plus single drive disks can run even months after they start showing read and write errors. In a single drive setup you can risk keeping them running until the node gets disqualified.

agente · June 14, 2024, 4:34pm

what do you mean? Many different pools for every hdd?

How about special dev? I know you can attach one special to one pool. If I have 48 pools need 48 speciale devs… Am I wrong?

littleskunk · June 14, 2024, 4:43pm

I don’t understand the question. Do you mean zpool create SN8 /dev/disk/by-id/... ?

What I also did was adding a 8 TB sata SSD and split that into 8 1 TB partitions. Now each drive has an l2arc cache on a single SSD. This should also work with smaller SSDs. Like 20GB for a 20TB drive. You can configure that this cache is used for metadata only. It will boost filewalker performance a lot. You can do the same with RAM but the RAM cache will take hours to fill again after a restart. The SSD metadata cache will allow you to finish the filewalker in a short time even after a restart.

Ambifacient · June 14, 2024, 4:46pm

Do you mind posting your L2 arc arc_summary?

Mine for example:

L2ARC status:                                                    HEALTHY
        Low memory aborts:                                             0
        Free on write:                                               153
        R/W clashes:                                                   0
        Bad checksums:                                                 0
        I/O errors:                                                    0

L2ARC size (adaptive):                                         237.9 GiB
        Compressed:                                    10.3 %   24.4 GiB
        Header size:                                    0.2 %  438.4 MiB
        MFU allocated size:                            99.5 %   24.3 GiB
        MRU allocated size:                           < 0.1 %  364.0 KiB
        Prefetch allocated size:                        0.5 %  128.3 MiB
        Data (buffer content) allocated size:           0.0 %    0 Bytes
        Metadata (buffer content) allocated size:     100.0 %   24.4 GiB

L2ARC breakdown:                                                    1.6G
        Hit ratio:                                      4.0 %      65.0M
        Miss ratio:                                    96.0 %       1.6G
        Feeds:                                                      2.3M

L2ARC writes:
        Writes sent:                                    100 %       1.1M

L2ARC evicts:
        Lock retries:                                                  0
        Upon reading:                                                  0

Somehow I feel the hit ratio is quite small, but my RAM arc cache is 96% hit rate. All datasets are set to secondarycache=metadata

agente · June 14, 2024, 4:55pm

Yes. So you have one pool per disk. In a 48+ disks system you need to create 48+ pools with 48+ partition for L2Arc. Nothing impossible but not clean and I want to use special dev (after reading @arrogantrabbit experience) not L2Arc.

littleskunk · June 14, 2024, 5:43pm

L2ARC size (adaptive):                                           9.3 TiB
        Compressed:                                    76.9 %    7.1 TiB
        Header size:                                    0.1 %    6.1 GiB
        MFU allocated size:                            11.2 %  815.7 GiB
        MRU allocated size:                            88.7 %    6.3 TiB
        Prefetch allocated size:                        0.1 %    6.7 GiB
        Data (buffer content) allocated size:          97.8 %    7.0 TiB
        Metadata (buffer content) allocated size:       2.2 %  162.0 GiB

L2ARC breakdown:                                                   10.4M
        Hit ratio:                                     51.2 %       5.3M
        Miss ratio:                                    48.8 %       5.1M

L2ARC I/O:
        Reads:                                       57.5 GiB       5.3M
        Writes:                                     469.4 GiB     104.4k

The hit ratio might be a bit misleading in my case because I am storing more than just metadata in the cache. Still 50% means less IOPs on the disks. It is 50% of the arc misses. So almost no reads at all. The big advantage of having some kind of l2arc is that I can restart my server without the big filewalker penalty I would otherwise have.

EasyRhino · June 14, 2024, 5:43pm

from my limited knowledge that’s common with l2arc. the RAM ARC has a really high hit ratio and the l2arc has a dramatically lower one.

Few reasons:

the l2arc is only populated with data near being EVICTED from RAM ARC. Therefore by definition is lesser used than what stays inside.
the speed at which l2arc is written to is throttled so as to not overwhelm the performance. Therefore not all the evicted data may even make it in.

here’s a longer and probably smarter writeup:
OpenZFS: All about the cache vdev or L2ARC | Klara Inc (klarasystems.com)

arrogantrabbit · June 14, 2024, 5:49pm

You mean record size? Leave it at default, and keep compression on. This will result in a most efficient space utilization. On storagenode datasets I’m seeing 1.18-1.28 compression ratio.

It depends on file size distribution in those 10. This is what I saw empirically:

When I allowed files smaller than 16k to go to 400GB special device, at about 20TB worth of storagenode data the special device filled up.

I have disabled sending small files to special device, rebalanced the datasets (zfs send | zfs recevie) and now the same 400GB seem to be sufficient to hold metadata for about 30TB worth of node data with room to spare.

I don’t do anything special for storj. I have one pool for everything. The pool contains three raidz1 vdevs, each of 4 drives. It has special device (mirror of two SSDs), and recently I added L2ARC because I just had an unused 500GB SSD.

This is how hit rates look like:

And this is traffic on L2ARC device:

agente · June 15, 2024, 8:16am

Do you change speed writing limit?

What do you think to use zfs with one pool one disk? not standard config but can it work with storj?

MattJE96011 · June 15, 2024, 3:54pm

One of my servers runs TrueNAS Core and all the nodes on that server run on single disk pools. Works perfectly fine, you obviously just don’t have redundancy. They’re all mounted to a single jail with each node running in separate tmux sessions. Just make sure you delete the scrub tasks that are automatically created for each ‘pool’ as there’s no need for them.

arrogantrabbit · June 15, 2024, 8:26pm

Why not rc.d? Why single jail?

Btw, feel free to reuse this one freebsd_storj_installer/overlay/usr/local/etc/rc.d/storagenode at bf16617f1aa5044bc9f8aa26b6d5d86b2f1192f7 · arrogantrabbit/freebsd_storj_installer · GitHub or even use the whole thing, including log rotator and updater.

Roxor · June 15, 2024, 8:49pm

Having a SSD as the special metadata device lets it hold pretty much everything the filewalkers care about (and even the small files)… while leaving the HDD with the medium+large files that it works better with. Plus if you have a lot of RAM the ZFS ARC will help with everything. It’s a clever config!

I’d get my systems to 128GB RAM first though: reboots are rare and memory trumps all!

arrogantrabbit · June 15, 2024, 8:56pm

I have 128, and it seems it’s overkill. 32 was definitely not sufficient. 64 is probably optimum.

And for storing small files on the special device I totally agree it’s definitely the way to go. I however ran out of space on 400GB special device and had to limit it to just metadata.

But now that you mentioned it again in such a concise form — I’m thinking to get a pair of 2TB p3600 off eBay (they are under $100) and migrate special device with small files there.

Roxor · June 15, 2024, 9:02pm

Man… that’s a great price for 2TB: and P3600’s had monster endurance so even used they should have tons of life left!

MattJE96011 · June 16, 2024, 2:39am

Personal preference really. I have a python program that manages nodes across multiple servers and that’s just what I chose to use for simplicity. Create a new tmux session with the storagenode run command… that’s it. No need for more than one jail.

arrogantrabbit · June 16, 2024, 4:29am

It just bothers me that you have extra two (!), completely unnecessary processes hanging around – tmux and shell – per storagenode.

You can get rid of them by using a (for some reason) little known command: disown.

Something like this:

sp.run(f"/usr/bin/zsh -c {storagenode_cmd_with_arguments} &>{log_file} &; pid=$!; printf '%d' $pid; disown $pid; exit", ...)

This will

start zsh (bash works too, I think)
tell zsh to run a command to start storagenode with all necessary arguments
redirect output to log_file &>
background the process &
get its pid $!
print it to stdout
disown it
exit.

As a result, only storagenode process is left running, and you have its pid to send commands to it.

Separate problem there is no way to tell storagenode that you want to rotate the logs, and if it crashes – to restart it. So you would want to run it under daemon utility anyway. And if you do that – might as well do it properly via rc.d.

Alexey · June 16, 2024, 8:56am

You may run a node any size. Just make sure that you do not use the same pool for more than a one node (otherwise it will be the same as running multiple nodes on the same disk), and do not use bypass methods for /24 rule.

agente · June 16, 2024, 10:09am

yes… need simply survive to test data and new “possible” pattern arriving and cannot buy 7tb enterprise ssd

I appreciated all the advice. thanks… I’m studying draid now