How to determine the required size for ZFS special VDEV?

little off-topic:
My test with l2arc without special dev configured to be only metadata give me this usage:

root@zfstest:~# zpool iostat -v
capacity operations bandwidth
pool alloc free read write read write


n001 9.40T 6.85T 27 25 839K 2.30M
scsi-SATA_ST18000NM003D-3D_ZVT7X7PW 9.40T 6.85T 27 25 839K 2.30M
cache - - - - - -
nvme0n1p1 76.6G 373G 2 1 23.3K 110K


n002 8.80T 7.45T 27 27 871K 3.60M
wwn-0x5000c5007446a88d 8.80T 7.45T 27 27 871K 3.60M
cache - - - - - -
nvme0n1p2 67.9G 382G 3 2 24.9K 139K


n003 4.85T 11.4T 12 16 369K 1.58M
scsi-35000c500e7511377 4.85T 11.4T 12 16 369K 1.58M
cache - - - - - -
nvme0n1p3 32.0G 418G 1 1 9.92K 62.6K

N001 used space 9.40T with l2arc of 76.6G used of 450
I think with 100g every 10tb you will be safe and totally cached. Btw here we are talking about special dev, sorry for that :slight_smile:

1 Like

I thought (heard) that the only way to improve things is to use a special device and do not use l2arc?

A special device can speedup reads and writes for metadata and (if enabled) small files. L2arc is just a read cache. However, for filewalkers the outcome might be the same if l2arc is set to metadata only.

Not sure about reboot atm. The original l2arc did not survive reboot but openzfs promised to change this.

1 Like

If you lost special device you lost everything. You need a mirror. Is a point of failure with great advantages in performance. L2arc is in the middle. You can set permanent to survive during a reboot

1 Like

My plan is to use pairs of 1 HDD + 1 cheap ssd for special device per node, no mirrors.

Cheap SSD will broke soon and you will lose everything

1 Like

I will lose one node out of many. So what?

BTW:
I have used a number of cheap SSDs for chia plotting, all are still alive, the record holder is at almost 10x TBW now.

1 Like

yes, unfortunately I would agree with @pangolin. If you would lose the one node from several, I suppose it doesn’t matter for the Storj network. Not sure that it wouldn’t matter for you… but well…
I still have a long running nodes without any redundancy (since 2019 at least) and they are working fine and do not produce a lot of IOPS (=wearing), unlike any RAID-similar solutions. And they are still working fine. Windows. NTFS. no SSD. no redundancy. no cache. YMMV. 5 years.
1 Windows service, two other - Docker Desktop for Windows (I know). WSL2.
Ok, I have UPS. But it’s not managed. The cheapest one of APC to handle the load for 15 minutes (eh. perhaps not anymore, but this server didn’t have a blackout yet in this place… there is an Atomic Station near the city).

I am not afraid about SSD wear. I my proposed setup we would write metadata and databases (second partition) of only one node to one SSD. Even a cheap SSD should not be in danger to hit it’s TBW. I would expect the HDD to die first in most cases.

Why second partition? Keep databases on the same pool, on a separate dataset. Configure special_small_blocks to be equal to the record size on that dataset.

This will send all block of that dataset to the SSD.

2 Likes

To be honest, I have used ZFS only for QNAP and TrueNAS boxes so far. Both exposing only a subset of ZFS features in their GUI. Changing things from command line can result in a broken GUI, I have learned that the hard way.

But yes, for my new headless ubuntu server using only one partition sounds reasonable. :+1:

All of what I described you can do in TrueNAS gui. Also, anything with zfs won’t break gui if you do it in CLI. GUI is just an interface to zfs, it does not matter how you change the data. All that are pool properties, not OS. You can export the pool and import it on another machine and those settings remain with the pool.

That said, TrueNAS Core’s approach to the OS configuration is nothing short of magnificent. The OS is vanilla OS, and all configuration changes are made on top of it, recorded in a database, they get replayed every boot. Try saving the “system system configuration”. The file you end up with contains step by step commands of how to get to the current state from the blank slate.

That’s why things like rc.conf or sysctl things shall be configured to UI so they end up in that replay list; modifying those things in the command line indeed is highly pointless – the changes won’t middleware restart, let alone reboot.

This, coupled with FreeBSD boot environments, and relentless stability testing make TrueNAS unbreakable.

So as long as you understand how configuration is handled, there are zero issues to manage pool-specific stuff in the cli.

The box I broke was the QNAP. :sweat_smile:

As for TrueNAS, I only used Scale so far. It seems like you can not assign partitions in the GUI, only disks. Is this different for Core?

I thought (heard) that the only way to improve things is to use a special device and do not use l2arc?
[/quote]

I’ve set up l2arc on my storj nodes for the last week of pretty serious filewalkers and also migrating two different hard drives with rsync.

l2arc is configured with 5GB per TB of HDD, and the fullest drive is only 50% full so far. “redundant metadata” is set to “some” on the drives, I don’t know if that affects l2arc. And of course this is l2arc usage, not special vdev usage which might be different.

the l2arc hit rate is only about 20% across all drives right now.

also the usage on the l2arc cache drive is pretty modest with metadata. less then 1MB/sec. it’s a used enterprise SSD so the endurance is measured in the petabytes anyway.

I used both in the past. I’m ok with only l2arc at the moment… used space filewalker with 10tb of data take 8 hours. You will have a better result with special dev of course.

But how you would see a difference?
As far as I understand, l2arc is helping only with reads. This leads to downloads by the customers and multiple filewalkers… However, is this cache keeping an enough time to process both? What’s your hit rate?

Actually l2arc store all metadata in permanent mode. During used space filewalker I have 30-40% hitrate.

Yeah, but if you mirror your metadata as you suggest, that problem doesn’t exist anymore. I just create two partition sizes 5GB/TB of HDD storage. The mirror improves performance even more and eliminates the single point of failure.

I actually just use two consumer SSDs. Average read and write actually is below 5MB/s. So, it probably will take 7 years to even reach the TBW.

4 Likes

My real usage is 3GB/TB HDD atm. So yes, 5GB/TB seems to be a reasonable number. :+1:

2 Likes

+1. In my limited testing, when I had different flags (default 128k recordsize, no compression)… I was filling the metadata device 3x-4x as fast - so it would have filled before the HDD. When I used recordsize=512k and compress=lz4 then 5GB/TB worked. There wasn’t a lot of wiggle-room (6GB/TB maybe would have been safer) but the HDD %-full was climbing faster than metadata %-full… which in the end is all you need.

But that was metadata-only: no small files. And for a 4TB node copied from ext4 to zfs.

1 Like