Notes on storage node performance optimization on ZFS

arrogantrabbit · June 16, 2024, 2:43pm

Why? Requirement of no more than one node per HDD is still satisfied.

I run three nodes on one array. Because when i need space back I’d delete one of them. How else I’m supposed to manage space, until node can shrink on demand, and fast?

Compression is not everything (and I feel compressing smaller blocks will be more efficient than large ones anyway); there are downsides for large default block sizes. For example, memory is allocated for the whole block. Allocating more memory than you need is slower and quite pointless. ARC memory utilization efficiency may also be impacted.

I would not change defaults unless you have a very good reason to, supported by benchmarks.

MattJE96011 · June 16, 2024, 5:48pm

Can’t argue with you there. When I started I didn’t have nearly as many nodes as I have now and tmux was more of a shortcut so I wasn’t really worried about it. Since then I’ve just taken the ‘it ain’t broke don’t fix it’ mentality. As for logs, for now they just get wiped each time the node restarts for updates. It still works fine, but it wouldn’t hurt to make some changes at some point. I just don’t really have the time right now to mess with it to much as it’s obviously just a side project.

Alexey · June 17, 2024, 7:20am

They will affect each other. You know that the pool with a redundancy (raidz for example) is working as a slowest disk in the pool. Thus even less IOPS per node.
And this doesn’t make any sense to run multiple nodes on the same pool behind the same /24 subnet of public IPs - they would share the traffic and would work as some kind of network RAID. Double RAID is not needed.

arrogantrabbit · June 17, 2024, 4:25pm

This is absolutely not true. Raidz is a type of vdev, not a pool.

Depending on number of disks you may still see better iops from that vdev compared to the single disk.
the pool may consist of multiple vdevs, they are all load balancing;
not all IO is hitting the disks in the first place: caches and special devices in the pool offload massive amount of it.

Thus pool iops capability can drastically exceed that of the single disks.

But this is moot: I’m not building setup for storj. I’m letting storj use my excess storage. So, raid config is decided without taking storj into account. At all. Storj just gets to use extra space on existing array.

From the storj perspective it does not. From my perspective it absolutely does: when I need to reclaim space for myself I’d rather delete a few small nodes, enough for my needs, than one huge one. It’s actually also better for storj — less repair traffic.

agente · June 17, 2024, 6:20pm

if you had to make a system just for storj and make the most of zfs what would you do? One pool per disk? One special for all disks?

jolmando · June 17, 2024, 6:49pm

zfs is not about one disk, the pool loses performance when it is 80% full. Banal, even taking 2-5 ordinary disks of 1 TB each with a pool, you will get a large amount of iop for the node.the more disks in the pool, the higher its speed

Roxor · June 17, 2024, 6:58pm

Just thinking-out-loud: A just-for-Storj setup could probably have one system that handled 48 HDDs (maybe a 4u with 24-bays attached to a similar SAS JBOD?). 16c/32t and 128GB RAM to stick with consumer motherboards (though 8c/16t and 32GB would run). Then for ZFS have a pair of 4TB NVMe chopped into 48 75GB-partitions… with each partition mirrored and attached as a metadata-only special device to one-zpool-per-HDD?

I have no idea if 75GB of metadata space could service a 20TB HDD filled by Storj: that could be 60-80-million files? May not be enough. Definately not enough to also have the metadata devices also handle small files.

Fun to think about. I think a SNO would have to copy their fullest node to a reference ZFS setup just to see how much metadata space is really needed per-TB-of-Storj.

jolmando · June 17, 2024, 7:02pm

For 16 tb, 75gb meta. For 30 tb 150-200 be good

arrogantrabbit · June 17, 2024, 8:17pm

No, I would not go with one pool per disk. I would have one pool, consisting of

Any number of single-disk vdevs. Add all the disks that are lying around the house. Better yet, sell them on eBay and buy fewer larger drives. This will be more power efficient, albeit less performant.
Find out amount of metadata storagenode generates from @jolmando’s comment above, or by looking at zdb -U /data/zfs/zpool.cache -PLbbbs pool1 output on one of the existing setups. (I will run it on mine and update here, if I don’t forget)
Find out number of small files from the table below, perhaps up to 64k sizes.
Buy a used enterprise SSD with PLP of larger size, and add it as a special device
Dataset configuration:
- sync = off
- atime = off
- special_small_blocks=64k (or whatever band fits into SSD size less metadata). Make sure it is smaller than the record size, otherwise all data will end up on SSD, obviously. Record size default is 128k, you can set it to 1M to be sure.

Hypothetical objections:

Q: But If one vdev fails all pool fails!!!
A: yes, so what? best case – few bad sectors, some files lost, no big deal. Worst case - loss of the disk. Still, nobody cares. Keep running it until node is disqualified. Then scrap and start a new one, there is no point in running RAID for a node’s sake, as @Alexey pointed out above.
Q: You are using special device without redundancy??!?!?! GAAAAA
A: again, see above. Nodes don’t need redundancy. If it fails – it fails. Who cares. Just use SSDs with PLP and avoid consumer trash, you’ll be fine.

historgram of file sizes on my 15TB node

Band	Count	Total Size of band	In TiB	Cumulative Size	In TiB
1k	3658879	3746692096	0.00	3746692096	0.00
2k	16479930	33750896640	0.03	37497588736	0.03
4k	4054790	16608419840	0.02	54106008576	0.05
8k	7106800	58218905600	0.05	112324914176	0.10
16k	6244297	102306562048	0.09	214631476224	0.20
32k	6937242	227319545856	0.21	441951022080	0.40
64k	3210418	210397954048	0.19	652348976128	0.59
128k	18334873	2403188473856	2.19	3055537449984	2.78
256k	3353330	879055339520	0.80	3934592789504	3.58
512k	745567	390891831296	0.36	4325484620800	3.93
1M	487975	511678873600	0.47	4837163494400	4.40
2M	3091878	6484138131456	5.90	11321301625856	10.30
4M	1694	7105150976	0.01	11328406776832	10.30
8M	61	511705088	0.00	11328918481920	10.30
16M	71	1191182336	0.00	11330109664256	10.30
32M	36	1207959552	0.00	11331317623808	10.31
64M	15	1006632960	0.00	11332324256768	10.31
Totals	73,707,856	11,332,324,256,768	10.31

Histogram obtains by running this on a node folder:

find . -type f -print0 | xargs -0 -P 24 -n 100 stat -f "%z" | awk '{ n=int(log($1)/log(2)); if (n<10) { n=10; } size[n]++ } END { for (i in size) printf("%d\t%d\n", 2^i, size[i]) }' | sort -n

Alexey · June 18, 2024, 6:50am

Ok, I got it. If your pool is without a redundancy, you perhaps do not have problems with IOPS. I just saw some setup with a one pool where were more than a one node and they were crawling failing left and right like if they were on a one SMR disk.

And I cannot suggest a better word. Perhaps for ZFS it’s a vdev, for other similar systems it usually the pool, sometimes - a volume, sometimes - a volume group. What is I actually mean, that you should not use the entity, which is represented as a one disk to the OS for multiple nodes.

flwstern · June 18, 2024, 9:47pm

Consumer ssd will break

flwstern · June 20, 2024, 9:28am

You get alot of write amplification on special vdev (2x - 10x) depending on load.

arrogantrabbit · June 20, 2024, 1:46pm

Before the tests started I was seeing about 10TB/month writes on special device (consistent 3-4MBps). And that’s before accounting for possible write amplification.

Consumer SSDs will get destroyed in a year tops.

flwstern · June 20, 2024, 7:02pm

No write amp is a issue all ssds has

arrogantrabbit · June 20, 2024, 7:23pm

Not exactly. SSDs write in blocks. If you are writing a file smaller than a block - it will write the whole block. There may or may not be some batching possible, but write amplification cannot be fully eliminated, because its unrealistic to expect that all writes are sector size and/or are perfectly batched.

Roxor · June 20, 2024, 11:19pm

Just talking out loud: I wonder if using SSDs as metadata-only special devices (even with no small-files)… even if they do hit their rated TBW quickly… is still worth the wear from all the IO they save from hitting the HDDs?

Like if you save at least one full-disk-scan filewalker per week: that’s a lot of head movement. Maybe your HDD could last an additional year (or more)?

And rated TBW is really for warranty anyways for SSDs: it’s for long-term power-off high-temp data retention. If your SSD is online 24x7 then TBW doesn’t say much about how long they’ll last: I have some Samsungs that are at 300% of their rated life and still scrub fine.

arrogantrabbit · June 21, 2024, 4:01am

I don’t look at it from wear standpoint — both SSDs and hdds are consumables. Using SSDs however helps improve responsiveness and user experience. If SSD perishes as a result — it was worth it; user experience has been improved.

Alexey · June 21, 2024, 7:23am

I would prefer that it become a read-only rather than a brick.

Roxor · June 21, 2024, 8:14am

I think it’s something I’ll have to test for myself. I know Storj has a ton of small files (3-4mil per TB?) so I wouldn’t be surprised if it needed more metadata space. You mention 5GB/10TB, and jomando mentioned 75GB/16TB, and elsewhere I’ve seen 120GB/100TB (all presumably for metadata-only not metadata+small-files)

My sample scenario is a system with say 48 HDD, 128GB RAM, and 2-4TB of mirrored SSD space for special-metadata. I’d need to find out how thinly to chop up that 2-4TB so hopefully each single HDD had enough for special-metadata/metadata-only

It’s not hard for me to test personally: I just need to copy my largest node into a zpool set up properly to measure GB-metadata-per-TB-raw-HDD to get an average number. Then I’d know if a Storj node was 10TB… it would need about X GB of SSD space for metadata.

littleskunk · June 21, 2024, 8:39am

I am planning on downsizing my setup and would appreciate some feedback.

Currently I am running 8 nodes on 8 HDDs with 128 GB of RAM and a cheap 8 TB caching SSD (sata).

I want to migrate that over to a Pi5. For fun I would like to try connecting all 8 HDDs to test out the limits but I don’t expect a single Pi5 to have enough resources for that. My expectation is that it can handle 4 drives and will get problems with the 5th drive. If that works out it only means I need a second Pi5 and can move all of my existing nodes to a setup that consumes less power and I can scale this up. If my friends want to run a storage node they get a Pi5 with a 5 bay enclosure and if that ever gets full just get another one. This would be the lowest power consumption I can think of.

So the challenge is to run as many drives as possible on the Pi5 with acceptable success rate and best case garbage collection takes only a few hours per drive and not days. I want to do the stupid thing and test out how ZFS will perform in this situation. I don’t expect a miracle but sync=disabled and a metadata cache sounds like exactly what I need.

Here is what I want to set so far:

zfs set compression=on SN1
zfs set sync=disabled SN1
zfs set atime=off SN1
zfs set recordsize=1M SN1
zfs set mountpoint=/mnt/sn1 SN1
zfs set primarycache=metadata SN1
zfs set secondarycache=metadata SN1

Now this would be with a 1TB NVMe SSD in the system. I could allocate 64GB per drive and use the rest for the OS including the storage node DBs. I am doing that in my current setup with the caching SSD as well. The problem is that adding a drive is complicated because I would have to resize the partitions. How would the special vdev perform in comparison? It will remove the need of resizing partitions right? Any downsides?