Notes on storage node performance optimization on ZFS

Why? Requirement of no more than one node per HDD is still satisfied.

I run three nodes on one array. Because when i need space back I’d delete one of them. How else I’m supposed to manage space, until node can shrink on demand, and fast?

Compression is not everything (and I feel compressing smaller blocks will be more efficient than large ones anyway); there are downsides for large default block sizes. For example, memory is allocated for the whole block. Allocating more memory than you need is slower and quite pointless. ARC memory utilization efficiency may also be impacted.

I would not change defaults unless you have a very good reason to, supported by benchmarks.

Can’t argue with you there. When I started I didn’t have nearly as many nodes as I have now and tmux was more of a shortcut so I wasn’t really worried about it. Since then I’ve just taken the ‘it ain’t broke don’t fix it’ mentality. As for logs, for now they just get wiped each time the node restarts for updates. It still works fine, but it wouldn’t hurt to make some changes at some point. I just don’t really have the time right now to mess with it to much as it’s obviously just a side project.

They will affect each other. You know that the pool with a redundancy (raidz for example) is working as a slowest disk in the pool. Thus even less IOPS per node.
And this doesn’t make any sense to run multiple nodes on the same pool behind the same /24 subnet of public IPs - they would share the traffic and would work as some kind of network RAID. Double RAID is not needed.

This is absolutely not true. Raidz is a type of vdev, not a pool.

  • Depending on number of disks you may still see better iops from that vdev compared to the single disk.
  • the pool may consist of multiple vdevs, they are all load balancing;
  • not all IO is hitting the disks in the first place: caches and special devices in the pool offload massive amount of it.

Thus pool iops capability can drastically exceed that of the single disks.

But this is moot: I’m not building setup for storj. I’m letting storj use my excess storage. So, raid config is decided without taking storj into account. At all. Storj just gets to use extra space on existing array.

From the storj perspective it does not. From my perspective it absolutely does: when I need to reclaim space for myself I’d rather delete a few small nodes, enough for my needs, than one huge one. It’s actually also better for storj — less repair traffic.

4 Likes

if you had to make a system just for storj and make the most of zfs what would you do? One pool per disk? One special for all disks?

zfs is not about one disk, the pool loses performance when it is 80% full. Banal, even taking 2-5 ordinary disks of 1 TB each with a pool, you will get a large amount of iop for the node.the more disks in the pool, the higher its speed

Just thinking-out-loud: A just-for-Storj setup could probably have one system that handled 48 HDDs (maybe a 4u with 24-bays attached to a similar SAS JBOD?). 16c/32t and 128GB RAM to stick with consumer motherboards (though 8c/16t and 32GB would run). Then for ZFS have a pair of 4TB NVMe chopped into 48 75GB-partitions… with each partition mirrored and attached as a metadata-only special device to one-zpool-per-HDD?

I have no idea if 75GB of metadata space could service a 20TB HDD filled by Storj: that could be 60-80-million files? May not be enough. Definately not enough to also have the metadata devices also handle small files.

Fun to think about. I think a SNO would have to copy their fullest node to a reference ZFS setup just to see how much metadata space is really needed per-TB-of-Storj.

For 16 tb, 75gb meta. For 30 tb 150-200 be good :slight_smile:

1 Like

No, I would not go with one pool per disk. I would have one pool, consisting of

  • Any number of single-disk vdevs. Add all the disks that are lying around the house. Better yet, sell them on eBay and buy fewer larger drives. This will be more power efficient, albeit less performant.
  • Find out amount of metadata storagenode generates from @jolmando’s comment above, or by looking at zdb -U /data/zfs/zpool.cache -PLbbbs pool1 output on one of the existing setups. (I will run it on mine and update here, if I don’t forget)
  • Find out number of small files from the table below, perhaps up to 64k sizes.
  • Buy a used enterprise SSD with PLP of larger size, and add it as a special device
  • Dataset configuration:
    • sync = off
    • atime = off
    • special_small_blocks=64k (or whatever band fits into SSD size less metadata). Make sure it is smaller than the record size, otherwise all data will end up on SSD, obviously. Record size default is 128k, you can set it to 1M to be sure.

Hypothetical objections:

  • Q: But If one vdev fails all pool fails!!!
  • A: yes, so what? best case – few bad sectors, some files lost, no big deal. Worst case - loss of the disk. Still, nobody cares. Keep running it until node is disqualified. Then scrap and start a new one, there is no point in running RAID for a node’s sake, as @Alexey pointed out above.
  • Q: You are using special device without redundancy??!?!?! GAAAAA
  • A: again, see above. Nodes don’t need redundancy. If it fails – it fails. Who cares. Just use SSDs with PLP and avoid consumer trash, you’ll be fine.
historgram of file sizes on my 15TB node
Band Count Total Size of band In TiB Cumulative Size In TiB
1k 3658879 3746692096 0.00 3746692096 0.00
2k 16479930 33750896640 0.03 37497588736 0.03
4k 4054790 16608419840 0.02 54106008576 0.05
8k 7106800 58218905600 0.05 112324914176 0.10
16k 6244297 102306562048 0.09 214631476224 0.20
32k 6937242 227319545856 0.21 441951022080 0.40
64k 3210418 210397954048 0.19 652348976128 0.59
128k 18334873 2403188473856 2.19 3055537449984 2.78
256k 3353330 879055339520 0.80 3934592789504 3.58
512k 745567 390891831296 0.36 4325484620800 3.93
1M 487975 511678873600 0.47 4837163494400 4.40
2M 3091878 6484138131456 5.90 11321301625856 10.30
4M 1694 7105150976 0.01 11328406776832 10.30
8M 61 511705088 0.00 11328918481920 10.30
16M 71 1191182336 0.00 11330109664256 10.30
32M 36 1207959552 0.00 11331317623808 10.31
64M 15 1006632960 0.00 11332324256768 10.31
Totals 73,707,856 11,332,324,256,768 10.31

Histogram obtains by running this on a node folder:

find . -type f -print0 | xargs -0 -P 24 -n 100 stat -f "%z" | awk '{ n=int(log($1)/log(2)); if (n<10) { n=10; } size[n]++ } END { for (i in size) printf("%d\t%d\n", 2^i, size[i]) }' | sort -n
4 Likes

Ok, I got it. If your pool is without a redundancy, you perhaps do not have problems with IOPS. I just saw some setup with a one pool where were more than a one node and they were crawling failing left and right like if they were on a one SMR disk.

And I cannot suggest a better word. Perhaps for ZFS it’s a vdev, for other similar systems it usually the pool, sometimes - a volume, sometimes - a volume group. What is I actually mean, that you should not use the entity, which is represented as a one disk to the OS for multiple nodes.

Consumer ssd will break :slight_smile:

You get alot of write amplification on special vdev (2x - 10x) depending on load.

2 Likes

Before the tests started I was seeing about 10TB/month writes on special device (consistent 3-4MBps). And that’s before accounting for possible write amplification.

Consumer SSDs will get destroyed in a year tops.

2 Likes

No write amp is a issue all ssds has :slight_smile:

1 Like

Not exactly. SSDs write in blocks. If you are writing a file smaller than a block - it will write the whole block. There may or may not be some batching possible, but write amplification cannot be fully eliminated, because its unrealistic to expect that all writes are sector size and/or are perfectly batched.

2 Likes

Just talking out loud: I wonder if using SSDs as metadata-only special devices (even with no small-files)… even if they do hit their rated TBW quickly… is still worth the wear from all the IO they save from hitting the HDDs?

Like if you save at least one full-disk-scan filewalker per week: that’s a lot of head movement. Maybe your HDD could last an additional year (or more)?

And rated TBW is really for warranty anyways for SSDs: it’s for long-term power-off high-temp data retention. If your SSD is online 24x7 then TBW doesn’t say much about how long they’ll last: I have some Samsungs that are at 300% of their rated life and still scrub fine.

2 Likes

I don’t look at it from wear standpoint — both SSDs and hdds are consumables. Using SSDs however helps improve responsiveness and user experience. If SSD perishes as a result — it was worth it; user experience has been improved.

I would prefer that it become a read-only rather than a brick.

I think it’s something I’ll have to test for myself. I know Storj has a ton of small files (3-4mil per TB?) so I wouldn’t be surprised if it needed more metadata space. You mention 5GB/10TB, and jomando mentioned 75GB/16TB, and elsewhere I’ve seen 120GB/100TB (all presumably for metadata-only not metadata+small-files)

My sample scenario is a system with say 48 HDD, 128GB RAM, and 2-4TB of mirrored SSD space for special-metadata. I’d need to find out how thinly to chop up that 2-4TB so hopefully each single HDD had enough for special-metadata/metadata-only

It’s not hard for me to test personally: I just need to copy my largest node into a zpool set up properly to measure GB-metadata-per-TB-raw-HDD to get an average number. Then I’d know if a Storj node was 10TB… it would need about X GB of SSD space for metadata.

1 Like

I am planning on downsizing my setup and would appreciate some feedback.

Currently I am running 8 nodes on 8 HDDs with 128 GB of RAM and a cheap 8 TB caching SSD (sata).

I want to migrate that over to a Pi5. For fun I would like to try connecting all 8 HDDs to test out the limits but I don’t expect a single Pi5 to have enough resources for that. My expectation is that it can handle 4 drives and will get problems with the 5th drive. If that works out it only means I need a second Pi5 and can move all of my existing nodes to a setup that consumes less power and I can scale this up. If my friends want to run a storage node they get a Pi5 with a 5 bay enclosure and if that ever gets full just get another one. This would be the lowest power consumption I can think of.

So the challenge is to run as many drives as possible on the Pi5 with acceptable success rate and best case garbage collection takes only a few hours per drive and not days. I want to do the stupid thing and test out how ZFS will perform in this situation. I don’t expect a miracle but sync=disabled and a metadata cache sounds like exactly what I need.

Here is what I want to set so far:

zfs set compression=on SN1
zfs set sync=disabled SN1
zfs set atime=off SN1
zfs set recordsize=1M SN1
zfs set mountpoint=/mnt/sn1 SN1
zfs set primarycache=metadata SN1
zfs set secondarycache=metadata SN1

Now this would be with a 1TB NVMe SSD in the system. I could allocate 64GB per drive and use the rest for the OS including the storage node DBs. I am doing that in my current setup with the caching SSD as well. The problem is that adding a drive is complicated because I would have to resize the partitions. How would the special vdev perform in comparison? It will remove the need of resizing partitions right? Any downsides?