Notes on storage node performance optimization on ZFS

00riddler · June 23, 2024, 8:39am

Like you said it will vary quite a bit depending on the file sizes. I do have many small files for US1.
At the moment i am at 46GB / 4TB only for the metadata.

Alexey · June 23, 2024, 9:04am

I think if you enabled a compression, the file size shouldn’t be matter?

seanr22a · June 23, 2024, 9:41am

I have 18 ZFS 11TB nodes and they are all full now, each one has a special device. Used space on the special device for a full 11TB node is around 55GB on all 18 nodes.

arrogantrabbit · June 23, 2024, 10:04am

Awesome!

So, from an average node we can expect a ballpark of 5GB of metadata per 1TB of data.

This is a cummulativedistribution of file sizes on my 10TB node normalized to a TB

Band	Count	In TB(band)	In TB (cumulative)	GB, per TB, cummulative
1k	3658879	0.00	0.00	0.33
2k	16479930	0.03	0.03	3.31
4k	4054790	0.02	0.05	4.77
8k	7106800	0.05	0.10	9.91
16k	6244297	0.09	0.20	18.94
32k	6937242	0.21	0.40	39.00
64k	3210418	0.19	0.59	57.57
128k	18334873	2.19	2.78	269.63
256k	3353330	0.80	3.58	347.20
512k	745567	0.36	3.93	381.69
1M	487975	0.47	4.40	426.85
2M	3091878	5.90	10.30	999.03
4M	1694	0.01	10.30	999.65
8M	61	0.00	10.30	999.70
16M	71	0.00	10.30	999.80
32M	36	0.00	10.31	999.91
64M	15	0.00	10.31	1000.00
Totals	73,707,856	10.31

It follows, that for 100TB worth of node data, we’ll need 500GB space on special device, and the remaining size can be filled by small files. Assuming 2TB special device size, we can define special_small_blocks=8K, and still have about 500GB to spare on a special device.

Neat!

Roxor · June 23, 2024, 11:16am

I very much appreciate everyones examples for GB-of-metadata vs TB-of-HDD. @00riddler’s ratio would mean I’d need a lot more SSD space than I had planned.

brainstorm · June 25, 2024, 11:39am

just to make sure I understand:
the storage2.database-dir in config.yaml is the path where all .db , .db-shm, .db-wal files reside
so it’s enough to create a zfs dataset, set this variable to the dataset mountpoint and move all those files to the dataset. Correct ?
because I see there are a few other files there.

I wonder also about the sqlite database page size that is used. The ZFS recommendations for sqlite is to set the db page size to 64KB and use a recordsize of 64KB on the zfs dataset for best performance. The default page size for sqlite is 4KB and I verified that is the page size used on storj databases. ( for i in /var/storj/*.db ; do echo ${i}; sqlite3 ${i} “PRAGMA page_size;” ; done
So, I wonder if setting the zfs record size to 4KB would lead to an improvement or not. (4KB is really small so I wonder if it would increase some zfs overhead maybe)
Alternatively, changing the pagesize for storj databases to 64KB could be an improvement, however increasing the page size can lead to increased db lock contention since the lock is made on a per page basis.

brainstorm · June 25, 2024, 12:41pm

hmm yes that is what I said, so you concur ?

brainstorm · June 25, 2024, 12:55pm

not sure about that. I found recommendations (on oracle’s site and others) that mysql recordsize should be 16KB and for postgresql 8KB.
and as I verified, sqlite uses 4KB as default.

arrogantrabbit · June 25, 2024, 12:58pm

Have you verified that the db record size is a bottleneck here?

Disable sync, disable access time update. this shall be enough. Storj barely writes to databases, penny pinching on record sizes is unnecessary.

brainstorm · June 25, 2024, 1:41pm

yes I guess those db are comparatively small

JWvdV · June 25, 2024, 2:19pm

I only have about 2.5GB per TB. This is my config:

zpool create -o ashift=12 -O compress=lz4 -O atime=off -O primarycache=metadata -O sync=disabled -m /storj/nd6 -O xattr=off -O redundant_metadata=some -O recordsize=512k storjdata6 /dev/sdg

zpool add storjdata6 -o ashift=12 special mirror /dev/disk/by-partlabel/STORJ6-METAD /dev/disk/by-partlabel/STORJ6-META -f

And even half of it hasn’t been used:

root@VM-HOST:~# zpool list -v
NAME                SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
storjdata11         479G  67.8G   411G        -         -     0%    14%  1.00x    ONLINE  -
  sdm               477G  67.6G   408G        -         -     0%  14.2%      -    ONLINE
special                -      -      -        -         -      -      -      -  -
  mirror-1         2.75G   140M  2.61G        -         -     2%  4.96%      -    ONLINE
    STORJ11-METAD     3G      -      -        -         -      -      -      -    ONLINE
    STORJ11-META      3G      -      -        -         -      -      -      -    ONLINE
storjdata16         479G   233G   246G        -         -     0%    48%  1.00x    ONLINE  -
  sdl               477G   232G   244G        -         -     0%  48.8%      -    ONLINE
special                -      -      -        -         -      -      -      -  -
  mirror-1         2.75G   636M  2.13G        -         -     6%  22.6%      -    ONLINE
    STORJ16-METAD     3G      -      -        -         -      -      -      -    ONLINE
    STORJ16-META      3G      -      -        -         -      -      -      -    ONLINE
storjdata6          932G   800G   133G        -         -     0%    85%  1.00x    ONLINE  -
  sdg               932G   798G   130G        -         -     0%  86.0%      -    ONLINE
special                -      -      -        -         -      -      -      -  -
  mirror-1         4.50G  1.77G  2.73G        -         -    55%  39.3%      -    ONLINE
    STORJ6-METAD      5G      -      -        -         -      -      -      -    ONLINE
    STORJ6-META       5G      -      -        -         -      -      -      -    ONLIN

Vadim · June 25, 2024, 2:24pm

Some of ny node DBs are over 2,5 GB so not very small.

JWvdV · June 25, 2024, 3:02pm

Databases on special device / separate pool, whether it is ZFS or any other filesystem?

agente · June 25, 2024, 3:14pm

Coming back to 48 disks example. Am I crazy If I run 48 disks in ONE pool (separate vdevs no raid) and let it run with a special dev (in mirror)? 20 or more nodes running and when one hdd breaks just replace it. One lost over 48 disk will be enough to disq nodes? or just be fine with some repairs?

Pentium100 · June 25, 2024, 3:22pm

Yes, that’s effectively RAID0 with 48 drives.

Roxor · June 25, 2024, 3:22pm

I’m not sure if this is what you’re describing too: but my plan for 48 HDD is to have a pair of SSDs… and for each HDD carve off a specific pair of partitions (one on each SSD) to mirror together, as metadata-only special device… for each single disk.

Like if the rule-of-thumb is 5GB-SSD-per-1TB-HDD… then for a 10TB HDD I’d fdisk each SSD to carve off a 50GB partition… and when I build that single-disk pool I’d specify those partitions to mirror each other.

That would mean if you lost both SSDs… you’d lose the data on all 48 HDDs. Either you think that’s unlikely and that you could stay on top of any repairs… or you should consider a 3-way mirror?

agente · June 25, 2024, 3:25pm

Exactly 48 drives on RAID0. What I’m curious to understand is if one drive lost will be safe and repaired from storj ecosystem. A loss of 20tb in a total of 960

Roxor · June 25, 2024, 3:27pm

If you really meant 48-drive-RAID0 then losing one disk means losing all 48 worth of data, doesn’t it? Storj-the-network won’t care: but your earnings sure would!

agente · June 25, 2024, 3:31pm

Every drive is a vdev in pool… still losing everything?

ps: I’m giving too much importance to the answers chatgpt gives me

brainstorm · June 25, 2024, 3:42pm

yes, there is no redundancy in a vdev with only one physical disk
at least run vdevs of 4 or 5 disks with raidz