Notes on storage node performance optimization on ZFS

Like you said it will vary quite a bit depending on the file sizes. I do have many small files for US1.
At the moment i am at 46GB / 4TB only for the metadata.

1 Like

I think if you enabled a compression, the file size shouldn’t be matter?

I have 18 ZFS 11TB nodes and they are all full now, each one has a special device. Used space on the special device for a full 11TB node is around 55GB on all 18 nodes.

5 Likes

Awesome!

So, from an average node we can expect a ballpark of 5GB of metadata per 1TB of data.

This is a cummulativedistribution of file sizes on my 10TB node normalized to a TB
Band Count In TB(band) In TB (cumulative) GB, per TB, cummulative
1k 3658879 0.00 0.00 0.33
2k 16479930 0.03 0.03 3.31
4k 4054790 0.02 0.05 4.77
8k 7106800 0.05 0.10 9.91
16k 6244297 0.09 0.20 18.94
32k 6937242 0.21 0.40 39.00
64k 3210418 0.19 0.59 57.57
128k 18334873 2.19 2.78 269.63
256k 3353330 0.80 3.58 347.20
512k 745567 0.36 3.93 381.69
1M 487975 0.47 4.40 426.85
2M 3091878 5.90 10.30 999.03
4M 1694 0.01 10.30 999.65
8M 61 0.00 10.30 999.70
16M 71 0.00 10.30 999.80
32M 36 0.00 10.31 999.91
64M 15 0.00 10.31 1000.00
Totals 73,707,856 10.31

It follows, that for 100TB worth of node data, we’ll need 500GB space on special device, and the remaining size can be filled by small files. Assuming 2TB special device size, we can define special_small_blocks=8K, and still have about 500GB to spare on a special device.

Neat!

1 Like

I very much appreciate everyones examples for GB-of-metadata vs TB-of-HDD. @00riddler’s ratio would mean I’d need a lot more SSD space than I had planned.

2 Likes

just to make sure I understand:
the storage2.database-dir in config.yaml is the path where all .db , .db-shm, .db-wal files reside
so it’s enough to create a zfs dataset, set this variable to the dataset mountpoint and move all those files to the dataset. Correct ?
because I see there are a few other files there.

I wonder also about the sqlite database page size that is used. The ZFS recommendations for sqlite is to set the db page size to 64KB and use a recordsize of 64KB on the zfs dataset for best performance. The default page size for sqlite is 4KB and I verified that is the page size used on storj databases. ( for i in /var/storj/*.db ; do echo ${i}; sqlite3 ${i} ā€œPRAGMA page_size;ā€ ; done :wink:
So, I wonder if setting the zfs record size to 4KB would lead to an improvement or not. (4KB is really small so I wonder if it would increase some zfs overhead maybe)
Alternatively, changing the pagesize for storj databases to 64KB could be an improvement, however increasing the page size can lead to increased db lock contention since the lock is made on a per page basis.

1 Like

hmm yes that is what I said, so you concur ?

not sure about that. I found recommendations (on oracle’s site and others) that mysql recordsize should be 16KB and for postgresql 8KB.
and as I verified, sqlite uses 4KB as default.

Have you verified that the db record size is a bottleneck here?

Disable sync, disable access time update. this shall be enough. Storj barely writes to databases, penny pinching on record sizes is unnecessary.

1 Like

yes I guess those db are comparatively small

I only have about 2.5GB per TB. This is my config:

zpool create -o ashift=12 -O compress=lz4 -O atime=off -O primarycache=metadata -O sync=disabled -m /storj/nd6 -O xattr=off -O redundant_metadata=some -O recordsize=512k storjdata6 /dev/sdg

zpool add storjdata6 -o ashift=12 special mirror /dev/disk/by-partlabel/STORJ6-METAD /dev/disk/by-partlabel/STORJ6-META -f

And even half of it hasn’t been used:

root@VM-HOST:~# zpool list -v
NAME                SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
storjdata11         479G  67.8G   411G        -         -     0%    14%  1.00x    ONLINE  -
  sdm               477G  67.6G   408G        -         -     0%  14.2%      -    ONLINE
special                -      -      -        -         -      -      -      -  -
  mirror-1         2.75G   140M  2.61G        -         -     2%  4.96%      -    ONLINE
    STORJ11-METAD     3G      -      -        -         -      -      -      -    ONLINE
    STORJ11-META      3G      -      -        -         -      -      -      -    ONLINE
storjdata16         479G   233G   246G        -         -     0%    48%  1.00x    ONLINE  -
  sdl               477G   232G   244G        -         -     0%  48.8%      -    ONLINE
special                -      -      -        -         -      -      -      -  -
  mirror-1         2.75G   636M  2.13G        -         -     6%  22.6%      -    ONLINE
    STORJ16-METAD     3G      -      -        -         -      -      -      -    ONLINE
    STORJ16-META      3G      -      -        -         -      -      -      -    ONLINE
storjdata6          932G   800G   133G        -         -     0%    85%  1.00x    ONLINE  -
  sdg               932G   798G   130G        -         -     0%  86.0%      -    ONLINE
special                -      -      -        -         -      -      -      -  -
  mirror-1         4.50G  1.77G  2.73G        -         -    55%  39.3%      -    ONLINE
    STORJ6-METAD      5G      -      -        -         -      -      -      -    ONLINE
    STORJ6-META       5G      -      -        -         -      -      -      -    ONLIN

Some of ny node DBs are over 2,5 GB so not very small.

Databases on special device / separate pool, whether it is ZFS or any other filesystem?

2 Likes

Coming back to 48 disks example. Am I crazy If I run 48 disks in ONE pool (separate vdevs no raid) and let it run with a special dev (in mirror)? 20 or more nodes running and when one hdd breaks just replace it. One lost over 48 disk will be enough to disq nodes? or just be fine with some repairs?

Yes, that’s effectively RAID0 with 48 drives.

1 Like

I’m not sure if this is what you’re describing too: but my plan for 48 HDD is to have a pair of SSDs… and for each HDD carve off a specific pair of partitions (one on each SSD) to mirror together, as metadata-only special device… for each single disk.

Like if the rule-of-thumb is 5GB-SSD-per-1TB-HDD… then for a 10TB HDD I’d fdisk each SSD to carve off a 50GB partition… and when I build that single-disk pool I’d specify those partitions to mirror each other.

That would mean if you lost both SSDs… you’d lose the data on all 48 HDDs. Either you think that’s unlikely and that you could stay on top of any repairs… or you should consider a 3-way mirror?

Exactly 48 drives on RAID0. What I’m curious to understand is if one drive lost will be safe and repaired from storj ecosystem. A loss of 20tb in a total of 960

1 Like

If you really meant 48-drive-RAID0 then losing one disk means losing all 48 worth of data, doesn’t it? Storj-the-network won’t care: but your earnings sure would!

1 Like

Every drive is a vdev in pool… still losing everything?

ps: I’m giving too much importance to the answers chatgpt gives me :slight_smile:

yes, there is no redundancy in a vdev with only one physical disk
at least run vdevs of 4 or 5 disks with raidz

1 Like