Like you said it will vary quite a bit depending on the file sizes. I do have many small files for US1.
At the moment i am at 46GB / 4TB only for the metadata.
I think if you enabled a compression, the file size shouldnāt be matter?
I have 18 ZFS 11TB nodes and they are all full now, each one has a special device. Used space on the special device for a full 11TB node is around 55GB on all 18 nodes.
Awesome!
So, from an average node we can expect a ballpark of 5GB of metadata per 1TB of data.
This is a cummulativedistribution of file sizes on my 10TB node normalized to a TB
Band | Count | In TB(band) | In TB (cumulative) | GB, per TB, cummulative | ||||
---|---|---|---|---|---|---|---|---|
1k | 3658879 | 0.00 | 0.00 | 0.33 | ||||
2k | 16479930 | 0.03 | 0.03 | 3.31 | ||||
4k | 4054790 | 0.02 | 0.05 | 4.77 | ||||
8k | 7106800 | 0.05 | 0.10 | 9.91 | ||||
16k | 6244297 | 0.09 | 0.20 | 18.94 | ||||
32k | 6937242 | 0.21 | 0.40 | 39.00 | ||||
64k | 3210418 | 0.19 | 0.59 | 57.57 | ||||
128k | 18334873 | 2.19 | 2.78 | 269.63 | ||||
256k | 3353330 | 0.80 | 3.58 | 347.20 | ||||
512k | 745567 | 0.36 | 3.93 | 381.69 | ||||
1M | 487975 | 0.47 | 4.40 | 426.85 | ||||
2M | 3091878 | 5.90 | 10.30 | 999.03 | ||||
4M | 1694 | 0.01 | 10.30 | 999.65 | ||||
8M | 61 | 0.00 | 10.30 | 999.70 | ||||
16M | 71 | 0.00 | 10.30 | 999.80 | ||||
32M | 36 | 0.00 | 10.31 | 999.91 | ||||
64M | 15 | 0.00 | 10.31 | 1000.00 | ||||
Totals | 73,707,856 | 10.31 |
It follows, that for 100TB worth of node data, weāll need 500GB space on special device, and the remaining size can be filled by small files. Assuming 2TB special device size, we can define special_small_blocks=8K
, and still have about 500GB to spare on a special device.
Neat!
I very much appreciate everyones examples for GB-of-metadata vs TB-of-HDD. @00riddlerās ratio would mean Iād need a lot more SSD space than I had planned.
just to make sure I understand:
the storage2.database-dir in config.yaml is the path where all .db , .db-shm, .db-wal files reside
so itās enough to create a zfs dataset, set this variable to the dataset mountpoint and move all those files to the dataset. Correct ?
because I see there are a few other files there.
I wonder also about the sqlite database page size that is used. The ZFS recommendations for sqlite is to set the db page size to 64KB and use a recordsize of 64KB on the zfs dataset for best performance. The default page size for sqlite is 4KB and I verified that is the page size used on storj databases. ( for i in /var/storj/*.db ; do echo ${i}; sqlite3 ${i} āPRAGMA page_size;ā ; done
So, I wonder if setting the zfs record size to 4KB would lead to an improvement or not. (4KB is really small so I wonder if it would increase some zfs overhead maybe)
Alternatively, changing the pagesize for storj databases to 64KB could be an improvement, however increasing the page size can lead to increased db lock contention since the lock is made on a per page basis.
hmm yes that is what I said, so you concur ?
not sure about that. I found recommendations (on oracleās site and others) that mysql recordsize should be 16KB and for postgresql 8KB.
and as I verified, sqlite uses 4KB as default.
Have you verified that the db record size is a bottleneck here?
Disable sync, disable access time update. this shall be enough. Storj barely writes to databases, penny pinching on record sizes is unnecessary.
yes I guess those db are comparatively small
I only have about 2.5GB per TB. This is my config:
zpool create -o ashift=12 -O compress=lz4 -O atime=off -O primarycache=metadata -O sync=disabled -m /storj/nd6 -O xattr=off -O redundant_metadata=some -O recordsize=512k storjdata6 /dev/sdg
zpool add storjdata6 -o ashift=12 special mirror /dev/disk/by-partlabel/STORJ6-METAD /dev/disk/by-partlabel/STORJ6-META -f
And even half of it hasnāt been used:
root@VM-HOST:~# zpool list -v
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
storjdata11 479G 67.8G 411G - - 0% 14% 1.00x ONLINE -
sdm 477G 67.6G 408G - - 0% 14.2% - ONLINE
special - - - - - - - - -
mirror-1 2.75G 140M 2.61G - - 2% 4.96% - ONLINE
STORJ11-METAD 3G - - - - - - - ONLINE
STORJ11-META 3G - - - - - - - ONLINE
storjdata16 479G 233G 246G - - 0% 48% 1.00x ONLINE -
sdl 477G 232G 244G - - 0% 48.8% - ONLINE
special - - - - - - - - -
mirror-1 2.75G 636M 2.13G - - 6% 22.6% - ONLINE
STORJ16-METAD 3G - - - - - - - ONLINE
STORJ16-META 3G - - - - - - - ONLINE
storjdata6 932G 800G 133G - - 0% 85% 1.00x ONLINE -
sdg 932G 798G 130G - - 0% 86.0% - ONLINE
special - - - - - - - - -
mirror-1 4.50G 1.77G 2.73G - - 55% 39.3% - ONLINE
STORJ6-METAD 5G - - - - - - - ONLINE
STORJ6-META 5G - - - - - - - ONLIN
Some of ny node DBs are over 2,5 GB so not very small.
Databases on special device / separate pool, whether it is ZFS or any other filesystem?
Coming back to 48 disks example. Am I crazy If I run 48 disks in ONE pool (separate vdevs no raid) and let it run with a special dev (in mirror)? 20 or more nodes running and when one hdd breaks just replace it. One lost over 48 disk will be enough to disq nodes? or just be fine with some repairs?
Yes, thatās effectively RAID0 with 48 drives.
Iām not sure if this is what youāre describing too: but my plan for 48 HDD is to have a pair of SSDs⦠and for each HDD carve off a specific pair of partitions (one on each SSD) to mirror together, as metadata-only special device⦠for each single disk.
Like if the rule-of-thumb is 5GB-SSD-per-1TB-HDD⦠then for a 10TB HDD Iād fdisk each SSD to carve off a 50GB partition⦠and when I build that single-disk pool Iād specify those partitions to mirror each other.
That would mean if you lost both SSDs⦠youād lose the data on all 48 HDDs. Either you think thatās unlikely and that you could stay on top of any repairs⦠or you should consider a 3-way mirror?
Exactly 48 drives on RAID0. What Iām curious to understand is if one drive lost will be safe and repaired from storj ecosystem. A loss of 20tb in a total of 960
If you really meant 48-drive-RAID0 then losing one disk means losing all 48 worth of data, doesnāt it? Storj-the-network wonāt care: but your earnings sure would!
Every drive is a vdev in pool⦠still losing everything?
ps: Iām giving too much importance to the answers chatgpt gives me
yes, there is no redundancy in a vdev with only one physical disk
at least run vdevs of 4 or 5 disks with raidz