From my measurements on my old nodes (ratio of “zfs plain file” objects cumulative size to everything else reported by the zdb) metadata takes 2.6% of the node used space.
With the recent massive traffic ingress, which looks to be mostly large objects, the metadata takes about 0.06% (this is from node I started few months ago, it has almost no “normal” data, all saltlake)
So for 20TB node you want 500GB metadata. And if you want to send small files there as well — you’d want to find one of the histograms posted on the forum here to assess additional size requirements.
You can find Intel p3600 2TB SSDs on eBay quite cheap — perhaps this overkill would be the way to go.
It also depends on the sector size and redundancy of meta data.
root@T8PLUS-N100:~# zpool list -v NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
storjdata12 2.73T 2.35T 388G - - 0% 86% 1.00x ONLINE -
STORJ12-ZFS 2.73T 2.35T 379G - - 0% 86.4% - ONLINE
special - - - - - - - - -
mirror-1 14.5G 5.65G 8.85G - - 52% 39.0% - ONLINE
STORJ12-METAD 15G - - - - - - - ONLINE
STORJ12-META 15G - - - - - - - ONLINE
storjdata19 4.57T 3.68T 908G - - 8% 80% 1.00x ONLINE -
STORJ19-ZFS 4.55T 3.68T 892G - - 8% 80.8% - ONLINE
special - - - - - - - - -
mirror-1 24.5G 7.67G 16.8G - - 76% 31.3% - ONLINE
STORJ19-META 25G - - - - - - - ONLINE
STORJ19-METAD 25G - - - - - - - ONLINE
storjdata21 2.73T 911G 1.84T - - 0% 32% 1.00x ONLINE -
STORJ21-ZFS 2.73T 910G 1.83T - - 0% 32.7% - ONLINE
special - - - - - - - - -
mirror-1 14.5G 1.86G 12.6G - - 47% 12.8% - ONLINE
STORJ21-METAD 15G - - - - - - - ONLINE
STORJ21-META 15G - - - - - - - ONLINE
storjdata3 4.57T 1.86T 2.71T - - 0% 40% 1.00x ONLINE -
STORJ3-ZFS 4.55T 1.85T 2.69T - - 0% 40.8% - ONLINE
special - - - - - - - - -
mirror-1 24.5G 5.28G 19.2G - - 65% 21.6% - ONLINE
STORJ3-METAD 25G - - - - - - - ONLINE
STORJ3-META 25G - - - - - - - ONLINE
storjdata5 3.64T 2.25T 1.39T - - 8% 61% 1.00x ONLINE -
zfs-1603f406057ea458 3.64T 2.25T 1.38T - - 8% 62.0% - ONLINE
special - - - - - - - - -
mirror-1 19.5G 6.20G 13.3G - - 69% 31.8% - ONLINE
STORJ5-METAD 20G - - - - - - - ONLINE
STORJ5-META 20G - - - - - - - ONLINE
storjdata7 932G 808G 125G - - 32% 86% 1.00x ONLINE -
zfs-15e729ab5c4ed5ed 931G 805G 123G - - 32% 86.7% - ONLINE
special - - - - - - - - -
mirror-1 4.50G 2.77G 1.73G - - 84% 61.5% - ONLINE
STORJ7-METAD 5G - - - - - - - ONLINE
STORJ7-META 5G - - - - - - - ONLINE
storjdata8 4.57T 981G 3.61T - - 0% 20% 1.00x ONLINE -
STORJ8-ZFS 4.55T 978G 3.59T - - 0% 21.0% - ONLINE
special - - - - - - - - -
mirror-1 24.5G 2.38G 22.1G - - 51% 9.72% - ONLINE
STORJ8-META 25G - - - - - - - ONLINE
STORJ8-METAD 25G - - - - - - - ONLINE
This is all sector size 512kB. And redundant_metadata=some. So a factor 10 lower than yours.
This is my making:
zpool create -o ashift=12 -O compress=lz4 -O atime=off -O primarycache=metadata -O sync=disabled -m /storj/nd12 -O xattr=off -O redundant_metadata=some -O recordsize=512k storjdata12 /dev/sdc2
zpool add storjdata12 -o ashift=12 special mirror /dev/disk/by-partlabel/STORJ12-METAD /dev/disk/by-partlabel/STORJ12-META -f
In my case I would allocate 100G, x2 for the mirroring.
Record size is irrelevant, most of storagenode files are smaller than the default record size.
Your nodes seem to be quite small and hence new, and therefore mostly contain test data from saltlake, which are large files, and this gets you the classic 0.3% metadata utilization, that is also close to my second paragraph above.
Actually, I had one with 128kB sector size. In which the metadata increased 3x. Aside from that, I don’t see any difference in comparison to other nodes already filled up to the rim when the test began:
root@VM-HOST:~# zpool list -v
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
storjdata10 2.73T 2.31T 431G - - 10% 84% 1.00x ONLINE -
zfs-0e9b458b17d1abae 2.73T 2.31T 423G - - 10% 84.8% - ONLINE
special - - - - - - - - -
mirror-1 14.5G 6.76G 7.74G - - 65% 46.6% - ONLINE
STORJ10-METAD 15G - - - - - - - ONLINE
STORJ10-META 15G - - - - - - - ONLINE
storjdata11 479G 402G 76.5G - - 40% 84% 1.00x ONLINE -
zfs-46912d350c39ebb9 477G 401G 75.0G - - 40% 84.2% - ONLINE
special - - - - - - - - -
mirror-1 2.75G 1.23G 1.52G - - 74% 44.9% - ONLINE
STORJ11-METAD 3G - - - - - - - ONLINE
STORJ11-META 3G - - - - - - - ONLINE
storjdata16 479G 399G 79.4G - - 31% 83% 1.00x ONLINE -
zfs-685857c77212e77b 477G 398G 77.8G - - 31% 83.6% - ONLINE
special - - - - - - - - -
mirror-1 2.75G 1.15G 1.60G - - 65% 41.9% - ONLINE
STORJ16-METAD 3G - - - - - - - ONLINE
STORJ16-META 3G - - - - - - - ONLINE
storjdata18 2.73T 2.34T 405G - - 8% 85% 1.00x ONLINE -
zfs-13156e834a813ec9 2.73T 2.33T 396G - - 8% 85.8% - ONLINE
special - - - - - - - - -
mirror-1 14.5G 5.15G 9.35G - - 70% 35.5% - ONLINE
STORJ18-METAD 15G - - - - - - - ONLINE
STORJ18-META 15G - - - - - - - ONLINE
storjdata22 932G 798G 135G - - 8% 85% 1.00x ONLINE -
zfs-7627b31b68853fd3 932G 796G 132G - - 8% 85.8% - ONLINE
special - - - - - - - - -
mirror-1 4.50G 1.59G 2.91G - - 70% 35.4% - ONLINE
STORJ22-METAD 5G - - - - - - - ONLINE
STORJ22-META 5G - - - - - - - ONLINE
storjdata4 2.73T 2.30T 442G - - 25% 84% 1.00x ONLINE -
zfs-07f56700262ef24b 2.73T 2.30T 433G - - 25% 84.4% - ONLINE
special - - - - - - - - -
mirror-1 14.5G 6.03G 8.47G - - 68% 41.6% - ONLINE
STORJ4-METAD 15G - - - - - - - ONLINE
STORJ4-META 15G - - - - - - - ONLINE
storjdata6 1.37T 1.19T 179G - - 3% 87% 1.00x ONLINE -
zfs-1391287a2e5fd9d2 1.36T 1.19T 175G - - 3% 87.4% - ONLINE
special - - - - - - - - -
mirror-1 7.50G 3.60G 3.90G - - 66% 48.0% - ONLINE
STORJ6-METAD 8G - - - - - - - ONLINE
STORJ6-META 8G - - - - - - - ONLINE
storjdata9 1.37T 1.16T 216G - - 30% 84% 1.00x ONLINE -
zfs-ace96bd612c38442 1.36T 1.15T 212G - - 30% 84.7% - ONLINE
special - - - - - - - - -
mirror-1 7.50G 4.32G 3.18G - - 74% 57.6% - ONLINE
STORJ9-METAD 8G - - - - - - - ONLINE
STORJ9-META 8G - - - - - - - ONLINE
This is quite unexpected.
Can you run this on that dataset?
find . -type f -print0 | xargs -0 -P 40 -n 100 stat -f "%z" | awk '{ n=int(log($1)/log(2)); if (n<10) { n=10; } size[n]++ } END { for (i in size) printf("%d %d\n", 2^i, size[i]) }' | sort -n
I see overwhelming majority of files are under 128k, even on the node that mostly contains salt lake data:
1024 43142
2048 2187575
4096 38914
8192 21722
16384 14521
32768 17796
65536 12215
131072 2199374
262144 52161
524288 2928
1048576 2738
2097152 8058
No, after discovering it I quickly ran a send-receive to a 512kB sector size pool. After that, the meta data ratio normalized to the same ratio as all other pools.
Also other sources find a relation between sector size, and meta data size: Reddit - Dive into anything
Yes, there is of course relation. It just should not apply to the node, because most files are already smaller than the record size.
That command prints the distribution of file sizes, you can still do it. If you have significant number of files over 128k it would explain what you see. But such distribution would be unexpected for a node — and that deserves investigation.
find . -type f -print0 | xargs -0 -P 40 -n 100 stat -f "%z" | awk '{ n=int(log($1)/log(2)); if (n<10) { n=10; } size[n]++ } END { for (i in size) printf("%d %d\n", 2^i, size[i]) }' | sort -n
Doesn’t work, a lot of stat errors…
Rewrote it to:
for i in /storj/n*; do echo $'\n\n'$i; find $i -type f -print0 | xargs -0 -P 40 -n 100 stat -c "%s" | awk '{ n=int(log($1)/log(2)); if (n<10) { n=10; } size[n]++ } END { for (i in size) printf("%d %d\n", 2^i, size[i]) }' | sort -n; done
Is running now.
Output:
root@VM-HOST:/# for i in /storj/n*; do echo $'\n\n'$i; find $i -type f -print0 | xargs -0 -P 40 -n 100 stat -c "%s" | awk '{ n=int(log($1)/log(2)); if (n<10) { n=10; } size[n]++ } END { for (i in size) printf("%d %d\n", 2^i, size[i]) }' | sort -n; done
/storj/nd10
1024 919951
2048 1932779
4096 1000146
8192 2089807
16384 1766473
32768 1950788
65536 568374
131072 2214877
262144 572826
524288 114094
1048576 165845
2097152 571579
4194304 15
33554432 1
268435456 1
/storj/nd11
1024 156033
2048 396929
4096 129840
8192 168365
16384 132136
32768 138056
65536 117274
131072 842821
262144 156511
524288 21561
1048576 21046
2097152 59577
4194304 32
8388608 15
67108864 1
/storj/nd16
1024 167019
2048 278879
4096 167042
8192 362033
16384 296683
32768 318188
65536 149449
131072 616591
262144 224099
524288 19564
1048576 19910
2097152 67071
4194304 6
8388608 3
16777216 3
/storj/nd18
1024 742299
2048 2532194
4096 606620
8192 697769
16384 564620
32768 537919
65536 397070
131072 6724158
262144 637321
524288 74297
1048576 79364
2097152 289497
4194304 34
8388608 84
16777216 54
33554432 22
1073741824 1
/storj/nd22
1024 150155
2048 639050
4096 125147
8192 162173
16384 136363
32768 143956
65536 99181
131072 2821874
262144 156492
524288 19069
1048576 24482
2097152 59721
4194304 21
8388608 14
16777216 7
33554432 4
67108864 6
536870912 1
/storj/nd4
1024 629509
2048 1828193
4096 623361
8192 1225669
16384 1171061
32768 1568389
65536 343783
131072 4760152
262144 414969
524288 88707
1048576 101833
2097152 457896
4194304 4806
8388608 26
16777216 6
33554432 23
67108864 1
268435456 1
/storj/nd6
1024 180541
2048 1767209
4096 140123
8192 176124
16384 146127
32768 160803
65536 124515
131072 3444797
262144 186871
524288 20358
1048576 26560
2097152 69026
4194304 92
8388608 18
16777216 19
33554432 3
536870912 1
/storj/nd9
1024 237145
2048 1971952
4096 217281
8192 353479
16384 335829
32768 345361
65536 216953
131072 2201886
262144 325034
524288 45188
1048576 37566
2097152 215577
4194304 229
8388608 3
67108864 1
So, what do you deduce from this?
As far as I understood, ZFS can pack multiple files in a sector. And as far I can see, there are very few files over 512k, so most files can be packed using one meta data record.
You mean record, but no, it can’t. It operates on a file granularity. The only optimization it can do is truncate the record to the size if the file size if it’s smaller than the max record size.
Looks very similar to my distribution, with most files under 128k.
I can’t explain why you observed what you observed. Could it be that send/receive forced the stale allocation stats to update to the correct value. (The zpool list data is not realtime, you can see the up to date data with zdb)
I’ll try to do that (send/receive, check metadata usage as reported by zpool list and zdb, then increase record size and send/receive again, then check again) on one of my nodes and see if I see reduction in metatata usage.
As I understood, ZFS can create records ranging from 2^ashift to recordsize. Meaning that a file bigger than recordsize needs multiple metadata records, but a smaller file gets just a smaller record. In my case a 1.3MiB file gets 3 512k-records (200k loss), but a file of 62k gets one 64k-record. From perspective of metadata size and reading speed, a big recordsize is favored. From perspective of space efficiency a small record size is favored.
I don’t care so much about efficiency, since it’s quite small potatoes we’re talking about. Which is also being compensated by the fact that the data turns out to be compressible by 5-10%.
So, that’s the reason why I choose 512k. Net result is that 512k recordsize is almost equal efficient as ext4.
Yes, this is precisely right. But since most files on the node are under 128k, (the default record size) increasing the record size to 512 would only affect a small minority of files, and hence can’t cause any significant, let alone 3x change you observed.
Well, I’m very interested to see your results. I’m quite new to ZFS. So, how does checking metadata work with zdb? Of course I made this up from the listing.
But given my overview, I don’t see the reason for your advice of 25G/TB (TS explicitly stated not to use small/inline files).
It will take some time, I’ve set a reminder to come back to this topic to update with findings.
Where does it state it?
Run something like zdb -PLbbbs tank
and add up values in the allocated size column for everything that is not plain data (“ L>0 and is not “plain file” or “zvol”), but be careful not to count cumulative values there.
Good start in point is here: Reddit - Dive into anything
I shall probably just write a script and share it here too.
Here it’s stated. So no small files.
Ah. Topic Starter, not Terms of Service got it.
5g/TB… it could be variable, a lot!
What do you mean, same opinion?