How to determine the required size for ZFS special VDEV?

1 Like

From my measurements on my old nodes (ratio of “zfs plain file” objects cumulative size to everything else reported by the zdb) metadata takes 2.6% of the node used space.

With the recent massive traffic ingress, which looks to be mostly large objects, the metadata takes about 0.06% (this is from node I started few months ago, it has almost no “normal” data, all saltlake)

So for 20TB node you want 500GB metadata. And if you want to send small files there as well — you’d want to find one of the histograms posted on the forum here to assess additional size requirements.

You can find Intel p3600 2TB SSDs on eBay quite cheap — perhaps this overkill would be the way to go.

1 Like

It also depends on the sector size and redundancy of meta data.

root@T8PLUS-N100:~# zpool list -v                                                                                       NAME                     SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
storjdata12             2.73T  2.35T   388G        -         -     0%    86%  1.00x    ONLINE  -
  STORJ12-ZFS           2.73T  2.35T   379G        -         -     0%  86.4%      -    ONLINE
special                     -      -      -        -         -      -      -      -  -
  mirror-1              14.5G  5.65G  8.85G        -         -    52%  39.0%      -    ONLINE
    STORJ12-METAD         15G      -      -        -         -      -      -      -    ONLINE
    STORJ12-META          15G      -      -        -         -      -      -      -    ONLINE
storjdata19             4.57T  3.68T   908G        -         -     8%    80%  1.00x    ONLINE  -
  STORJ19-ZFS           4.55T  3.68T   892G        -         -     8%  80.8%      -    ONLINE
special                     -      -      -        -         -      -      -      -  -
  mirror-1              24.5G  7.67G  16.8G        -         -    76%  31.3%      -    ONLINE
    STORJ19-META          25G      -      -        -         -      -      -      -    ONLINE
    STORJ19-METAD         25G      -      -        -         -      -      -      -    ONLINE
storjdata21             2.73T   911G  1.84T        -         -     0%    32%  1.00x    ONLINE  -
  STORJ21-ZFS           2.73T   910G  1.83T        -         -     0%  32.7%      -    ONLINE
special                     -      -      -        -         -      -      -      -  -
  mirror-1              14.5G  1.86G  12.6G        -         -    47%  12.8%      -    ONLINE
    STORJ21-METAD         15G      -      -        -         -      -      -      -    ONLINE
    STORJ21-META          15G      -      -        -         -      -      -      -    ONLINE
storjdata3              4.57T  1.86T  2.71T        -         -     0%    40%  1.00x    ONLINE  -
  STORJ3-ZFS            4.55T  1.85T  2.69T        -         -     0%  40.8%      -    ONLINE
special                     -      -      -        -         -      -      -      -  -
  mirror-1              24.5G  5.28G  19.2G        -         -    65%  21.6%      -    ONLINE
    STORJ3-METAD          25G      -      -        -         -      -      -      -    ONLINE
    STORJ3-META           25G      -      -        -         -      -      -      -    ONLINE
storjdata5              3.64T  2.25T  1.39T        -         -     8%    61%  1.00x    ONLINE  -
  zfs-1603f406057ea458  3.64T  2.25T  1.38T        -         -     8%  62.0%      -    ONLINE
special                     -      -      -        -         -      -      -      -  -
  mirror-1              19.5G  6.20G  13.3G        -         -    69%  31.8%      -    ONLINE
    STORJ5-METAD          20G      -      -        -         -      -      -      -    ONLINE
    STORJ5-META           20G      -      -        -         -      -      -      -    ONLINE
storjdata7               932G   808G   125G        -         -    32%    86%  1.00x    ONLINE  -
  zfs-15e729ab5c4ed5ed   931G   805G   123G        -         -    32%  86.7%      -    ONLINE
special                     -      -      -        -         -      -      -      -  -
  mirror-1              4.50G  2.77G  1.73G        -         -    84%  61.5%      -    ONLINE
    STORJ7-METAD           5G      -      -        -         -      -      -      -    ONLINE
    STORJ7-META            5G      -      -        -         -      -      -      -    ONLINE
storjdata8              4.57T   981G  3.61T        -         -     0%    20%  1.00x    ONLINE  -
  STORJ8-ZFS            4.55T   978G  3.59T        -         -     0%  21.0%      -    ONLINE
special                     -      -      -        -         -      -      -      -  -
  mirror-1              24.5G  2.38G  22.1G        -         -    51%  9.72%      -    ONLINE
    STORJ8-META           25G      -      -        -         -      -      -      -    ONLINE
    STORJ8-METAD          25G      -      -        -         -      -      -      -    ONLINE

This is all sector size 512kB. And redundant_metadata=some. So a factor 10 lower than yours.

This is my making:

zpool create -o ashift=12 -O compress=lz4 -O atime=off -O primarycache=metadata -O sync=disabled -m /storj/nd12 -O xattr=off -O redundant_metadata=some -O recordsize=512k storjdata12 /dev/sdc2
zpool add storjdata12 -o ashift=12 special mirror /dev/disk/by-partlabel/STORJ12-METAD /dev/disk/by-partlabel/STORJ12-META -f

In my case I would allocate 100G, x2 for the mirroring.

1 Like

Record size is irrelevant, most of storagenode files are smaller than the default record size.

Your nodes seem to be quite small and hence new, and therefore mostly contain test data from saltlake, which are large files, and this gets you the classic 0.3% metadata utilization, that is also close to my second paragraph above.

Actually, I had one with 128kB sector size. In which the metadata increased 3x. Aside from that, I don’t see any difference in comparison to other nodes already filled up to the rim when the test began:

root@VM-HOST:~# zpool list -v
NAME                     SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
storjdata10             2.73T  2.31T   431G        -         -    10%    84%  1.00x    ONLINE  -
  zfs-0e9b458b17d1abae  2.73T  2.31T   423G        -         -    10%  84.8%      -    ONLINE
special                     -      -      -        -         -      -      -      -  -
  mirror-1              14.5G  6.76G  7.74G        -         -    65%  46.6%      -    ONLINE
    STORJ10-METAD         15G      -      -        -         -      -      -      -    ONLINE
    STORJ10-META          15G      -      -        -         -      -      -      -    ONLINE
storjdata11              479G   402G  76.5G        -         -    40%    84%  1.00x    ONLINE  -
  zfs-46912d350c39ebb9   477G   401G  75.0G        -         -    40%  84.2%      -    ONLINE
special                     -      -      -        -         -      -      -      -  -
  mirror-1              2.75G  1.23G  1.52G        -         -    74%  44.9%      -    ONLINE
    STORJ11-METAD          3G      -      -        -         -      -      -      -    ONLINE
    STORJ11-META           3G      -      -        -         -      -      -      -    ONLINE
storjdata16              479G   399G  79.4G        -         -    31%    83%  1.00x    ONLINE  -
  zfs-685857c77212e77b   477G   398G  77.8G        -         -    31%  83.6%      -    ONLINE
special                     -      -      -        -         -      -      -      -  -
  mirror-1              2.75G  1.15G  1.60G        -         -    65%  41.9%      -    ONLINE
    STORJ16-METAD          3G      -      -        -         -      -      -      -    ONLINE
    STORJ16-META           3G      -      -        -         -      -      -      -    ONLINE
storjdata18             2.73T  2.34T   405G        -         -     8%    85%  1.00x    ONLINE  -
  zfs-13156e834a813ec9  2.73T  2.33T   396G        -         -     8%  85.8%      -    ONLINE
special                     -      -      -        -         -      -      -      -  -
  mirror-1              14.5G  5.15G  9.35G        -         -    70%  35.5%      -    ONLINE
    STORJ18-METAD         15G      -      -        -         -      -      -      -    ONLINE
    STORJ18-META          15G      -      -        -         -      -      -      -    ONLINE
storjdata22              932G   798G   135G        -         -     8%    85%  1.00x    ONLINE  -
  zfs-7627b31b68853fd3   932G   796G   132G        -         -     8%  85.8%      -    ONLINE
special                     -      -      -        -         -      -      -      -  -
  mirror-1              4.50G  1.59G  2.91G        -         -    70%  35.4%      -    ONLINE
    STORJ22-METAD          5G      -      -        -         -      -      -      -    ONLINE
    STORJ22-META           5G      -      -        -         -      -      -      -    ONLINE
storjdata4              2.73T  2.30T   442G        -         -    25%    84%  1.00x    ONLINE  -
  zfs-07f56700262ef24b  2.73T  2.30T   433G        -         -    25%  84.4%      -    ONLINE
special                     -      -      -        -         -      -      -      -  -
  mirror-1              14.5G  6.03G  8.47G        -         -    68%  41.6%      -    ONLINE
    STORJ4-METAD          15G      -      -        -         -      -      -      -    ONLINE
    STORJ4-META           15G      -      -        -         -      -      -      -    ONLINE
storjdata6              1.37T  1.19T   179G        -         -     3%    87%  1.00x    ONLINE  -
  zfs-1391287a2e5fd9d2  1.36T  1.19T   175G        -         -     3%  87.4%      -    ONLINE
special                     -      -      -        -         -      -      -      -  -
  mirror-1              7.50G  3.60G  3.90G        -         -    66%  48.0%      -    ONLINE
    STORJ6-METAD           8G      -      -        -         -      -      -      -    ONLINE
    STORJ6-META            8G      -      -        -         -      -      -      -    ONLINE
storjdata9              1.37T  1.16T   216G        -         -    30%    84%  1.00x    ONLINE  -
  zfs-ace96bd612c38442  1.36T  1.15T   212G        -         -    30%  84.7%      -    ONLINE
special                     -      -      -        -         -      -      -      -  -
  mirror-1              7.50G  4.32G  3.18G        -         -    74%  57.6%      -    ONLINE
    STORJ9-METAD           8G      -      -        -         -      -      -      -    ONLINE
    STORJ9-META            8G      -      -        -         -      -      -      -    ONLINE

This is quite unexpected.

Can you run this on that dataset?

find . -type f -print0 | xargs -0 -P 40 -n 100 stat -f "%z" | awk '{ n=int(log($1)/log(2)); if (n<10) { n=10; } size[n]++ } END { for (i in size) printf("%d %d\n", 2^i, size[i]) }' | sort -n

I see overwhelming majority of files are under 128k, even on the node that mostly contains salt lake data:

1024 43142
2048 2187575
4096 38914
8192 21722
16384 14521
32768 17796
65536 12215
131072 2199374
262144 52161
524288 2928
1048576 2738
2097152 8058

No, after discovering it I quickly ran a send-receive to a 512kB sector size pool. After that, the meta data ratio normalized to the same ratio as all other pools.

Also other sources find a relation between sector size, and meta data size: Reddit - Dive into anything

Yes, there is of course relation. It just should not apply to the node, because most files are already smaller than the record size.

That command prints the distribution of file sizes, you can still do it. If you have significant number of files over 128k it would explain what you see. But such distribution would be unexpected for a node — and that deserves investigation.

2 Likes
find . -type f -print0 | xargs -0 -P 40 -n 100 stat -f "%z" | awk '{ n=int(log($1)/log(2)); if (n<10) { n=10; } size[n]++ } END { for (i in size) printf("%d %d\n", 2^i, size[i]) }' | sort -n

Doesn’t work, a lot of stat errors…

1 Like

Rewrote it to:

for i in /storj/n*; do echo $'\n\n'$i; find $i -type f -print0 | xargs -0 -P 40 -n 100 stat -c "%s" | awk '{ n=int(log($1)/log(2)); if (n<10) { n=10; } size[n]++ } END { for (i in size) printf("%d %d\n", 2^i, size[i]) }' | sort -n; done

Is running now.

Output:

root@VM-HOST:/# for i in /storj/n*; do echo $'\n\n'$i; find $i -type f -print0 | xargs -0 -P 40 -n 100 stat -c "%s" | awk '{ n=int(log($1)/log(2)); if (n<10) { n=10; } size[n]++ } END { for (i in size) printf("%d %d\n", 2^i, size[i]) }' | sort -n; done

/storj/nd10
1024 919951
2048 1932779
4096 1000146
8192 2089807
16384 1766473
32768 1950788
65536 568374
131072 2214877
262144 572826
524288 114094
1048576 165845
2097152 571579
4194304 15
33554432 1
268435456 1


/storj/nd11
1024 156033
2048 396929
4096 129840
8192 168365
16384 132136
32768 138056
65536 117274
131072 842821
262144 156511
524288 21561
1048576 21046
2097152 59577
4194304 32
8388608 15
67108864 1


/storj/nd16
1024 167019
2048 278879
4096 167042
8192 362033
16384 296683
32768 318188
65536 149449
131072 616591
262144 224099
524288 19564
1048576 19910
2097152 67071
4194304 6
8388608 3
16777216 3


/storj/nd18
1024 742299
2048 2532194
4096 606620
8192 697769
16384 564620
32768 537919
65536 397070
131072 6724158
262144 637321
524288 74297
1048576 79364
2097152 289497
4194304 34
8388608 84
16777216 54
33554432 22
1073741824 1


/storj/nd22
1024 150155
2048 639050
4096 125147
8192 162173
16384 136363
32768 143956
65536 99181
131072 2821874
262144 156492
524288 19069
1048576 24482
2097152 59721
4194304 21
8388608 14
16777216 7
33554432 4
67108864 6
536870912 1


/storj/nd4
1024 629509
2048 1828193
4096 623361
8192 1225669
16384 1171061
32768 1568389
65536 343783
131072 4760152
262144 414969
524288 88707
1048576 101833
2097152 457896
4194304 4806
8388608 26
16777216 6
33554432 23
67108864 1
268435456 1


/storj/nd6
1024 180541
2048 1767209
4096 140123
8192 176124
16384 146127
32768 160803
65536 124515
131072 3444797
262144 186871
524288 20358
1048576 26560
2097152 69026
4194304 92
8388608 18
16777216 19
33554432 3
536870912 1


/storj/nd9
1024 237145
2048 1971952
4096 217281
8192 353479
16384 335829
32768 345361
65536 216953
131072 2201886
262144 325034
524288 45188
1048576 37566
2097152 215577
4194304 229
8388608 3
67108864 1

So, what do you deduce from this?
As far as I understood, ZFS can pack multiple files in a sector. And as far I can see, there are very few files over 512k, so most files can be packed using one meta data record.

1 Like

You mean record, but no, it can’t. It operates on a file granularity. The only optimization it can do is truncate the record to the size if the file size if it’s smaller than the max record size.

Looks very similar to my distribution, with most files under 128k.

I can’t explain why you observed what you observed. Could it be that send/receive forced the stale allocation stats to update to the correct value. (The zpool list data is not realtime, you can see the up to date data with zdb)

I’ll try to do that (send/receive, check metadata usage as reported by zpool list and zdb, then increase record size and send/receive again, then check again) on one of my nodes and see if I see reduction in metatata usage.

2 Likes

As I understood, ZFS can create records ranging from 2^ashift to recordsize. Meaning that a file bigger than recordsize needs multiple metadata records, but a smaller file gets just a smaller record. In my case a 1.3MiB file gets 3 512k-records (200k loss), but a file of 62k gets one 64k-record. From perspective of metadata size and reading speed, a big recordsize is favored. From perspective of space efficiency a small record size is favored.

I don’t care so much about efficiency, since it’s quite small potatoes we’re talking about. Which is also being compensated by the fact that the data turns out to be compressible by 5-10%.

So, that’s the reason why I choose 512k. Net result is that 512k recordsize is almost equal efficient as ext4.

1 Like

Yes, this is precisely right. But since most files on the node are under 128k, (the default record size) increasing the record size to 512 would only affect a small minority of files, and hence can’t cause any significant, let alone 3x change you observed.

1 Like

Well, I’m very interested to see your results. I’m quite new to ZFS. So, how does checking metadata work with zdb? Of course I made this up from the listing.

But given my overview, I don’t see the reason for your advice of 25G/TB (TS explicitly stated not to use small/inline files).

1 Like

It will take some time, I’ve set a reminder to come back to this topic to update with findings.

Where does it state it?

Run something like zdb -PLbbbs tank and add up values in the allocated size column for everything that is not plain data (“ L>0 and is not “plain file” or “zvol”), but be careful not to count cumulative values there.

Good start in point is here: Reddit - Dive into anything

I shall probably just write a script and share it here too.

1 Like

Here it’s stated. So no small files.

1 Like

Ah. Topic Starter, not Terms of Service :slight_smile: got it.

1 Like

5g/TB… it could be variable, a lot! :slight_smile:

1 Like

What do you mean, same opinion?

1 Like