Notes on storage node performance optimization on ZFS

My notes from performance optimization efforts on ZFS.

Hardware setup for reference: Node holds around 9TB of data today, running on an old 2012-ish era freebsd server; there is a single zfs pool consisting of two 4-disk RAIDZ1 VDEV’s. Out of 32GB of ram, 9 are used by services, and the rest available for ARC. There is 2TB SSD used as L2ARC, but it is not really being utilized, I just had it, it is not necessary in any way. To speed up synchronous writes I have a cheap 16GB Optane device mounted as SLOG. It helps a lot with Time Machine, and to some degree, storj – but see below, it is not necessary either.

The server is used for other tasks, mainly hosting plex/homebridge, serving as time-machine and other backup target, from about 7 computers, and serving two iSCSI drives to windows machines in the LAN.

After making the changes described in no particular order below I don’t notice storj impacting performance of other tasks.

  1. General recommendation, often overlooked: Ensure your SSDs are configured with native block size. Some SSDs, especially Samsung, like to pretend they have 512-byte sectors, however that is not the case. When you are adding such device to a pool, override the sector size with the ashift parameter. For example, to add most SSDs as a L2ARC device, force 4096 sector size with this: (note 1 << 12 == 4096):

    zpool add -o ashift=12 pool1 cache ada0
    
  2. Disable access time updates on the datasets holding storage node data: This removes associated IO:

    zfs set atime=off pool1/storj
    
  3. You should have a UPS. In this case, a dramatic performance improvement can be accomplished by disabling sync writes on Storj dataset

    zfs set sync=disabled pool1/storj
    

    If you don’t have a UPS – don’t disable sync writes, but instead add SLOG: it will offload some of the IO from the main array. I did both: added SLOG to use for sync writes with other data, and disabled sync for storj data, because I have a UPS.

  4. Caching: storage node stores two main classes of data: blobs and databases.

    For blobs you want to only cache metadata: clients access data randomly, there is little benefit in cache churn.

    Databases, on the other hand, benefit strongly from caching: with metadata-only caching on my node the dashboard takes over a minute to refresh. With full caching – about a second.

    As you can see there are conflicting requirement. Naturally, the best solution is to keep databases on a separate dataset, with separate caching configuration.

    I’ve created that dataset with 64k sector size, to better match sqlite usage. The dataset for the rest of the storj data is kept at default 128k: even though storj recommended chunk size is 64M, I found vast majority of files are significantly smaller than that. 128k sectors provide a good balance between overhead and space usage. Keep the default ZFS compression on – this helps to conserve disk space for incomplete sectors. The databases dataset is mounted separately into the storj jail and the path to it is specified in the config file via storage2.database-dir: parameter:

    # path to store data in
    storage.path: /mnt/storj
    ...
    # directory to store databases. if empty, uses data path
    storage2.database-dir: "/mnt/storj-databases"
    

    With that, caching is configured as follows:

    zfs set primarycache=metadata pool1/storj
    zfs set primarycache=all pool1/storj/databases
    

    You may go a step further and apply this to a secondary cache as well: as discussed above, caching blob data does not help performance with current access pattern, but may increase SSD usage. When you have significant traffic on the node and the nature of the traffic changes somewhat – i.e. when sufficient amounts of chunks are accessed more than once repeatedly – you may decide to switch secondary cache back to all.

    zfs set secondarycache=metadata pool1/storj
    zfs set secondarycache=all pool1/storj/databases
    

That’s pretty much all I had.

8 Likes

For the segment. Your node stores 1/80 of the segment. So it should be 819.2 KiB or less.

1 Like

This matches my experiences with setting up nodes on ext4, especially avoiding synchronous writes and configuring storage for databases in a different way than for blobs. Thank you!

2 Likes

another good configuration for a zfs pool is to set logbias=throughput
sure the pool may become less responsive for some stuff, but really the bandwidth it goes up, by a few multiples… i forget the exact number… but its a lot

i’m sure there could be cases where logbias=latency is preferable, but the benefit from getting much higher data transfers also helps reduce latency… so, i always go throughput.

there is many tricks to make zfs run better.
xattr=off also helps reduce iops pr record or file… much like atime…

1 Like

This, essentially, turns off SLOG. It may be fine for some databases, in some corner cases, at very high utilization, but not for the bulk of data. I would leave it at default. Especially since the rate of writing to SQLite by storj is fairly minuscule.

In vast majority of cases latency is preferable. That’s why it’s a default.

Storage node does not read or write extended attributes, so this change will have no effect (I’m not sure it’s supported on FreeBSD in the first place though).

But aren’t only 29 of those pieces needed to recreate the file? This would imply a max piece size of ~2.2MB, which is what I seem to remember being a pretty common file size in my blob folder.

On a separate note: you can get even less disk usage moving your databases to ram (on Linux you can do this by storing them under /dev/shm). They aren’t strictly necessary for node operation, just nice to have.

Yes, 29 is enough to reconstruct the segment, so you are correct, 64MiB/29 is a maximum piece size with the current Erasure Coding settings.

What method did you follow to determine this? How old is the node you analyzed? I’m interested in trying to reproduce your results, as I’ve often wondered the same.

UPS does not protect against crashes, so there still could be a data loss?
But the small data loss is probably meaningless, worst case you loose the last 5 seconds, which could be 625MB on a Gigabit Link. 625MB will not get you disqualified.

zdb over pool1/storj would be interesting

Maybe ARC hit ratio.
Curious, how does your ARC hit ratio looks with metadata being (only?) on special vdev? Does ARC ignore your folder because it is read from special vdev, or is it like an additional cache on top?

Yes, by observing ARC hit ratio, and rate of L2ARC growth. The L2ARC write speed is heavily throttled by design, however before the change it was filling up at that max rate, but after implementing the changes it stopped, while ARC hit rate remained at 98-100%. It is really hard for me to separate storj activity from everything else that the server is doing, but the fact that the L2ARC stopped filling up like there is no tomorrow indicates that the infrequently accessed data does not compete anymore with the frequently used one for the space in the cache – which is an ultimate goal.

It would be indeed interesting to compare observation on some pool that is dedicated to storj, if anyone has any.

I’m not sure what are you asking - can you elaborate? Preemptively, here is a good starting point on how ARC works ZFS L2ARC

I haven’t had a crash for a very long time – the key is to pick a very stable and boring OS for your servers and stable hardware, but you’re right, kernel panic will have the same effect as abrupt power loss.

I agree, that there is no reason to care too much about storj data integrity, there is redundancy built into the system, but then again – this is a shared machine, and I do care about my other data. Hence, async writes to storj dataset, and default for everything else.

sudo zdb -d -U /data/zfs/zpool.cache pool1/storj
Dataset pool1/storj [ZPL], ID 17585, cr_txg 2486832, 6.62T, 15338628 objects

Let me know if you want more specific data.

That is for 9-month old one. There is another, 5-month old one on the same host.

With sync disabled and caches enabled it gets pretty close – the traffic volume is pretty small. But I agree, these dbs are purely cosmetic.

I’ve put together a small script to draw a histogram of file sized distripution with various bin sizes, and ran it overnight. This is the result for the 9-month old node at 4k binning:

--- Bin size fixed: 4096 bytes
One star represents about 15841 files. Omitting results with counts smaller than that
      4096 B   |    1900915 | ************************************************************************************************************************
      8192 B   |    1142603 | ************************************************************************
     12288 B   |     739347 | **********************************************
     16384 B   |     689267 | *******************************************
        20 kiB |     719417 | *********************************************
        24 kiB |     241788 | ***************
        28 kiB |     192761 | ************
        32 kiB |     204852 | ************
        36 kiB |     336095 | *********************
        40 kiB |     440324 | ***************************
        44 kiB |      66746 | ****
        48 kiB |      66421 | ****
        52 kiB |      67986 | ****
        56 kiB |      60739 | ***
        60 kiB |      49708 | ***
        64 kiB |      74731 | ****
        68 kiB |      66825 | ****
        72 kiB |      52849 | ***
        76 kiB |     618568 | ***************************************
        80 kiB |      50655 | ***
        84 kiB |      34664 | **
        88 kiB |      47353 | **
        92 kiB |     125894 | *******
        96 kiB |      92397 | *****
       100 kiB |     147334 | *********
       104 kiB |     127450 | ********
       108 kiB |      97814 | ******
       112 kiB |      31869 | **
       116 kiB |      19226 | *
       144 kiB |      61535 | ***
       148 kiB |      19555 | *
       152 kiB |      16882 | *
       156 kiB |      15960 | *
       180 kiB |     823735 | ****************************************************
       268 kiB |      16650 | *
       272 kiB |      36852 | **
       276 kiB |      26593 | *
       280 kiB |      24566 | *
       284 kiB |      16449 | *
       288 kiB |     183892 | ***********
       304 kiB |      18654 | *
       312 kiB |      25229 | *
       316 kiB |      36200 | **
       320 kiB |      28857 | *
       356 kiB |     104541 | ******
       392 kiB |     751961 | ***********************************************
       568 kiB |    1433244 | ******************************************************************************************
       712 kiB |      27343 | *
      1772 kiB |      36849 | **
      2128 kiB |     212407 | *************
      2268 kiB |    1894912 | ***********************************************************************************************************************
--- Bin size fixed: 16384 bytes
One star represents about 37268 files. Omitting results with counts smaller than that
     16384 B   |    4472132 | ***********************************************************************************************************************
        32 kiB |    1358818 | ************************************
        48 kiB |     909586 | ************************
        64 kiB |     253164 | ******
        80 kiB |     788897 | *********************
        96 kiB |     300308 | ********
       112 kiB |     404467 | **********
       128 kiB |      60971 | *
       144 kiB |     101822 | **
       160 kiB |      67515 | *
       176 kiB |      51718 | *
       192 kiB |     846216 | **********************
       272 kiB |      64902 | *
       288 kiB |     251500 | ******
       304 kiB |      50898 | *
       320 kiB |     103932 | **
       368 kiB |     112512 | ***
       400 kiB |     757022 | ********************
       576 kiB |    1441592 | **************************************
      1776 kiB |      39988 | *
      2128 kiB |     212848 | *****
      2272 kiB |    1903272 | ***************************************************

Most files are under 16k, with majority of those under 4k; Somewhat smaller amount of files are around 512kiB, and 2.2MiB

I’ve put a script here if you want to play around with your nodes: Create ASCII histogram of the file sizes in the folder hierarchy · GitHub

Full histograms are here:
9-month old node: Histograms for 9-month old node · GitHub
5-month old node: Histograms for 5-month old node · GitHub

2 Likes

Ups sorry, I somehow assumed you are using the two SSD as special vdev and not one as L2ARC.
With your hit ratio of 98%, special vdevs would not help and you already have the perfect
cache for meta data.

Sorry, wrong command but the you script gets pretty close. I meant
zdb -Lbbbs pool1/storj, that should show you the different block sizes.

That way you could find out what amount of of STORJ could possibly be stored on the SSDs if you would use special vdev.

Right. For other than storj scenarios, I do see quite a good hit ratio on L2:

I get the same output:

% sudo zdb -Lbbbs -U /data/zfs/zpool.cache pool1/storj
Dataset pool1/storj [ZPL], ID 17585, cr_txg 2486832, 6.63T, 15358313 objects

Sorry, here you go

 find . -type f -print0 | xargs -0 ls -l | awk '{ n=int(log($5)/log(2)); if (n<10) { n=10; } size[n]++ } END { for (i in size) printf("%d %d\n", 2^i, size[i]) }' | sort -n | awk 'function human(x) { x[1]/=1024; if (x[1]>=1024) { x[2]++; human(x) } } { a[1]=$1; a[2]=0; human(a); printf("%3d%s: %6d\n", a[1],substr("kMGTEPYZ",a[2]+1,1),$2) }'

I found the command here level1techs and you get an excellent overview of your data. Here is mine without STORJ data:

As you can see, I have the default recordsize of 128k in my mirror setup.

For you this could be interesting, because of padding costs of RAIDZ.
For the HDDs you use ashift 12 aka 4k sectors right?
Parity would ideally cost you 25% of total storage, but with some sizes there are additional padding costs.

size parity + padding cost
4k 50%
8k 50%
12k 25%
16k 33%
20k 38%
24k 25%
28k 30%
32k 33%
36k 25%
40k 29%
64k 27%
128k 27%

My assumption is that your 4k files on your system are 7.25GB total, but use 15GB space. I think RAIDZ in combination with special vdevs could achieve great results.

1 Like

I don’t worry about the padding cost of the small files, the extra size is still insignificant; but speeding up random lookups by offloading metadata to special vdev sounds enticing: if the storj exhibited repeatable access pattern this would not have mattered much: metadata would have ended up in ARC pretty quickly and ram would always be faster than SSD. But because it does not, once metadata size exceeds the available space in the arc — special vdev should improve random access response time. I’ll look into this deeper tomorrow. Thank you for the suggestion.

That histogram building with logarithmic binning may be obscuring the outliers, but it’s ok to get a feel of distribution of data size magnitudes.

33% or up to 50% is not insignificant depending on how much small files you have.
But thankfully, STORJ does not have huge amount of small files

4k * 1900915 = 7,2GB
8k * 1142603 = 8.7GB
16k * 689267 = 10.5GB
20k * 719417 = 13GB

That is pretty insignificant for your 9TB.

Not so sure about the caching. To me it seems that the read pattern is way to random and you would have to cache basically everything or TB of data. On the other hand, I could imagine file-walker mostly reading metadata and because of that have huge performance gain from a special vdev.

1 Like

It depends on how frequently accessed the piece is, at least on my node, since April 10 some pieces were accessed over 1000 times. Depending on their size, caching them may be useful.

By the way, if you are usng zfs to store the node files directly (not a zvol like I do), then you can set recordsize to 1M, because zfs uses variable size records and the setting specifies the maximum size.

2 Likes

That’s what I meant – it has negligible impact on this usecase.

These are good point, I was curious how many times are chunks downloaded. On my 9month old node I see this statistics:

Last week:

Download was requested for 1411241 total chunks
64.77% of chunks were requested more than 1 times
32.26% of chunks were requested more than 10 times
15.73% of chunks were requested more than 100 times
8.94% of chunks were requested more than 1000 times
7.04% of chunks were requested more than 10000 times

Last 3 months:

Download was requested for 10004066 total chunks
88.62% of chunks were requested more than 1 times
47.89% of chunks were requested more than 10 times
24.36% of chunks were requested more than 100 times
14.24% of chunks were requested more than 1000 times
4.71% of chunks were requested more than 10000 times

The script for those playing along at home: STORJ: repeated downloads · GitHub (bzgrep because I rotate and compress logs daily)

I need to think about what that means. It appears that adding chunk caching will benefit 30-50% of download requests. What effect will this have on winning races requires further measurement. I’ll see if I can extract this information from the logs (i.e. track time to download specific chunk for the first time, vs subsequent times, unless cancelled, and track cancellation rate across all chunks for repeated vs one-shot ones)

Accelerating metadata is a no brainer: either by special vdev, or by primary/secondary ARC, the end result is the same. I’m more inclined towards the caching vs special vdev though from reliability standpoint: no need for redundancy and I can remove it anytime.

Whether to cache chunks – depending on how much space is available, and the total volume of chunks that end up repeatedly read – it could indeed be couple of TB – and whether extra boost of races won (if any) is worth some degree of SSD churn. Incidentally, I have 2TB L2ARC, and it does not really fill that much faster. This maybe all that is needed.

As an aside, I have vfs.zfs.l2arc.rebuild_enabled=1 for the L2 cache to persist reboots, and I’ve set vfs.zfs.l2arc_write_max to 25MB/s from the default 8, and the vfs.zfs.l2arc_write_boost is double of that.

1 Like

Are races won by shouting “yeah I have the part, get it from me” or by actually sending it to the host? If the second one is true, I suspect peering to have a way higher impact than storage latency.

The second version.

The uplink starts more uploads/downloads than needed to have enough pieces to reconstruct the file (download) or to store the file with enough redundancy (upload). I do not remember the exact numbers, but it’s something like 100 uploads are started at once, but only 80 are needed. 50 downloads are started at once, but only 30 pieces are needed to get the segment.

All transfer happen at the same time, once there are enough complete transfers, the rest are canceled - those nodes “lost the race”.

Satellite already knows that – thats’ why customer is requesting the file from you in the first place.

If I understand correctly, ultimately the total time to receive the full chunk is what matters: if you manage to get the chunk to the customer fast enough to be in the first N completed transfers – you get paid. Else – you don’t.

That time consists of latency to start transferring the data, and then transfer time. Caching is supposed to help with the former by reducing “time to first byte”. The latter you can’t control (other than by generally improving QoS with SQM) because depending on network conditions you can be simply too far to compete.