Notes on storage node performance optimization on ZFS

arrogantrabbit · April 16, 2023, 6:42am

My notes from performance optimization efforts on ZFS.

Hardware setup for reference: Node holds around 9TB of data today, running on an old 2012-ish era freebsd server; there is a single zfs pool consisting of two 4-disk RAIDZ1 VDEV’s. Out of 32GB of ram, 9 are used by services, and the rest available for ARC. There is 2TB SSD used as L2ARC, but it is not really being utilized, I just had it, it is not necessary in any way. To speed up synchronous writes I have a cheap 16GB Optane device mounted as SLOG. It helps a lot with Time Machine, and to some degree, storj – but see below, it is not necessary either.

The server is used for other tasks, mainly hosting plex/homebridge, serving as time-machine and other backup target, from about 7 computers, and serving two iSCSI drives to windows machines in the LAN.

After making the changes described in no particular order below I don’t notice storj impacting performance of other tasks.

General recommendation, often overlooked: Ensure your SSDs are configured with native block size. Some SSDs, especially Samsung, like to pretend they have 512-byte sectors, however that is not the case. When you are adding such device to a pool, override the sector size with the ashift parameter. For example, to add most SSDs as a L2ARC device, force 4096 sector size with this: (note 1 << 12 == 4096):
```
zpool add -o ashift=12 pool1 cache ada0
```
Disable access time updates on the datasets holding storage node data: This removes associated IO:
```
zfs set atime=off pool1/storj
```
You should have a UPS. In this case, a dramatic performance improvement can be accomplished by disabling sync writes on Storj dataset
```
zfs set sync=disabled pool1/storj
```
If you don’t have a UPS – don’t disable sync writes, but instead add SLOG: it will offload some of the IO from the main array. I did both: added SLOG to use for sync writes with other data, and disabled sync for storj data, because I have a UPS.
Caching: storage node stores two main classes of data: blobs and databases.

For blobs you want to only cache metadata: clients access data randomly, there is little benefit in cache churn.

Databases, on the other hand, benefit strongly from caching: with metadata-only caching on my node the dashboard takes over a minute to refresh. With full caching – about a second.

As you can see there are conflicting requirement. Naturally, the best solution is to keep databases on a separate dataset, with separate caching configuration.

I’ve created that dataset with 64k sector size, to better match sqlite usage. The dataset for the rest of the storj data is kept at default 128k: even though storj recommended chunk size is 64M, I found vast majority of files are significantly smaller than that. 128k sectors provide a good balance between overhead and space usage. Keep the default ZFS compression on – this helps to conserve disk space for incomplete sectors. The databases dataset is mounted separately into the storj jail and the path to it is specified in the config file via storage2.database-dir: parameter:
```
# path to store data in
storage.path: /mnt/storj
...
# directory to store databases. if empty, uses data path
storage2.database-dir: "/mnt/storj-databases"
```
With that, caching is configured as follows:
```
zfs set primarycache=metadata pool1/storj
zfs set primarycache=all pool1/storj/databases
```
You may go a step further and apply this to a secondary cache as well: as discussed above, caching blob data does not help performance with current access pattern, but may increase SSD usage. When you have significant traffic on the node and the nature of the traffic changes somewhat – i.e. when sufficient amounts of chunks are accessed more than once repeatedly – you may decide to switch secondary cache back to all.
```
zfs set secondarycache=metadata pool1/storj
zfs set secondarycache=all pool1/storj/databases
```

That’s pretty much all I had.

Alexey · April 16, 2023, 9:21am

For the segment. Your node stores 1/80 of the segment. So it should be 819.2 KiB or less.

Toyoo · April 16, 2023, 11:43am

This matches my experiences with setting up nodes on ext4, especially avoiding synchronous writes and configuring storage for databases in a different way than for blobs. Thank you!

SGC · April 16, 2023, 8:05pm

another good configuration for a zfs pool is to set logbias=throughput
sure the pool may become less responsive for some stuff, but really the bandwidth it goes up, by a few multiples… i forget the exact number… but its a lot

i’m sure there could be cases where logbias=latency is preferable, but the benefit from getting much higher data transfers also helps reduce latency… so, i always go throughput.

there is many tricks to make zfs run better.
xattr=off also helps reduce iops pr record or file… much like atime…

arrogantrabbit · April 16, 2023, 9:24pm

This, essentially, turns off SLOG. It may be fine for some databases, in some corner cases, at very high utilization, but not for the bulk of data. I would leave it at default. Especially since the rate of writing to SQLite by storj is fairly minuscule.

In vast majority of cases latency is preferable. That’s why it’s a default.

Storage node does not read or write extended attributes, so this change will have no effect (I’m not sure it’s supported on FreeBSD in the first place though).

aad · April 17, 2023, 8:34pm

But aren’t only 29 of those pieces needed to recreate the file? This would imply a max piece size of ~2.2MB, which is what I seem to remember being a pretty common file size in my blob folder.

On a separate note: you can get even less disk usage moving your databases to ram (on Linux you can do this by storing them under /dev/shm). They aren’t strictly necessary for node operation, just nice to have.

Alexey · April 18, 2023, 3:27am

Yes, 29 is enough to reconstruct the segment, so you are correct, 64MiB/29 is a maximum piece size with the current Erasure Coding settings.

aad · April 18, 2023, 12:32pm

What method did you follow to determine this? How old is the node you analyzed? I’m interested in trying to reproduce your results, as I’ve often wondered the same.

arrogantrabbit · April 18, 2023, 4:17pm

Yes, by observing ARC hit ratio, and rate of L2ARC growth. The L2ARC write speed is heavily throttled by design, however before the change it was filling up at that max rate, but after implementing the changes it stopped, while ARC hit rate remained at 98-100%. It is really hard for me to separate storj activity from everything else that the server is doing, but the fact that the L2ARC stopped filling up like there is no tomorrow indicates that the infrequently accessed data does not compete anymore with the frequently used one for the space in the cache – which is an ultimate goal.

It would be indeed interesting to compare observation on some pool that is dedicated to storj, if anyone has any.

I’m not sure what are you asking - can you elaborate? Preemptively, here is a good starting point on how ARC works ZFS L2ARC

I haven’t had a crash for a very long time – the key is to pick a very stable and boring OS for your servers and stable hardware, but you’re right, kernel panic will have the same effect as abrupt power loss.

I agree, that there is no reason to care too much about storj data integrity, there is redundancy built into the system, but then again – this is a shared machine, and I do care about my other data. Hence, async writes to storj dataset, and default for everything else.

sudo zdb -d -U /data/zfs/zpool.cache pool1/storj
Dataset pool1/storj [ZPL], ID 17585, cr_txg 2486832, 6.62T, 15338628 objects

Let me know if you want more specific data.

That is for 9-month old one. There is another, 5-month old one on the same host.

With sync disabled and caches enabled it gets pretty close – the traffic volume is pretty small. But I agree, these dbs are purely cosmetic.

I’ve put together a small script to draw a histogram of file sized distripution with various bin sizes, and ran it overnight. This is the result for the 9-month old node at 4k binning:

--- Bin size fixed: 4096 bytes
One star represents about 15841 files. Omitting results with counts smaller than that
      4096 B   |    1900915 | ************************************************************************************************************************
      8192 B   |    1142603 | ************************************************************************
     12288 B   |     739347 | **********************************************
     16384 B   |     689267 | *******************************************
        20 kiB |     719417 | *********************************************
        24 kiB |     241788 | ***************
        28 kiB |     192761 | ************
        32 kiB |     204852 | ************
        36 kiB |     336095 | *********************
        40 kiB |     440324 | ***************************
        44 kiB |      66746 | ****
        48 kiB |      66421 | ****
        52 kiB |      67986 | ****
        56 kiB |      60739 | ***
        60 kiB |      49708 | ***
        64 kiB |      74731 | ****
        68 kiB |      66825 | ****
        72 kiB |      52849 | ***
        76 kiB |     618568 | ***************************************
        80 kiB |      50655 | ***
        84 kiB |      34664 | **
        88 kiB |      47353 | **
        92 kiB |     125894 | *******
        96 kiB |      92397 | *****
       100 kiB |     147334 | *********
       104 kiB |     127450 | ********
       108 kiB |      97814 | ******
       112 kiB |      31869 | **
       116 kiB |      19226 | *
       144 kiB |      61535 | ***
       148 kiB |      19555 | *
       152 kiB |      16882 | *
       156 kiB |      15960 | *
       180 kiB |     823735 | ****************************************************
       268 kiB |      16650 | *
       272 kiB |      36852 | **
       276 kiB |      26593 | *
       280 kiB |      24566 | *
       284 kiB |      16449 | *
       288 kiB |     183892 | ***********
       304 kiB |      18654 | *
       312 kiB |      25229 | *
       316 kiB |      36200 | **
       320 kiB |      28857 | *
       356 kiB |     104541 | ******
       392 kiB |     751961 | ***********************************************
       568 kiB |    1433244 | ******************************************************************************************
       712 kiB |      27343 | *
      1772 kiB |      36849 | **
      2128 kiB |     212407 | *************
      2268 kiB |    1894912 | ***********************************************************************************************************************
--- Bin size fixed: 16384 bytes
One star represents about 37268 files. Omitting results with counts smaller than that
     16384 B   |    4472132 | ***********************************************************************************************************************
        32 kiB |    1358818 | ************************************
        48 kiB |     909586 | ************************
        64 kiB |     253164 | ******
        80 kiB |     788897 | *********************
        96 kiB |     300308 | ********
       112 kiB |     404467 | **********
       128 kiB |      60971 | *
       144 kiB |     101822 | **
       160 kiB |      67515 | *
       176 kiB |      51718 | *
       192 kiB |     846216 | **********************
       272 kiB |      64902 | *
       288 kiB |     251500 | ******
       304 kiB |      50898 | *
       320 kiB |     103932 | **
       368 kiB |     112512 | ***
       400 kiB |     757022 | ********************
       576 kiB |    1441592 | **************************************
      1776 kiB |      39988 | *
      2128 kiB |     212848 | *****
      2272 kiB |    1903272 | ***************************************************

Most files are under 16k, with majority of those under 4k; Somewhat smaller amount of files are around 512kiB, and 2.2MiB

I’ve put a script here if you want to play around with your nodes: Create ASCII histogram of the file sizes in the folder hierarchy · GitHub

Full histograms are here:
9-month old node: Histograms for 9-month old node · GitHub
5-month old node: Histograms for 5-month old node · GitHub

arrogantrabbit · April 19, 2023, 6:11am

Right. For other than storj scenarios, I do see quite a good hit ratio on L2:

I get the same output:

% sudo zdb -Lbbbs -U /data/zfs/zpool.cache pool1/storj
Dataset pool1/storj [ZPL], ID 17585, cr_txg 2486832, 6.63T, 15358313 objects

arrogantrabbit · April 19, 2023, 7:24am

I don’t worry about the padding cost of the small files, the extra size is still insignificant; but speeding up random lookups by offloading metadata to special vdev sounds enticing: if the storj exhibited repeatable access pattern this would not have mattered much: metadata would have ended up in ARC pretty quickly and ram would always be faster than SSD. But because it does not, once metadata size exceeds the available space in the arc — special vdev should improve random access response time. I’ll look into this deeper tomorrow. Thank you for the suggestion.

That histogram building with logarithmic binning may be obscuring the outliers, but it’s ok to get a feel of distribution of data size magnitudes.

Pentium100 · April 20, 2023, 11:06am

It depends on how frequently accessed the piece is, at least on my node, since April 10 some pieces were accessed over 1000 times. Depending on their size, caching them may be useful.

By the way, if you are usng zfs to store the node files directly (not a zvol like I do), then you can set recordsize to 1M, because zfs uses variable size records and the setting specifies the maximum size.

arrogantrabbit · April 20, 2023, 5:00pm

That’s what I meant – it has negligible impact on this usecase.

These are good point, I was curious how many times are chunks downloaded. On my 9month old node I see this statistics:

Last week:

Download was requested for 1411241 total chunks
64.77% of chunks were requested more than 1 times
32.26% of chunks were requested more than 10 times
15.73% of chunks were requested more than 100 times
8.94% of chunks were requested more than 1000 times
7.04% of chunks were requested more than 10000 times

Last 3 months:

Download was requested for 10004066 total chunks
88.62% of chunks were requested more than 1 times
47.89% of chunks were requested more than 10 times
24.36% of chunks were requested more than 100 times
14.24% of chunks were requested more than 1000 times
4.71% of chunks were requested more than 10000 times

The script for those playing along at home: STORJ: repeated downloads · GitHub (bzgrep because I rotate and compress logs daily)

I need to think about what that means. It appears that adding chunk caching will benefit 30-50% of download requests. What effect will this have on winning races requires further measurement. I’ll see if I can extract this information from the logs (i.e. track time to download specific chunk for the first time, vs subsequent times, unless cancelled, and track cancellation rate across all chunks for repeated vs one-shot ones)

Accelerating metadata is a no brainer: either by special vdev, or by primary/secondary ARC, the end result is the same. I’m more inclined towards the caching vs special vdev though from reliability standpoint: no need for redundancy and I can remove it anytime.

Whether to cache chunks – depending on how much space is available, and the total volume of chunks that end up repeatedly read – it could indeed be couple of TB – and whether extra boost of races won (if any) is worth some degree of SSD churn. Incidentally, I have 2TB L2ARC, and it does not really fill that much faster. This maybe all that is needed.

As an aside, I have vfs.zfs.l2arc.rebuild_enabled=1 for the L2 cache to persist reboots, and I’ve set vfs.zfs.l2arc_write_max to 25MB/s from the default 8, and the vfs.zfs.l2arc_write_boost is double of that.

Pentium100 · April 20, 2023, 5:44pm

The second version.

The uplink starts more uploads/downloads than needed to have enough pieces to reconstruct the file (download) or to store the file with enough redundancy (upload). I do not remember the exact numbers, but it’s something like 100 uploads are started at once, but only 80 are needed. 50 downloads are started at once, but only 30 pieces are needed to get the segment.

All transfer happen at the same time, once there are enough complete transfers, the rest are canceled - those nodes “lost the race”.

arrogantrabbit · April 20, 2023, 7:56pm

Satellite already knows that – thats’ why customer is requesting the file from you in the first place.

If I understand correctly, ultimately the total time to receive the full chunk is what matters: if you manage to get the chunk to the customer fast enough to be in the first N completed transfers – you get paid. Else – you don’t.

That time consists of latency to start transferring the data, and then transfer time. Caching is supposed to help with the former by reducing “time to first byte”. The latter you can’t control (other than by generally improving QoS with SQM) because depending on network conditions you can be simply too far to compete.

Pentium100 · April 20, 2023, 8:03pm

With the current traffic, IMO caching is not really necessary, especially L2ARC or similar (though my server has a lot of RAM maybe that’s why). However, if the traffic was a lot higher, the hard drives may not be able to keep up and then caching would help to keep the bandwidth up.

arrogantrabbit · April 20, 2023, 8:36pm

Regardless of traffic, there is this minimum latency due to HDD seek time - 5-10ms. Compared to network latency of 10-20ms this is not an insignificant amount, essentially, removing the seek time cuts the time to first byte in half. How significantly does this affect race winning needs to be measured.

I agree that with more traffic when IO queues are not empty this time will multiple, and therefore caching will become more impactful. On shared machines however this may already the case due to other activities.

arrogantrabbit · April 21, 2023, 6:12pm

That’s not just bad peering – most folks have heavily asymmetric connection with usually much smaller upstream – mine is 800mbps down and 25mbps up. So that’s the actual limit. I haven’t however seen storj saturating it ever, not even close – which points to some latency induced inefficiency.

On the other hand, in your example it’s still 10% head start. And that is the worst case – largest chunk size according to the histogram above. Majority of chunks are smaller, with the huge amount of them being under 4k – transfer time of those is insignificant; and hence only the seek time along with network latency matters. Especially if folks have fiber connection that does not have 10ms additional overhead cable modems exhibit.

Think about it this way: lowering average latency help deliver data to customer sooner, and as a result on average increases probability of these chunks to be part of first 29 or whatever is needed to reconstruct the file.

How does reduction in latency translate into winning races on average – no idea. there are too many moving parts to do a meaningful A/B testing. Maybe having two equally old equally sized nodes on the same pool in different datasets, enable caching on one, and observe rate of cancellations. Then disable, and enable on the other, to confirm the reduction in cancellations follows enablement of caching, if any, and by how much.

It’s not black and white, there is some distribution, some may have higher latency but be closer, and still win. It’s about when you are in between – when due to being far and having high seek latency your transfer would get canceled, but reducing seek latency lowers the total latency enough to make then cut.

All the latencies here are the same order of magnitude – around 10ms.

agente · June 14, 2024, 1:46pm

Resuming an old good thread for ZFS.
I’m testing zfs a little (cause I found limits that you always talk about in my ext4 setup ).
I decided to use special dev (raid1) without l2arc. Database in the same dataset (I want to start simplifying as much as possible).
QUESTIONS:

Blocksize? 512? someone use 1M. What do you think?
Special dev ratio 0.3 every 10tb of data still valid?
For storj what kind of raidz would you use if you need to setup a 48disk or more system? 12 Raidz2 vdev is too much “aggressive”? Raidz3 with more disks?

pangolin · June 14, 2024, 2:49pm

I am running all my zfs nodes single disk. Storj doesn’t need any additional redundancy.