Best configuration for potato nodes and subpar hardware

JWvdV · June 17, 2024, 8:55pm

I thought let’s think through the optimal configuration for potato nodes and subpar hardware, which many of us probably have because the ‘use what you have’-policy. Also many of us have probably faced some hardships, during recent tests. This topic isn’t meant to lament on the obviously changed requirements for a node. This topic is meant, to think out options to improve your node as much as possible and also making chances as high as possible to survive another test.

The basic principles:

The device you have is weak, so a Raspberry Pi, Intel N95-300/N5xxx/N4xxx, Odroid, …
The device is quite low on RAM, <1GB/TB you’re offering; most probably far below.
You have subpar hardware, but you have at least one SSD with a few GBs free.

Implications:

Due to memory/TB, ZFS isn’t a real option here. BTRFS gave that much trouble, that I avoid it like the plague for storage nodes (although usefull in other use cases). So for file systems EXT4 and XFS are an option. I don’t talk Windows here, because I don’t have profound experience with STORJ and Windows.
I don’t use RAID in this setup, because it also eats additional resources. STORJ is already taking care of the redundancy anyway.
I priorityze preventing data loss below keeping the system running, as long as it is (far) below 2%. Following the recent policy of STORJ (not syncing every file).

For this piece I’m using a N100 with 16GB, offering 45TB of data:

I installed the OS on an 512GB SATA SSD, leaving some space for additional partitions.
In order to run filewalkers, I need as much as possible RAM for filesystem caching. So
I installed zram-tools in order to compress low-use RAM, with config in /etc/default/zramswap:

ALGO=zstd
PERCENT=95

In /etc/sysctl.d/ I created a file 99-zram-params.conf:

vm.swappiness=190
vm.watermark_boost_factor=0
vm.watermark_scale_factor=125
vm.page-cluster=0

Based on content like GitHub - facebook/zstd: Zstandard - Fast real-time compression algorithm and zram - ArchWiki (archlinux.org)

Furthermore, I decreased the dirty page cache by:

vm.dirty_bytes=200000000
vm.dirty_background_bytes=80000000

In order to prevent stalls, I also increased writeback frequency and lowered the time-to-become dirty.

vm.dirty_expire_centisecs=1000
vm.dirty_writeback_centisecs=500

Positive side-effect is that written pages are not lingering all to long in the page cache, but are flushed out to the disk almost immediately lowering the chance of data loss.

I lowered the tendency of clearing inodes and dentry cache by

vm.vfs_cache_pressure=2

I moved all databases from the storagenodes to the SSD.
I enabled fast_commit on all ext4 drives.
I created a external journal for all non-SSD ext4 drives on the remaining space of the OS-drive.
I mount all ext4 drives with journal_async_commit,noatime,lazytime,commit=150 ( latter will problem never come into effect, due to zealous flushing to the disk ).

And now I’m having a quite stable system, without stalls.
See how long it’s running this stable.

And I’m really interested in any ideas from others, how to squeeze as much as possible out the system.

Toyoo · June 17, 2024, 9:41pm

I have a system somewhat weaker than yours, with AMD N36L and 16 GB of RAM, maintaining nodes of a similar size to yours. CPU not strong enough for zram-tools.

As I wrote here, -I 128 reduces metadata size. I expect ~1 GB of RAM to be enough to cache 2 TB worth of piece files with this setting. Requires a fresh file system.

LVMcache is a good thing.

BTW, I suspect after recent changes disabling synchronous writes that btrfs might actually be viable now.

JWvdV · June 18, 2024, 8:24am

Why not? Even in RPi it was worthwhile. That’s a weaker system than yours.

I also implemented that, indeed. XFS on the other hand seems to be quicker, although having bigger meta data.

Different stories, bcache and LVMcache cache the file data. Different stories. But doesn’t help that much if the HDD always is on its limits, like SMR HDDs.
BcacheFS hasn’t matured enough at this moment in my opinion opinion.
External log on SSD, could improve performance by 40%.

ACarneiro · June 18, 2024, 10:23am

I re-read that post (and quite are wow the subsequent ones). I note you didn’t see a huge amount of performance improvement.
I am, however, interested in reducing filewalker times and I got the impression that out of all those bespoke mkfs options, -I 128 was the one that made a difference.
Would that be a fair comment?

I am deploying a few more nodes and might format them with that option if you think it’ll be helpful

Toyoo · June 18, 2024, 10:46am

N100 has 5.5k CPU Mark on Passmark benchmarks, N36L has 487.

And CPU doesn’t have the right instructions to hardware-accelerate compression.

Not true. LVMcache caches block device, which includes both metadata and data. And given metadata is much more often accessed, it has priority.

At the time, no. Now we know that in regime of 0.5 to 1 GB of RAM per TB the speedup is huge exactly because metadata fit it RAM with -I 128, and no longer so without.

But below 0.5 GB per TB again there’s not much speedup.

IIRC other options help with just disk space, no performance changes expected.

ACarneiro · June 18, 2024, 11:02am

Thank you
I will use -I 128 on my most RAM-constrained machine and see how it goes

JWvdV · June 18, 2024, 11:07am

Yeah, and you only want meta data to be cached. So actually we’re looking for an option to expand the VFS cache. From this perspective ZFS is great, and also bcachefs looks promising. But up till now respectively too RAM consuming and immature.

littleskunk · June 18, 2024, 1:00pm

I will challenge you on that. My plan is to run ZFS on a Pi5. I don’t expect it to be as fast as on a machine with enough memory but that doesn’t mean the storage node will suffer. It might still run great. I will most likely still need weeks to finish the planning but one day we can compare our filewalker runtime

Toyoo · June 19, 2024, 12:12am

Nope. I want metadata to be cached. But I don’t care whether there’s anything else cached too. There no “only”.

Alexey · June 19, 2024, 5:59am

I would like to give you the challenge back: run the second on ext4 and compare

agente · June 19, 2024, 7:13am

https://man7.org/linux/man-pages/man7/lvmcache.7.html
You can configure: metadata_only = 0|1
Only metadata is promoted to the cache. This option
improves performance for heavier REQ_META workloads.
Never tried btw…

littleskunk · June 19, 2024, 7:57am

Well if I should compete against myself the outcome is obvious. I have no motivation to do so. If you want to challenge me you will have to run the alternative configuration and we both compare our results.

Alexey · June 19, 2024, 8:10am

yeah, I wish too. Not possible at the moment. So, no fixed IPv4 and most of places are CGNAT - I’m constantly moving. If I would use a VPN, it would not add anything.

Toyoo · June 19, 2024, 12:15pm

I recall there were some hosting companies with RPi “in the cloud”. If Storj would sponsor a few of them “for research”, I’m pretty sure you’d find volunteers on the forum to run tests.

JWvdV · June 19, 2024, 12:43pm

I do, because as soon there’s being cached something else it will be less efficient. For sure in contact of STORJ.

That would mean that LVM should be filesystem agnostic which isn’t the case, or they have a very sophisticated way to differentiate metadata from other random reads.
So, I really wonder how they would do it.

Ow, it’s only about the LVM metadata:

Cache metadata logical volume — the logical volume containing the metadata for the cache pool logical volume, which holds the accounting information that specifies where data blocks are stored (for example, on the origin logical volume or the cache data logical volume).

See: 5.4.7. Creating LVM Cache Logical Volumes Red Hat Enterprise Linux 6 | Red Hat Customer Portal

Toyoo · June 19, 2024, 9:23pm

Yep, LVMcache is filesystem-agnostic.

If you want to dig deeper, you can set up device mapper so that only the block group metadata (descriptor, block/inode bitmaps, inode table) areas on ext4 are cached. This is pretty close to black magic though, and direntries wouldn’t be cached this way either. So indeed if you really, really need to cache only metadata, then ext4+LVMcache is not the right choice.

As for me, I sized my LVMcache so that not much more than metadata can fit it.

I would posit though that they might still be better than other choices. Will wait for @littleskunk’s experiment results, might be quite interesting in this context.

littleskunk · June 28, 2024, 10:37am

6TB node on a 18 TB drive takes about 3 hours for one filewalker run. Success rate while the filewalker is active is about 95%. The moment the filewalker finishes the success rate goes up to 98-99%. So ZFS does work even with less available memory.

Here are some stats:

ZFS Subsystem Report                            Fri Jun 28 12:36:25 2024
Linux 6.6.31+rpt-rpi-2712                           2.2.3-1~bpo12+1~rpt1
Machine: pi5storagenode (aarch64)                   2.2.3-1~bpo12+1~rpt1

ARC status:                                                      HEALTHY
        Memory throttle count:                                         0

ARC size (current):                                    99.1 %    3.9 GiB
        Target size (adaptive):                       100.0 %    3.9 GiB
        Min size (hard limit):                          6.2 %  251.6 MiB
        Max size (high water):                           16:1    3.9 GiB
        Anonymous data size:                            4.6 %   68.9 MiB
        Anonymous metadata size:                        1.4 %   21.1 MiB
        MFU data target:                              < 0.1 %  121 Bytes
        MFU data size:                                  0.0 %    0 Bytes
        MFU ghost data size:                                     0 Bytes
        MFU metadata target:                           16.2 %  244.3 MiB
        MFU metadata size:                             11.1 %  167.5 MiB
        MFU ghost metadata size:                               453.6 MiB
        MRU data target:                              < 0.1 %  121 Bytes
        MRU data size:                                  0.0 %    0 Bytes
        MRU ghost data size:                                     0 Bytes
        MRU metadata target:                           83.8 %    1.2 GiB
        MRU metadata size:                             78.6 %    1.2 GiB
        MRU ghost metadata size:                                98.2 MiB
        Uncached data size:                             4.4 %   65.7 MiB
        Uncached metadata size:                         0.0 %    0 Bytes
        Bonus size:                                     0.9 %   34.8 MiB
        Dnode cache target:                            10.0 %  402.6 MiB
        Dnode cache size:                              27.1 %  109.2 MiB
        Dbuf size:                                      1.2 %   46.7 MiB
        Header size:                                    0.8 %   32.1 MiB
        L2 header size:                                33.4 %    1.3 GiB
        ABD chunk waste size:                          23.2 %  926.2 MiB

ARC hash breakdown:
        Elements max:                                              14.7M
        Elements current:                              99.6 %      14.7M
        Collisions:                                               155.0M
        Chain max:                                                    38
        Chains:                                                     1.0M

ARC misc:
        Deleted:                                                   55.0M
        Mutex misses:                                              29.9k
        Eviction skips:                                            16.7M
        Eviction skips due to L2 writes:                               0
        L2 cached evictions:                                   808.7 GiB
        L2 eligible evictions:                                 243.2 GiB
        L2 eligible MFU evictions:                      5.2 %   12.5 GiB
        L2 eligible MRU evictions:                     94.8 %  230.7 GiB
        L2 ineligible evictions:                                 3.8 GiB

ARC total accesses:                                               430.8M
        Total hits:                                    85.6 %     368.9M
        Total I/O hits:                               < 0.1 %      70.2k
        Total misses:                                  14.3 %      61.8M

ARC demand data accesses:                               0.7 %       3.1M
        Demand data hits:                              38.4 %       1.2M
        Demand data I/O hits:                         < 0.1 %        820
        Demand data misses:                            61.5 %       1.9M

ARC demand metadata accesses:                          97.8 %     421.5M
        Demand metadata hits:                          87.2 %     367.6M
        Demand metadata I/O hits:                     < 0.1 %      67.3k
        Demand metadata misses:                        12.8 %      53.8M

ARC prefetch data accesses:                           < 0.1 %      10.3k
        Prefetch data hits:                             5.3 %        547
        Prefetch data I/O hits:                         9.2 %        952
        Prefetch data misses:                          85.5 %       8.8k

ARC prefetch metadata accesses:                         1.5 %       6.3M
        Prefetch metadata hits:                         2.1 %     131.5k
        Prefetch metadata I/O hits:                   < 0.1 %       1.2k
        Prefetch metadata misses:                      97.9 %       6.1M

ARC predictive prefetches:                             99.7 %       6.2M
        Demand hits after predictive:                  95.1 %       5.9M
        Demand I/O hits after predictive:               4.8 %     302.0k
        Never demanded after predictive:                0.1 %       6.0k

ARC prescient prefetches:                               0.3 %      20.8k
        Demand hits after prescient:                   93.5 %      19.5k
        Demand I/O hits after prescient:                6.5 %       1.4k
        Never demanded after prescient:                 0.0 %          0

ARC states hits of all accesses:
        Most frequently used (MFU):                    71.6 %     308.7M
        Most recently used (MRU):                      13.7 %      59.1M
        Most frequently used (MFU) ghost:               0.7 %       2.9M
        Most recently used (MRU) ghost:                 0.5 %       2.2M
        Uncached:                                       0.3 %       1.2M

DMU predictive prefetcher calls:                                  891.6k
        Stream hits:                                    6.0 %      53.7k
        Stream misses:                                 94.0 %     837.9k
        Streams limit reached:                         86.7 %     726.3k
        Prefetches issued                                           9.0k

L2ARC status:                                                    HEALTHY
        Low memory aborts:                                         70.2k
        Free on write:                                              2.5k
        R/W clashes:                                                   0
        Bad checksums:                                                 0
        Read errors:                                                   0
        Write errors:                                                  0

L2ARC size (adaptive):                                         224.7 GiB
        Compressed:                                    15.3 %   34.4 GiB
        Header size:                                    0.6 %    1.3 GiB
        MFU allocated size:                             1.3 %  440.3 MiB
        MRU allocated size:                            98.7 %   33.9 GiB
        Prefetch allocated size:                      < 0.1 %    1.6 MiB
        Data (buffer content) allocated size:           0.0 %    0 Bytes
        Metadata (buffer content) allocated size:     100.0 %   34.4 GiB

L2ARC breakdown:                                                   61.8M
        Hit ratio:                                     82.1 %      50.8M
        Miss ratio:                                    17.9 %      11.0M

L2ARC I/O:
        Reads:                                      142.9 GiB      50.8M
        Writes:                                      15.2 GiB      14.0k

L2ARC evicts:
        L1 cached:                                                     0
        While reading:                                                 0

JWvdV · June 28, 2024, 12:22pm

What’s your setup?
I see L2ARC device of 225G?
Special devs?

littleskunk · June 28, 2024, 2:00pm

It is a Pi5 with 8 GB of RAM and an NVMe drive for operating system and caching.

zpool create SN1 /dev/disk/by-id/...
zpool add SN1 cache /dev/disk/by-id/...
zfs set compression=on SN1
zfs set sync=disabled SN1
zfs set atime=off SN1
zfs set recordsize=1M SN1
zfs set mountpoint=/mnt/sn1 SN1
zfs set primarycache=metadata SN1
zfs set secondarycache=metadata SN1

And than repeat this for each drive. Each drive gets its own SSD partition for caching the metadata. I can’t get a special dev because that would be required to run it. In my setup the caching partition is optional. If it fails I can still buy a new SSD and continue running the node. To my understanding if the special devs are failing its game over and you lose all data.

I am still testing my setup. It is currently just a single HDD. Over the next days I am going to add more and more hard drives to the Pi5 to find out at which point the success rate gets impacted. I am going to write up a summary for others so they can copy my setup.

JWvdV · June 28, 2024, 4:31pm

You can also partition your NVMe and use those partitions as special devs. You need depending on the recordsize 5GB/TB with recordsize 256k.

Therefore I use two SSDs, from which I mirror the partitions for the special devs.