I thought let’s think through the optimal configuration for potato nodes and subpar hardware, which many of us probably have because the ‘use what you have’-policy. Also many of us have probably faced some hardships, during recent tests. This topic isn’t meant to lament on the obviously changed requirements for a node. This topic is meant, to think out options to improve your node as much as possible and also making chances as high as possible to survive another test.
The basic principles:
The device you have is weak, so a Raspberry Pi, Intel N95-300/N5xxx/N4xxx, Odroid, …
The device is quite low on RAM, <1GB/TB you’re offering; most probably far below.
You have subpar hardware, but you have at least one SSD with a few GBs free.
Implications:
Due to memory/TB, ZFS isn’t a real option here. BTRFS gave that much trouble, that I avoid it like the plague for storage nodes (although usefull in other use cases). So for file systems EXT4 and XFS are an option. I don’t talk Windows here, because I don’t have profound experience with STORJ and Windows.
I don’t use RAID in this setup, because it also eats additional resources. STORJ is already taking care of the redundancy anyway.
I priorityze preventing data loss below keeping the system running, as long as it is (far) below 2%. Following the recent policy of STORJ (not syncing every file).
For this piece I’m using a N100 with 16GB, offering 45TB of data:
I installed the OS on an 512GB SATA SSD, leaving some space for additional partitions.
In order to run filewalkers, I need as much as possible RAM for filesystem caching. So
I installed zram-tools in order to compress low-use RAM, with config in /etc/default/zramswap:
ALGO=zstd
PERCENT=95
In /etc/sysctl.d/ I created a file 99-zram-params.conf:
Positive side-effect is that written pages are not lingering all to long in the page cache, but are flushed out to the disk almost immediately lowering the chance of data loss.
I lowered the tendency of clearing inodes and dentry cache by
vm.vfs_cache_pressure=2
I moved all databases from the storagenodes to the SSD.
I enabled fast_commit on all ext4 drives.
I created a external journal for all non-SSD ext4 drives on the remaining space of the OS-drive.
I mount all ext4 drives with journal_async_commit,noatime,lazytime,commit=150 ( latter will problem never come into effect, due to zealous flushing to the disk ).
And now I’m having a quite stable system, without stalls.
See how long it’s running this stable.
And I’m really interested in any ideas from others, how to squeeze as much as possible out the system.
I have a system somewhat weaker than yours, with AMD N36L and 16 GB of RAM, maintaining nodes of a similar size to yours. CPU not strong enough for zram-tools.
As I wrote here, -I 128 reduces metadata size. I expect ~1 GB of RAM to be enough to cache 2 TB worth of piece files with this setting. Requires a fresh file system.
LVMcache is a good thing.
BTW, I suspect after recent changes disabling synchronous writes that btrfs might actually be viable now.
Why not? Even in RPi it was worthwhile. That’s a weaker system than yours.
I also implemented that, indeed. XFS on the other hand seems to be quicker, although having bigger meta data.
Different stories, bcache and LVMcache cache the file data. Different stories. But doesn’t help that much if the HDD always is on its limits, like SMR HDDs.
BcacheFS hasn’t matured enough at this moment in my opinion opinion.
External log on SSD, could improve performance by 40%.
I re-read that post (and quite are wow the subsequent ones). I note you didn’t see a huge amount of performance improvement.
I am, however, interested in reducing filewalker times and I got the impression that out of all those bespoke mkfs options, -I 128 was the one that made a difference.
Would that be a fair comment?
I am deploying a few more nodes and might format them with that option if you think it’ll be helpful
N100 has 5.5k CPU Mark on Passmark benchmarks, N36L has 487.
And CPU doesn’t have the right instructions to hardware-accelerate compression.
Not true. LVMcache caches block device, which includes both metadata and data. And given metadata is much more often accessed, it has priority.
At the time, no. Now we know that in regime of 0.5 to 1 GB of RAM per TB the speedup is huge exactly because metadata fit it RAM with -I 128, and no longer so without.
But below 0.5 GB per TB again there’s not much speedup.
IIRC other options help with just disk space, no performance changes expected.
Yeah, and you only want meta data to be cached. So actually we’re looking for an option to expand the VFS cache. From this perspective ZFS is great, and also bcachefs looks promising. But up till now respectively too RAM consuming and immature.
I will challenge you on that. My plan is to run ZFS on a Pi5. I don’t expect it to be as fast as on a machine with enough memory but that doesn’t mean the storage node will suffer. It might still run great. I will most likely still need weeks to finish the planning but one day we can compare our filewalker runtime
https://man7.org/linux/man-pages/man7/lvmcache.7.html
You can configure: metadata_only = 0|1
Only metadata is promoted to the cache. This option
improves performance for heavier REQ_META workloads.
Never tried btw…
Well if I should compete against myself the outcome is obvious. I have no motivation to do so. If you want to challenge me you will have to run the alternative configuration and we both compare our results.
yeah, I wish too. Not possible at the moment. So, no fixed IPv4 and most of places are CGNAT - I’m constantly moving. If I would use a VPN, it would not add anything.
I recall there were some hosting companies with RPi “in the cloud”. If Storj would sponsor a few of them “for research”, I’m pretty sure you’d find volunteers on the forum to run tests.
I do, because as soon there’s being cached something else it will be less efficient. For sure in contact of STORJ.
That would mean that LVM should be filesystem agnostic which isn’t the case, or they have a very sophisticated way to differentiate metadata from other random reads.
So, I really wonder how they would do it.
Ow, it’s only about the LVM metadata:
Cache metadata logical volume — the logical volume containing the metadata for the cache pool logical volume, which holds the accounting information that specifies where data blocks are stored (for example, on the origin logical volume or the cache data logical volume).
If you want to dig deeper, you can set up device mapper so that only the block group metadata (descriptor, block/inode bitmaps, inode table) areas on ext4 are cached. This is pretty close to black magic though, and direntries wouldn’t be cached this way either. So indeed if you really, really need to cache only metadata, then ext4+LVMcache is not the right choice.
As for me, I sized my LVMcache so that not much more than metadata can fit it.
I would posit though that they might still be better than other choices. Will wait for @littleskunk’s experiment results, might be quite interesting in this context.
6TB node on a 18 TB drive takes about 3 hours for one filewalker run. Success rate while the filewalker is active is about 95%. The moment the filewalker finishes the success rate goes up to 98-99%. So ZFS does work even with less available memory.
It is a Pi5 with 8 GB of RAM and an NVMe drive for operating system and caching.
zpool create SN1 /dev/disk/by-id/...
zpool add SN1 cache /dev/disk/by-id/...
zfs set compression=on SN1
zfs set sync=disabled SN1
zfs set atime=off SN1
zfs set recordsize=1M SN1
zfs set mountpoint=/mnt/sn1 SN1
zfs set primarycache=metadata SN1
zfs set secondarycache=metadata SN1
And than repeat this for each drive. Each drive gets its own SSD partition for caching the metadata. I can’t get a special dev because that would be required to run it. In my setup the caching partition is optional. If it fails I can still buy a new SSD and continue running the node. To my understanding if the special devs are failing its game over and you lose all data.
I am still testing my setup. It is currently just a single HDD. Over the next days I am going to add more and more hard drives to the Pi5 to find out at which point the success rate gets impacted. I am going to write up a summary for others so they can copy my setup.