Help with node optimizations / best practices

JWvdV · February 21, 2024, 9:34am

I can see where you come from, but honesty is: most of us it took quite a way to get working what we’ve got now. Especially since many of us are just hobbyists. So, I actually have some drives formatted as xfs, f2fs, reiserfs and even a combined drive with mergerfs of 2x xfs + 1x ext4 (stupidest thing I ever did) aside from the recommended ext4-option.

Since I want to keep it as dirt cheap as possible (because hobby and the fun of it), I have many different drives even some micro-SDs as storage.

Except from the filewalker issue on some of my nodes, it’s running rock solid for over 3 months after six months of juggling and trying.

So, yeah, it might be an investment. It’s up to you to decide whether it’s worth your time. But I really would consider to simplify it a bit.

As you can see above, I run 15 nodes with just 16GB RAM of which most of it isn’t being used, N100 processor which really isn’t a beast and I’m even running some other processes along it like a VPN, Syncthing back-up and such. And the processor easily can cope it all. It’s all IO what’s bottlenecking it which is to be expected.

Alexey · February 23, 2024, 3:30am

of course. But since you suggested to create a partition on the same NTFS drive, it would be easier to perform all moves and merging partitions with LVM, because you can do it on the fly.

shrink the current NTFS to 50%,
create LVM partition, create pv, create lv, format it to ext4
move the node from NTFS to this new partition with rsync method
create a new LVM partition instead of NTFS, create pv, add it to the existing lv
pvmove data from the pv at the end of the disk to the pv at the beginning, remove the empty pv from lv
extend the first pv to use all available space, extend lv, extend ext4

The downtime would be minimal - only before the final rsync --delete in p.3, all other can be done on the fly.

Walter1 · February 24, 2024, 1:18pm

Do you recommend LVM for single ext4-drives as well? I’m kinda scared and have LVM not enabled, just the full disk formatted with ext4 and 0% reservation.

Alexey · February 24, 2024, 1:22pm

No. The LVM partition will be slower than a pure ext4. But it has some advantages like an ability to move data from one disk/partition to another or make a parity RAID without a downtime… But it has zero tolerance to bitrot unlike ZFS or BTRFS (in the Synology implementation only though).
But…

Walter1 · February 24, 2024, 1:28pm

Are there any practical numbers when bitrot starts to occur on a HDD? Like for example a year or five? Do you have any practical experiences?

Alexey · February 24, 2024, 1:29pm

Check the backblaze stat

The greater disk and lower the quality - the bitrot is the inevitable event.
In the RAID setup it’s - fatality, unless you use ZFS and in the any pool with a redundancy.
Otherwise you have a lower risk for a single drive setup with a plain ext4 for Linux and NTFS for Windows (please, DO NOT MIX the OS and FS).

Whis-key · March 7, 2024, 1:05pm

Could you please tell me which template you are using?

JWvdV · March 7, 2024, 1:36pm

I use debian-minimal in priviliged container with nesting, because it’s the most stable distro I know.

Whis-key · March 12, 2024, 6:23pm

Thanks! Debian11 or debian12 or 10?

JWvdV · March 12, 2024, 6:45pm

Debian download template, which is 12.5 I believe.

GoldfishIT · April 21, 2024, 11:19am

After a few months of testing things and then migrating all my nodes, the 100% activity on all drives all the time is fixed, IOdelay in Proxmox is much lower and filewalker only takes a few hours for a 7TB node.

After diagnosing that read speeds were reaching about 300KB/s tops for each drive while running Storj (other workloads seemed OK in the same setup) but write speeds were close to bare metal, I took a ~2TB node as a test subject and controlled the conditions so that minimal ingest from Storj would occur to that node.

Filewalker “speed” on a ~2TB node:
~6 days - Disk in passthrough to a windows VM, formatted in NTFS with 4K cluster size, default cache/aio proxmox options and iothread enabled, indexing disabled in Windows.
14 hours - Same disk but passed through via USB. This is what made me understand that the problem had to emerge/worsen on its own maybe due to updates or something, and had to be in the interaction of passthrough with other stuff, seeing as one way to avoid nodes dropping in the past for me was to move the drives from USB in a bare metal machine to becoming passthrough to a VM.
36 Minutes - Same everything of the 6 days test but instead of passthrough it’s LVM. Theory confirmed.
30 Minutes - Same everything but writeback cache in proxmox
27 Minutes - Same everything but writethrough cache in proxmox

Other stuff I’ve tried that had negligible impact on performance or worsened it or had other big drawbacks I won’t get into:
ReFS formatting, aio native with lvm/scsi, cluster size increase, exfat formatting, device caching off on windows.

Alexey · April 21, 2024, 11:27am

Thank you for confirming my doubts that virtualization is the key of drop down the performance, also a special thanks for figuring out, what’s culprit in the situation when you forced to use a VM on Proxmox!
I would like to see the similar report for EXSI, though.

Ottetal · April 22, 2024, 9:19am

HDDs in multiple? Are you running a RAID array on top of your disks? I see in your post that you’ve upgraded to more expensive disks, but are they each carrying a single node or multiple workloads? I ran 8x 20TB disks at home for some time in RAID5 on ext4 and had 4TB of read/write cache in front and still had huge issues with write amplification. It’s as @Alexey says:

The virtualization usually affecting disk IOPS

Which is true. All virtualization carries a performance hit on almost all fronts, but if configured correctly it should just be a few percent. @Toyoo has a good analogy on all the switches needing to be correct, before your performance is great. I work as a Compute admin for a huge VMware setup, and we don’t have any issues either. I use VMware at home to virtualize my StorJ setup as well, and I great all my StorJ VMs as I treat my production Database VMs at work, because that’s the workload StorJ reminds most of. That being said, it’s not the same, StorJ is very unique in regards to it’s performance characteristics, there is not really anything like it - it’s almost completely random reads at all times, the stored files are almost all within the same size threshold and it’s very sensitive to IO degredation (just like Databases). Oh well

Here is the settings I use

Storage Array:

If your HHDs are local to your VMhosts, just use single datastore for a single disk
Synology running single disk volumes with single disk storage pools on them. Always formatted ext4, allways running ISCSI (not NFS!) on dedicated NICs, and always running thick provision on the LUNs
- I am debating switching to ZFS on a homebrew machine so I can use the special metadata device, but what I have now works and I don’t want to spend any additional time or money right now.
I am using the before mentioned standalone 8x 20TB disks (MG10ACA20TE), and a RAID10 array of 2TB SSDs as read/write cache, allocating 400GB to each disk. This is massively overkill, but the performance is great. Here is a screen shot of three of the disks:

image2277×1052 57 KB
And here is a screenshot of one of the LUNs performance. It could be better, but it’s in the upper end of what I expect from cached HDDs:

image1467×864 101 KB

Hardware:

Identical VMhosts of: Intel 12400, 128GB RAM, 2TB local mirrored NVMe, 4x 1Gb NICs
… But none of that really matters. As long as you hypervisor is not swapping to disk due to memory exhaustion and as long as you’re alright with CPU utilization, you can use whatever you want. I’d recommend still having local NVMe for the OS drives and still using dedicated NICs for SAN uplinks

VMware:

Here is where it get’s more interesting. I always assign 1GB of RAM pr TB of used StorJ space for StorJ workloads, and a minimum of 2GB, because I run Windows on my guests. I give them 2 vCPUs, because I like them the ability to burst up when doing updates.
All my VMs are regular win10 VMs, but stripped of most of their unused stuff thanks to Chris Titus’ scripts: Debloat Windows in 2022. This brings down idle memory consumption significantly
- I do this on all my windows machines. It’s a wonderful script.
All VMs have their C:\ Drive on local NVMe. All VM’s have their second disk as a 10GB drive, also on local NVMe which is only used for the StorJ Databases and the StorJ directory is built on multiple spanned 1TB disks:

image856×842 56.4 KB
I use spanned disks in wondows, and not striped, because the limiting factor of the performance is going to be my underlying hardware and not Window’s implementation of striping. Yes, there is a higher chance of datacorruption when running it this way - but the disk ~~will~~ should only corrupt if your SAN goes out. If this method is good enough for work, it’s good enough for me.
Remember to enable round-robin on your SAN uplinks, and try to have enough NICs so your SAN uplinks are not also you VM network.
When running larger nodes, it is adviced to add additional vSCSI controllers to the VM, and balance dthe disks out on those since each controller get’s it’s own IO queue. It is also advised to put each VMdisk on a single datastore, since each datastore also gets it’s own IO queue. This would be good advice when running massive nodes (30TB+), but I don’t, and even if I did, my underlaying storage of single HDDs would be the bottleneck - not the storage setup.
- Sidenote; the original “guide to 100k IOPS” whitepaper is a wonderful read: 100,000 I/O Operations Per Second, One ESX Host - VROOM! Performance Blog. The pictures are gone since the Broardcom purchase, but you get the point. The million and two-million whitepapers the following years are great too.
I run more nodes than I have disks, so I break the “One Node, One Disk” rule, but while one disk can hold multiple nodes I do make sure that a node only lives on a single disk.
I have spare local NVMe. I’ve thought about using it as local host cache for SWAP, but like I said, if you’re swapping, you have other issues.
All disks must not be thin provisioned. The write penalty for requesting new blocks when expanding the disks is simply too big. If your disk is full and already thin provisioned, there is no need to inflate it (because all the bytes in it’s allocated size is already written), but I would highly suggest inflating all other thin disks. My old nodes are running thick lazy formatted disks, but I create all newer disks as thick eager. I can’t feel the performance difference yet, but when creating new disks I can offload all the future IO cost of zeroing now, and that’s wonderful.
- If you want to inflate a thin disk, you can either svMotion it to a different disk and chose thick provisioned as your preferred storage type or navigate to the .vmdk file in the datastore file-browser and press “inflate” there.
Don’t use VMware snapshots. Snapshots are not backup, they are terrible for performance and grow crazy fast in this use case. If you have to (perhaps you’re worried for an update?), please stop your node process first. Delete the snapshot before starting the node again.

Aaaaaand I think that’s it really. It’s a LOT more involved than running a single machine with docker and assign each disk to each image. If you have to use virtualization, follow the advice above and consider running a Linux distro with docker instead of windows.

I only do it this way, because it conveniently allows me to test some larger scripts and orchestration I use at work in my spare time.

I’ll update this comment, if I find anything else cool

Roxor · April 22, 2024, 12:13pm

This is off-topic: but I’m curious. Given what you’ve spent on RAM and 20TB HDDs: why are you running 4x 1Gb NICs… when used SFP+ gear has become so cheap? Is it because you already owned it and prefer physical isolation? VLAN’d 10Gbps can be done on homelab budgets these days!

Not a critique at all: just didn’t seem to line up with thoughtful provisioning you’ve done for everything else. I’m still jealous!

Ottetal · April 22, 2024, 1:40pm

Good question! I got some nice discounts on the components. Disks are never cheap, lol

Regarding the NICs, then yeah sure - I’ve got a stack of x520s laying on my desk right now. I used to use them, but they suck a tonne of power, and I don’t really need the extra performance. I mean it’s nice when doing svMotions, but the whole point of much of the storage is to just stay in one place. Besides, I like having vCenter not nagging about uplink redundancy.

The Synology already have 4x NICs out of the box so I don’t need an addin card at all for those. My motherboards do also have two NICs, but I just a 2x2.5Gb addin card with an intel i226 chipset on a PCIe 1x lane. I just run them 1Gb for now. I like 10Gb as much as the next guy, and the performance is really nice, but the 2.5Gb cards use almost no power, and I can addin cards completely out of the Synology, which is also nice. Before I had my second VMhost, I even bypassed the switch, wiring directly inbetween the VMhost and SAN.

I’m contemplating a third VMhost and getting some cheap L2 2.5Gb switch, that’ll then be come a physically isolated storage network. Then again, there’s a 4 port version of the NIC I linked before available as well, I could also just plug that into the Synology.

“Why not just go FC for storage traffic and do it the right way”?

Good question. No need right now