Help with node optimizations / best practices

Ottetal · April 22, 2024, 9:19am

HDDs in multiple? Are you running a RAID array on top of your disks? I see in your post that you’ve upgraded to more expensive disks, but are they each carrying a single node or multiple workloads? I ran 8x 20TB disks at home for some time in RAID5 on ext4 and had 4TB of read/write cache in front and still had huge issues with write amplification. It’s as @Alexey says:

The virtualization usually affecting disk IOPS

Which is true. All virtualization carries a performance hit on almost all fronts, but if configured correctly it should just be a few percent. @Toyoo has a good analogy on all the switches needing to be correct, before your performance is great. I work as a Compute admin for a huge VMware setup, and we don’t have any issues either. I use VMware at home to virtualize my StorJ setup as well, and I great all my StorJ VMs as I treat my production Database VMs at work, because that’s the workload StorJ reminds most of. That being said, it’s not the same, StorJ is very unique in regards to it’s performance characteristics, there is not really anything like it - it’s almost completely random reads at all times, the stored files are almost all within the same size threshold and it’s very sensitive to IO degredation (just like Databases). Oh well

Here is the settings I use

Storage Array:

If your HHDs are local to your VMhosts, just use single datastore for a single disk
Synology running single disk volumes with single disk storage pools on them. Always formatted ext4, allways running ISCSI (not NFS!) on dedicated NICs, and always running thick provision on the LUNs
- I am debating switching to ZFS on a homebrew machine so I can use the special metadata device, but what I have now works and I don’t want to spend any additional time or money right now.
I am using the before mentioned standalone 8x 20TB disks (MG10ACA20TE), and a RAID10 array of 2TB SSDs as read/write cache, allocating 400GB to each disk. This is massively overkill, but the performance is great. Here is a screen shot of three of the disks:

image2277×1052 57 KB
And here is a screenshot of one of the LUNs performance. It could be better, but it’s in the upper end of what I expect from cached HDDs:

image1467×864 101 KB

Hardware:

Identical VMhosts of: Intel 12400, 128GB RAM, 2TB local mirrored NVMe, 4x 1Gb NICs
… But none of that really matters. As long as you hypervisor is not swapping to disk due to memory exhaustion and as long as you’re alright with CPU utilization, you can use whatever you want. I’d recommend still having local NVMe for the OS drives and still using dedicated NICs for SAN uplinks

VMware:

Here is where it get’s more interesting. I always assign 1GB of RAM pr TB of used StorJ space for StorJ workloads, and a minimum of 2GB, because I run Windows on my guests. I give them 2 vCPUs, because I like them the ability to burst up when doing updates.
All my VMs are regular win10 VMs, but stripped of most of their unused stuff thanks to Chris Titus’ scripts: Debloat Windows in 2022. This brings down idle memory consumption significantly
- I do this on all my windows machines. It’s a wonderful script.
All VMs have their C:\ Drive on local NVMe. All VM’s have their second disk as a 10GB drive, also on local NVMe which is only used for the StorJ Databases and the StorJ directory is built on multiple spanned 1TB disks:

image856×842 56.4 KB
I use spanned disks in wondows, and not striped, because the limiting factor of the performance is going to be my underlying hardware and not Window’s implementation of striping. Yes, there is a higher chance of datacorruption when running it this way - but the disk ~~will~~ should only corrupt if your SAN goes out. If this method is good enough for work, it’s good enough for me.
Remember to enable round-robin on your SAN uplinks, and try to have enough NICs so your SAN uplinks are not also you VM network.
When running larger nodes, it is adviced to add additional vSCSI controllers to the VM, and balance dthe disks out on those since each controller get’s it’s own IO queue. It is also advised to put each VMdisk on a single datastore, since each datastore also gets it’s own IO queue. This would be good advice when running massive nodes (30TB+), but I don’t, and even if I did, my underlaying storage of single HDDs would be the bottleneck - not the storage setup.
- Sidenote; the original “guide to 100k IOPS” whitepaper is a wonderful read: 100,000 I/O Operations Per Second, One ESX Host - VROOM! Performance Blog. The pictures are gone since the Broardcom purchase, but you get the point. The million and two-million whitepapers the following years are great too.
I run more nodes than I have disks, so I break the “One Node, One Disk” rule, but while one disk can hold multiple nodes I do make sure that a node only lives on a single disk.
I have spare local NVMe. I’ve thought about using it as local host cache for SWAP, but like I said, if you’re swapping, you have other issues.
All disks must not be thin provisioned. The write penalty for requesting new blocks when expanding the disks is simply too big. If your disk is full and already thin provisioned, there is no need to inflate it (because all the bytes in it’s allocated size is already written), but I would highly suggest inflating all other thin disks. My old nodes are running thick lazy formatted disks, but I create all newer disks as thick eager. I can’t feel the performance difference yet, but when creating new disks I can offload all the future IO cost of zeroing now, and that’s wonderful.
- If you want to inflate a thin disk, you can either svMotion it to a different disk and chose thick provisioned as your preferred storage type or navigate to the .vmdk file in the datastore file-browser and press “inflate” there.
Don’t use VMware snapshots. Snapshots are not backup, they are terrible for performance and grow crazy fast in this use case. If you have to (perhaps you’re worried for an update?), please stop your node process first. Delete the snapshot before starting the node again.

Aaaaaand I think that’s it really. It’s a LOT more involved than running a single machine with docker and assign each disk to each image. If you have to use virtualization, follow the advice above and consider running a Linux distro with docker instead of windows.

I only do it this way, because it conveniently allows me to test some larger scripts and orchestration I use at work in my spare time.

I’ll update this comment, if I find anything else cool