Hello all. The forum get’s a lot of questions regarding how to virtualize nodes, that I wish to shed light on.
I’ve written some about the topic before, and am now creating this topic to further push visibility.
This guide is written for a VMware environment, but many of the concepts are the same even if you’re on XCP-NG, Xen, Virtualbox, Ovirt, KVM, HyperV or Proxmox.
The areas to consider can be broken into three major areas, all connected by networking, which is so tied into all three that I won’t consider it it’s own area
- Storage Array
- Hardware
- Virtulization settings
Just to be clear, here are the concepts I work with
- VMhost = The physical hardware that you’ve installed your hypervisor on. The physical hardware is not the hypervisor, and the hypervisor is just the software implementation on top of your hardware
- VM = The virtualized OS that lives on the VMhost
- Node = A single instance of the StorJ Software.
- Local = Resources directly connected (and thus not networked) to the physical hardware I’m talking about that time
And here are the words I tend to stay away from
- Server - Because it’s and unambiguous word. Is it the VM? Is it the physical hardware? Is it a VMhost? No one knows, and for all the things a server could describe, there is more accurate wording
Let’s get into it.
Storage Array:
- If your HHDs are local to your VMhosts, just use single datastore for a single disk
- Synology running single disk volumes with single disk storage pools on them. Always formatted ext4, allways running ISCSI (not NFS!) on dedicated NICs, and always running thick provision on the LUNs
- I am debating switching to ZFS on a homebrew machine so I can use the special metadata device, but what I have now works and I don’t want to spend any additional time or money right now.
- I am using the before mentioned standalone 8x 20TB disks (MG10ACA20TE), and a RAID10 array of 2TB SSDs as read/write cache, allocating 400GB to each disk. This is massively overkill, but the performance is great. Here is a screen shot of three of the disks:
- And here is a screenshot of one of the LUNs performance. It could be better, but it’s in the upper end of what I expect from cached HDDs:
Hardware:
- Identical VMhosts of: Intel 12400, 128GB RAM, 2TB local mirrored NVMe, 4x 1Gb NICs
- … But none of that really matters. As long as you hypervisor is not swapping to disk due to memory exhaustion and as long as you’re alright with CPU utilization, you can use whatever you want. I’d recommend still having local NVMe for the OS drives and still using dedicated NICs for SAN uplinks
- Sidenote: I use 1Gb NICs, because I have them and they work fine for me. That being said; older 10Gbit PCIe cards are dirt cheap, and will give you a considerable network speedboost.
VMware:
- Here is where it get’s more interesting. I always assign 1GB of RAM pr TB of used StorJ space for StorJ workloads, and a minimum of 2GB, because I run Windows on my guests. I give them 2 vCPUs, because I like them the ability to burst up when doing updates.
- All my VMs are regular win10 VMs, but stripped of most of their unused stuff thanks to Chris Titus’ scripts: Debloat Windows in 2022. This brings down idle memory consumption significantly
- I do this on all my windows machines. It’s a wonderful script.
- All VMs have their
C:\
Drive on local NVMe. All VM’s have their second disk as a 10GB drive, also on VMhost local NVMe which is only used for the StorJ Databases and the StorJ directory is built on multiple spanned 1TB disks:
- I use spanned disks in wondows, and not striped, because the limiting factor of the performance is going to be my underlying hardware and not Window’s implementation of striping. Yes, there is a higher chance of datacorruption when running it this way - but the disk
willshould only corrupt if your SAN goes out. If this method is good enough for work, it’s good enough for me. - Remember to enable round-robin on your SAN uplinks, and try to have enough NICs so your SAN uplinks are not also you VM network.
- When running larger nodes, it is adviced to add additional vSCSI controllers to the VM, and balance dthe disks out on those since each controller get’s it’s own IO queue. It is also advised to put each VMdisk on a single datastore, since each datastore also gets it’s own IO queue. This would be good advice when running massive nodes (30TB+), but I don’t, and even if I did, my underlaying storage of single HDDs would be the bottleneck - not the storage setup.
- Sidenote; the original “guide to 100k IOPS” whitepaper is a wonderful read: 100,000 I/O Operations Per Second, One ESX Host - VROOM! Performance Blog. The pictures are gone since the Broardcom purchase, but you get the point. The million and two-million whitepapers the following years are great too.
- I run more nodes than I have disks, so I break the “One Node, One Disk” rule, but while one disk can hold multiple nodes I do make sure that a node only lives on a single disk.
- I have spare local NVMe. I’ve thought about using it as local host cache for SWAP, but like I said, if you’re swapping, you have other issues.
- All disks must not be thin provisioned. The write penalty for requesting new blocks when expanding the disks is simply too big. If your disk is full and already thin provisioned, there is no need to inflate it (because all the bytes in it’s allocated size is already written), but I would highly suggest inflating all other thin disks. My old nodes are running thick lazy formatted disks, but I create all newer disks as thick eager. I can’t feel the performance difference yet, but when creating new disks I can offload all the future IO cost of zeroing now, and that’s wonderful.
- If you want to inflate a thin disk, you can either svMotion it to a different disk and chose thick provisioned as your preferred storage type or navigate to the
.vmdk
file in the datastore file-browser and press “inflate” there.
- If you want to inflate a thin disk, you can either svMotion it to a different disk and chose thick provisioned as your preferred storage type or navigate to the
- Don’t use VMware snapshots. Snapshots are not backup, they are terrible for performance and grow crazy fast in this use case. If you have to (perhaps you’re worried for an update?), please stop your node process first. Delete the snapshot before starting the node again.
Aaaaaand I think that’s it really. It’s a LOT more involved than running a single machine with docker and assign each disk to each image. If you have to use virtualization, follow the advice above and consider running a Linux distro with docker instead of windows.
I only do it this way, because it conveniently allows me to test some larger scripts and orchestration I use at work in my spare time.
I’ll update this post, if I find anything else useful