FAQ: Best practices for virtualized nodes

Ottetal · April 29, 2024, 10:10am

Hello all. The forum get’s a lot of questions regarding how to virtualize nodes, that I wish to shed light on.

I’ve written some about the topic before, and am now creating this topic to further push visibility.

This guide is written for a VMware environment, but many of the concepts are the same even if you’re on XCP-NG, Xen, Virtualbox, Ovirt, KVM, HyperV or Proxmox.

The areas to consider can be broken into three major areas, all connected by networking, which is so tied into all three that I won’t consider it it’s own area

Storage Array
Hardware
Virtulization settings

Just to be clear, here are the concepts I work with

VMhost = The physical hardware that you’ve installed your hypervisor on. The physical hardware is not the hypervisor, and the hypervisor is just the software implementation on top of your hardware
VM = The virtualized OS that lives on the VMhost
Node = A single instance of the StorJ Software.
Local = Resources directly connected (and thus not networked) to the physical hardware I’m talking about that time

And here are the words I tend to stay away from

Server - Because it’s and unambiguous word. Is it the VM? Is it the physical hardware? Is it a VMhost? No one knows, and for all the things a server could describe, there is more accurate wording

Let’s get into it.

Storage Array:

If your HHDs are local to your VMhosts, just use single datastore for a single disk
Synology running single disk volumes with single disk storage pools on them. Always formatted ext4, allways running ISCSI (not NFS!) on dedicated NICs, and always running thick provision on the LUNs
- I am debating switching to ZFS on a homebrew machine so I can use the special metadata device, but what I have now works and I don’t want to spend any additional time or money right now.
I am using the before mentioned standalone 8x 20TB disks (MG10ACA20TE), and a RAID10 array of 2TB SSDs as read/write cache, allocating 400GB to each disk. This is massively overkill, but the performance is great. Here is a screen shot of three of the disks:

image2277×1052 57 KB
And here is a screenshot of one of the LUNs performance. It could be better, but it’s in the upper end of what I expect from cached HDDs:

image1467×864 101 KB

Hardware:

Identical VMhosts of: Intel 12400, 128GB RAM, 2TB local mirrored NVMe, 4x 1Gb NICs
… But none of that really matters. As long as you hypervisor is not swapping to disk due to memory exhaustion and as long as you’re alright with CPU utilization, you can use whatever you want. I’d recommend still having local NVMe for the OS drives and still using dedicated NICs for SAN uplinks
- Sidenote: I use 1Gb NICs, because I have them and they work fine for me. That being said; older 10Gbit PCIe cards are dirt cheap, and will give you a considerable network speedboost.

VMware:

Here is where it get’s more interesting. I always assign 1GB of RAM pr TB of used StorJ space for StorJ workloads, and a minimum of 2GB, because I run Windows on my guests. I give them 2 vCPUs, because I like them the ability to burst up when doing updates.
All my VMs are regular win10 VMs, but stripped of most of their unused stuff thanks to Chris Titus’ scripts: Debloat Windows in 2022. This brings down idle memory consumption significantly
- I do this on all my windows machines. It’s a wonderful script.
All VMs have their C:\ Drive on local NVMe. All VM’s have their second disk as a 10GB drive, also on VMhost local NVMe which is only used for the StorJ Databases and the StorJ directory is built on multiple spanned 1TB disks:

image856×842 56.4 KB
I use spanned disks in wondows, and not striped, because the limiting factor of the performance is going to be my underlying hardware and not Window’s implementation of striping. Yes, there is a higher chance of datacorruption when running it this way - but the disk ~~will~~ should only corrupt if your SAN goes out. If this method is good enough for work, it’s good enough for me.
Remember to enable round-robin on your SAN uplinks, and try to have enough NICs so your SAN uplinks are not also you VM network.
When running larger nodes, it is adviced to add additional vSCSI controllers to the VM, and balance dthe disks out on those since each controller get’s it’s own IO queue. It is also advised to put each VMdisk on a single datastore, since each datastore also gets it’s own IO queue. This would be good advice when running massive nodes (30TB+), but I don’t, and even if I did, my underlaying storage of single HDDs would be the bottleneck - not the storage setup.
- Sidenote; the original “guide to 100k IOPS” whitepaper is a wonderful read: 100,000 I/O Operations Per Second, One ESX Host - VROOM! Performance Blog. The pictures are gone since the Broardcom purchase, but you get the point. The million and two-million whitepapers the following years are great too.
I run more nodes than I have disks, so I break the “One Node, One Disk” rule, but while one disk can hold multiple nodes I do make sure that a node only lives on a single disk.
I have spare local NVMe. I’ve thought about using it as local host cache for SWAP, but like I said, if you’re swapping, you have other issues.
All disks must not be thin provisioned. The write penalty for requesting new blocks when expanding the disks is simply too big. If your disk is full and already thin provisioned, there is no need to inflate it (because all the bytes in it’s allocated size is already written), but I would highly suggest inflating all other thin disks. My old nodes are running thick lazy formatted disks, but I create all newer disks as thick eager. I can’t feel the performance difference yet, but when creating new disks I can offload all the future IO cost of zeroing now, and that’s wonderful.
- If you want to inflate a thin disk, you can either svMotion it to a different disk and chose thick provisioned as your preferred storage type or navigate to the .vmdk file in the datastore file-browser and press “inflate” there.
Don’t use VMware snapshots. Snapshots are not backup, they are terrible for performance and grow crazy fast in this use case. If you have to (perhaps you’re worried for an update?), please stop your node process first. Delete the snapshot before starting the node again.

Aaaaaand I think that’s it really. It’s a LOT more involved than running a single machine with docker and assign each disk to each image. If you have to use virtualization, follow the advice above and consider running a Linux distro with docker instead of windows.

I only do it this way, because it conveniently allows me to test some larger scripts and orchestration I use at work in my spare time.

I’ll update this post, if I find anything else useful

Ruskiem · April 29, 2024, 2:21pm

Hi Ottetal, great value, many hints i didn’t know…
turned out i don’t even have option to choose controller in my VM version…
And what about PVSCSI controller?

Because some LLM3 70b on groq said me its better:
“If you’re looking to squeeze out more performance from your VM, switching to the BusLogic controller (buslogic) might be a good option. However, keep in mind that this change might affect compatibility with your VM’s operating system or disk configuration, so be sure to test carefully before making any changes.”

will pvscsi work in 12.5.9 version of this Workastation?

The PVSCSI controller is a paravirtualized SCSI controller that’s optimized for virtualized environments. However, it’s not supported in VMware Workstation 12.5.9.

The PVSCSI controller was introduced in VMware Workstation 15 and later versions. In Workstation 12.5.9, you won’t be able to use the PVSCSI controller.

If you want to take advantage of the PVSCSI controller, you’ll need to upgrade to a newer version of VMware Workstation, such as version 15 or later.

In your current version, you can still try modifying the SCSI controller type to buslogic or lsilogic, as I mentioned earlier. These controllers are supported in Workstation 12.5.9 and might provide some performance improvements."

i opened the .vmx file as You adviced, and i se there is a line with: scsi0.virtualDev = “lsisas1068” hmm its not lsilogic, wonder what i can change it for to get better performance?

Answer:
"You’ve opened the .vmx file and found the line with scsi0.virtualDev = "lsisas1068".

The lsisas1068 value specifies the SCSI controller type, which is an LSI SAS 1068 controller. This is a popular and widely-supported SCSI controller type.

To explore other options, you can try modifying the scsi0.virtualDev value to one of the following:

buslogic: This will change the SCSI controller to a BusLogic controller, which is another popular type.
lsilogic: This will change the SCSI controller to an LSI Logic controller, which is similar to the current lsisas1068 controller.
pvscsi: This will change the SCSI controller to a Paravirtual SCSI (PVSCSI) controller, which is a high-performance controller optimized for virtualized environments.

Keep in mind that changing the SCSI controller type might affect compatibility with your VM’s operating system or disk configuration. Make sure to test the changes carefully and revert if you encounter any issues.

Before making any changes, create a backup of the original .vmx file, and then modify the scsi0.virtualDev line to the desired value. Save the changes and restart the VM instance to test the new configuration.

Remember to monitor the VM’s performance and adjust the settings as needed to achieve the best results.

its 12.5.9 version, seems it has no additional settings about disk controelr in settings hmmm

You’re running VMware Workstation 12.5.9, and it seems that the disk controller settings are not exposed in the UI.

In that case, you can try editing the VM’s configuration file (.vmx) to modify the disk controller settings. This will require some manual editing, but it can give you more control over the disk controller configuration.

Here’s an example of how you can modify the .vmx file to change the disk controller type:

Shut down the VM instance.
Locate the .vmx file associated with the VM instance. This file is usually found in the same directory as the VM’s virtual disk files.
Open the .vmx file in a text editor (e.g., Notepad++).
Add the following lines to the file, depending on the disk controller type you want to use:

For a SCSI controller: scsi0.present = "TRUE" and scsi0.virtualDev = "lsilogic" (or buslogic for a BusLogic SCSI controller)
For a SATA controller: sata0.present = "TRUE" and sata0.controllerKey = "sata" (or ahci for an AHCI SATA controller)

Save the changes to the .vmx file.
Restart the VM instance.

Keep in mind that editing the .vmx file can be risky, and incorrect changes can cause issues with the VM. Make sure to create a backup of the original file before making any changes.

If you’re not comfortable editing the .vmx file, you can also try upgrading to a newer version of VMware Workstation, which might provide more disk controller settings in the UI.

im interested with number 4 You mentioned, i added whole Disks to the VM, and there was only SATA or SCSI controller i guess

You’ve added whole disks to the VM, and you’re limited to SATA or SCSI controllers. In that case, let me explain the difference between these two controllers and provide some additional information to help you optimize disk performance.

SATA Controller:

Emulates a SATA disk controller, which is a common type of disk controller found in modern computers.
Supports up to 6 devices (disks) per controller.
Generally, SATA controllers are suitable for most use cases, and they provide a good balance between performance and compatibility.

SCSI Controller:

Emulates a SCSI (Small Computer System Interface) disk controller, which is a more advanced and feature-rich disk controller type.
Supports up to 15 devices (disks) per controller.
SCSI controllers are generally faster and more efficient than SATA controllers, especially in virtualized environments.
However, SCSI controllers might require additional configuration and may not be compatible with all operating systems or disk types.

Alexey · April 30, 2024, 4:42am

So many quirks and also hybrids (like a Linux host for the hypervisor and the Windows Guest, spanned disks instead of expanding the existing virtual disk, also Windows to run a node).
If you forced to use VMWare on a Linux host it’s highly advisable to run Linux as a guest too, and use docker. The docker setup doesn’t have limits how many containers you may run.
For honestly I would skip using a VM as a unit, and run docker containers directly on your host avoiding all performance and resource usage impacts.
With the selected approach you likely always will have failed readable/writeable checks with default options, you also likely will have a problem with all filewalkers, high CPU and RAM usage, also corruptions and database locks (unless you move databases to a separate SSD, I saw the disk for databases on your screenshot, but it’s not obvious what’s under the hood).

So it looks very expensive and overcomplicated (thus - unreliable and likely slow) in general.

Ruskiem · May 16, 2024, 7:29am

i’m doing various tests inside VM and outside, same computer, VM, same 16TB HDD sATA HC550.
And this is just insane Alex.

1st. In Random Access Read test in HDTune Pro 5.75
I’m having same times and speeds inside VM and outside (no downside for VM!)
(for small files like 4KB it’s like ~0.330MB/s and ~81 IOPS and ~11ms avg. access time)

2nd. Inside VM.
With node turned off.
Counting files under Windows 10,
around ~1Mil files with ~3TB, but data written close to each other, HD Tune Pro shows
I/O read at ~1000/s, spikes to 1500/s (counts that 3TB really fast inside VM!)
But when i go to some Storj blobs folder it gets not more than ~190/s
Then i go to next folder it gets not more than ~85/s!

Whats the matter?
Random access ladies and gentlemen + SMALL files
The difference between one satellite folder and another is:
one has folders with mostly 2266KB files,
the other has mostly <200KB files, like 110KB, 7KB, 4KB, 2KB, and even 1KB files!

i know its probably nothing new,
but that small files takes ~2,2 times MORE time to finish used-space-filewalkers.

And this 16TB HDD is not 4Kn, but 512e!
If only the files from one satellite could be somehow written next to each other…
otherwise we have 1-4KB files from some .us1 or .eu1 scattered all over the 16TB disk, thats quite some madness, if You ask me.
Therefore, i postulate to stop denigrating Virtual Machines or they will UPRISE one day!

Ottetal · May 16, 2024, 9:35am

Hmmmm, putting different files from different satellites on different volumes could be a fun endeavor to look at

Alexey · May 17, 2024, 3:48pm

And the closest way to disqualification.
The node checks availability only storage folder, where blobs are located, but not inside blobs.

Alexey · May 17, 2024, 3:49pm

Unfortunately it’s expected behavior for Windows VM on Linux hypervisor. But you may try your test on Windows VM on Windows Hyper-V instead, the result can differ a lot.

Ruskiem · May 20, 2024, 7:29am

Or to make 1 node - 1 satellite (after creation, forget all others)
That will only make sens if there would be a lot of data in the Storj network.
It would be 1 node - 1 sattelite - 1 hdd, but in the end it would not solve My problem.

Thx for comments Alex, but i don’t know if theres any linux involved,
its VMware Workstation 15.x for Windows, on windows host OS.
i updated from 12.x but no change.

for research continuing, i did some more tests,
The speed while counting files is like 400-700KB/s vs. outside VM same folder counting speed is 3,5MB/s
or even 10MB/s in some folders, where files are more in blocks between 32 and 64kilobytes (according to HD tune pro)

(but the disk is formated to 4096 bytes, so i guess HD tune pro means: the total size of a file is 32-64KB)
(access times avg. ~11ms are same, in VM or Non-VM)
How ever, for Non-VM environment the speed is much better for 32K files+, vs 4k files.
but in my VM instances it makes no difference.

and most folders has files in 4K block it shows
(such folders Host machine (outside VM) counts with 3.5MB/s speed.)

Example image:

i did installed new VM, a Windows Server 2012 r2 datacenter as a guest,
did updates mannualy.
so it was created from scratch on the 15.5 Workstation Pro
and still, it was no difference than Windows 10 pro guest,
SCSI or SATA, speeds still at 400-700KB/s inside VM

Perhaps i need to update host OS to Win 10 or Win 11 and VMware to 17.x
im willing to try next month, coz my nodes need recover from offline time i made for tests.
I want to make my VM instance speeds close to Outside VM, for small files.

Pentium100 · May 20, 2024, 7:52am

I’ll add one thing I have noticed with my setup.
I am running the node inside a VM on a Linux host (no, I don’t want to run docker on the host). The host has 192GB of RAM, 16GB given to the node VM. The pool is made from 3 6-drive raidz2 vdevs. A zvol is given to the node VM and ext4 is inside the zvol.

While it is useful to mount the storage disk with the discard option to free up space on the host when Storj deletes data, the way the node does it (deleting large amounts of data at once) slows the server to a crawl because of all the TRIM operations. Instead of that, mount the virtual drive without that option and periodically run a script that only discards large blocks with some sleeps in between the operations.
When the filewalker starts it can use a lot of IO from the server. Combine that with zfs scrub or something else running on the host that wants IO and you get problems. What I have done is write a script that checks if the filewalker process is running and then limits its IO so as to use 55-65% of the IO of the virtual disk. This makes the filewalker run longer, but does not impact the performance of other things.

Alexey · May 20, 2024, 8:47am

The VM doesn’t see a load from the host, so the lazy mode actually doesn’t help, only for load inside the VM.

Pentium100 · May 20, 2024, 10:03am

Yeah, that’s why I needed a separate script to limit the IO load instead of “lowest priority, but take as much as is left from the other processes”. Now the process has a MB/s limit as well, depending on the virtual disk IO load during the last minute.

Morcin42 · May 20, 2024, 10:27am

This sounds horribly complicated and prone to errors. If you really want to virtualise, why not install a simple linux distro on your vm, and passthrough individual disks. Then run docker, and start a node for each physical disk. Because you’re passing them trough, each node will have full access to the disk and you will not have to worry about the other factors. If a disk dies, you simply replace it and start a new node.

Pentium100 · May 20, 2024, 1:01pm

I am using that server/pool for other stuff. Even though the node is the biggest VM on the server now, it wasn’t like that when I started. Also, I prefer to use RAID (well, zfs) instead of running separate nodes on separate drives (even if I wasn’t using the server for other things). For me, running multiple nodes is what sounds more complicated.

Alexey · May 21, 2024, 7:05am

This would disable the host cache and other features maybe applied to the disk pool. So the best solution would be to do not use VM in a first place and use a docker setup directly on the host instead, in this case you wouldn’t have a resources and the performance penalties.