Help with node optimizations / best practices

GoldfishIT · February 20, 2024, 2:10am

What are the best practices for making Storj nodes?
More specifically if we have virtualized Windows nodes?
I’m asking this because I feel there must be something important or obvious that I’m missing.

→ Reading this is optional for responding to the question but I’ll just explain the context of why I’m asking these questions.

Firstly because when wanting to start a node I’ve just followed what the website guided me through, thinking it was going to be fine. I was wrong. The process should have important information about hardware and configurations so we don’t f*** up in the long run but I guess nobody cares. I’ve invested over 100h of my time reading this forum and trying different things, even more if you think about the node migrations, and I still feel like my nodes are a failure and I’m close to just GEing.

My first nodes used Windows 10 and USB SMR drives and after ~2TB started failing all the time. I’ve shucked them and moved them into my proxmox server and made some Windows Server VM’s, problem was solved but at a little over 3TB it started happening again and there was a BIG discrepancy between real space occupied and what the node reported. Bought CMR Exos drives and deleted the databases, the problem still hasn’t reappeared over a year later but my drives are ALWAYS are 100% activity. When consulting the logs I can see that the lazy filewalker runs take over 20 days and there’s a lot of cancelled uploads. Also when Storj needs to update my node just dies until I manually restart the process, which didn’t happen back in Windows 10 and 11, so I guess Storj doesn’t work well with Windows server maybe?

Turning off indexing on the drive took over a month but seemed to help a little bit.

Turning on writethrough caching + iothread gives a significant speed boost but also some unexpected behavior in terms of IO wait and guest CPU usage, and it’s still too slow anyway.

Benchmarks on the guest machine give me IOPS and bandwidth at all queue depths within 15% of what I get on bare metal so I guess the virtualization overhead is not significant.

So what the hell am I missing? It it because I left the default cluster size of 4KB? Should I make some optimizations in the config file? Maybe defrag the MFT? Is the only choice for running a decent Storj node using Linux and EXT4 or something? I’m giving this one last shot but I’m way too tired to make another mistake. Please help. I would ideally stick to Windows VM’s because it works well with the way I manage the server.

Roxor · February 20, 2024, 2:51am

I haven’t needed many rules. One HDD per node (even when virtualization may make it easy to run a few). Avoid SMR when possible. Use the default ‘lazy’ filewalker, so the node always has higher IO priority. And occasionally run one of the success scripts so I’m confident I’m still winning high-90%'s of upload and download races. The space-used filewalkwalker runs on restart: and it’s normal to run 3-4h/TB… so with enough data it’s OK to see it pin a disk for a day or more.

I’ll eyeball the GUI sometimes to see how ingress is going… but other than that I ignore things and wait for an email from hello@storj.io to tell me if something is really wrong.

What are you trying to improve? If you’re winning near-100% of uploads and downloads you don’t have a problem.

GoldfishIT · February 20, 2024, 3:09am

My HDDs are pinned 100% of the time. Used space filewalker takes over 2 weeks. Used space is ~6TB. I don’t know the % I’m winning but I assure you it’s nowhere near 100% judging by the logs.
Current setup is CMR Exos drives on a dual socket server passed through to the Windows guest in Proxmox using -qm set. In Windows caching is on, indexing is off, partition in NTFS, cluster size is 4KB.
What’s your setup?

Alexey · February 20, 2024, 5:05am

Hello @GoldfishIT,
Welcome to the forum!

kind of expected because of

The virtualization usually affecting disk IOPS, this seems inevitable like I can see from the other posts. I can guess that on bare metal it would work much better.
However, you may also turn off the lazy mode (it will use a normal priority, but should finish scan much earlier, may be only few days, not 2/3 of the month).

So, in this configuration looks like nothing to improve except move to the docker on the host itself, but the disk should be formatted to ext4 in this case. This expected to work significantly better than under Windows VM.

Knowledge · February 20, 2024, 5:54am

There is also an improvement coming to optimize the trash folder scan. This should help reduce I/O for all nodes.

github.com/storj/storj

Trashfolder cleanup is wasting too much IOPs and is running with wrong IO priority

opened 01:46PM - 18 Dec 23 UTC

littleskunk

Bug Needs Estimation

**Expected Behavior** The storage node receives a bloom filter and moves a lot …of pieces into the trash folder. The pieces should stay in the trash folder for 7 more days. In the end, a cleanup job should delete the pieces that are older than 7 days. That cleanup job should run with low IO priority like garage collection does. **Actual Behavior** Garbage collection is already optimized and there is a flag to run it with lower IO priority. The problem is the cleanup job. It checks the entire trash folder every 24 hours and on every storage node restart. 6 days in a row it is wasting IOPs by just checking the trashfolder without deleting anything. It is also running with normal IO priority which will impact ongoing Upload and Download activities. We should stop checking the trash folder that often and we should do it with low IO priority. **Possible Solution** How about garage collection creates a subfolder in the trash folder with the current timestamp? That way the cleanup job could just ignore all subfolders that are not yet 7 days old and wipe the folders that are older than 7 days. No matter how many pieces are in the trash folder it would be just 1 check instead of checking the timestamp of every piece in the trash folder. Also, run it with low IO priority like garbage collection.

Solu · February 20, 2024, 6:14am

Hi Goldfish
i run a 24TB node (22TB full atm) virtualized on Proxmox. Both, node and storage (Truneas). ZFS filesystem on CMR drives to ext4 on node vm.

What i can say is, if you set your loglevel in config.yaml from “info” to “warn” i saw a very
noticeable improvement on disk I/O. Also also added a Read/Write cache later on, that helps a bit more even if the cost to buy an extra ssd are not worth it. But if you have one in spare
just try it out, mine is 100GB.

Monthly filewalter takes 4-5 days.

daki82 · February 20, 2024, 7:18am

you guessed wrong, its in the making.

then you read this for sure too. + link at the end of the thread.

GoldfishIT · February 20, 2024, 8:45am

My logs are already going to SSD, forgot to say that. Thanks.

Toyoo · February 20, 2024, 11:31am

You are not sharing all details of your storage setup, we can’t help much.

What storage type are you using in Proxmox? What is your (host) software and hardware storage stack, that is, any RAIDs, thin storage, whatever? What disk image format, if applicable? What virtual hard disk bus/controller are you using? Are you using memory ballooning (it’s strange that people are not aware it affects performance in nontrivial ways!)? How much RAM do you have in the host for caching, in the guest? You need to post all relevant configuration details.

The reason raw Linux and ext4 are recommended are not because other stacks are bad. They are recommended because there’s little to misconfigure there, as opposed to the monstrous complexity that you have with any virtualization platform. These 100h? That’s the price, likely not even a full price. There’s no single magical switch “make it work correctly”. All switches has to be exactly right.

Alternatively, you need to debug your problem from first principles. Measure what random I/O is your storage capable of from host, at every software storage layer you have in your setup. Compare these numbers to measurements from inside the VM. Look at caches and buffers, etc. Measure what you can and verify that the numbers make sense.

Back when I was a proxmox admin for a company setup, I was able to get close to 100% of raw I/O speed of the enterprise NVMes we got in our clusters, and all of that available to Windows guests. But it took a lot of experimenting to get it right. While I liked the learning factor, frankly I’m very happy I no longer need to do that, Windows is quite finicky as a guest at times. As such, I would also suggest you to run your storage nodes outside of VMs, within a docker container on the host. Proxmox is, after all, an extended Debian distribution…

GoldfishIT · February 20, 2024, 10:30pm

Is there any workload I can use to simulate/benchmark/extrapolate my nodes relative performance?
I don’t like testing in production especially when the load conditions aren’t stable enough to produce repeatable results. Common Windows tools give me only a -15% difference between bare metal and VM IOPS and are thus unreliable to give me insight into what the hell is actually happening, which is why I’m asking. Other workloads in other VMs are working reasonably close to bare metal with no fiddling at all, only Storj is giving me headaches.

GoldfishIT · February 20, 2024, 10:43pm

Windows Server guest is on ZFS on datacenter grade SSDs and has no IOPS issue at all.
Storj data is on a single CMR Exos HDD (for each node) that is passed through to the guest using “qm set -virtio1 ”. Guest being Windows is naturally using NTFS. Cluster size and caching on the guest is default. Indexing on the guest is disabled. Caching on the host is default disabled but enabling writethrough helps, just not enough, enabling IOThread seems to very slightly help as well. VM storage controller is Virtio SCSI single. Async IO is default but somehow native seems to make it worse. Passing the drive using “qm set -scsi1 ” also worsens the issue.

JWvdV · February 20, 2024, 10:52pm

This is the foremost important question: what needs to be fixed?

The answer seems to be: improving disk performance concerning the walkers?

You’ve got a quite nasty setup:
ProxMox (Linux) > Windows guest per drive (N=?) > Storagenode software.

Just some questions:

Why not just Linux and all drives in one environment with multiple dockers?
Or even avoiding VMs at all, and run the whole thing in a priviliged LXC-container?
Does it need to remain the way it is? Why is Windows convenient the way you manage your server?

GoldfishIT · February 21, 2024, 12:43am

As I’ve explained, it does not need to remain the way it is. If it’s possible to be fixed then I will fix it as per your suggestions. Migrating everything to Linux will take weeks and I’ll have to buy another drive, so it’s preferable if it’s possible to fix what is already built.

My setup is, however, not as you describe, so maybe the situation is a bit easier. The guest OS drive is in a ZFS and uses Datacenter grade SSDs and is fast, has no performance issues tested or perceived or resource exhaustion at all on that side of the equation. The Storj node data however is in its own dedicated CMR Exos HDD which is passed through in Proxmox to the Windows guest VM. The performance issues happen/become apparent in the HDD side.

Alexey · February 21, 2024, 2:58am

The storage node software is a best testing suite in this case, but not portable unfortunately (except connecting this disk to a bare metal Windows).
Synthetic tests likely would not give you the answer, or you need to test simultaneous random reads and writes with sizes around 2MiB with something like 1. fio - Flexible I/O tester rev. 3.36 — fio 3.36 documentation
I didn’t find on the forum a profile for storagenode though, except

but it’s likely not representative, because block size on NTFS is actually 4k, not 128k and they also tested only writes, but I assume that you have a problem mostly with reads.

JWvdV · February 21, 2024, 5:28am

The complexity I mean is: you’re mixing up types of OS’es, and especially the need of multiple VMs to run Windows guests is really resource wasting.

In the end it’s all up to you, but I would convert to privileged LXC in which you run multiple docker instances. This would give bare metal speeds.
Additionally the database can be put on the SSD, further decreasing unnecessary IOPS to the drives that can be used for the primary process.

And yeah, I see the consequences and time investment it takes. But it’s also an opportunity to learn ;).

Besides, as long your nodes are filled up <45% (or at least one is), you can 1) shrink the partition to 50%, 2) create a 2nd ext4 partition on the drive, preferably a little bit smaller than the ntfs partition (but sufficient to contain all data), 3) copy all data from NTFS to ext4 partition using cp or rsync, 4) dd the ext4 partition over the ntfs partition, 5) throw away the first created ext4 partition (of course after testing the 2nd created), expand the remaining ext4 partition to whole size of drive and grow the filesystem.

So no necessity to buy another drive.
Of course you could also think of hybrid solutions, with multiple drives >50% filled but it will increase complexity quite soon.

JWvdV · February 21, 2024, 5:40am

To give you an idea what lxc-solutions are capable of:

root@STORJ-HOST:/# lxc-ls -f
NAME             STATE   AUTOSTART GROUPS IPV4                                   IPV6                                                     UNPRIVILEGED
(...)
STORJ-BLUEPRINT  STOPPED 0         -      -                                      -                                                        false
STORJ-MULTINODE  RUNNING 1         -      172.17.0.1, 192.168.1.20               fdfd:98a1:5eab:8160:4852:4fff:fe54:5320                  false
STORJ1           RUNNING 1         -      172.17.0.1, 192.168.1.2                fdfd:98a1:5eab:8160:4852:4fff:fe54:5301                  false
STORJ12          RUNNING 1         -      172.17.0.1, 192.168.1.13               fdfd:98a1:5eab:8160:4852:4fff:fe54:5312                  false
STORJ13          RUNNING 1         -      172.17.0.1, 192.168.1.14               fdfd:98a1:5eab:8160:4852:4fff:fe54:5313                  false
STORJ14          RUNNING 1         -      172.17.0.1, 192.168.1.15               fdfd:98a1:5eab:8160:4852:4fff:fe54:5314                  false
STORJ15          RUNNING 1         -      172.17.0.1, 192.168.1.16               fdfd:98a1:5eab:8160:4852:4fff:fe54:5315                  false
STORJ16          RUNNING 1         -      172.17.0.1, 192.168.1.17               fdfd:98a1:5eab:8160:4852:4fff:fe54:5316                  false
STORJ17          RUNNING 1         -      172.17.0.1, 192.168.1.18               fdfd:98a1:5eab:8160:4852:4fff:fe54:5317                  false
STORJ2           RUNNING 1         -      172.17.0.1, 192.168.1.3                fdfd:98a1:5eab:8160:4852:4fff:fe54:5302                  false
STORJ3           RUNNING 1         -      172.17.0.1, 192.168.1.4                fdfd:98a1:5eab:8160:4852:4fff:fe54:5303                  false
STORJ4           RUNNING 1         -      172.17.0.1, 192.168.1.5                fdfd:98a1:5eab:8160:4852:4fff:fe54:5304                  false
STORJ5           RUNNING 1         -      172.17.0.1, 192.168.1.6                fdfd:98a1:5eab:8160:4852:4fff:fe54:5305                  false
STORJ6           RUNNING 1         -      172.17.0.1, 192.168.1.7                fdfd:98a1:5eab:8160:4852:4fff:fe54:5306                  false
STORJ7           RUNNING 1         -      172.17.0.1, 192.168.1.8                fdfd:98a1:5eab:8160:4852:4fff:fe54:5307                  false
STORJ8           RUNNING 1         -      172.17.0.1, 192.168.1.9                fdfd:98a1:5eab:8160:4852:4fff:fe54:5308                  false
STORJ9           RUNNING 1         -      172.17.0.1, 192.168.1.10               fdfd:98a1:5eab:8160:4852:4fff:fe54:5309                  false
VPN-server       RUNNING 1         -      10.66.66.1, 192.168.1.217              fd42:42:42::1, fdfd:98a1:5eab:8160:216:3eff:fe4a:60e3    false

root@STORJ-HOST:~# free -m
               total        used        free      shared  buff/cache   available
Mem:           15772        4164         566         129       12716       11607
Swap:          26789        1751       25037

Altogether 15 drives summing up to about 30TB.

Could have been using only docker instances, but since I had to do more by hand in that case, I preferred this option also having more options for monitoring and external control (every node has it’s own LAN-ip).

This all on a N100, also running multiple other services. Almost only idling in wait of IO:

root@STORJ-HOST:~# iostat
Linux 6.1.0-18-amd64 (STORJ-HOST)       21-02-24        _x86_64_        (4 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           2,54    0,26    7,48   88,63    0,00    1,09

Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd
sda             224,59      3585,21       831,45         0,00  210023063   48706365          0
sdb             245,18      3814,54       935,69         0,00  223457147   54813032          0
sdc             365,15     10329,29       477,68         0,00  605093094   27982396          0
sdd               3,80        20,75        42,82         0,00    1215384    2508540          0
(...)

Alexey · February 21, 2024, 7:18am

Actually they can do it a little bit easier, safer and on the fly, if they would use LVM.

JWvdV · February 21, 2024, 7:49am

It doesn’t help, that you’re quite vague on your config. But what’s different?
Proxmox on ZFS-pool, with more than one Windows Server OS in order to run storagenodes on NTFS-formatted CMR EXOS drives?

Proxmox BTW has native support for containers, as far as I’m up to date (not using it myself).

Why? I mean, LVM is a mess as soon as there is a corruption as far as I’m aware. Few possibilities to repair, no integrity checks (dm-integrity) by default. Only easier in the sense that you can create logical volumes in the fly, so no need for disk images anymore. Or do you see any further benefits?

GoldfishIT · February 21, 2024, 8:56am

I’ll try a different way to explain.
One NTFS formatted HDD holds Storj data for one node and is passed through to just one node which is a Windows guest on Proxmox. My other nodes use the same setup. If you’re curious as to why this happened instead of just using LVM or docker or linux VMs or whatever then you can check the context part of my question.

JWvdV · February 21, 2024, 9:18am

So, again, I actually get the feeling we’re not talking about different things.

You’re talking about a node, like it includes the OS. That’s not true. You can easily run multiple nodes on one machine even within the same user space of you want (running the binary more than once, with different params), provided you’ve got enough resources for it and at least one physical drive (or equivalent) per node.

Again:

So, you have a host OS which is ProxMox. On a ZFS pool, probably RAID as you’re talking about SSD’s.
This OS has some virtual machines instances running Windows Server as an OS, all situated on the ZFS pool.
At least some (if not all) of these Windows VM-instances have running STORJ-storagenode software for which also a physical storage drive has been passed through formatted NTFS (one per VM instance).

This is what I worked with, and I draw my conclusions from.

It really helps if you just make it as tangible as possible for your helpers: I have a machine with X RAM and Y CPU, running Z OS having N VM instances running O guest OS which have P RAM assigned, Q physical drives for STORJ…