Updates on Test Data

zip · May 31, 2024, 3:35pm

You are correct.
I did move few of the nodes from VMs to a bare metal as a test, and the CPU usage is night and day. In some cases tenfold less usage on bare metal vs the VM.
Really hard to believe, as I was used to a norm the overhead is a few percent at most when running workloads on VMs.
I suspected the networking is causing this, but that isn’t the case as it was the same when I switched to VFIO.
I suspect all the mitigations the Kernel is doing for the guests is the problem, especially on older hardware. This can be turned off, but I haven’t had a chance yet as that requires a reload and probably won’t bother and move out of virtualisation alltogether.
If Docker causes similar performance hit, then I guess it would be advisable to move away from the Docker as well and ask SNOs to run this on bare metal if possible.

littleskunk · May 31, 2024, 3:50pm

Ok turns out we are doing this today. It will get deployed shortly.

arrogantrabbit · May 31, 2024, 4:02pm

This is not a normal workload though. Modern CPUs have built in instructions specifically designed to accelerate encryption and decryption. These, and other extensions might not be available in VMs/not supported by the hypervisor/emulated in software/etc

ACarneiro · May 31, 2024, 4:30pm

I hope you’re not on call over the weekend!

littleskunk · May 31, 2024, 4:43pm

It is a feature flag so we can go back to the behavior that was running stable for hours.

KernelPanick · May 31, 2024, 5:03pm

I have 3 nodes on 1 IP and I’ve been suspicious if that was a good decision I made now at this point. Contemplating preventing 2 of them from filling anymore if there is risk of overload from having more than one. Please let us know if there are adverse findings for this. So far I’m seeing similar data per day combined (350-500GB) and the resource consumption doesn’t even seem too close to the limit.

I like this strategy to make the node limit size no larger than my largest single drive size for a DR scenario.

Pentium100 · May 31, 2024, 5:29pm

30% on Linux means 30% of one core. If the node process used all 4 cores it would show 400% utilization.

Ruskiem · May 31, 2024, 5:32pm

And what are You doing right now?

because i noticed that all nodes started to gain more and more RAM for storagenode.exe(Windows GUI) and disk seems to be keeping up like usually,
the network load is ~30%, it was like that and more all day before and the RAM was under control, untill to 19:19 UTC+2 when i noticed that change.

Yea im observing RAM for the process is going up by the minute.
5 minutes ago it was 1300MB , now its 1700MB. Similar acros all nodes that gets Ingress.
Not occuring in nodes that are full.

With more CPU % load in accompaniment ofcourse.
I see the HDDs are not working any harder than usually, i see them chilin’, with writes, (if they can afford to show 3-4-5MB/s and then some 1,5MB/s or 748KB/s in Writes, means not so much load, that they could not keep up with or anything)

Edit: now 1946MB of ram for storagenode.exe, other nodes similarly climbing, to the machine available limit, will observe and watch what happens

Edit2: Yup. The memory for the process started to gain rapidly like 4-5MB per sec. up to 2282MB i saw, then screen flashed black, and the storagenode.exe was terminated unexpectedly, and freed all occupied ram.
And in Even Viewer i see prior to that, a warning that machine is running low on virtual RAM.
Storagenode.exe was restarted after 2 minutes (because that what i set in services on crash)

in logs theres nothing, just abruptly cut, after a line of some normal, file upload work.

Edit3:
Oh i see the storagenode.exe process back to 2000MB of RAM again. So fast.
I though after restart it will climb slowly. But here it is at >2000MB again.
And its gaining like 0,3MB per sec. its at 2200MB again hmmm.
Looks like all nodes going in that direction, oh shoe, and i was hoping they will finish the filewalkers this week so i can turn it off oh shoe the crashes will cancel all the progress

Edit4: 20:19 UTC+2 time, i see the process is losing MB finally, and i see the overall network utilization just droped by half ~120 sec ago, thank You.

Edit5: yea its very much loosing MB, 5 minutes ago it was ~1500MB, now its ~1000MB

Edit6: oh no, 20:36 UTC+2 time, its back to, i will call it “bad ingress”, only 30% of the network, but with too much CPU and too much growing RAM. I was enyoing such load today earlier, and days before and the problem wasn’t occurring hmmm

the nodes are on 1.104.5 as it’s current minimum.
Maebe You Guys enabled something at satelites and forgot nodes cant handle it yet? idk, i see theres ver 1.105.4 in testing available

Edit7: i can confirm that at 00:58 (UTC+2) the traffic is back to 30% of my network, same network load as with the problems, but observed node are back to stable RAM and CPU usage, i didn’t change anything, all other nodes also are behaving, hope thats some usefull feedback.

agente · May 31, 2024, 5:37pm

my router collapsed…

MarviBiene · May 31, 2024, 6:18pm

Love that proxmox thinks I am writing with like 200 GB/s

Mitsos · May 31, 2024, 6:26pm

Can confirm, baremetal RAM usage is 5x of what it used to be with minimal network and IO.

Roxor · May 31, 2024, 7:23pm

I wonder if this is just the change from sync to async writes? With the old sync behavior a node was limited by what disks could actually handle (and it would lose races asking for more). But with the new async writes everything is going to get funneled straight into RAM (for the OS to write when-it-can)… so the node is always telling the network “I got this - send me more”.

Get some data… stuff it in RAM… get some data… stuff it in RAM… completed writes are a problem for later

Ruskiem · May 31, 2024, 7:28pm

i personally doubt, it looked like HDD was rather bored
and we had some cases of memory leak back in the days 2021 or so, but the problem was solved.
i would be fine probably even with 3-4GB sticked to the process, but i can’t put more RAM dices to that motherboard to allocate more

ACarneiro · May 31, 2024, 8:05pm

I’ve seen some more RAM used on all my nodes but still nothing especially worrying.
With the exception of one of the very first tests where CPU usage was very high, they’re all coping very well with the current test loads…

snorkel · May 31, 2024, 8:31pm

I don’t get the “huge number of connections to a node” test. If we prepare for a big client, or a few clients, I imagine they have a few servers that will upload to nodes. Why do we simulate a huge number of connections, if we prepare for a few servers? Can’t they just upload with fewer connections instead of more? It’s the same bandwith. 10x1Gbps=1x10Gbps. Why do we need to suffocate the CPU, router, ISP etc with more connections?

pangolin · May 31, 2024, 8:37pm

Isn’t this by design (1 piece = 1 connection)?

Roxor · May 31, 2024, 10:28pm

Are you asking why they’re testing the performance of different configurations… during their performance testing?

Knowledge · May 31, 2024, 10:42pm

Because it is a lot of data, from many sources, all hitting at once. The purpose of testing is to know what our best results are under varying high traffic conditions.

Andrew · May 31, 2024, 10:43pm

Mine too. It would be good to know when the developers plan to finally finish these tests. It’s very annoying. Online and audits are failing, all thanks to the tests. I don’t understand the purpose of these tests. Storj, in its current state, is nowhere near this load of traffic or number of customers.

Aitor · May 31, 2024, 10:45pm

I also think the test it’s too much.