Updates on Test Data

littleskunk · May 28, 2024, 3:46pm

Today will start a bit slower. Before we can go on with full benchmark tests we need to deploy the power of 2 node selection to all production satellites first. Any satellite that keeps running the old node selection will get upload errors the moment SLC pushes the network to the limit. We do need a few more code changes so this will take at least a few hours if not days.

In the meantime we will slow down on the benchmark tests to not impact the other satellites. So first test today will be a comparison of upload duration between old and new node selection. That is a different test focus than throughput. I am sure you will notice the drop in traffic. Don’t panic.

thelastspark · May 28, 2024, 4:14pm

Just wanted to chime in and express my appreciation for the open communication (and excited to hear the results of further tests/onboarding grapevine)

littleskunk · May 30, 2024, 2:36pm

Satellite are all running the new choice of 2 node selection. (I used the wrong term “power of 2” earlier. Lets try to call it choice of 2 from now on.)

Time to find out the maximum throughput we can reach now. So this time there will be a higher load again.

Toyoo · May 30, 2024, 2:36pm

I’ve added some logging to my nodes and now I see that pieces uploaded with TTL=30 days (I presume: test traffic) have a fixed order size of 2319360 bytes, even if the piece itself is 2048 bytes. This is worrying, as this would make traffic accounting in 1.104.5 nodes way off. This is exactly the part of my patch that I wanted to investigate before making a pull request.

At least the patch was reverted later, so up to date nodes will again have the right numbers.

Roxor · May 30, 2024, 3:16pm

I have a better name for the new node-selection system.

jtolio · May 30, 2024, 4:24pm

Just wanted to reply to this - this is only kind of true.

It is true that the Satellite is now going to keep track of recent success rates in a system-wide data structure (a speed profile, I guess), but we actually have many of these speed profiles to choose from. We are going to be choosing which speed profile to use based on where the request is coming from.

Right now, this functionality isn’t enabled, so yes, your description is true /this week/.

Our next step is to have every region’s S3 gateway (where most of our traffic comes from anyway) get their own speed profile (once I can figure out why https://review.dev.storj.io/c/storj/edge/+/13309 isn’t passing the build). And then after every gateway region is handled, then we’ll explore what we can do for remaining native integrations.

snorkel · May 30, 2024, 4:33pm

(…with all the risk of beeing banned…)
That’s what she said!

zip · May 30, 2024, 6:25pm

So I’m watching this and the binary storagenode process, during these tests, is doing a CPU load of over 700% with IOwait% being pretty much insignificant. This is on x86 CPU with plenty of RAM and SSD cache and with logs on ramdisk.
It looks like the process is quite CPU heavy, I assume the more the smaller the segments gets.
Are there some plans to optimise the storagenode code more, maybe doing binaries for different, more advanced CPU flags if such thing would help?
As with such high CPU loads and PPS, even the ping to the local gateway is quite erratic and the node is loosing significant amount of races.
I can probably throw more cores at it, but that probably isn’t an option for someone running this on a low powered device or for someone running this at scale.
I wonder how Synos and similar devices are coping with this load.

littleskunk · May 30, 2024, 6:51pm

We can control the piece size by changing the RS numbers. Last test that ended right now should have been twice the piece size and halve the number of connections. According to our result that increased performance. This isn’t a solution. It is just an observation for now.

Ruskiem · May 30, 2024, 7:43pm

i can confirm. Windows GUI. Win 10 pro.
2 cores per node and 1h ago 80-100% CPU usage.
Those are 2 cores of a 8/16 CPU (AMD Zen 2, AM4 platform)
delegated per VM instance, If someone going to say: Oh its VM fault - i’d say wonder what would happen if that wasn’t in VM.
A potential CPU usage leak would eat me all the CPU usage of a whole workstation.
With VM, at least i can limit cores per node like that. now (1 node, 1 HDD, 2 CPU cores)

Edit:
Normally 1 VM storj instance uses 20-40% CPU load, even with quite high traffic like now (ingres now is 30-45% network, that is high, past months normal was like 3-10%)
That is 2 core per VM.
The high CPU % occurs only when network was tested up to 100% its ability, like yesterday, so not something that occurs often, but rather rare, but still.

ACarneiro · May 30, 2024, 8:36pm

That’s interesting. Mine had a peak of bandwidth use but CPU and IOWait didn’t really go very high.
I wonder if my nodes were selected less often with the new choice of 2 selection criteria…

Vadim · May 31, 2024, 3:28am

In my setup, I do not see big cpu usage on nodes itself, but antivirus eat significantly more that usual, and RAM consumption more than usual 300-500 mb, but if there is lot of connections it is OK for me, as my setup in node buffer is 4mb

nerdatwork · May 31, 2024, 3:38am

Didn’t you add node drive to exclusion list ?

Vadim · May 31, 2024, 3:49am

Thank you, I forgot to add around half of HDDs to exclusion. now it more resonable.

Alexey · May 31, 2024, 5:08am

I suspect a VM is a culprit

see

There are bare metal nodes as far as I know. My observation is the same: the Windows service node doesn’t consume much of CPU (unfortunately in Windows there is no easy way to get a CPU usage in percentage in CLI), but the task manager shows ~0.12% (8 CPU cores).
while docker nodes (they are running in the VM because it’s Docker Desktop for Windows):

CONTAINER ID   NAME           CPU %     MEM USAGE / LIMIT     MEM %     NET I/O           BLOCK I/O   PIDS
29ce6d785933   wireguard      0.17%     24.16MiB / 24.81GiB   0.10%     1.98kB / 183B     0B / 0B     21
3d20fef76e67   storagenode5   6.69%     75.09MiB / 24.81GiB   0.30%     2.63MB / 78.5MB   0B / 0B     54
9fb28e5cf48a   storagenode2   12.23%    668.8MiB / 24.81GiB   2.63%     10.3GB / 723MB    0B / 0B     229

And the Docker itself shows about 30% CPU usage (8 CPU cores).

anto294 · May 31, 2024, 9:52am

It seems that my node has been selected. Hard disks are frying like eggs

littleskunk · May 31, 2024, 12:52pm

Test results are a step closer to our target and this time we can keep it running for hours without any errors on the other satellites. That is good. Our target gets more and more in reach.

We are working on a different success tracker. Deployment on a Friday is a bad idea so it will have to wait for Monday. There is a good description in the PR what the differences are: https://review.dev.storj.io/c/storj/storj/+/13308

To me it looks like we might have a problem with the number of connections now. If I run a single node I have almost 100% success rate. If I add a second node on the same IP they split the load and still almost 100% success rate. I can continue like that up until 4 nodes. If I add more nodes they start to get long tail canceled. My reading of this is that TCP fast open or connection pooling gives me 100% success rate for a single node but the more nodes I add the higher the chance that a new connection needs to be established and that makes me lose the race. Not much so I could as well just ignore it but it is visible on my grafana dashboard.

pangolin · May 31, 2024, 2:13pm

My Internet connection is now at its limit. Almost all successful uploads coming from SLC.

Pentium100 · May 31, 2024, 2:51pm

I do not see a lot of CPU usage on my node. The node runs in a VM, the host has 2x Xeon X5687 CPUs, the node VM has 4 cores.

It looks like the node process uses about 30% CPU.

Ruskiem · May 31, 2024, 3:20pm

30% CPU of 4 cores, could be 60%+ of 2 cores, so roughly rather high, per one storagenode.exe