Updates on Test Data

Today will start a bit slower. Before we can go on with full benchmark tests we need to deploy the power of 2 node selection to all production satellites first. Any satellite that keeps running the old node selection will get upload errors the moment SLC pushes the network to the limit. We do need a few more code changes so this will take at least a few hours if not days.

In the meantime we will slow down on the benchmark tests to not impact the other satellites. So first test today will be a comparison of upload duration between old and new node selection. That is a different test focus than throughput. I am sure you will notice the drop in traffic. Don’t panic.

19 Likes

Just wanted to chime in and express my appreciation for the open communication (and excited to hear the results of further tests/onboarding grapevine)

16 Likes

Satellite are all running the new choice of 2 node selection. (I used the wrong term “power of 2” earlier. Lets try to call it choice of 2 from now on.)

Time to find out the maximum throughput we can reach now. So this time there will be a higher load again.

10 Likes

I’ve added some logging to my nodes and now I see that pieces uploaded with TTL=30 days (I presume: test traffic) have a fixed order size of 2319360 bytes, even if the piece itself is 2048 bytes. This is worrying, as this would make traffic accounting in 1.104.5 nodes way off. This is exactly the part of my patch that I wanted to investigate before making a pull request.

At least the patch was reverted later, so up to date nodes will again have the right numbers.

1 Like

I have a better name for the new node-selection system.

Just wanted to reply to this - this is only kind of true.

It is true that the Satellite is now going to keep track of recent success rates in a system-wide data structure (a speed profile, I guess), but we actually have many of these speed profiles to choose from. We are going to be choosing which speed profile to use based on where the request is coming from.

Right now, this functionality isn’t enabled, so yes, your description is true /this week/.

Our next step is to have every region’s S3 gateway (where most of our traffic comes from anyway) get their own speed profile (once I can figure out why https://review.dev.storj.io/c/storj/edge/+/13309 isn’t passing the build). And then after every gateway region is handled, then we’ll explore what we can do for remaining native integrations.

6 Likes

(…with all the risk of beeing banned…)
That’s what she said! :sunglasses:

3 Likes

So I’m watching this and the binary storagenode process, during these tests, is doing a CPU load of over 700% with IOwait% being pretty much insignificant. This is on x86 CPU with plenty of RAM and SSD cache and with logs on ramdisk.
It looks like the process is quite CPU heavy, I assume the more the smaller the segments gets.
Are there some plans to optimise the storagenode code more, maybe doing binaries for different, more advanced CPU flags if such thing would help?
As with such high CPU loads and PPS, even the ping to the local gateway is quite erratic and the node is loosing significant amount of races.
I can probably throw more cores at it, but that probably isn’t an option for someone running this on a low powered device or for someone running this at scale.
I wonder how Synos and similar devices are coping with this load.

3 Likes

We can control the piece size by changing the RS numbers. Last test that ended right now should have been twice the piece size and halve the number of connections. According to our result that increased performance. This isn’t a solution. It is just an observation for now.

i can confirm. Windows GUI. Win 10 pro.
2 cores per node and 1h ago 80-100% CPU usage.
Those are 2 cores of a 8/16 CPU (AMD Zen 2, AM4 platform)
delegated per VM instance, If someone going to say: Oh its VM fault - i’d say wonder what would happen if that wasn’t in VM.
A potential CPU usage leak would eat me all the CPU usage of a whole workstation.
With VM, at least i can limit cores per node like that. now (1 node, 1 HDD, 2 CPU cores)

Edit:
Normally 1 VM storj instance uses 20-40% CPU load, even with quite high traffic like now (ingres now is 30-45% network, that is high, past months normal was like 3-10%)
That is 2 core per VM.
The high CPU % occurs only when network was tested up to 100% its ability, like yesterday, so not something that occurs often, but rather rare, but still.

1 Like

That’s interesting. Mine had a peak of bandwidth use but CPU and IOWait didn’t really go very high.
I wonder if my nodes were selected less often with the new choice of 2 selection criteria…

In my setup, I do not see big cpu usage on nodes itself, but antivirus eat significantly more that usual, and RAM consumption more than usual 300-500 mb, but if there is lot of connections it is OK for me, as my setup in node buffer is 4mb

Didn’t you add node drive to exclusion list ?

3 Likes

Thank you, I forgot to add around half of HDDs to exclusion. now it more resonable.

3 Likes

I suspect a VM is a culprit

see

There are bare metal nodes as far as I know. My observation is the same: the Windows service node doesn’t consume much of CPU (unfortunately in Windows there is no easy way to get a CPU usage in percentage in CLI), but the task manager shows ~0.12% (8 CPU cores).
while docker nodes (they are running in the VM because it’s Docker Desktop for Windows):

CONTAINER ID   NAME           CPU %     MEM USAGE / LIMIT     MEM %     NET I/O           BLOCK I/O   PIDS
29ce6d785933   wireguard      0.17%     24.16MiB / 24.81GiB   0.10%     1.98kB / 183B     0B / 0B     21
3d20fef76e67   storagenode5   6.69%     75.09MiB / 24.81GiB   0.30%     2.63MB / 78.5MB   0B / 0B     54
9fb28e5cf48a   storagenode2   12.23%    668.8MiB / 24.81GiB   2.63%     10.3GB / 723MB    0B / 0B     229

And the Docker itself shows about 30% CPU usage (8 CPU cores).

2 Likes

It seems that my node has been selected. Hard disks are frying like eggs :joy:

Test results are a step closer to our target and this time we can keep it running for hours without any errors on the other satellites. That is good. Our target gets more and more in reach.

We are working on a different success tracker. Deployment on a Friday is a bad idea so it will have to wait for Monday. There is a good description in the PR what the differences are: https://review.dev.storj.io/c/storj/storj/+/13308

To me it looks like we might have a problem with the number of connections now. If I run a single node I have almost 100% success rate. If I add a second node on the same IP they split the load and still almost 100% success rate. I can continue like that up until 4 nodes. If I add more nodes they start to get long tail canceled. My reading of this is that TCP fast open or connection pooling gives me 100% success rate for a single node but the more nodes I add the higher the chance that a new connection needs to be established and that makes me lose the race. Not much so I could as well just ignore it but it is visible on my grafana dashboard.

12 Likes

My Internet connection is now at its limit. Almost all successful uploads coming from SLC.

1 Like

I do not see a lot of CPU usage on my node. The node runs in a VM, the host has 2x Xeon X5687 CPUs, the node VM has 4 cores.

It looks like the node process uses about 30% CPU.

1 Like

30% CPU of 4 cores, could be 60%+ of 2 cores, so roughly rather high, per one storagenode.exe