Can't split load on multiple nodes. Why?

littleskunk · June 6, 2024, 4:11pm

I am running a Ryzen 5600 with 128GB of memory and 8 nodes with 1 HDD each. So far so good. While the benchmark tests are running I tried a few things. The highest upload rate I have if I accept all data on a single node. The other 7 nodes will claim that they are full and don’t get any uploads. Now If I add a second node they will split the incoming uploads and still hit the same number of total uploads. So far everything works as expected.

What I would like to do is signal free space on all 8 nodes but that doesn’t work. I can see that I start losing some races. But why? Is it a higher number of connections? Is there something I can improve on my side? Like a limit for TCP fastopen connections that I need to increase or something like that?

Worst case I write a script that just rotates through the 8 nodes in a way that only 1 or 2 are accepting data at the same time. That would solve it and still allow me to fill all drives.

Mitsos · June 6, 2024, 6:10pm

Are you using realtek nics in that machine? They have low(er) onboard buffers and not so good of a hardware offloading (vs intel enterprise for example). I think you are hitting the limits of the networking stack (either router, nics, etc), with regards to concurrent connections.

Vadim · June 6, 2024, 7:28pm

you can try to use second network card.

littleskunk · June 6, 2024, 8:26pm

Ok. Is there some command I can run to check?

Mitsos · June 6, 2024, 8:34pm

On linux:

ss -tup | grep (your storagenode's port) | wc -l

Or leave the wc -l part out to not count them and instead display their state per storagenode’s port.

You could try and correlate the number of connections to the time the node starts failing. If the connections aren’t going up, that means the node can’t handle anything more thrown at it. If it’s clean on the IO side, that means it’s pure network from there on. You could go deeper and analyze CPUs context switches per second and buffer usage, but I don’t think you need to dig that deep.

If a single node can do 500 concurrent connections, but 4 nodes can only do 100, then it’s not a storage issue.

littleskunk · June 6, 2024, 8:37pm

Ok. I will try that out tomorrow. Thank you.

Toyoo · June 6, 2024, 9:42pm

SATA multiplexing? Kernel context switches? Router port forwarding rules overhead? BTW, I’ve seen the same on my N36L half a year ago when I accidentally left my horde of nodes all accepting uploads. Didn’t bother looking for answer then though, but I wouldn’t expect anything specific to the new node selection algorithms.

ACarneiro · June 6, 2024, 10:07pm

I’m having some issues as well.
No matter how many nodes I have running behind one IP, I only seem to be getting about 150Mbps throughput. When I deployed a new node on a different machine, throughput on the line (1 Gbps) remains around 150Mbps and on the existing nodes drops accordingly.

I am currently downloading a large dataset on my other (%00Mbps) line so can’t quite tell if throughput on that one is the same as well or not.

I have a Unifi Dream Machine Pro router with CPU and RAM utilisation in the 60%ish percent range.
Disabling IPS didn’t make a difference.
There is a bit of increased latency on my line when my nodes are all running.

Could it be an issue on the ISP’s end?Is it something to do with node selection?

littleskunk · June 11, 2024, 4:43pm

The problem was my limited bandwidth. With a single node the satellite has no issue to max out my resources. The bitshift success tracker is doing a great job and keeps my node at a high success rate. If an upload fails the bitshift success tracker will notice that very quickly and scale down the upload rate.

With 2 node on the same IP this gets problematic but it still works. With 4 nodes on the same IP it starts failling to scale the request rate and tries to upload more pieces to my nodes than I have bandwidth available. The result is a much lower success rate. I am not 100% sure why that is happening. It might be bad luck that the upload failures that the bitshift success tracker is waiting for are not evenly distributed making it harder for the success tracker to max out my resources. It could also be that the 100% bandwidth utilization has some side effect on TCP fast open or connection pooling. A lot of variables.

Now to the good news. The moment I upgraded my bandwidth the problem was gone. Now I can split the incoming traffic even on 8 nodes with no issues and they are all having a great time with 99% success rate.

Now this could happen again. Upgrading my bandwidth makes it so that I currently doesn’t hit 100% bandwidth utilization but it is too close to feel comfortable. So I have written an script that just rotates my nodes. It takes the output from df to find out which drive has more free space available. The corresponding node gets the entire drive as allocated space. All the other nodes will get 500GB and therefor not accept any uploads. On the next day I run the script again and it will allow a different node to take all the uploads and reduce the current one to 500GB allocation.

Running this script once per day is great but hopefully one day I will run out of free space and as a consequence I have to run the script once per hour. Now there is one problem with that. The transition from one node to another takes 5 minutes. I would lose 5 minutes of uploads per hour. But there is a solution. Instead of sizing down the node to 500 GB I resize it to still have about 5GB of free space. That way I have a nice transition with no gap. The node will inform all the satellites that they should stop uploading to it but it will take any file that still gets uploaded. While this is ongoing one of the other nodes replaces it in the node selection. In the next 5 minutes the satellite API pods will either select the soon full node or start selecting the now available node with no gap. On the following run the script will size down the node to 500 GB. So in total there are 3 states. One node accepts uploads, one node is in transition with 5GB free space and at the end it gets just 500GB allocated space and waits for the next turn. That way I keep my upload rate high and run the script as frequently as needed.

I also thought about mixing in a used space calculation on startup from time to time. The node that is going to take the uploads from now on is never allowed to run the used space calculation. I don’t want it to slow down the uploads and also there is no reason for that node to run it. Same for the node that is in transition. I don’t want to slow it down but running the used space calculation the moment it transitions from 5GB free space to 500GB allocated it can run some additional maintenance. Perfect time and it would be just one node per cycle. For now I am not doing it. In a perfect world with no bugs I shouldn’t have to run the used space calculation ever. Not running it is the best way to identify possible bugs in that area. But I am sure there will be other maintenance jobs that I want to run from time to time and with my script I can make sure the maintenance job doesn’t impact the 1-2 nodes that accept uploads and only run on the drives that are more or less on idle.

Ruskiem · June 11, 2024, 9:02pm

Cool. And You sure that was bandwidth? with higher plan You didn’t got a new better router?

littleskunk · June 11, 2024, 9:15pm

No hardware, firmware or software changes on my side. My ISP just bumped the limit in the system and that was it.

Toyoo · June 11, 2024, 10:16pm

This explanation is still not consistent with the symptoms.

I wouldn’t be surprised though if the higher bandwidth package also came with better ISP-side QoS settings.

vovannovig · June 13, 2024, 5:24pm

Please tell me, in the previous mention it was said that if there are more than 4 nodes on the /24 subnet (or 1IP address), then there is a problem, and as I understand it, it is better to take into account a larger number of catches in one subnet?
What is the rational limit?

The question is that just like 1 node - one disk, if you expand, you need to create a separate node, but I already have several of them.

vovannovig · June 15, 2024, 6:43am

This is a very important question for me, since I do not use RAID, but create new nodes on new disks (realizing that I am losing a lot of money on retention), I would like to offer several solutions for consideration:

Allow operators who have nodes in one /24 network to set priority between their nodes, thereby regulating and redistributing the load during service time.
Allow operators to specify what TTL they would like to have the minimum, this will also relieve the load if a node cannot cope or several nodes in the /24 subnet or for maintenance.

I want to emphasize that this is the WISH of the operator, and not a hard and fast rule. The desire will be to know the satellite, and decide to send data or not. If everything is fine in the network, we accept the wishes of the node - it will help remove the long tail, and if there is not enough space in the network, the satellite will send all the data, ignoring the WISH of the operator.

Thus, this will allow us to balance the load, and the satellites will decide for themselves, as before, to whom and how much to send.
My proposals are an addition to the concept of a “slow” node, which is now being tracked by the satellite, knowing that the node is losing the race.

Please consider my suggestions.

Alexey · June 15, 2024, 8:26am

Sounds like the same issue

and likely the reason is the same.

vovannovig · June 15, 2024, 9:18am

NO

Updates on Test Data

Test results are a step closer to our target and this time we can keep it running for hours without any errors on the other satellites. That is good. Our target gets more and more in reach.

We are working on a different success tracker. Deployment on a Friday is a bad idea so it will have to wait for Monday. There is a good description in the PR what the differences are: [https://review.dev.storj.io/c/storj/storj/+/13308 ](https://review.dev.storj.io/c/storj/storj/+/13308)

To me it looks like we might have a problem with the number of connections now. If I run a single node I have almost 100% success rate. If I add a second node on the same IP they split the load and still almost 100% success rate. I can continue like that up until 4 nodes. If I add more nodes they start to get long tail canceled. My reading of this is that TCP fast open or connection pooling gives me 100% success rate for a single node but the more nodes I add the higher the chance that a new connection needs to be established and that makes me lose the race. Not much so I could as well just ignore it but it is visible on my grafana dashboard.

сб, 15 июн. 2024 г. в 11:36, Alexey via Storj Community Forum (official) <storj@literatehosting.com>:

I wrote this in the topic about test data, because that’s where this information was.

PLEASE DO NOT TRANSFER MY MESSAGES ON OTHER TOPICS!

Alexey · June 15, 2024, 10:46am

Ok, I wouldn’t. However, I do not see a relation with the previous topic. Could you please explain to me why I failed to determine the right place?

littleskunk · June 15, 2024, 11:20am

Since you are reffering my posts you might have missed the final solution that I posted here: Can't split load on multiple nodes. Why? - #9 by littleskunk

Toyoo · June 15, 2024, 8:52pm

This is already possible in two ways:

Setting your allocation for a given node below the amount of data the node believes to store stops ingress.
You can set the maximum number of connections to 1, which with the recent changes @littleskunk is working on will make the satellite send a lot less ingress traffic to your node.