Updates on Test Data

I guess our nodes fall into the “preferred” category with the new selection algorithm.

All of my 12 nodes are “preferred”? Seems unlikely, surely…

:open_mouth: these are the right settings!

Help, will need more RAM! :smiley:

1 Like

The new node selection is awesome. More than twice the troughput and from what I can see it didn’t overload the nodes this time. Looks promising. I will write a few words about the node selection later.

8 Likes

Maybe one more advice. If you have a concurrency limit set now would be the time to remove it. No matter how bad your node is the new node selection should scale down the request rate automatically and not overload your node anymore. Just try it out.

4 Likes


The system was still responsive, had my finger on the trigger to kill it, but didn’t have to. MORE PLEASE!!!

Sorry, just to be clear:
Is this storage2.max-concurrent-requests: ?

yea that’s the one.
(20 chars)

1 Like

Between the IPv6 thread and this, I’ll pick this to spend my night on. Have to go back and figure out what the load actually was (I’m thinking waiting on IO) since the hosts didn’t feel sluggish at all even with some of them peaking at 1.7K load.

1 Like

Ok while you are removing any concurrency limit you might have set I will explain how the node selection actually works.

When a segment gets commited to the database it will contain the nodes that have been fast enough and it will be missing the nodes that got long tail canceld. The satellite calculates a success rate for each node with that.

The node selection takes that success rate. Instead of 110 total nodes it selectes 220 nodes at first and compares them in pairs and pick the one with the higher success rate. So it throws away the slow nodes. We call this the power of 2 node selection. The benefit of this method is that the risk for a stomping herd effect is low. This effect means the selection of a group of nodes that has a high success rate but because the node selection overshoots the goal the nodes gets selected too many times and success rate decreases. At the same time another group of nodes recovers and is now the new vicitm of a node selection with stomping herd effect. Power of 2 choice is build to minimize that risk.

10 Likes

They count possible ingress now but calling it still ingress. So this is not real. :sweat_smile:

Well, my success rate is still above 98% so most of it does count :slight_smile:

You are on the wrong party. There wasn’t a single storage node dashboard screenshot. I was watching my nodes with grafana and netdata and it was looking real good.

2 Likes

How does the new node selection deal with /24? Does it still count them all as one?

My combined ingress from all nodes was over 1 TB yesterday while the router reported only 600 GB. And there is a lot of other traffic. So I guess the new fake ingress is 2x to 3x real ingress.

You still don’t get a benefit of running multiple nodes on the same subnet but it will scale up and down the traffic per node not per subnet. So lets say we have 2 nodes sharing the same subnet and one has a better success rate. It will get selected more often by the power of 2 node selection but it would still only reach halve the request rate it could get without the second node in the subnet.

3 Likes

I don’t know what you are talking about. You are still on the wrong party. My node has a 99% success rate. There is no fake traffic if you win all the races. Also how do you expect netdata to show any fake traffic? There is no code change we could do that would let netdata show wrong numbers.

1 Like

What is the timeframe to recover?

I was talking to @ACarneiro. Could it be you are on the wrong party? :wink: