Updates on Test Data

ACarneiro · June 3, 2024, 10:15pm

Same here. The only node I have with a higher IOWait at the moment is one of my Pi5s with spinning rust and I suspect it’s running a GC. All others are taking this in their stride.

Mitsos · June 3, 2024, 10:15pm

High CPU usage on this one (clocks boosting way up) + higher load.

littleskunk · June 3, 2024, 10:20pm

bestofn n=4 could be a bit too much. (I still owe you a descript what exactly it does)

In terms of throughput that one was a new record for the bandwidth optimized RS number we are currently running. In combination with one of the even faster RS numbers this could hit the target.

Toyoo · June 3, 2024, 10:22pm

Probably because other nodes started GC and started losing races.

ACarneiro · June 3, 2024, 10:22pm

Yes, it trebled (at peak even quadrupled) the baseline I’ve been seeing for the test.

littleskunk · June 3, 2024, 10:28pm

Up next bestofn n=2 but this time we remove the ip subnet filter. Don’t get too exited. This is kind of impossible for production but since I get that question kind of in every meeting we can as well test it

BrightSilence · June 3, 2024, 10:30pm

Certainly seemed to push a lot more data.

IO wait has been an issue today since I restarted most of my nodes earlier today and filewalkers are running simultaneously with some GC going on as well.

This is mostly unrelated to the new tests, though I did see it increase the issue during peaks. Nothing to be too concerned about though as it seemed to have been performing just fine for the most part.

littleskunk · June 3, 2024, 11:01pm

Disabling the ip subnet filter had almost no effect. Up next we will try bestofn n=3 to see if the results are closer to n=2 or n=4.

littleskunk · June 3, 2024, 11:14pm

While the test is running I can explain a bit what bestofn does. Current active RS settings are 16/20/30/38 With n=2 it would select 38*2=76 nodes and order them by success rate picking only the 38 nodes with high success rate.

Downside is that with n=4 it kind of stops selecting nodes with a low success rate. It still selectes them from time to time but less frequent than the choiceofn selection would do. On my storage node I noticed an decrease in usage compared to n=2. That indicates n=4 is now missing out on the resources that the slow to resonable fast nodes still have to offer. However on the other side the total throughput was impressive. So as a storage node operator myself I would vote for a more balanced approach but I would also understand if my manager ignores that wish an picks the higher total throughput.

Thats why we are testing n=3 to get a sense of how it performance on these 2 aspects.

littleskunk · June 3, 2024, 11:33pm

n=3 was very close to the n=4 results. So somewhere between n=2 and n=3 is the most gain and increasing it further doesn’t help as much. It was also using my storage node again.

Up next we will keep n=3 but combine it with the fastest RS number 16/20/30/60 from previous tests. Technically 3*38!=3*60 so this isn’t a fair comparison. So later I might want to try n=1.9

Toyoo · June 3, 2024, 11:40pm

Not sure why wouldn’t this make sense. I said this already: Storj will work on a small number of well-connected data center nodes with lots of storage, there’s no need for small operators to exist. It would be actually worrisome if that wasn’t known by your manager.

littleskunk · June 3, 2024, 11:57pm

We have that. It is called the storj select network and currently performing worse than the public network. Decentralizations and the pure number of nodes just outperforms it.

Toyoo · June 4, 2024, 12:02am

Yet your results for n=4 suggest you could get rid of probably half of the operators of the public network. My impression is that if you performed an experiment like: pick k nodes with the best success rate, and run the old node selection algorithm just on them, for k=500, 1000, 2000, then you’d outperform even the n=4 experiment for at least one of the k.

littleskunk · June 4, 2024, 12:08am

I read the results totally different. The test results are showing that we have a decent number of fast nodes and bestofn can make use of that but it is missing out on the resources that the rest of the nodes still offer. This is the wrong node selection for the job. Even the slowest node out there will still bring additional resources to the party that an ideal node selection can utilize.

BrightSilence · June 4, 2024, 12:10am

You probably know better which test corresponds to which peak. But the last one definitely pushed most data to my node. Though with the higher number of initiated transfers that doesn’t necessarily mean I get to keep more data as I’m sure more got long tail cancelled as well.

It’s now easier to see the CPU impact as well as it seems most file walkers have finished. Though I still have one GC running.

It seems like that last test peaked normal CPU usage more than IO wait, which I wasn’t expecting.

This would lead to a much larger number of nodes getting significantly less data. It’s clear that I’m not in that lower part from the traffic I saw, but that does make me worry a little more about distribution. Is that an aspect you are looking into for these tests? I’d say especially if you push these numbers too far (like maybe n=4) the chances of pieces ending up on more coordinated systems/locations goes up, possibly impacting durability risk.

Toyoo · June 4, 2024, 12:10am

So I assume it’s useless to attempt to challenge you to perform this experiment?

littleskunk · June 4, 2024, 12:10am

Now testing RS number 16/20/30/60 with bestofn n=1.9

littleskunk · June 4, 2024, 12:22am

There is a reason we tested everything else with the other RS setting that has a much shorter long tail. In fact it has the shortest long tail I could come up with without increasing the risk of upload failures too much.

The point of this test is to verify how it compares to earlier tests. It does show us if the node selection itself is able to utilize the available resources without the old long tail trick. Ideally we find a node selection that gets us maximum throughput with minimal resource overhead. So from time to time we have to test the excessive RS number to verify how good or bad the current node selection settings are.

Toyoo · June 4, 2024, 12:24am

How about automating these tests with some form of black box optimization? I did so ages ago with Postgres and Apache Druid using hyperopt, was quite helpful not to spend engineering time manually testing different hypotheses.

Roxor · June 4, 2024, 12:41am

So… today slower nodes still get selected for the opportunity to win upload races (even if they may fail often). But the recent highest-performance configs don’t even give them a chance? If they never get a chance: they’ll leave.

I see what @Toyoo is saying: a more winner-takes-all node selection could benefit paying clients. But to the detriment of the health of the SNO community. Because yes speed is a priority: but also a strong diversity in nodes to mitigate risk.

Like the fastest performance may be “send all client requests to @Th3Van”. But if the network atrophies to 500 fast nodes then you’re screwed when he’s offline.

This is all interesting stuff! I get that Storj wants to tweak all the dials to see what’s possible. Then they’re going to have to make some tough business decision about what’s safe and realistic. Glad I’m not you!