Node overloaded - traffic flooding

littleskunk · May 16, 2024, 5:39pm

You can read more about it here: Updates on Test Data

nerdatwork · May 16, 2024, 5:40pm

Windows: Node Version: v1.102.3

While my Raspi on same v1.102.3 was running so smoothly I couldn’t take my eyes off of it. Reminded me of early beta days when new data was pushed.

littleskunk · May 16, 2024, 5:45pm

Ok that sounds like the new version should fix it. There was a free space sys call that was very expensive on Windows.

PePeR · May 16, 2024, 5:47pm

mine nodes are .102 version on docker

Roxor · May 16, 2024, 6:02pm

I wonder how much of the network is on 1.104… I think around 20% now? This would be a great test of its new write caching features.

MarviBiene · May 16, 2024, 6:26pm

Oh, and I was wondering why my nodes crashed. But mine are at v102 at the moment? And lazy Filewalker was running at that moment. Can I disable it somehow?

Ruskiem · May 16, 2024, 9:11pm

Just want to say all my nodes are operational.
Taking 25% of network ability currently.
2 minutes ago i noticed a spike of 50% network’s ability which is ~500Mbps of ingress (download to my nodes) and had no problem. Reporting: i’m ready for more!

Edit: it’s 13 nodes (13 HDD’s for total of 150TB)
@Roxor

@jammerdan
not much, yesterday i checked all and found only 2.
majority is on some 1.102.x

Roxor · May 16, 2024, 9:22pm

How many nodes is that across? I can’t imagine holding that rate across just two or three! (although I’m jealous if that’s what you’re getting )

jammerdan · May 16, 2024, 9:25pm

How many on v1.104 yet?

Mitsos · May 16, 2024, 9:40pm

What I would suggest is that we go ahead with the rollout for 1.104.5 before running any more benchmarks.

The reason being simple: A lot of SNOs would be running used-space when upgrading, since that fixes the trash problem. On top of that, we’ve had a lot of deletes so they would still be running trash-fw as well (trash-clean is still running for 2024-05-03 on some of my nodes for example). Pile on top of that GC runs.

Asking >now< to benchmark the network is not the best time, IMHO. Let the network go through the update (we have already established that it helps a lot), let the SNOs go through used-space and let the GCs/trash cleanup finish for the heavy deletes. Then benchmark to your heart’s desire.

So far I haven’t had any node crash. All running 1.104.x (still waiting on used-space to finish on some of them since 1.104.1).

littleskunk · May 16, 2024, 10:29pm

We are not running these tests for fun. The full story you can read up here: Updates on Test Data

Mitsos · May 16, 2024, 10:43pm

Aren’t the results of those benchmarks skewed because nodes on fsync are not as fast as they should be?

littleskunk · May 16, 2024, 10:47pm

If the target would be to measure storage node performance than yes. But you know that isn’t the only component that needs to keep up with the load…

Mitsos · May 16, 2024, 10:49pm

I know, which is why I suggested that a crashed node because it chocked on fsync writes isn’t going to provide any meaningful data. I’m not saying pause the benchmarks for the next 5 years, I’m saying pause them for the next week.

littleskunk · May 16, 2024, 10:59pm

Sounds like you didn’t understand me. The target is to test the performance of the satellite. For that test the storage node performance is irrelavant and even crashing a few nodes still gives us more than enough data to keep improving the satellite performance.

MarviBiene · May 17, 2024, 12:13am

I would like to add that node crashes could provide valuable information on how the satellite manages situations where many nodes suddenly disappear, whether due to an inability to handle the load or network issues.

KernelPanick · May 17, 2024, 12:34am

3 nodes here 1 restarted around the time of the test, but appears to have updated, not crashed. the other two are on 102, and all of them appear have to survived the test.

I saw a spike around 500Mbps too toward what appears to be the end of that test run. Must have been a lot of nodes around me drop out.

My nodes are running on pretty stout hardware, definitely not economy stuff that’s recommended. But, not enterprise grade either.

nyancodex · May 17, 2024, 5:27am

same here, my nodes crashed, server loads went up to 1000, the whole server almost freezed (SSH can’t work), what the hell happened?

my nodes are at 102.3

Roberto · May 17, 2024, 5:34am

Same here, two node on Qnap arm, 8Tb, 1.5Gb Ram, they goes offline.
Version 1.102.3.
The other node on Windows, N100 with 16Gb of Ram, 2Tb, always 1.102.3, run smoothly

naxbc · May 17, 2024, 6:10am

Exactly why I keep “storage2.max-concurrent-requests: 10” uncommented.