Node overloaded - traffic flooding

mcanto73 · May 16, 2024, 3:30pm

Hi guys,
I have five nodes running since years and sometime, like now 16th May @ 17:00-17:30 CEST - Italy summer time, I experience in some a network traffic flooding.
It could be a nice thing but the flooding is too much, also checking the logs with TAIL command scroll down in a crazy and fast way.
CPU load reach more that 500 and all is in stuck, all the upload requests will be canceled and the node in few minutes crash.
Now I have fixed disabling the upload reducing the free space.

Does anyone else faced the same issue ?

Is there a way to tune the number of uploads in parallel or something else to avoid crazy request from the network ?

Thanks

littleskunk · May 16, 2024, 3:38pm

Is is a problem with your hardware/software or is your internet connection not enough? How many of your nodes are running the new storage node release?

PePeR · May 16, 2024, 3:47pm

had the same, half of my nodes crashed during 1h. I think this was some kind of a benchmarking? but yeah half of my nodes crashed during that time. Thanks

Th3Van · May 16, 2024, 3:49pm

Graph

Upload benchmark test ?

Th3Van.dk

littleskunk · May 16, 2024, 4:00pm

Yes it was a benchmark test. I don’t think we even hit max load yet. So all the nodes that crashed will crash again. The question is what exactly was causing problems.

PePeR · May 16, 2024, 4:07pm

in my case my nodes were on heavy load, cause they were all deleting data from the new bloomfilters, and they’re quite large 10TB+ so they’re not finished yet, especially when every few days you get new version of a node that restarts it. So filewalkers were running and deletes were happening and benchmark hit. I guess it was just too much

littleskunk · May 16, 2024, 4:14pm

The bad message is it will happen again. How can we fix it? Is your node running the new version already?

ACarneiro · May 16, 2024, 4:15pm

I still shouldn’t have crashed the system, though… if you have very high IOWait then the most that should happen is you losing races…

jammerdan · May 16, 2024, 4:56pm

Maybe the nodes can become smarter:

And furthermore maybe don’t run things like filewalkers, trash collectors when pressure from customers is high.

littleskunk · May 16, 2024, 4:59pm

I don’t get it. My node has no issues. Why are your nodes crashing?

jammerdan · May 16, 2024, 5:00pm

I don’t know if my nodes crashed.

littleskunk · May 16, 2024, 5:01pm

Can we keep this thread nice and organized please? Only reports of nodes that had problems with the short loadtest please.

PePeR · May 16, 2024, 5:25pm

I think because you did benchmark it during heavy deletes and filewalking + ingress/egress from customers. I think you shouldn’t do benchmarks when disks are already under heavy load…

mcanto73 · May 16, 2024, 5:26pm

Hello,
4 if 5, including the one crashed are in 1.102.3. One is running 1.104.5

No hw issue but maybe slow hardware for this kind of load.

It makes no sense to overload all nodes when then only few will win the race and get the data.
Please find a better way to spread the data and to spread the load or at least provide instructions on how to limit incoming traffic : if nodes crash could stay down for hours.

Thank you

PePeR · May 16, 2024, 5:28pm

I think if u keep benchmarking and nodes will crash, the filewalkers may never end their jobs… especially when you have big nodes with a lot of data

littleskunk · May 16, 2024, 5:29pm

The customer will not wait with his uploads until it is convinient. The network has to maintain this kind of load even over a longer time period.

littleskunk · May 16, 2024, 5:31pm

I am trying to find out if the new version will help you or if there is something else wrong with your setup. What is going to happen is that this kind of load will happen a few more times and we are planning to run it even for days with no break (after storage node rollout is finished)

PePeR · May 16, 2024, 5:32pm

Ofcourse not, except the benchmark you did was the biggest upload I’ve seen since I become SNO, my bandwidth was reaching ~500mbit/s ingress… I never seen this kind of ingress ever

nerdatwork · May 16, 2024, 5:36pm

My windows node just paused for whole 10 mins. There was nearly no activity on the disk ~2-3%. No log lines were created. The 10 mins was deduced from looking at timestamp of last log line with current time. I killed my docker and restarted node and then it was just cancelled upload from Saltlake. Success rate was 4% for upload.

littleskunk · May 16, 2024, 5:37pm

Old storage node version or the new storage node version?