Node overloaded - traffic flooding

Hi guys,
I have five nodes running since years and sometime, like now 16th May @ 17:00-17:30 CEST - Italy summer time, I experience in some a network traffic flooding.
It could be a nice thing but the flooding is too much, also checking the logs with TAIL command scroll down in a crazy and fast way.
CPU load reach more that 500 and all is in stuck, all the upload requests will be canceled and the node in few minutes crash.
Now I have fixed disabling the upload reducing the free space.

Does anyone else faced the same issue ?

Is there a way to tune the number of uploads in parallel or something else to avoid crazy request from the network ?

Thanks

Is is a problem with your hardware/software or is your internet connection not enough? How many of your nodes are running the new storage node release?

2 Likes

had the same, half of my nodes crashed during 1h. I think this was some kind of a benchmarking? but yeah half of my nodes crashed during that time. Thanks :smiley:

2 Likes

image

Graph

Upload benchmark test ?

Th3Van.dk

2 Likes

Yes it was a benchmark test. I don’t think we even hit max load yet. So all the nodes that crashed will crash again. The question is what exactly was causing problems.

in my case my nodes were on heavy load, cause they were all deleting data from the new bloomfilters, and they’re quite large 10TB+ so they’re not finished yet, especially when every few days you get new version of a node that restarts it. So filewalkers were running and deletes were happening and benchmark hit. I guess it was just too much

The bad message is it will happen again. How can we fix it? Is your node running the new version already?

I still shouldn’t have crashed the system, though… if you have very high IOWait then the most that should happen is you losing races…

Maybe the nodes can become smarter:

And furthermore maybe don’t run things like filewalkers, trash collectors when pressure from customers is high.

3 Likes

I don’t get it. My node has no issues. Why are your nodes crashing?

2 Likes

I don’t know if my nodes crashed.

1 Like

Can we keep this thread nice and organized please? Only reports of nodes that had problems with the short loadtest please.

2 Likes

I think because you did benchmark it during heavy deletes and filewalking + ingress/egress from customers. I think you shouldn’t do benchmarks when disks are already under heavy load…

1 Like

Hello,
4 if 5, including the one crashed are in 1.102.3. One is running 1.104.5

No hw issue but maybe slow hardware for this kind of load.

It makes no sense to overload all nodes when then only few will win the race and get the data.
Please find a better way to spread the data and to spread the load or at least provide instructions on how to limit incoming traffic : if nodes crash could stay down for hours.

Thank you

I think if u keep benchmarking and nodes will crash, the filewalkers may never end their jobs… especially when you have big nodes with a lot of data

The customer will not wait with his uploads until it is convinient. The network has to maintain this kind of load even over a longer time period.

6 Likes

I am trying to find out if the new version will help you or if there is something else wrong with your setup. What is going to happen is that this kind of load will happen a few more times and we are planning to run it even for days with no break (after storage node rollout is finished)

4 Likes

Ofcourse not, except the benchmark you did was the biggest upload I’ve seen since I become SNO, my bandwidth was reaching ~500mbit/s ingress… I never seen this kind of ingress ever

My windows node just paused for whole 10 mins. There was nearly no activity on the disk ~2-3%. No log lines were created. The 10 mins was deduced from looking at timestamp of last log line with current time. I killed my docker and restarted node and then it was just cancelled upload from Saltlake. Success rate was 4% for upload.

Old storage node version or the new storage node version?