Hi guys,
I have five nodes running since years and sometime, like now 16th May @ 17:00-17:30 CEST - Italy summer time, I experience in some a network traffic flooding.
It could be a nice thing but the flooding is too much, also checking the logs with TAIL command scroll down in a crazy and fast way.
CPU load reach more that 500 and all is in stuck, all the upload requests will be canceled and the node in few minutes crash.
Now I have fixed disabling the upload reducing the free space.
Does anyone else faced the same issue ?
Is there a way to tune the number of uploads in parallel or something else to avoid crazy request from the network ?
Is is a problem with your hardware/software or is your internet connection not enough? How many of your nodes are running the new storage node release?
had the same, half of my nodes crashed during 1h. I think this was some kind of a benchmarking? but yeah half of my nodes crashed during that time. Thanks
Yes it was a benchmark test. I don’t think we even hit max load yet. So all the nodes that crashed will crash again. The question is what exactly was causing problems.
in my case my nodes were on heavy load, cause they were all deleting data from the new bloomfilters, and they’re quite large 10TB+ so they’re not finished yet, especially when every few days you get new version of a node that restarts it. So filewalkers were running and deletes were happening and benchmark hit. I guess it was just too much
I think because you did benchmark it during heavy deletes and filewalking + ingress/egress from customers. I think you shouldn’t do benchmarks when disks are already under heavy load…
Hello,
4 if 5, including the one crashed are in 1.102.3. One is running 1.104.5
No hw issue but maybe slow hardware for this kind of load.
It makes no sense to overload all nodes when then only few will win the race and get the data.
Please find a better way to spread the data and to spread the load or at least provide instructions on how to limit incoming traffic : if nodes crash could stay down for hours.
I am trying to find out if the new version will help you or if there is something else wrong with your setup. What is going to happen is that this kind of load will happen a few more times and we are planning to run it even for days with no break (after storage node rollout is finished)
Ofcourse not, except the benchmark you did was the biggest upload I’ve seen since I become SNO, my bandwidth was reaching ~500mbit/s ingress… I never seen this kind of ingress ever
My windows node just paused for whole 10 mins. There was nearly no activity on the disk ~2-3%. No log lines were created. The 10 mins was deduced from looking at timestamp of last log line with current time. I killed my docker and restarted node and then it was just cancelled upload from Saltlake. Success rate was 4% for upload.