How to reduce network traffic

SGC · April 18, 2020, 4:22pm

but why bother, when you can just adjust max concurrent in the config.yaml and get a smooth running solution that you can even game on without seeing any ill effects, and the node can even be using 98% of the bandwidth meanwhile you are gaming and because it’s using only a few connections you barely even see it there…

anyways adjusting connections is my go to solution for making internet run smooth and dynamically adjusting… ofc QoS and managed switches can do it better, but often there should barely be a need for that on a home type network.

try it out and i bet you won’t want to bother looking at other solutions.

anon27637763 · April 18, 2020, 4:33pm

~~Do both at the same time…~~

I don’t think the default network behavior is equal division of bandwidth per TCP connection. Doesn’t each connection bandwidth usage depend on the amount data transferred during the connection time?

If there are 4 parallel TCP connections and one is transferring a lot of data, then that connection may consume more than 1/4th of the available bandwidth. So, the parallel connection limitation won’t necessarily limit ethernet bandwidth usage…

Unless my understanding is faulty… which, of course, never happens

EDIT: After looking at some multi-connection vs. single connection documentation… it seems that multiple connections utilize the interface bandwidth more effectively. It’s kind of like curve fitting… A single TCP connection can saturate the interface and then back off. Then the single connection begins adding bandwidth until it hits max bandwidth again. This lowers the effective bandwidth of the interface.

However, with multiple simultaneous connections, each connection “fills-in” the interface bandwidth more effectively… like sand in a jar versus marbles.

So, in reality, probably the best way to limit bandwidth usage is to use per interface bandwidth shaping and Increase the number of simultaneous connections in order to more efficiently utilize the smaller interface pipe.

SGC · April 18, 2020, 4:53pm

ofc.

well, i’m sure it’s more complex than i care to know, but basically i look at it like this… each open connection has a faction of the bandwidth… doesn’t mean it has to use it, but if it tries to max out on all four in your example, then they each get 25%.
but if the other 3 are just idle connections, the remaining one can use 100% of the bandwidth.
atleast to my understanding of it.

it’s partly what makes the network run so brilliantly with low connections, there is always high bandwidth for new opened connections… atleast compared to what most applications require and the latency is low because the applications aren’t fighting as much over the limited network resources.

in the past i’ve used it when using torrents, which can basically kill most networks or could in the past and i’m sure still can today, but if you limit the connections to like 10-20 or even 40 it will run a bit slower at first because it takes it a bit longer to talk to all the different peers online, but often you can get the same upload and download bandwidth, it just takes a bit longer to get to that stage.
like if we compare 10 connections to 800 then with 800 it will immediately when going online talk to 800 different peers, instead of 10 and then instead search around for those that can give more bandwidth.

the other very annoying thing from people running 800 connections is that when you do get a transfer from them, then you might get 1/800th of their bandwidth… which leads to people upping their connections to get more bandwidth faster from more peers…

so really it becomes this detrimental fight for running most connections with basically no throughput… and then if somebody else is trying to use the internet, they might also run into the 1/800th of the bandwidth issue… ofc not always, this is also why most schools, businesses and such tried very hard to block out stuff like that on their local network, until they figured out they just needed to limit how many connections each network user was allowed to take…

managing connections is a great easy way to make network run brilliant… without needing fancy gear. and essentially everybody can use the full bandwidth, if its free…

the overhead of many connections might also have an effect… not sure tho… there has to be some sort of added overhead by talking to so many different peers.

SGC · April 18, 2020, 5:05pm

now if only i could get my damn windows server VM to run on my server without impacting my storagenode stats… my successrate drops when the windows VM is running, even if it’s just… idle

anon27637763 · April 18, 2020, 5:12pm

This is likely if both VMs are utilizing the same HDD for storage.

SGC · April 18, 2020, 6:36pm

shouldn’t be much of a problem, it’s 5 drives in raid, with a 400gb or so ssd L2ARC/SLOG cache and then the machine has 48gb of ram to keep the most recently accessed stuff in its ARC / Zil.

my latest theory was that it might have peaked some cores at brief intervals creating bottlenecks, so now i limited the VM so it’s only allowed to utilize 50% cpu of each thread meanwhile being limited to 8 threads and running NUMA to make it even more streamline.

maybe it’s just a waste of time to try and make it better… but i would hate to have to turn it off and on when i need it…

anon27637763 · April 18, 2020, 8:22pm

The speed of the array is the limiting factor. I’m running on similar hardware… but I have very fast SAS drives. If you are running a SATA array, it’s likely that the drives are too slow for multiple applications to be running as well as an SN… Thus dropping your success rates.

SGC · April 18, 2020, 8:50pm

well the hdd’s are barely used less than 3% utilization last i checked in spikes… and their backlog is like 10-20ms, my ssd i’ve finally gotten down to 10-20ms peak backlogs, now 95+% of the time its near like 1ms or so… but my OS is still on the ssd and on its own partition or whatever its called in linux… so i think it is what creates the spikes of 10-20ms backlogs… got it down from 100ms by setting my max concurrent to 14, that put me at 30ms backlog at the ssd, now i got the windows vm virtual drive moved from the ssd to the zpool, now i’m down to 10-20ms backlog spikes… which i believe is from my netdata running on my OS, the OS itself should be able to barely have an disk activity… since it would load into ram, but netdata keeps using the drive, taking up io and because it’s shared with the L2ARC/SLOG/OS Partition then when netdata writes, which it seems to do often then it disrupts the SSD’s latency… was pondering switching to a samsung 970 evo pro nvme, but from what i can gather it should be possible for me to get close to the same numbers if i don’t over tax the drive…

so the plan is to get that moved over so the ARC/zil and l2arc/slog can figure out what has to go where and keep netdata from disrupting my system running smooth.

also put some extra load on the windows VM by running Folding@Home just to see how smooth i can get it to run while working… looking quite good… pretty sure the last bit is a netdata issue… and yeah ofc there might be some loss of efficiency when having a VM running… but i might as well fine tune it the best i can, if it will give me some 5-10% better successrates… i got so close to 90% but ill settle for 85% or so with the windows VM running. but a drop of 10% is unacceptable…
and then maybe i should try to turn on my cache on my harddrives… also cannot hurt … now that i’m actually running a filesystem with CoW. disabled them because i didn’t have a battery for my raid controller and since i’m still using that i never actually turned the internal cache on the drives back on… xD not like they are important anyways right lol

anon27637763 · April 18, 2020, 10:14pm

I seem to remember reading around here that the algorithmic baseline is something like 75 to 80 % … so, if your node is higher than that, it’s much better than network average. If so, my guess is that when you start running the Windows VM, you lose the speed advantage and slide down to the algorithmic average… which isn’t bad at all.

In other words, I don’t think there’s a problem to fix.

BrightSilence · April 18, 2020, 11:23pm

80 out of 110 succeed for uploads. So 73%.
29 out of 39 succeed for downloads. So 74%

Though for some reason most people see higher download success rates than that. Not sure why. I’m currently seeing upload success rates around 62%. But I doubt there is anything I can do about that from my end.

SGC · April 19, 2020, 5:49am

well if my machine is on fiber, uses low number of connections meaning each connection will get a minimum of 1/14 of 400mbit in both up and down… maybe a bit less if outside europa
and my server is streamlined to not have any delays, basically if you send a request, it will within a few ms from, just checked the recent avg from the network packet hits until the upload is sent out from the hdd’s
can on avg be done in less than 10ms. worst possible latency is something like 21ms.

not much but if the avg file size is 2MB at and i often see upload spikes at 120mbit then that is 15 MB/s pr second so 1.5MB in 100ms
so if your harddrives are busy, they can give you easily give you 100ms delay easily + internet latency

even my SSD which is a Crucial BX300, which actually makes it onto the leader board on storagereview.com granted in the sata consumer catagory, but if you look at the avg 4k latency, it’s really the same as a 970 evo pro, until one over taxes it and its latency goes up…
which in my case goes up to about 100-150ms depending on stress… my hdd array drives when i was resilvering a drive yesterday was up at 1.5sec in backlog… meaning if somebody else is running at something like my speeds and with a singular hdd pr node… then they might have 1-1.5sec backlog… meaning they get close to two sec latency from the initial request is sent from the satellite to the data stream hits the client.

so i could finish a 20MB file before they might be able to start… which is split into
80 pieces and bloated by 2.3… so lets just say its split into 20 pieces because the size expansion takes the rest… so 20MB upload is a 400MB client size file or so. which i can complete transfering the data for before most people hdd head has time to get through its backlog and actually transfer start reading the data.

been testing lately, pretty sure i can bring it back up over 85% on uploads and ofc my download is basically at the theoretical max 99.5% and the error rate of ipv4 is like 0.3%
uploads look like this for the last … while… 4 days very low right now tho because i have had the windows vm running for the last 12 hours while it working on folding at home at high priority.
Successful: 425699
Success Rate: 77.866%

the graphs look good tho… i can barely hint that its been running… but the successrate still drops, its a quite sensitive thing…

@BrightSilence i think most of the successrate cancels thing is latency, ofc it only works so long as there are people slower than yourself…like avoiding to be eaten, you don’t have to run faster than everybody else, just faster than the slowest in the group… in this case faster than the avg, but still if most people have 1sec or so backlog on their hdd… then thats really the highest latency in the entire system.

BrightSilence · April 19, 2020, 8:17am

Yes, but my SSD cache hovers between a below 10ms to 40ms backlog, with only very rare occasional jumps to just over 100ms.
My node is also seeing a lot of download right now, as a result I have a success rate of around 92% there. But that download hits the same SSD cache. I also have a 1000/500mbit connection, so not much I can do on that front. I’m not worried about the numbers I’m seeing.

SGC · April 19, 2020, 8:31am

well i would almost guarantee that it’s your ssd that’s the issue… my numbers was about the same with about the same ssd backlog, when i reduce the backlog it my successrate goes up.
i would love to say getting a samsung 970 evo pro 1tb nvme drive is the solution… and it is… but in many cases you might be able to get close with the ssd you already got, if it’s not overloaded then it’s backlog should go down into the range of a couple of ms.
tho thats basically what the 970 is at while serving 15000 sql clients lol according to storagereview

anon27637763 · April 19, 2020, 10:30am

This is about what my stats are… on SAS drives.

The last time I looked at my failure stats, my 10% level was on the order of microseconds, not milliseconds. What that appears to mean is that my node is just very slightly faster than other nodes being offered the same data.

So, my guess is that it’s the SATA interface making your node so … err… bad ?

You don’t get paid for Upload only Download. It is true that a node needs to have data in order to be able to supply that data… However, if your node is filling up with data at a reasonable pace, there really is nothing to fix.

SGC · April 19, 2020, 11:26am

we wasn’t talking failed uploads, but cancelled… my failed uploads are 0.005%
tho for some reason my failed downloads are at 0.461%, which isn’t unreasonable, but its getting a bit on the high end of what i would expect from base ipv4 failure rates… but perfectly possible that thats just it…

any spinning drive is in the millisecond / ms range, the microsecond / μm range is reserved for idle ssd or nvme, even most non nvme sdd’s have working times in the millisecond range, doesn’t really matter if you are running sata or sas… the rpm on the hdd’s has a major factor to play on response time and 10-15k rpm goes from an optimal of 4 to 2 or so miliseconds… ofc add just a tiny bit workload and that number goes up real fast.

even the latest optane and samsung 970 evo pro 1tb do under heavy loads move out into or near the millisecond range.

but yeah i do run SATA over SAS gear, but from what i can tell, if the gear isn’t under load the latency isn’t any worse than on SAS… ofc SAS is better… in basically every regard… but what i really care about is my latency, and running SAS or SATA is basically irrelevant compared to what kind of storage technology i’m using, such as HDD RPM, SSD chip technology and controller.

and so lets say you could squeeze it into the microsecond range… at one point you will just end up the disk latency being 1% and the network or other stuff being the 99% and the it’s basically irrelevant how low you can go… law of diminishing returns.

6 week old node filling at 250gb a day and still accelerating more or less every day… yeah i’ll say its doing quite okay.

direktorn · April 20, 2020, 12:20am

so ToC is not about how much is stored, did you even read it?