Bandwidth utilization comparison thread

@Krystof well your reply kinda gave me an idea… the numbers was deviation a good deal… so adding all the ingress into one might be the way to go…
so i tried that and got

July 3rd

@dragonhogan = total ingress 17.75 GB
@Mark = total ingress 18.17 GB
@striker43 = incomplete dataset
@SGC = total ingress 17.40 GB

thats less than less than 10% deviation and i had stability issues, the mark and hogan is within like 2.5%

4th July
Mark = 57,97
SGC = 55,32 (still with stability issues
@kevink = 48.05 (seems slightly off, but have been changing/tinker numbers of nodes)
striker43 = 57,75 (i will assume mark and striker are the two accurate numbers here)
so accurate… (if we ofc remove the deviants xD me and kevink)
dragonhogan - 57,03

July 5th
striker43 - total 105,81
SGC - total 104,20
Kevink - total 106,07
Krystof - total 106,15
dragonhogan - total 101,62
the mighty geek - total 105,45

yeah i think it looks like it pretty damn near perfectly spot on…
sorry about my numbers being slightly inaccurately… i also f’ed them for today…
was testing a node emergency shutdown script… ended up giving myself 1hr and 40min down time …tsk tsk…

but will be fun to see how that looks tomorrow… now i know almost exactly how long i was down for then i might be able to guess i would be that far off from the daily total…

i think we can pretty safely conclude, that multiple nodes vs single nodes, and successrates have basically zero impact… atleast on the numbers that the webdashboard gives us…

ofc we can keep monitoring until we grow bored… ill keep posting my daily screenshots for a while atleast… maybe we will learn something unexpected even tho i kinda doubt it…

however i think we can clearly say that the satellites doesn’t seem to distribute the data randomly, they have some sort of structure to their decision making else the numbers would never get so damn close…

2 Likes

Didn’t realize people were posting updates on a daily basis. For participation and additional data points, here’s my screen shots for the past couple of days.

Node 1:


2020-07-06 16.09.52

Node 2:


2 Likes

@dragonhogan
added them to my post of totals… you had downtime on the 5th?
your number seems slightly lower than what i expect… i guess we may need to also monitor internet connection…

may also just be random deviations… duno yet… we don’t have a ton of data points yet… so
nothing to weird about it… timezones, and such things might also show up as deviations in the numbers… or weekly / monthly totals may give more accurate overall number…
because they would minimize stuff like timezones…

will be interesting to see… thanks for the update.

1 Like

Nope, no downtime on 05Jul on either node. I have a 200 up / 20 down connection. Although, who knows why mine saw lower numbers…let’s blame it on my wife streaming Netflix all day, although I’m sure that’s not it…

1 Like

oh thats interesting tho… did you know it takes about 10% in overhead… so if you where to download with 200mbit then you would need 20mbit upload sustained for overhead…
i think that’s about right anyways… i’m sure somebody can correct me on that… but is better to be roughly right rather than precisely wrong… :smiley:

anyways…

so if we call your egress from the 5th 35gb… ergo 35000mb / 2mb pr sec
so 17500sec to upload that and 3600 sec in an hour so about 5 hours of full upload capacity…

then we need 10% overhead upload for your ingress… which was 101gb so thats like less than 1hour added… so 6hours total of 24… so basically your upload is 25% utilized on avg over the 5th…

you may … and i have to stress the may… we will need to keep an eye on it
be seeing the limitations of your egress affecting / choking your ingress…

should become more clear if the trend continues if we see increases in egress and ingress

2 Likes

Hi , no problem.
Localisation is boulogne sur mer in north of France.
10To sharing on raid5 nas.

here screenshot with full day stats :

2 Likes

1 Like

While I think you’re all enjoying this, you won’t see any differentiation on a sufficiently fast connection. Since node selection is random, everyone will see similar numbers unless their node was unhealthy for a time or the connection is slow enough to have an impact. Furthermore, this doesn’t really show that the nodes actually got the same amount of data. If a piece is 90% transferred before being cancelled, that 90% bandwidth still counts. Since most are cancelled near the end of the transfer, the differentiation between actual cancellations and badly logged ones will be hard to detect in these numbers.

For what it’s worth, the logging issue will be fixed soon. So success rates will become useful again.
https://review.dev.storj.io/c/storj/storj/+/2234

Ps. I don’t mean to discourage you from sharing these numbers, merely providing context of why you’re seeing them.

5 Likes

yeah i’m at about the same conclusion… maybe tho we might learn something about what connection speeds and upload / download amounts / ratios it takes before the different lower connection bandwidths become a limitation…

but yeah the data is so accurate it’s almost clockwork, but that might enable us to do other stuff with it…

like say want to know if your connect is fast enough… one just checks if one reached the highest peak over the last long period, or can atleast see how much of a deviation connections might cases…

it would also sort of be a measure of upload successrates… if everybody gets the same ingress and egress, then we can sort of establish with near certainty that they are irrelevant, and it might be better simply to log them all as successful… (fixed soon… that would be nice… if it works seems from this data that there is basically no difference between the bandwidth used of nodes. and if everyone is the same in bandwidth, the upload successrate… accurate or not… isn’t really useful for anything, most likely because its within a few % of each other…)

its a common question that creates constant traffic on the forum…
good science is not about what you know you can learn…it’s about discovering the unexpected… will be discover something interesting in this… i duno…

but i’m willing to put an hour or so into it, over the next month or so and see what comes out of it…

maybe we will see some trends that gives us hints … we might be able to see global geological bandwidth limitations… there are always many factors in complex systems.

but yeah most science is just boring and unrewarding xD
and yet usually it teaches us alittle bit… if nothing else

Trust me, I’ve done a lot of the boring scientific work. :slight_smile: But I think it’s important to know the context. You’re not testing an unknown black box here. There is a ton of open source code that can answer your questions and without that context you may draw wrong conclusions. For example, your data at some point might suggest slow nodes get less data. But if you then conclude that slower nodes are selected less often, you would absolutely be wrong.

Node selection is random, the effect you’re measuring happens after node selection. It’s also important to realize a transfer is not all or nothing when accounting for bandwidth impact. But it is when it comes to the data ending up on your node. So someone with 99% of the bandwidth of someone else could still end up with 80% of the amount of data on their node. This context is important to not arrive at the wrong conclusions. And this is also why I think the success rates of transfers will eventually give a better indication again after the fix is implemented.

I’d still be interested to see at what point we’ll see the connection speed impact. But that obviously also depends on activity on the network. I’ve previously estimated that anything over 50mbit basically doesn’t matter. But it could also be 100mbit. So lets verify that part as that is not a known conclusion we can derive from the code alone. And I used very limited anecdotal data and my gut to come to that number, so I could be horribly wrong.

yeah guesstimates can be difficult, and sure the data won’t give us a 100% clear image… but it will tell us something, which may be useful for future investigations into storagenode behavior, but alas it will just be yet another data point for those interested in trying to make sense of it…

and yeah code can sometimes be the simple way to estimate something, but its rare that the code alone defines something… hell we might even see effects of SMR drives limited write speeds… if we get some insane ingress some time while comparing bandwidth…

we would also need to know how the numbers in the dashboard is defined… i mean if there was cancelled uploads… are their bandwidth also added, or is the bandwidth usage defined by what actually hit the disk… as always there will be hundreds of related questions that one also would need to answer to really make good use of the data… but without the data, we are just a little more blind…

so will be interesting atleast… always a gamble if we learn anything or not… and even if we don’t somebody else might find it useful or reassuring, that most nodes seems to get the same data ingress…

i was very focused on my upload successrates until i realized just how futile it was lol… but i like to understand, rather than blindly follow, which i’m sure you are well aware of by now :smiley:
@BrightSilence
slightly unrelated :smiley:
also been thinking quite a bit about this bandwidth deviation i was seeing after extended downtime…
this kinda shows something similar… maybe thats the issue… it takes spikes and thus the avg over long periods will be different after a node shutdown…

so it could be a type of rounding/avg deviation in the proxmox bandwidth avg …

Just migrated from a 1TB HDD to 8TB on the 5th of july.
Node is on a 30/10 Mbit and even with these pretty low speeds I never saturate the connection when streaming with netflix for a couple of hours.

1 Like

i will assume you mean 30mbit/10mbit

i added you to the list i made… you are smack in the middle of the top, less than 1% deviation…
maybe we should try to also do a egress ratio compared to the ingress… might also be interesting.

You’ve got that one flipped. Code defines the exact rules for the environment. You can give definitive answers by looking at the code. Those aren’t estimates, they are exact. If code says nodes are selected as a random set of healthy nodes, that means over time each healthy node gets selected the same amount of times.
Measurements never get you more than an estimate.
Now of course there are second order effects of the code and the environment a node is running on. But if conclusions based on measurements of those secondary effects contradict the code, you’ll have to find your explanation elsewhere. The code doesn’t lie.

Egress isn’t related to ingress. A full node would have no ingress, but if it stores a lot of data there would be a lot if egress. It’d be more useful to try and relate egress traffic to the amount of data stored instead. You’ll see more deviations here because the types of data on each node differs based on when they first came online and when they filled up.

so the real world is a secondary effect on the code… i think thats just slightly offset… code is only stable / viable so long as the rest of reality allows it… and it doesn’t take much to disrupt it… sure code can be very fixed… if it’s simple code… but only need a few variables and the possible paths quickly scale towards infinite… and thats really the problem with reality… it’s infinite and thus there will always be paths that cannot be considered… even tho we can get very close to 100%
then a stray cosmic ray hitting the in a wrong way, could bring down some systems… and corrupt the code…

sure in a perfect world, one can rely on perfection… but in the real world, everything is basically chaos… even if chaos is merely a system so complex that we have yet to define it… i have no doubt that the universe is mathematical in nature and that one day we might be infinitely close to understanding it completely, but for now it still has us puzzled…

you may trust code… but reality has a tendency to put anything one trust on its ass… xD

exactly, wrote ingress by mistake should have been the above… :smiley:

Corrected one letter there, but yes. The effect you see is all caused by code and inputs to that code. Computers are by definition deterministic machines. Given the code I can tell you that the number of times your subnet will be selected in a day is roughly the following.

(Number of segments uploaded by customers * number of RS pieces per segment) / number of subnets with at least one healthy node with free space

Given that your node is healthy and has space available. If those aren’t the case it won’t get selected at all.

Next step, how much bandwith does that use? Sum the amounts of data transfered for all those times it was selected to know the total ingress bandwidth. Or sum the piece size of all the piece transfers that finished to know how much data ended up on your node.

Relevant inputs to calculate how much data ends up on your node: number of segments uploaded, size of those segments, number of healthy subnets, and rate of success on your node.
If you have those inputs you can calculate how much data ends up on your node without having to collect data from a lot of other sources.

Yes, code paths can be complex, yes there can be many scenarios. But the question you’re trying to answer isn’t that complex and has clearly defined steps in code that should not be ignored.

Example
number of segments uploaded: 100000
RS ratio selected/min: 110/29
avg size of those segments: 64MB
number of healthy subnets: 3000
and rate of success on your node: 50%

((100000*(110/29)*64MB)/3000)*50%=4046MB give or take

So what part of those input depends on your connection or setup? Only the success rate. So that’s what you want to measure to see how well you are doing.

Edit: Bonus points to anyone who points out the two things I thought of, but left out to simplify this calculation. And extra bonus points to someone who points out something the code allows for but I didn’t anticipate!

and then stuff like disk read write speed, disk health, placement of data on the platers, latency, power outages, network issues, various bandwidth limitation… the code might set some fixed limits, but the real world can have just as much impact.

we are just trying to collect some real world numbers… to see if something interesting pops up…

Will only result in either a higher or lower success rate.

Will not get your node selected at all.

I’m just trying to isolate what is related to what, so people don’t start suggesting that satellites select nodes that were offline more often after they return. Or similar conclusions that are just not possible.

Here’s yesterdays:

Node 1:

Node 2:

Here is my total ingress for 6th:

Normal: 102,87 GB
Repair: 9,96 GB
Total: 112,83 GB