Sustained 200mbps ingress

570RJ · May 14, 2020, 9:27pm

mine sounds and looks (dashboard graphs) like it is on fire. But I am happy to know that my 6GB vm and firewall can sustain these speeds without any packet drops.

BrightSilence · May 14, 2020, 9:33pm

I had a quick look at the changes around node selection as a result of the cache implementation. They seemed rather substantial. This result seems to suggest the subnet filtering may have gotten removed/circumvented by these changes.

Pentium100 · May 14, 2020, 9:41pm

That is interesting.

@570RJ how many nodes do you have in the same /24?

Sol-R · May 14, 2020, 9:44pm

The same on my node

from europe-north-1.

littleskunk · May 14, 2020, 9:47pm

Just to make sure. Who is seeing more traffic and is running multiple nodes on the same ip? Who has still the same load and running only one node? Any other combinations?

The node selection cache is not showing a higher throughput. It looks like the node selection is currenty not working as expected. We will disable the node selection cache and fix that issue.

BrightSilence · May 14, 2020, 9:47pm

DistinctIP       bool          `help:"require distinct IPs when choosing nodes for upload" releaseDefault:"true" devDefault:"false"`

devDefault is false? Is it running as dev?

Sol-R · May 14, 2020, 9:48pm

More traffic and running only one node.

littleskunk · May 14, 2020, 9:49pm

That should all be correct. The difference might be randomly select and group vs group first and then select. That makes a difference for the number of chances you have to get selected.

Pentium100 · May 14, 2020, 9:50pm

I am running a single node, my traffic looks like this;
Total:

europe-north

Other sats less than a megabit each.

BrightSilence · May 14, 2020, 9:58pm

github.com

storj/storj/blob/1ec5eb06bd876e52dacfdc2f15c733d591d4036a/satellite/satellitedb/nodeselection.go#L122


      
          			condition: reputableNodesCondition,
          			limit:     reputableNodeCount,
          		}
          		newNodeQuery = partialQuery{
          			selection: `SELECT last_net, id, address, last_ip_port, true FROM nodes`,
          			condition: newNodesCondition,
          			limit:     newNodeCount,
          		}
          	} else {
          		reputableNodeQuery = partialQuery{
          			selection: `SELECT DISTINCT ON (last_net) last_net, id, address, last_ip_port, false FROM nodes`,
          			condition: reputableNodesCondition,
          			distinct:  true,
          			limit:     reputableNodeCount,
          			orderBy:   "last_net",
          		}
          		newNodeQuery = partialQuery{
          			selection: `SELECT DISTINCT ON (last_net) last_net, id, address, last_ip_port, true FROM nodes`,
          			condition: newNodesCondition,
          			distinct:  true,
          			limit:     newNodeCount,

Yes it should definitely be group first and then select.
Worth looking into how SELECT DISTINCT ON is implemented combined with a limit. I would have thought it would group first, then filter, but it could be it just iterates over results, skips duplicates on last_net until it reaches the limit.
Additionally I think a reputable and a new node could be selected on the same subnet. That wouldn’t explain this increase though. << EDIT: Ignore this, it’s caught elsewhere in the code

EDIT: Other theory, the order by on last_net could be an issue if order by is applied before the limit. Which I think it is. I’ll stop now, it’s indeed getting late.

btw, I run a single node and saw a slight increase, but nothing close to what others are reporting.

littleskunk · May 14, 2020, 10:00pm

Good that we have a feature flag for this one. We will disable it, add some more code to get additional monkit data about how many times we are selecting which node and next release we can try again and collect more data.

Plan B keep it enabled but reduce the number of uploads.

It is getting late. One of these solutions will win short term and then we can figure out how to fix it.

Thank for the quick notice.

BrightSilence · May 14, 2020, 10:16pm

Not sure if related, but in this topic it was mentioned that 2 nodes on the same IP saw really different amounts of ingress.

570RJ · May 14, 2020, 10:17pm

It stopped

fmoledina · May 14, 2020, 10:38pm

Same here, all nodes are at <1Mbps ingress. Down from 100Mbps total.

cdhowie · May 15, 2020, 1:51am

To clarify my numbers, I am running 6 nodes (4 in one location, 2 in another) and I’m seeing only a very minor increase in the amount of traffic in one location (~50% increase) and no noticeable difference in the other.

EDIT: None of my nodes are vetted on this satellite. If we assume I’m getting 5% odds at uploads compared to a vetted node, my increase in traffic would be 1000 if the nodes were vetted.

SGC · May 15, 2020, 8:16am

i’m seeing nothing, my node is dead like a tomb…

or activity level is… still getting a download here and an upload there… been like that

kinda looks like it was at the strike of midnight…

why would you complain about 200mpbs ingress… xD

Pentium100 · May 15, 2020, 8:23am

@littleskunk probably stopped the tests

SGC · May 15, 2020, 9:06am

yeah i would assume so, reconfiguration is rarely a quick job on large stuff.
but hey, then i got time to scrub my pool again… the first scrub looked kinda horrible… might also considered trying to clean my backplane… had yet another drive throw a few read errors which thus far seems always to be related to corrosion.

am getting better at locating my drives physically on the server… does require a bit of an eye for it… but atleast i ordered the HBA connections so it sort of makes sense when looking at the backplane.

funny to see how much my successrates tanked now the network is almost idle… down to like 71-72%, but i suppose the less activity the further geographically the storagenodes fight for connections, maybe… but i suppose i should be happy being in the top 1/3 using my old gear… some of the images of the post pictures of your storagenode thread are really impressive.

also got to thinking that maybe i disregarded one of my 3tb drives due to the backplane, meaning i could had had 5 drives instead of 4… tho wouldn’t have had a bay to put it in anyways tho… before i successful clean it…

BrightSilence · May 15, 2020, 2:13pm

I wouldn’t call it complaining, but it turns out it was actually useful feedback.

I do miss the traffic now, but I’m sure they’re working on a more solid solution in the mean time. Besides, it gives all the SMR nodes some time for house keeping.

SGC · May 15, 2020, 2:49pm

i would mind a couple of months of 200mbit/s so i could fill up my node…
but now im also doing some house keeping, getting a second scrub in because i didn’t like the result of the first one… is 25000 incorrect chksums bad? xD
2nd scrub is 63% done now and only 5 this time, so hopefully that trend holds.