Distribution algorithm for node selection

hashbackup · October 4, 2021, 11:45pm

There is already a list of eligible nodes and they create a map of nodes selected, so I’d just start generating random numbers between 0-n and add that node to the map of selected nodes.

This brings up another interesting point. I’ve read discussions about how subnet filtering means that if you have n nodes on one subnet, each node will only get 1/nth the traffic it normally would get. Is that really true, because after looking at the selection process, I don’t think it is. I think each unique node increases your chance of being picked.

In fact, I don’t think the subnet tests are even necessary. If you have 12K nodes and are randomly picking 120, the odds are extremely slim you’ll pick 2 nodes on the same subnet. I guess someone could try to game it, and maybe they did in the network’s early days when there were fewer nodes, but it seems it would be difficult to game with 12K nodes.

Edit: I just looked again and realized there are two distinct selection methods in that code: 1) select nodes with equal probability, and 2) select by subnet. I missed the select by subnet.

So first it goes through all nodes and creates a map of subnets, where each map entry is a list of nodes with that subnet, then creates a list of structures with the subnet and a list of nodes in that subnet. Then it randomly picks a subnet and then randomly picks a node.

I think on mathrand.Perm you might be mistaken on how that works. it doesn’t permute anything, but rather creates a permuted list of indexes. So if you give it 5, it will return a list or iterator like 3,1,4,0,2.

BrightSilence · October 5, 2021, 12:20am

This would no longer prevent selecting the same node more than once per segment. Hence my slightly more complex solution.

That is not correct. The line I highlighted only selects a unique subnet. 2 lines below is where it selects a random node within that subnet. Since the mathrand.perm function returned a permutation of unique IDs, that same subnet will not be selected again for that segment. Hence you get the same amount of traffic per subnet, no matter how many nodes you have. You can also look around for traffic comparison topics where this has been confirmed in practice many times.

This is btw also true for the old implementation. It instead did things in SQL by selecting a random node per subnet and then a random amount of subnets from the resulting list. Both the subquery and main query were ordered by random(). So this results in a list of random subnets with a random node for each.
I see that there is now also a SELECT DISTINCT ON (last_net) query in the mix, this would still ensure selecting a subnet only once, but it is indeed iterated through in case nodes need to be excluded afterwards. However, that function still looks like it checks for duplicate subnets afterwards. I’m not the biggest fan of that implementation though as it feels like just trying until you get enough nodes. The cache based implementation is a lot more elegant. If you’re going to do it in SQL you should have all the conditions in the SQL query so you don’t have to exclude nodes afterwards and run that query again.

hashbackup · October 5, 2021, 12:27am

Ah, right, it would need a check of the selected list. They’re already doing that with a list of excluded nodes and a map of excluded subnets.

hashbackup · October 5, 2021, 12:38am

That code was selecting some vetted nodes and some non-vetted nodes, then unioning them. It might have been hard to filter distinct subnets between the two queries, or that’s what it looked like. But I agree, the cache method is much better.

In this SQL code, it is selecting a distinct subnet but a specific nodeid. So even though it does an order by random() to randomize the results, the nodeid has already been chosen, right?

BrightSilence · October 5, 2021, 8:03am

I’m actually getting a bit lost going through the code. But I think when it is turned into a query here storj/satellite/satellitedb/nodeselection.go at 821a077f7c9dce5f99077b9ed28bdfe96b161301 · storj/storj · GitHub it actually generates a subquery in which it is ordered by random(). The order of operations with a SELECT DISTINCT ON clause is not entirely clear to me. But I assume based on this implementation that order by is applied before the distinct is. This should still result in unique subnets and random selection of nodes within the subnet.

jammerdan · October 5, 2021, 9:17am

I don’t understand what you are trying to tell me.
The count on the linksharing map of this file is: 160. So this is the current live count of pieces.
The linkshare states, that there are

119 Storj nodes are storing a piece of this file.

So my understanding is that there must be nodes that store more than none piece? That is not correct?

So if one node gets down he could take down multiple segments but not multiple pieces of a single segement?

BrightSilence · October 5, 2021, 10:10am

No, because a node never has more than one piece of the same segment. It could drop availability by 1 for all segments, but that’s not a problem because for each segment where availability drops below the threshold, repair will kick in to prevent issues.

hashbackup · October 5, 2021, 2:05pm

Yeah, I hear that. It’s amazing how much more confusing SQL is when it’s pieced together instead of looking at the whole query. There’s a question of what the actual query is, plus when looking at the individual pieces it’s easier to get confused about the order of operations. And combining DISTINCT and ORDER BY is obviously confusing.

My statement about selecting lowest subnets first is likely wrong because ORDER BY happens before LIMIT.

If there is only 1 node in a subnet, which is the usual case, then both the db query and cache selection are okay (I think). If there is competition within a subnet, the SNO with the most nodes has an advantage with the cache method. With the db method it’s not clear to me either which node gets selected.

The other interesting thing I want to check is vetted vs unvetted aka “reputable nodes” vs “new nodes” in the code. This is all throughout the db query code. For the cache mechanism, NewState refreshes the cache. The Select function underneath it is also interesting.

hashbackup · October 5, 2021, 2:16pm

You need to seed it to get different values on every run. By default you get the same sequence of random numbers. This aids testing. Some languages like Python automatically seed the random number generator by default, so you have to explicitly seed it with a fixed value for consistent testing. Go apparently defaults to Seed(1) rather than Seed(time).

BrightSilence · October 5, 2021, 2:27pm

If you are talking about different SNOs in the same subnet, then yes, this is technically true. But also highly unlikely to happen unless you are hosting your node in data centers that are used by other operators as well. This behavior is the same with the SQL query btw.

Well with a normal distinct you can only order by fields you select, I believe. So that kind of evades the problem. I’m not that familiar with the DISTINCT ON (field) feature being used here though. I don’t think that’s ANSI SQL. But I looked it up now… it seems to be postgres specific and it returns the first row for each “field” going by the ORDER BY defined order. So unlike a normal DISTINCT it seems to apply the order by first and then generates rows with distinct fields based on the order by. So yeah, it works as intended since that order by clause contains random().

Toyoo · October 5, 2021, 4:28pm

I usually start my R scripts with set.seed(42). I don’t like “1”, its random numbers are too bright. “42” is pleasantly calm.

Alexey · October 6, 2021, 7:29am

You always need ORDER BY if you used DISTINCT, some engines can execute the query much longer if you skip the ORDER BY, some will throw an error. Think of DISTINCT as GROUP BY and everything will take its place.
The difference here that ORDER BY is executed before the DISTINCT.

hashbackup · October 6, 2021, 12:46pm

Thanks. What I mainly missed, as @BrightSilence mentioned, is that for the distinct subnet case, there is a subquery and random() gets used twice.

Toyoo · October 6, 2021, 1:59pm

That’s misinformation. DISTINCT doesn’t require the results to be sorted. On the contrary, database engines can decide to use a DISTINCT implementation based on hashing. Hashing will be faster than sorting data.

What you might have thought of is the TOP(n)/LIMIT n clause, which might become non-deterministic without ordering.

Alexey · October 6, 2021, 3:24pm

As I said, it’s engine-dependent.
Please, check the execution plan with and without ORDER BY.

hashbackup · October 9, 2021, 1:43am

Getting back to the original discussion, 4 days ago I uploaded some 70gb files. The first one created 159 pieces. Today the share link says that file has 153 pieces. To a new user not understanding how things work, it might be a little disconcerting that 6 pieces have gone missing in 4 days.

It might be a good idea to add more information than just the piece count, for example, instead of:

153 Storj nodes are storing a piece of this file.

maybe:

153 Storj nodes are storing pieces of this file.
58 pieces are required to retrieve this file.
Pieces are cloned if fewer than 90 nodes store pieces.

Alexey · October 9, 2021, 11:42am

Pieces will be recovered if fewer than 39 (currently even more - less than 50) nodes store pieces of one segment.
Pieces are unique, they will not be cloned.
Number of nodes can vary.

hashbackup · October 9, 2021, 1:22pm

Thanks for the clarification. In the whitepaper on p71 Table 7.2 it looked like 45 was the magic number to trigger recovery for k=30, n=80, but I wasn’t sure since I think k is actually 29.

For a multisegment file, it’s hard to explain briefly the technical nuances. If the number of pieces that drop off are all in the same segment (unlikely), recovery would happen sooner than if they are distributed across segments.

I do think since a customer can see pieces going away, it’s important to convey in the same place that there is a compensating mechanism. Otherwise a new customer not familiar with the details might think “I had 159 pieces, I’m losing 1.5 per day, so in 100 days my file would be almost gone”. I just checked again, and my file that had 153 pieces yesterday has 152 today. So to be more technically accurate, maybe:

Pieces are regenerated if fewer than 100 nodes store pieces.

or

Pieces are replaced if fewer than 100 nodes store pieces.

I don’t think it’s important that a customer understand the concept of segments on this map or that if 80 nodes are storing segment 0 and only 49 nodes are storing segment 1, that would also generate new pieces even though more than 100 nodes are storing pieces.