Distribution algorithm for node selection

jammerdan · January 16, 2021, 6:07am

I had uploaded 2 bigger files: File1 and File2 (Map creation takes a while)

Both show that the distribution of pieces is mainly in the Europe and North America region and I am wondering if this is a desired outcome.
In terms of redundancy and fast worldwide access, I would have expected a broader distribution. Both files seem to have not a single pieces stored in Australia very few in the Asia/Pacific region anyway. Almost none in Africa and very few again in South America.
What is the reason for that? I assume it is that the fastest nodes get selected, which might naturally be closer to where the uploader is. I wonder if such an outcome is really a good one or if the pieces should be broader distributed over all geographical regions for the reasons above.

tylkomat · January 16, 2021, 9:46am

On: https://storjnet.info/ you can see that there a basically no nodes in Afrika, while most of the nodes are in Germany.

TheMightyGreek · January 16, 2021, 10:00am

Is it normal that I can’t see the background map ?
storj1

kevink · January 16, 2021, 11:29am

It’s a known bug iirc

jammerdan · January 16, 2021, 1:29pm

Yes I know that. But there are quite a few in the Asian-Pacific region.
I am exaggerating a bit, but imagine Germany or Europe goes blackout, then a good distribution is the key for data to remain accessible.
But it is not only for redundancy, I am playing media professional again, maybe distribution of some work to Asia or even Australia or collaborating with an Australian production company. For such a scenario a broad distribution over all regions would also be beneficial. At least I assume it would.

dragonhogan · January 16, 2021, 6:47pm

Unrelated question about the “share” links you posted that show the distribution “map”. There’s a download link on that page. I assume if I click that, then I’d download the file? Do you want people to be able to download that file? Also, if people do download, who pays for that?

jammerdan · January 16, 2021, 6:54pm

Maybe unrelated but a good catch. I wish there was a way to only display the map without download option.
I think downloading does no longer work as the quota is exceeded. Generally the account owner pays for the download. In this case it is a test account.

dragonhogan · January 16, 2021, 7:12pm

ah, okay, fair enough. was a little concerned #1 that you might have accidentally shared a personal file for anyone to download and #2 wanted to make sure that if someone did download, you wouldn’t be responsible for that bandwidth usage.

mike · January 16, 2021, 10:18pm

Just a note on this - I’ve experienced LONG responsetimes on almost any link. This is very annoying, perhaps a rework to improve page load time is in order?

jammerdan · January 17, 2021, 2:07am

Alexey · January 17, 2021, 8:17am

The fix is expected in the next release. The map is loading too long (third party)

jammerdan · January 18, 2021, 10:46am

Here is another one: Link
It shows again that distribution is mainly the regions Europe and North America. Asia/Pacific, South America and Africa is only very few.

BrightSilence · January 24, 2021, 2:43pm

jammerdan · March 23, 2021, 6:30am

@Dominick
This error is still present. It is very annoying.

jammerdan · October 3, 2021, 1:33pm

With the new linkshare page we get to see additional information about the storage distribution of a file.
If I take the Storj video as example there is following information:
https://link.dcs1.storjshare.io/s/jxqy2wdbf5qeaxo4a2eq6yw3gthq/videos/Storj%20Final.mp4?wrap=1

115 Storj nodes are storing a piece of this file.

And from the map that there are 160 pieces in total.

So math says we have some nodes that store more than one piece and I am wondering why is that. With more than 12k nodes why is there not a better distribution? Wouldn’t it be preferable to store not more than 1 piece per node if possible?

Of course there is no question that with large files it does easily happen that nodes store more than one piece, but with such a tiny file?
And a second question would be, if there is a limit of how many pieces of one single file a node is allowed to store?

stefanbenten · October 4, 2021, 8:52am

The count shown on the linksharing map is the current live count of pieces on the network. Meaning it will always be less or at max equal the total count that is possible with your math .
In theory you are right that a node could possibly get two pieces out of that file, but according to our math that was a fairly small chance, even with the current network size.

There is no limit of how many pieces a node of a given file can store and if fact it should not matter. We treat the objects on the segment level. As long as an individual subnet does not get more than 2 pieces of a given segment, the safety of the file should still be given.

hashbackup · October 4, 2021, 7:40pm

I’ve been curious myself about how node selection is performed. Today I uploaded 3 70GB files, so 2 segments and 80 pieces each. When I viewed the share link, these files were stored on 159, 157, and 158 nodes.

I went poking around and here’s some code I looked at. I don’t know Go, so take everything here with an extra large grain of salt.

Satellite db tables and indexes
Overlay upload selection
Upload node selection
Random upload node selection
Satellitedb nodeselsection

It’s not quite clear to me yet how these are related. There seem to be 2 different mechanisms:

maintain a cache of all eligible nodes; create a random index list; pick nodes out of the cache based on a random index. This is implemented in “random…” above and seems very unlikely to pick duplicate nodes.
a second mechanism does SQL select statements on the node table in the database, mainly in the last link above. It does up to 3 rounds of queries to create a list of new and vetted nodes. This may be an old mechanism that is no longer used, I dunno.

Is that 2nd mechanism used anymore? I can imagine some problems with it. It does this query:

SELECT last_net, id, address, last_ip_port, false FROM nodes ...

where last_net is the a.b.c.0 /24 address of the node. Or it may do a distinct(last_net). This will grab a batch of nodes but the nodes preferred is up to the database software. There is an index on last_net, so if the database scans that index, it will probably prefer nodes with a lower IP address. And if there are 2 nodes on the same /24, the node that was created first will probably be preferred, though that’s really up to how the database organizes the records.

Is that triple loop db query still used?

moby · October 4, 2021, 8:49pm

Hey @hashbackup,

Your investigation seems to have been fruitful - you have almost all the relevant files already. So I will focus on clarifying a few details:

Firstly, we have different queries for selecting nodes in different circumstances. Many of the functions in the satellite/satellitedb/nodeselection.go file actually have nothing to do with selecting nodes for normal uploads. EDIT: we may have used some of these functions at one point for normal uploads, but it doesn’t look like anything in nodeselection.go is used right now with the current cached node selection for uploads.

Here is the precise path that is followed for a new segment upload (links include the line numbers):

The uplink makes a request to the satellite to create a Segment. The endpoint is handled inside the satellite/metainfo package (side note - “metainfo” refers to the data about where a file is stored, like which pieces are on which storage nodes, as well as encrypted segment-encryption keys and the encrypted path)
Inside the BeginSegment function in the metainfo package, overlay.FindStorageNodesForUpload() is called: storj/metainfo.go at 4b79f5ea862d4dcf5998bf9ae63ec52cd151e26c · storj/storj · GitHub (“overlay” is the service we use to interact with the nodes table in the database; it includes a caching layer for functionality like node selection for segment uploads)
FindStorageNodesForUpload basically attempts to get the storage nodes from the upload cache: storj/service.go at 4b79f5ea862d4dcf5998bf9ae63ec52cd151e26c · storj/storj · GitHub
If the upload cache has been updated recently (within 3 minutes), the cache selects the nodes from in-memory: storj/uploadselection.go at 4b79f5ea862d4dcf5998bf9ae63ec52cd151e26c · storj/storj · GitHub - I am not as familiar with how the cache is implemented, but it looks like it essentially takes in the same configuration as the database for node selection (I will get to that in the next point)
If the upload cache has not been updated in the last 3 minutes, the cache is refreshed with the database - 1. storj/uploadselection.go at 4b79f5ea862d4dcf5998bf9ae63ec52cd151e26c · storj/storj · GitHub 2. storj/uploadselection.go at 4b79f5ea862d4dcf5998bf9ae63ec52cd151e26c · storj/storj · GitHub. The function that is called on the database is defined here: storj/overlaycache.go at 4b79f5ea862d4dcf5998bf9ae63ec52cd151e26c · storj/storj · GitHub, and the upload config that is used is defined here: storj/config.go at 4b79f5ea862d4dcf5998bf9ae63ec52cd151e26c · storj/storj · GitHub
After the cache is synced with the db, we would select the nodes from the cache the same way as if the cache did not need to be refreshed

Node selection is something that we have modified and reworked a lot over time, especially as we add new features (e.g. a separate node selection mechanism for repair, graceful exit). Furthermore, we have services like audit, which perform audits on a per-node basis, but select segments to audit without even looking at the nodes table or using the overlay service.

Anyway, I hope I was able to help clarify something, and I’m happy to answer any followup questions, as I may not have fully addressed your comment.

BrightSilence · October 4, 2021, 11:13pm

github.com

storj/storj/blob/38ce0e154cd665b43c6975d0543541d8b312f778/satellite/nodeselection/uploadselection/selector.go#L88


      
          // Count returns the number of maximum number of nodes that it can return.
          func (subnets SelectBySubnet) Count() int { return len(subnets) }
          
          // Select selects upto n nodes.
          func (subnets SelectBySubnet) Select(n int, excludedIDs []storj.NodeID, excludedNets map[string]struct{}) []*Node {
          	if n <= 0 {
          		return nil
          	}
          
          	selected := []*Node{}
          	for _, idx := range mathrand.Perm(len(subnets)) {
          		subnet := subnets[idx]
          		node := subnet.Nodes[mathrand.Intn(len(subnet.Nodes))]
          
          		if ContainsID(excludedIDs, node.ID) {
          			continue
          		}
          		if excludedNets != nil {
          			if _, excluded := excludedNets[node.LastNet]; excluded {
          				continue
          			}

This should ensure no duplicates at all. It generates a random permutation of IDs for all nodes. This ensures unique nodes per segment, but only per segment. A node could still get pieces for multiple segments in a file. That’s not a problem though as health is determined per segment as well as repair.

This is not really the most efficient way though. If I understand the code correctly Mathrand.Perm generates a permutation of IDs for all nodes in the node selection cache even though the function only needs n random nodes. Since this permutation is generated by looping through a all IDs and creating random replacements, its processing time scales with number of nodes in cache instead of number of nodes to select.

It would be better to create an array r with range length(subnets), then iterate through it by selecting a random element between 0 and n-i from that array and replacing it with element n-i-1. Then break once you have enough nodes. That way you don’t do more swaps than needed. Not really an issue with the current amount of nodes, but once you have millions, doing the full permutation for each node selection could add up.

hashbackup · October 4, 2021, 11:33pm

This is very helpful, thanks!

I saw this comment:

// FindStorageNodesForUpload searches the overlay network for nodes that meet the provided requirements for upload.
//
// When enabled it uses the cache to select nodes.
// When the node selection from the cache fails, it falls back to the old implementation.

So I guess before the cache implementation it used the db selects, and still falls back to this method. If @jammerdan 's file was uploaded with the direct db implementation, I think it would be easy to get 45 nodes storing both segments because of the way the query is constructed, right?

Conceptually it is looking through an ordered list of eligible nodes, picking some while ignoring previously-picked nodes & networks, then randomly ordering the sublist. (This process loops up to 3x; a loop may not produce enough nodes because of filtering).

It seems that this will tend to select the same nodes for each segment as long as they have enough space. So if this db query is used for picking new nodes to host pieces during segment repair, the pieces won’t be distributed randomly. Or so I’m guessing.