Geographical spread of the data

silaxe · January 24, 2021, 7:19pm

Hello,
I understand that when a SNO uploads data on the network, it is spread across multiple nodes which do not share the same /24 subnet.
However, is there a mechanism that ensures that the data is sent close to the locations where it is most likely to be downloaded again?
The closest you are from the download points the quicker you would retrieve the pieces, wouldn’t you?Moreover I would find it rather disturbing to make a download request to 20 (or more) very remote places: this favours large backbones over the efficiency of a decentralised network.
Basically I’m thinking of a system like Netflix, which has its servers at the level of the ISP and hosts only the content that is most likely to be watched by the users in a specific location: saves infrastructure costs (reduces need for large backbones and huge centralised data centers) and also relieves the internet from huge data flows (which ISPs like very much).
In an ideal world, I could imagine that the data stays as much as possible in the perimeter of the ISP and, as a reward, the ISP offers to host SNOs for free in its preexisting data centers or to run them in internet boxes (e.g. in France adsl/cable/fiber offers are usually provided with a player for the TV which could easily run a Storj node).
I hope these thoughts somehow make sense
Thanks in advance for your answers!

Toyoo · January 24, 2021, 8:46pm

The downloading client initiates more connections than necessary, so that the long tail of slower connections will just be dropped as soon as the faster connections finish collecting enough chunks to reconstruct data. In this way, Storj doesn’t need to know or predict the locations of downloads; by spreading chunks all over the place there’s high probability that enough chunks will be close anyway—and nodes that are slower/further away function mostly as a redundancy mechanism.

Alexey · January 25, 2021, 2:59am

I think you mean Customer. SNO it’s a Storage Node Operator, it’s service provider, i.e. owner of the storage nodes.

This is implemented with over allocating nodes, as @Toyoo said.
For uploads the customer selects 110 nodes and start upload in parallel, it will stop when 80 will finish first, all other got canceled. The same for downloads - the customer requests 39 nodes and stop when 29 will finish first, all other got canceled.
Thus it’s statistically would be closest nodes to the customer’s location because of low latency and speed. So, the real limit is only customer’s bandwidth.
This solution helps to solve two issues - long tail (when you have a slow nodes in your list) and various connection parameters of nodes, obviously that nodes with fastest connection and closest to the customer’s location will got uploads and downloads more often than nodes with slow connection or far away from the customer.
To protect customer’s data from losing, we trying to do not place all pieces of the segment into one location:

the nodes selection is random - so the customer will have nodes from around a globe;
the nodes filtered by /24 subnet of public IPs (only one node is selected from the /24 subnet for each piece of one segment, thus we do not store all pieces of one segment in one physical location or using the same ISP).

jammerdan · January 25, 2021, 6:02am

This confirms what I am trying to figure out here:

So distribution would favor the uploaders location and will distribute the pieces close to his location.

However for redundancy or for collaboration this might not be favorable. A media professional in USA sharing terabytes of data with a production company in Australia might be better off if the data gets close to the location where it is downloaded.
For redundancy it might be better off for a EU customer when his data gets also uploaded far away from the EU in case of an blackout.
And 3rd as it has something to do with the algorithm, on the other hand a German user from the public sector might be required to make sure that data he uploads will only be stored on nodes that are in the EU region.

So how does the algorithm cater for these situations?

Pac · January 25, 2021, 6:03pm

@jammerdan From what I understand, the way pieces get uploaded to Tardigrade (randomly) covers both scenarios:

Pieces get scattered all over the globe, which is good for redundancy resilience. The example of a blackout in Europe would get covered by the fact many pieces are also out of Europe.
If a customer uploads data at place A, for heavy downloads in place B (far away), well this works too as there are pieces everywhere…

So… it’s kind of great both ways. Right?

You make a very good point here: And it’s the same for other countries too. I must say it sounds like a required feature for them

Alexey · January 26, 2021, 6:27am

I already described a whole algorithm. As you can see - there is no Geo-lock any kind.
The current implementation will give you nodes around the world. The fastest ones will get most of the pieces, however not all. You even can see a map for the file - it’s distributed around a globe.
For example, I’m now in Asia and uploaded file to the Tardigrade, so some pieces are landed in Australia too

However, I thought they would be mostly in China, since it is almost near me… Maybe their firewall gives a big latency…

jammerdan · January 26, 2021, 7:38am

But that is what I am saying. If the algorithm favors the fastest nodes only without considering additional requirements, then maybe the algorithm needs reshaping.
I give you additional ideas what the algorithm should consider:

Storage only in a specific region (like GDPR requires)
Storage in n-number of redundancy region (Example: If Europe goes down have enough pieces in Asia/Pacific region to be able to rebuild the file)
Location of uploader <> location of downloader. For high speed collaboration of files geographical evenly spreading is important, so a media professional in Hollywood can upload fast, but his production subcontractor in Australia can download fast as well.

Alexey · January 26, 2021, 7:55am

That’s planned, but not implemented yet.
In case of media demand we have an option to increase a distribution, if such demand is happened. This function is mentioned in the whitepaper, but not implemented too.

Our team is working on implementation of Gateway-MT with multi-part upload functionality at the moment (by the way, it could help you to upload big files, if your upstream bandwidth is low - it uses only one stream to upload data):

This will allow any s3-compatible software use Tardigrade without needs to host a Tardigrade S3 Gateway themselves.

Toyoo · January 26, 2021, 7:55am

SNOs are free to travel with their nodes, so that might be tricky to enforce.

kevink · January 26, 2021, 7:57am

Satellite can just remove all pieces that you are not allowed to have on your current ip adress. So no problem there.

jammerdan · January 26, 2021, 8:03am

This is what should be done.
However it does not help when using VPN tunnels giving false locations to the satellites.

jammerdan · January 26, 2021, 8:06am

I have noted that one. Sounds interesting. But does it still require to upload 2,7x the data that are originally there? Which still makes this hard for the intended use by low upload bandwidths.

Alexey · January 26, 2021, 9:03am

No, 1x. The 2.7x will be done on Gateway-MT side. The current drawback is server-side encryption with your access key. So, the data will not be available until you provide your access key, then the metadata can be decrypted with it.
However we working on client-side encryption too.

jammerdan · January 26, 2021, 10:22am

That is really interesting news.

But this I still quite don’t understand. Is there not something like private and public keys in place, so data could encrypted with public key and decrypted with the private key only.

Toyoo · January 26, 2021, 7:50pm

That’s not how the law usually works. The data processor usually must ensure that data never crosses borders in the first place. Acting after the fact is not enough.

Alexey · January 27, 2021, 1:22am

The data is encrypted anyway, but the key will be stored in encrypted form on the Gateway-MT side. To unlock (decrypt the key) you need to have your AWS S3-like access key and secret key. When you access the Gateway-MT with them, your Tardigrade keys become decrypted and they will be used to decrypt your data, thus you can access your data.
So it’s still safe, but not as much as the client-side encryption. So, we are working on implementation of the client-side encryption too.
You can read there

aseegy · March 10, 2022, 8:47pm

Hello!

Is it possible to be compliant with EU GDPR requirement while using storj?

I am referring to this case:

Thanks!

Alexey · March 16, 2022, 8:14am

In regard of geographically stored data - yes, see Geofencing and advanced placement-constraint support #4227
However, the GDPR has much more than that.
See