Yeah I’m happy to see that latency is important. Even with a US Satellite handing out global nodes… the ones in the EU win races against nodes across the Atlantic or in Asia (for EU customers).
Data needs to be “spread out” across node owners: not necessarily “spread out evenly over the world”. If the erasure-encoding is providing the durability… then customers should be given performance.
Then your expectations are wrong. Data needs to be spread across many nodes (preferably not belonging to the same SNO), as close to the customer as possible, which improves performance from the client’s perspective.
Correct, the fastest nodes relative to the customer are preferred.
How does satellite know latency between node and a customer, to make that decision? Hint: it does not. And note, the satellite in this topic is US1 to begin with.
Storj stores data redundantly. These requirements are not contradictory. You can have most pieces close by, and some pices far away, for durability. This is also part of the value proposition: CDN-like performance. This pretty much requires pieces to be distributed globally.
Just think about it – if what you say was true, nodes in US would not get any traffic from EU and vice versa. It’s obviously not the case – just look at your node dashboard.
So no, having all pieces in EU and none in US is a bug either in reporting or worse.
It doesn’t, that’s the whole point. The satellite provides a list of nodes to the client, the client says “I’ve uploaded to nodes 1, 2 and 3, the rest didn’t get any files”. Nodes 1, 2 and 3 are the fastest nodes (since they successfully received, stored, and acknowledged the data).
I didn’t say it was contradictory, I said that the fastest nodes relative to the customer are chosen, as they should. If something fails, and data needs to be repaired, then other nodes will be chosen. If I have a fast enough node, then I’ll get the data, even if I’m further away (geographically) from the client. The client in this case would be the satellite, since that’s who is responsible for repair.
In normal client > SNO traffic, I could be halfway across the world and still get all the traffic from a customer that is 3 houses down from you, since my nodes have lower latency because your nodes are on a DSL line, passed through 7 hops internally to your ISP, and your ISP doesn’t have any peering agreement with the customer’s ISP, which means another 3 hops outside your ISP’s network to get to the customer’s ISP, and 6 hops from there to the customer. My dedicated internet line (plumbed directly into my ISP’s core routers) has 9 hops to the customer, so I’m faster.
Let me correct you right there: everything I say is true.
Let’s clear this up: fastest nodes doesn’t not mean geographically closer. See previous paragraph.
A redundant file does not need to be spread all across the world to survive, if enough pieces are available for it to be rebuilt, it can even be distributed within the same country and that will not affect the durability in any way.
What nodes make the list? How many nodes are in the list? Are only nodes from a single location in the list?
I’ll repeat again: how will satellite know which nodes are the fastest from customer perspective? Only customer knows that. Satellite offers global selection of nodes, customer will end up uploading mostly to the fastest ones, but others will also receive some traffic. Not zero. Otherwise – see any previous comment, your US node won’t have any data from EU.
This is demonstrably false. Entire country can experience blackout. Especially small country. Entire region can experience blackout.
Per whitepaper:
Appendix C.1 Security, page 79
In a typical scenario (with a 20/40 Reed-Solomon setup), each file is distributed across 40 different disk drives in a global network of over 100,000 independently operated nodes.
Now, there is a caveat: storj allows to to configure geo locality per bucket:
Section 3.5 Metadata, page 24:
We allow users to choose storage based on geographic location , performance characteristics, available space, and other features.
So if user configured the bucket to limit nodes to be in EU – then even US1 satellite will only offer nodes in EU. Otherwise, nodes will be offered from the global pool.
And in the latter case, you will see data globally distributed.
Because, I’ll repeat again, Storj won’t be able to claim “cdn-like performance” otherwise.
Satellite picks a random number of different /24 nodes to send to the client (I’ve seen the number 100 mentioned somewhere)
And I’ll repeat again: if you give me a list of 10 places to store data, and I give you back a list of 3 of those places that I’ve stored data, now you know which places I prefer to store my data at.
There are some threads that explain the “new node selection”. Of course some nodes will still get data, as long as they are on the satellite’s “favorite” list (see first sentence in this reply, we are in the “random /24” part). As soon as a customer replies that “hey, I haven’t used that particular node” (the reply from the customer > satellite, that I’ve mentioned twice already), that node gets removed from the “favorite list” (if you look, you’ll find we are in the “the node will be chosen again after some time” part of the node selection, now).
Not really no, key phrase is:
In simpler English: that doesn’t mean that every time a customer wants to upload a file, he gets sent the entire 25,000 node list. The client could use any of the 25,000 active nodes, which are not in a single country, hence “global network”.
No arguments there: if a customer sets up EU geo-restriction, then yes only nodes in EU get sent to the client. Did you understand that I’ve said anything different, with relation to that? Because I haven’t mentioned geo-restriction in any of my previous replies, so I’m curious where you saw that I meant something different.
Debatable. If I set up EU geo, and all my users are in EU, it will be “cdn-like perfomance” to my users. (my users=people downloading my file from storj)
Aren’t most SNOs in the EU? I wouldn’t be surprised if all the fastest nodes for an EU customer end up being from the same area. And by area… alpharabbits image is around 5000km across and touches many countries. The files are still spread across more locations than any other S3 provider.
And CDN-like-performance isn’t solely latency (needing local nodes). You also get great performance from massive parallelism. The time-to-first-byte may be longer: but it can still complete an entire transfer quicker.
I think we’re all discussing slightly different things: each correct based on different assumed configs. Storj is fast because of paralellism and node-racing, and durable due to erasure-encoding making sure somebody out there has enough pieces to retrieve a file (even if it’s all on the same continent)
I see there are different situations. If uploader and downloader are the same or close, then I think this is correct.
Contrary to that there is the other situation where uploader and downloader are far apart. Think of a (movie) shooting in Australia (upload) and a the dailies are to be reviewed or processed by producer in Los Angeles. Maybe in such cases the download locations should be closer to downloader?
So I am a bit with @arrogantrabbit here. Upload should always be fast, meaning closest nodes is probably the fastest solution. But when it comes to downloads it could be that this is not always the case. If my downloaders are spread worldwide, mabye the pieces should be spread worldwide too to have the CDN-like performance. If I know my downloaders are in a specific region, maybe the pieces should be spread close to that location.
Maybe customers should be able to makes such a selection when they create a bucket to determine the spreading of the pieces.
But I do not know if there would be any gains from that in performance or resilience.
The ICMP ping is not used to discover the speed, the satellite uses DRPC protocol to contact nodes, the upload/download success is reported by the uplink client to the satellite and the satellite accounts it to calculate a success rate for the node.
So, when you can see a log message mentioned “ping”, it actually uses a Ping message in DRPC protocol, so it’s more complicated package than the ICMP ping.
The speed is not measured anywhere, because it doesn’t have any sense, it will be different for each uplink client, because the data is flowing between nodes and clients directly, not through the satellite. The satellite here is an address book, auditor, repairer, online checker and a billing processor.
There isn’t any side to be on, me and @arrogantrabbit are saying the same thing, I don’t understand why the pitchforks come out in every thread on here:
There are three scenarios when @jammerdan uploads a file:
Default: @jammerdan doesn’t care where his files are stored. The only requirement is to upload them to the network as fast as possible. Since @jammerdan didn’t ask the network to do anything specific with @jammerdan 's files, the network does what the network should: provide nodes that are the fastest to @jammerdan (fastest=report OK, I’ve stored it, next). One of those nodes may be @jammerdan 's neighbor, one might be me halfway across the world (since I’m running my own IX where every ISP on the planet is peered with me directly) or it might even be @arrogantrabbit since he spent hundreds of thousands of $$$ optimizing his ZFS array to store data as fast as possible, even if his network is running on a DSL line.
@jammerdan wants a file uploaded to the network to be available to his video editors, which are all in Germany. Now @jammerdan selects EU, or even Germany alone, as the place for his files (this bucket has geo-restriction on). Every file uploaded by @jammerdan will end up in either EU, or Germany nodes (depending on what was selected), since the only users will be in that region. @jammerdan Doesn’t care what the performance will be relative to @jammerdan, so it doesn’t matter if it takes a day to finish uploading halfway across the world, since @jammerdan is located in the USA.
World-wide CDN: @jammerdan wants to connect his Cloudflare page to storj, through an S3 gateway that @jammerdan runs on 5 continents. Since requests might come in from any of those 5 continents, and @jammerdan did the right thing by running his gateway on an anycast IP (which “points” to a different multi-availability-zone-cluster, depending on where the request for that IP is coming from), @jammerdan needs the pieces to be spread as “geographically distributed” as possible. Since @jammerdan didn’t signal any requirements for storage to the network, @jammerdan 's files will end up being distributed similar to the default scenario: worldwide, according to performance of the nodes.
There isn’t any side to pick, the network is operating as the network should: the fastest nodes to report OK to the uploader, store the files, unless expressly asked by the uploader to use specific geo-restrictions.
I’ll give you a bonus scenario: @arrogantrabbit 's array blew up, since he ignored my recommendations to run multiple metadata devices (mirrored), and his used-server-grade SSD they bought on ebay for $10 finally gave up, taking the entire array down (since there isn’t any other place to store the metadata on the files, the array now just contains random bytes stored, without any context as to where or what those bytes belong to). Since @arrogantrabbit 's array was running the last piece of @jammerdan 's file (the last to trigger a rebuild of the data, ie fell below the minimum available pieces), a repair will now be triggered by the satellite. The downloader is now the satellite, downloading the pieces from all the remaining nodes. The satellite reconstructs the missing pieces, and uploads just the missing pieces to new nodes. The uploader is now the satellite, which means the fastest nodes are now counted from the satellite’s perspective.
This is a great description, how is it happening. With some deviations though, some my segments are uploaded not only to the closest nodes, but also random nodes across the world.. because I was lucky to try to upload to a full or overloaded nodes, which are really closest to my location… Well, now I have segments in Australia, New Zealand and even South America, while I’m in Asia.
Of course, many segments in EU (I do not know, why, honestly), some even in Russia and USA (how?!), especially wondering those which in South America.
For those in Australia I have no questions. I geographically close to them.
There is no pitchfork. I am with @arrogantrabbit because he stated:
This would be my expectation as well as I tried to make clear in my older thread that I have posted.
I see that you are solving the issue by bringing geofencing into play. Yes it is handy that it is available for EU. As far as I know it is not available for single country like Germany and not for a lot of regions, as far as I know only for the US and EU, but maybe that has changed. So if you take it out to different regions, your example number 2 would not work. A shooting in Los Angelos with post production in New Zealand I don’t know if there is a similar geofencing solution. But I agree, if I can select a geofencing for a bucket and this region is close to the downloader, then the downloader might have a really fast download speed as it might be intended by customer.
And the default scenario was that in which @arrogantrabbit was wondering why the files are not geo distributed all over the world, wasn’t it?
At least in the past there was no long tail cancelation for repair jobs:
No, the default scenario was the one I explicitly explained why nodes from all over the world (=fastest nodes to signal “OK, I’ve stored it”) would get a piece. Whether those nodes are in the same city, country, continent, or planet is irrelevant. And yes, Alexey is right. Since I’m running the fastest network (latency wise) on this planet, I’m getting all the traffic. Every single file uploaded to the network ends up at least partly on my nodes, then other nodes get some leftovers (to build up the number of available pieces). That filled up my nodes, so now even DSL nodes can get some bytes, until my nodes clear up some space. Those DSL nodes are all in Africa, so now that you’ve uploaded your file, your file is stored in my nodes which are spread across the US and EU, plus other nodes in Africa, hence geo-distributed around the world.
Those who live in the USA do not see any problems, they operate with the concept of a fast or slow node exclusively among their networks.
And I am from Ukraine and I can tell you with confidence that even a node on NVME, a lot of RAM and excellent Internet with redundancy = I see only data outflow.
Before the cleanup there were 400 TB, after the cleanup 150 TB (and frankly, it is gradually decreasing) = I am not complaining, but after changing the distribution, the system clearly works according to geozones, connection response, which of course, the closer the better, and what was previously divided by /24 and talked about equality and distribution - no longer works and ALAS it will be so until critical events, cataclysms occur.
After changing the node selection algorithm, people constantly raise the issue of equal rights in distribution on the forum, but they always say - this is how it should be.
P.S. This is just my opinion. The operator, who will lose 3/5 of the storage (about the dead data in the network), at the same time - invested in providing UPS, optical Internet and corporate data storage devices.
Can we have someone located in Australia or New Zealand upload a distribution map of one of their files of similar size? I would love to see how this comes out.
I would happily, but for the life of me cannot find the globe icon @alexey described in the Storj web console. I see a globe icon on the ‘Browse Buckets’ page but it’s not a link. On us1.storj.io