Make downloads faster by randomizing stripe order when downloading

Just a shower thought.

Per whitepaper, erasure coding happens at stripe blocks, and a single segment consists of many stripe blocks. If I understand the code correctly, right now each piece contains data for stripes in the same order, and that’s also the order in which they are then sent over the network:

Assume RS k=3.

-------------------------------time---------------------------------------------->
Node 1:  «stripe 1»       «stripe 2»    «stripe 3»       «stripe 4»
Node 2:           «stripe 1»   «stripe 2»       «stripe 3»        | «stripe 4»
Node 3:     «stripe 1» «stripe 2» «stripe 3» «stripe 4»           |
Node 4:       «stripe 1»  «stripe 2»  «stripe 3»  «stripe 4»      |
Node 5:    «stripe 1»      «stripe 2»     «stripe 3»         «stripe 4»
-------------------------------time-------------------------------|-------------->
                                                       download finishes here

So, download ends when the last stripe has data from 3 pieces.

But we could try to (deterministically) randomize the order of stripes received. Here I’m not changing the delays, just the order of stripes:

-------------------------------time---------------------------------------------->
Node 1:  «stripe 4»       «stripe 3»    «stripe 2»       «stripe 1»
Node 2:           «stripe 1»   «stripe 3»       «stripe 2»          «stripe 4»
Node 3:     «stripe 2» «stripe 1» «stripe 4» «stripe 3»            
Node 4:       «stripe 4»  «stripe 1»  «stripe 3»  «stripe 2»       
Node 5:    «stripe 1»      «stripe 2»     «stripe 4»         «stripe 3»
-------------------------------time----------------|----------------------------->
                                         download finishes here

Here, the download can end earlier, because we can leverage stripe 1 data from nodes 2, 3, and 4; stripe 2 from nodes 1, 2, and 5; stripe 3 from nodes 1, 2, and 4; and stripe 4 from nodes 1, 3, and 5.

This way we would get faster downloads, possibly less bandwidth overhead from not wasting data received from slower nodes at the cost of uh, a bit of complexity between the node and uplink, plus teensy-weensy more complex accounting.

Also, time-to-first-byte would be worse, so this shuffle could be an opt-in.

What do you think?

1 Like

Given that nodes will have different speeds (that will vary over time, and that you don’t know in advance): wouldn’t randomization be just as likely to slow things down? You’ve for-sure added complexity but don’t for-sure receive any benefit?

No. Currently you wait for the first 29 nodes that deliver all stripes. Consider that if the order of the stripes in those 29 nodes would be randomized, that is exactly also the worst case for the proposed approach. And any time one of the other 6 nodes will deliver some stripe data faster, it only means we don’t need to wait on these 29 nodes to deliver that stripe, so we can only improve on the worst case.

The current download process is implemented in a such way, that it’s starting more than 29 downloads at time, so I do not think that we need to randomize them at all - we already doing so by starting to download for example 39 from needed 29. These numbers also depends on a used RS schema, it could be more or less.

I’m not sure you have understood the proposal here. I’m talking about stripes, not pieces.

1 Like

Done some simulations. Writing verbosely here to make clear what is being compared.

In the best case where each node has exactly the same speed, with the usual N=35, K=29 I see an average speedup of 4–5%. That is, compared to opening 35 connections and waiting for the fastest 29 nodes to complete, opening 35 connections and waiting for 29 stripes to complete across all nodes sending them in random order, is 4–5% faster, depending on the segment size.

The difference is bigger with bigger Ns though. For example, compared to opening 40 connections and waiting for the fastest 29 nodes to complete, opening 40 connections and waiting for 29 stripes to complete across all nodes sending them in random order, is 9–12% faster, depending on the segment size.

And compared to opening 46 connections and waiting for the fastest 29 nodes to complete, opening 46 connections and waiting for 29 stripes to complete across all nodes sending them in random order, is 16–21% faster, depending on the segment size.

I’d conclude that while N=35 probably sees diminishing results with the current approach, this proposed approach can leverage way more connections and still get truly meaningful speedups. Effects will be smaller with nodes differing in performance. I don’t have data to simulate this though.

Fun experiment.

3 Likes

I’m not sure if I understand the topology correctly, but could this be mitigated by allocating a portion of nodes to always serve these stripes in linear order? If done on the side of the node, perhaps this is randomized on start.

This suggestion for implementation in uplink, so you can use a not modified uplink to download as usual (for streaming for example). No need to change the node selection on the satellite.

I’m not sure, that different erasure shares from different pieces can be reconstructed to a stripe before you would have all erasure shares of that stripe from all 29 pieces. Then when you will have all stripes you can reconstruct the segment.

Makes sense. This would effectively make it from on/off switch to a continuous tunable.

That’s how I understand libuplink’s code. We would effectively need to go from:

piece = []
erasure_shares = defaultdict(list)
count_of_so_far_received_erasure_stripes_per_node = defaultdict(int)
for node_id, erasure_share in get_next_received_erasure_share():
    erasure_share_id = count_of_so_far_received_erasure_stripes_per_node[node_id]
    count_of_so_far_received_erasure_stripes_per_node[node_id] += 1
    erasure_shares[erasure_share_id].append((node_id, erasure_share))
    if len(erasure_shares[erasure_share_id]) == 29:
        piece.append(reconstruct_stripe(erasure_shares[erasure_share_id]))

to (assuming we would choose to generate a permutation deterministically from node id, just like we do with RS codes):

piece = []
erasure_shares = defaultdict(list)
permutation_per_node_id = defaultdict(list)
for node_id in nodes_being_downloaded_from:
    permutation_per_node_id = compute_permutation(node_id)
for node_id, erasure_share in get_next_received_erasure_share():
    erasure_share_id = permutation_per_node_id.pop(0)
    erasure_shares[erasure_share_id].append((node_id, erasure_share))
    if len(erasure_shares[erasure_share_id]) == 29:
        piece.append(reconstruct_stripe(erasure_shares[erasure_share_id]))

There’s also the bit where the code now can now assume that it’s possible to reconstruct the whole segment as soon as you get enough bytes from 29 nodes, so it doesn’t bother doing so earlier. Now the code would have to do some checks after each received drpc response.

The libuplink code expects that all shares are collected for the same stripe.
So with random shares it will wait for the correct share to append it to build a stripe. Thus I think that the resulted time will be the same or longer - you will just shift the waiting time from downloads to the segment reconstruction.

29 shares per stripe, yes. But we are getting them from 35 nodes. It doesn’t matter which of these 35 nodes will serve us these 29 shares, that’s the point of RS codes.

Is 35 the low-repair threshold now? I don’t know where to look for active RS numbers: maybe 29/46/54/70? So your example would optimistically be looking for 29-from-54?

No, 35 is the number of connections opened for downloads by regular (non-repair) uplink clients. It’s not part of the set of numbers quoted by littleskunk, which are, if i understand correctly: RS k/repair threshold/target long-term availability/upload connections.

Ah, I thought (using those number) that the satellites give upload clients a list of 70 possible nodes… and the clients are done when they’ve uploaded to 54 of them (presumably the fastest). So when the clients come back later asking to download… there’s up to 54 nodes they could try to grab from.

Yeah, same thing actually!

Indeed the network could send the full list of 54 nodes to uplinks; with the current approach this would likely lead to more bandwidth wasted though. Having only a subset of nodes for downloads also helps spread the load.

The satellite must be sending all 54, shouldn’t it? The client wants speed: and the cost of opening a few more connections is essentially zero. Nodes race no matter what: failed/lost-race uploads+downloads aren’t a client concern - getting the data as-soon-as-possible is. And it’s not a waste: it’s part of what they paid for.

Like in your example… if the satellite only told told the client about 35 places to get data from… it needed 29… and only found 25 online… what then? Does it go back to the satellite and say “the first node list you sent me didn’t work, can you please really really really tell me all the places I paid for my data to be stored”?

The satellite and client having two conversations about one set of data… would be expensive. Compared to the satellite handing over the full node list immediately… and the client connecting to all of them in parallel.

I’m not arguing with you - I mean if it doesn’t work that way I’m confused. So much of Storj marketing pushes performance: and so much of their architecture depends on massive parallellism. So connect to every node that may have the data, every time, no?

The network must optimize for good performance for all participants and longevity of the network, not just for performance of a single download.

As such:

It is not when you’re downloading many segments concurrently. My home router gives up with few thousands concurrent connections. We’ve seen threads here showing clients failing with too many concurrent connections, e.g. Uplink: failed to upload enough pieces (needed at least 80 but got 78)

Thankfully, in this case you’re still working with hundreds of connections, so concurrency works; just not at a scale of a single segment.

Another factor is that each connection is also I/O and bandwidth on the node side—even if node fails the race. If you double the number of connections, this means an average node will have double the I/O and double the bandwidth. I/O historically has been already a problem, and I bet you’ll get node operators annoyed if their bandwidth will increase twice without rewards for many more races lost.

Yet another factor is bandwidth. Worst case when nodes have exactly the same speed, you’re effectively doubling the amount of bandwidth consumed to download a segment, because even if you discard data from half of the nodes, before you drop connection you have already received that data. This might lead to slower nodes dropping out of the network altogether.

And, with the current scheme, beyond some number I suspect there are diminishing results in terms of single segment download speed.

Tough luck :person_shrugging: The 6 additional nodes should be covering this case well enough. If not, yes, you just download it again from the satellite. BTW, even AWS S3 requires clients to be prepared to resend the request, because it does happen they’ll just throw a 50x randomly. So you can consider this part of the S3 protocol.

Rare enough that it’s not a problem. The 6 additional entries cover this well enough. At some point it’s just a balance of probability…

Just trade-offs. And it’s not like other providers don’t do them. Storj does have more flexibility in tuning them though, so by being able to do this tuning better they win on performance.

BTW, what Storj could implement is doing some sort of connection selection client-side. So, for example, let’s send the full list of nodes to uplink, and the uplink uses some sort of a local database to track relative performance of nodes to choose the nodes that are most likely to deliver pieces quickly. With the current scale this database wouldn’t be that big. If the scale would grow, you can still do some sort of averaged out statistics e.g. per /24 bucket.