Uplink library upload process and recommended file fragmentation size

ndragun · December 22, 2020, 9:12pm

Since this is probably a question other developers looking to use the uplink library will have I figured posting this here would be helpful for others.

I’m looking to get some feedback on:

Understanding how the files are ingested into storj for large files
Suggestions/recommendations on file size splitting

#1
In all the current usage examples we see simple analogies where our full byte array is easily loaded into memory and then shuffled up to UploadObject before Commit. This explanation is helpful for small files that can fit in memory, but leaves a big grey area for how developers should treat large files.

Let’s say I have a 1GB file. Let’s also say that I choose to create a 128K buffer for reading the file pieces into memory. Does “storj” split each 128K buffer slice sent to it across the network? Does it buffer every 128K slice until Commit is called and send the whole 1GB file at once? Clearly we can’t keep an entire file in memory after a certain size threshold, so you must have a process in mind.

Would the flow look something like this?

...
UploadObject(<bucket,key,etc>)
<build buffer>
Write(<send buffer>)
<build buffer>
Write(<send buffer>)
<loop build/write until complete>
Commit()
...

#2
If it does indeed divide up the file pieces into storj network commit chunks based off the write size, there is probably a “sweet spot” for these chunk sizes for both optimal cpu and network throughput. Are there recommended size/chunk ratios you all have observed as optimal?

I guess I’m also looking for clarity as to how storj knows how to split a file into X chunks of Y size when it doesn’t have all the data beforehand.

jtolio · December 22, 2020, 9:52pm

Hello! This is a great question!

It turns out, there are actually multiple levels of places where Storj chunks incoming data into different sections. At the highest level, an incoming object stream is broken into 64 MB ranges of bytes we call “segments”. This is why there are pauses in our broken progress bar (currently) every 64 MB. Most of our internal accounting really handles segments and not objects.

Within a segment, the encrypted data is broken into “stripes”, which are just a few KB. Each stripe is erasure encoded into “erasure shares” and added to outgoing streams to individual storage nodes called “pieces”. We don’t really know how large a piece will be when we start, but we know it won’t be larger than the erasure encoded output of the max segment size. If the incoming data stream stops before we get to the max segment size, the pieces are just a little smaller. We basically shovel the data through the pipeline as it comes in, so by the time we receive the last stripe of a segment, the first erasure encoded stripe should already be out to storage nodes.

We go into this with more detail in section 4.8 of https://storj.io/storjv3.pdf.

Because storage nodes all upload at different rates, we buffer within a segment to try and allow fast storage nodes to run ahead a little bit from slower storage nodes. I believe this runahead amount is configurable, and while we haven’t well documented it, you can extend your calls to the Uplink library to buffer on disk or in memory.

For developer advice, we definitely encourage developers to upload “large” objects, but there is currently no benefit either from a cost or performance perspective to objects larger than 64MB, because internally, that is the max segment size we deal with. There is a cost to lots of smaller objects - it takes more overhead on the Satellite, and so we do charge more (there’s a per object cost).

One current downside of objects larger than 64MB is that you have a greater joint probability of failure for the upload. Our current API is a single shot and tries to upload your whole stream in one go. If your network drops, you have to start over from the beginning. We’re adding a new feature early next year (soon!) to add native support for AWS S3-style multipart uploads, which will allow you to upload larger objects in separate stages, so you don’t have to start over if your network drops midway.

Suffice it to say, the current sweet spot for object sizes, if you’re developing an application on Storj, is a little under or up to 64 MB.

Keep in mind that if you have lots of smaller objects, a great strategy is to pack them together in such a way where you keep track of their offset for reads. When you download from Storj, you can start your download of an object at an arbitrary offset, which automatically calculates which stripes are needed and thus which ranges of which pieces are needed. If you have many tiny files, but have an index data structure that keeps track of which object they are in and what offset and length they have within that object, you can efficiently download just the small file you need through our Uplink library without downloading the full object. If you want sample code, my toy backup app https://github.com/jtolio/jam is designed this way: https://github.com/jtolio/jam/blob/4ce1c592202da556df1c408964730beebfbc6d76/backends/storj/storj.go

ndragun · December 22, 2020, 10:25pm

This is awesome, really appreciate the info. So in a nutshell, at the moment, application developers do not need to worry about the chunking size and striping operations.

Great news, did not realize this was not currently implemented.

As I’d like to make sure we’re keeping mindful of code design for future capabilities, do you see this upcoming feature set increasing the 64MB chunk cap? It’d be great to be able to modify the default ‘cap’ under certain circumstances, but I have no idea how far reaching those implications are under the hood!

jtolio · December 23, 2020, 12:08am

That’s right!

Well, so, there really isn’t a cap. You can upload a 4GB or 4TB file if you like. Internally we map it to these 64MB segments for easier handling of repair and distribution across the network, and ultimately we have discussed changing this segment size either higher or lower depending on network dynamics. But developers shouldn’t need to know or care about the segment size. Ultimately, it’s a bug if the 64MB segment size is causing issues for you. We do have this outstanding problem where when uploading a large (>64MB) file, there are stalls every 64 MB in our upload pipeline, but we’re working to eliminate those stalls. That’s an independent effort of the segment size decision.

To be honest, as long as you aren’t storing lots of small (~KB) objects and are mostly storing large (>20MB objects) of any size, you’re hitting the sweet spot for us. Since we don’t (yet) have support for multipart upload, I only suggested sticking around 64 MB not because you can’t go higher, but because it makes the usefulness of multipart upload in the face of upload failures less pressing.

ndragun · December 30, 2020, 7:47pm

I had to spend some time mulling this over a bit as something didn’t seem very sound from a performance aspect. I’ve come to the following understanding, so bear with me through this thought process and maybe we’ll come out ahead.

I don’t really see the current method of handling upload segments (trying to make sure to use the same terminology you guys use) scaling well at all. The performance scaling issue points back to having a fixed segment size; with the series vs parallel upload being a secondary issue. So let me try and explain what I see at a deeper level and provide a possible solution.

Generalizations
For the sake of this conversation I’m going to focus only on the network transport components. Additionally I also assume that each segment (fixed 64M max size) splits into 80 pieces and that the expansion factor is a fixed 1:2.75 ratio (or 275%). Each piece is then sent off to a different storage node. Lastly, let’s assume for the sake of uniformity it only takes 20ms total for a client to request an upload storage node (SN) and establish a connection with it (this seems quite generous).

Problem
With having a fixed maximum segment size we run into unnecessary resource usage overhead and underutilized storage node throughput capacity as the upload file size increases. The easiest way to picture this is to look at:

Client / storage node connection build up and tear down cpu time/usage
RTT connection request latency (Ex: “Hey sat, give me storage nodes.” “ok, here you go” … “hey storage node, let me send you a file” … etc etc)
Fixed maximum total upload size per storage node connection

Since the argument I’m making is based off variable file size I’ll try and extrapolate how this comes together in a table:

The “Piece size in Mbps” column means that it would only take X Mbps to transfer the whole piece in 1 second. (Ex: So a whole 2.2MB piece can be transferred in 1 second at 17.6Mbps)

My math may be a little rough here, but you can see that using a fixed segment size we quickly increase the total number of connections necessary for uploading and each of the upload streams is heavily underutilized. I don’t really know a good way to calculate how much cpu/ram overhead building 8000 storage node connections uses on a client so I didn’t add that to this table.

Solution
While being able to parallelize segment uploading will help with the burp/stall in requesting more storage nodes for the next segment, it does not address the need to reduce the total number of uploaded pieces (reduce latency) and increase individual bandwidth usage.

By dynamically scaling the segment size it is possible to find an optimal balance that takes advantage of available network resources, reduce total upload time, and even reduces satellite load.

I propose that the following two additional optional functionalities would solve this problem:

Allow the client to provide a total file size during creation - use this file size parameter to “intelligently” calculate an optimal segment size
Allow the client to specify a segment size during creation - no different than the current behavior except that it can be overridden with a different value

Alexey · December 30, 2020, 11:41pm

What to do with a streaming? You do not have a size either a file or a segment.

ndragun · January 6, 2021, 10:13pm

Sorry Alexey, I’m having a little difficulty translating your meaning.

Alexey · January 7, 2021, 8:42pm

If you want to stream video for example to the Tardigrade, you would not have either size of the segment or size of the file - they are unknown.
With both your suggestions it would not work.