Uplink library upload process and recommended file fragmentation size

Hello! This is a great question!

It turns out, there are actually multiple levels of places where Storj chunks incoming data into different sections. At the highest level, an incoming object stream is broken into 64 MB ranges of bytes we call “segments”. This is why there are pauses in our broken progress bar (currently) every 64 MB. Most of our internal accounting really handles segments and not objects.

Within a segment, the encrypted data is broken into “stripes”, which are just a few KB. Each stripe is erasure encoded into “erasure shares” and added to outgoing streams to individual storage nodes called “pieces”. We don’t really know how large a piece will be when we start, but we know it won’t be larger than the erasure encoded output of the max segment size. If the incoming data stream stops before we get to the max segment size, the pieces are just a little smaller. We basically shovel the data through the pipeline as it comes in, so by the time we receive the last stripe of a segment, the first erasure encoded stripe should already be out to storage nodes.

We go into this with more detail in section 4.8 of https://storj.io/storjv3.pdf.

Because storage nodes all upload at different rates, we buffer within a segment to try and allow fast storage nodes to run ahead a little bit from slower storage nodes. I believe this runahead amount is configurable, and while we haven’t well documented it, you can extend your calls to the Uplink library to buffer on disk or in memory.

For developer advice, we definitely encourage developers to upload “large” objects, but there is currently no benefit either from a cost or performance perspective to objects larger than 64MB, because internally, that is the max segment size we deal with. There is a cost to lots of smaller objects - it takes more overhead on the Satellite, and so we do charge more (there’s a per object cost).

One current downside of objects larger than 64MB is that you have a greater joint probability of failure for the upload. Our current API is a single shot and tries to upload your whole stream in one go. If your network drops, you have to start over from the beginning. We’re adding a new feature early next year (soon!) to add native support for AWS S3-style multipart uploads, which will allow you to upload larger objects in separate stages, so you don’t have to start over if your network drops midway.

Suffice it to say, the current sweet spot for object sizes, if you’re developing an application on Storj, is a little under or up to 64 MB.

Keep in mind that if you have lots of smaller objects, a great strategy is to pack them together in such a way where you keep track of their offset for reads. When you download from Storj, you can start your download of an object at an arbitrary offset, which automatically calculates which stripes are needed and thus which ranges of which pieces are needed. If you have many tiny files, but have an index data structure that keeps track of which object they are in and what offset and length they have within that object, you can efficiently download just the small file you need through our Uplink library without downloading the full object. If you want sample code, my toy backup app https://github.com/jtolio/jam is designed this way: https://github.com/jtolio/jam/blob/4ce1c592202da556df1c408964730beebfbc6d76/backends/storj/storj.go

5 Likes