File expansion + overhead much higher than expected

chris233 · June 17, 2021, 12:58am

From reading up on StorJ, an expansion factor of 2.75 is given as the amount extra data that one has to upload to cover the extra data that comes with the erasure coding. I would consider this fact a little hidden, as you really have to dig through the documentation to find it. However, even if you do find it, that is far from the truth of how much overhead comes from the distributed nature of StorJ.

In my testing, on upload, I’m seeing between 4 and 5x more data being sent than the size of the files themselves (as in, uploading a 100meg file will result in 400 - 500MB uploaded). From searching this forum, I understand that this is because StorJ initiates more uploads to storage nodes than is necessary, and then kills the extra uploads as soon as the target upload number is reached. So worst case scenario, all uploads would finish at nearly the same time, resulting in a killed connection that had already transferred a majority of its chunk.

I think I read that this is done to mitigate “tail end” latency, or something of the sort? Preventing the situation where I need to upload 80 chunks, I’m uploading to 80 nodes, but a single node has an issue, and my upload is left hanging. By “over uploading”, a problematic node doesn’t matter because I have other nodes already on the go.

This same technique is used for downloading, where more chunks than required are downloaded from storage nodes, and as soon as I get the required 29, any not-yet-complete downloads are just killed. This also leads to more data being transferred than is necessary - 1.3x in my test.

Ok, so, one of the benefits I believed StorJ offered was S3-competitve download speeds, in part due to the globally distributed nature of the storage nodes, but that hasn’t been my experience at all. For instance, downloading a 1.4GB file from Google Storage (Standard, regional) I average 63.2MB/s. That same file downloaded from StorJ averaged 22.7MB/s. I thought I might be CPU limited, as I saw rclone (used for both tests, v1.55.1) spiking to 65% CPU used (another drawback to StorJ). The Google Cloud region was Oregon, so I used a VM in Amsterdam thinking this would give StorJ an edge, but the results were nearly identical.

Looking at StorJ’s “Common Use Cases”, pretty much any where performance is mentioned I have to wonder how accurate StorJ is being. For instance, for “Large File Transfer” it says “High-throughput bandwidth takes advantage of parallelism for rapid transit”. But, really though? Sure, it’s very parallelized, but you need to send 4 - 5x more data than if you used S3 or similar.

I’m curious what other people think about this. I feel this overhead is quite hidden, and could lead to lots of disappointment from new users. I also wonder whether StorJ will have to consider allowing users to specify their own level of durability and performance. I’m convinced that files stored on StorJ are incredibly safe (satellites aside). Maybe that is worth a 4-5x overhead to one customer. Whereas, maybe another really does want to take advantage of large file transfer speed, and they can be allowed to configure their client to upload only the bare minimum number of chunks, understanding the risk they’re taking.

Thanks for reading. I think what’s happening here is really cool, and for whatever it’s worth, I’ve moved all my online backups to StorJ - so I’m not just here to be a hater.

SGC · June 17, 2021, 6:36am

there has been a number of people performing tests that i’m familiar with, and yes when one connects directly to the network, there is extra overhead, this doesn’t really limit your connection tho, your connection limits the transfer… usually… ofc it can be difficult to test as most people don’t have to hardware to actually run proper tests on the max speeds of Storj DCS

on a 10Gbit uplink people have had speeds of 300+ MB/s which is when multiplied by rough data expansion factor of x3 comes out of 900MB/s or 75% of what a 10Gbit connection can handle.

however at those levels your local SSD / CPU could easily become the bottleneck.

To solve this issue for the avg customer storjlabs has created their gateway service, which handles communication with the network if you are unable to do so yourself.

however if you compare customer speeds to the storj DCS network compared to other services, its not possible for other non CDN or such geographical local proximity services to allocate enough bandwidth.

because Storj DCS sends data from multiple scattered geographical points to a central location (the customer) then there is a fundamental physical / geometrical advantage to this because it will use more connections from all sides.

so long story short.

with Storj DCS there is not throttling of data, you a limited by your hardware and the speed of the internet, and maybe by the number of storagenodes near your geographical location.

With central datacenters your connection will be throttled because if they have lets say 400Gbit for the service in the data center near you… then they cannot allow you to take 10Gbit or 100Gbit, they will have a preset or a dynamically adjusted max speed, based on their traffic or what they consider good enough that more speed isn’t really worth it, or they don’t want to pay more of the service…
bandwidth can be expensive.

more about gateway here.

a bit more on why Storj DCS is fast than you think.

Alexey · June 17, 2021, 6:50am

Hello @chris233 ,
Welcome to the forum!

I think your numbers is about right: the expansion factor for uploads is 80/29 * 110/80 resulting in 2.76 - 3.79 range and 39/29 for downloads resulting in 1.0 - 1.34 range in case of a native connector, if you would use an Storj-hosted S3 Compatible Gateway, the expansion factor for your connection would be 1.0 up and down, but on Storj network level it would be still in the same ranges.

110/80 for uploads and 39/29 for downloads are designed to cut a long tail, as you clearly described.

The parallel uploads and downloads have a benefit usually when you have a wide bandwidth, the native connector is able to saturate it at whole.
If you have an asymmetric channel (upstream is lower than downstream for example), the native connector could not play well, especially for uploads, if upstream is lower than 40Mbps.

So for edge cases and for S3 compatibility we host an S3 compatible gateway. You can also use a Self-hosted S3 Compatible Gateway, but it should be on the server in your local network and have a good Internet bandwidth, because only clients would not have a deal with 110 uploads and 39 downloads, but the gateway will.

ACarneiro · June 17, 2021, 8:06am

I dunno, I’ve been unimpressed by speed both on DCS and the S3 connector.
I’m on a 1Gbps leased line and whilst I can saturate the upstream uploading to DCS, the downloads are around 60-100Mbps
On the S3 connector I’m usually around the 100Mbit mark.

Could be something my end, of course, but I’ve not noticed issues with any other connections.

SOMEthing isn’t still quite right.

chris233 · June 17, 2021, 3:30pm

Thank you @SGC and @Alexey for the replies. It seems the conclusion is that the connection or hardware on my end isn’t fast enough. I find that curious, and wonder who the customer is StorJ is targeting.

I spun up a Google Compute VM in Ohio, 2 vCPUs, with an SSD-backed disk. Exact same result:

chris@storj:~/rclone-v1.55.1-linux-amd64$ ./rclone -P copy StorJ:/test/P3200922.MP4 ~
Transferred:   	    1.388G / 1.388 GBytes, 100%, 22.364 MBytes/s, ETA 0s
Transferred:            1 / 1, 100%
Elapsed time:       1m4.2s

So, I don’t know what more can be expected of a customer using StorJ. Where do I find a faster internet connection? Google says ingress bandwidth is limited to 20Gbps or 1,800,000 pps.

I also tried uplink on the same VM, which had a slower result:

chris@storj:~$ time ./uplink cp sj:/test/P3200922.MP4 P3200922.MP4
1.39 GiB / 1.39 GiB
[-----------------------------------------------] 100.00% 13.48 MiB p/s
Downloaded sj:/test/P3200922.MP4 to P3200922.MP4

real	1m45.887s
user	1m27.131s
sys	0m18.197s

I could come at this differently - what do you recommend I do to get faster performance? I’m willing to try it. My first thought is that these VM CPUs might be underpowered when it comes to reassembling and unencrypting the incoming data. The pps limit of the Google VM I don’t really know if that’s a high limit or not. But, what’s left to try?

SGC · June 17, 2021, 3:39pm

i would try with more cpu yes… i’ve been told that one can max out basically any cpu you can buy today.

chris233 · June 17, 2021, 3:54pm

As most home connections can download data much faster than upload, what you’re seeing may indicate that the issue is the speed at which individual storage nodes can send data.

Doing some really crude math, say a node has 5 Mbps with which to send you a 64 megabyte segment (5 Mbps being the minimum upstream bandwidth required of a node).

64 megabytes = 512 megabits, so 512/5 = 102 - Or, about 102 seconds to upload.

Ok, so your beefy connection is downloading segments left and right, but, you can’t complete your download without the slowest of the top 29 nodes completing its transfer. If that 29th fastest node (the slowest of the top 29) is uploading at 5Mbps, then you’re waiting at least 102 seconds.

I guess StorJ tries to get around that by downloading from more than 29 nodes (39 Alexey said?), and killing the connections once the first 29 segments make it to you. But, you’re still going to have a slowest in those 29, and that is going to be the limiting factor in how fast you can complete your file transfer.

SGC · June 17, 2021, 4:46pm

erasure coding doesn’t exactly work like that, and i believe it’s a lot more pieces… more like close to 90 and you need like 29 of them to restore the data.

so granted there will be a slowest node but it will be best out of 3… ofc maybe people just have crappy internet connections, and that is slowing down the network…

i never see my bandwidth utilized… i can do like 800mbit both up and down, so either the bandwidth peaks and ends the transfer so fast that my netdata doesn’t display it correctly.
or there is some other bandwidth limitation somewhere…

also if you are in europe many of us are on fiber… so i doubt that there will be the same download limitaitons… tho i could imagine in the states that one might have difficult getting enough bandwidth in some locations, due to the infrastructure storj lives on is limited / antique

when i get some time in a couple of weeks ill dig into this topic and get some proper tests done.
not sure i got enough cpu tho… but we will see.

kevink · June 17, 2021, 6:19pm

most pieces on a storagenode are smaller than 2.3MB. Definitely not 64 megabytes.
So your piece is downloaded from 29 nodes and each erasure piece is 2.5MB and each node can upload at 5MB/s (many faster nodes have >50Mbps upload), then every download from a node takes 0.5s at a spike of 150MB/s.
The bottleneck here should theoretically be your CPU. Even if the nodes are all only half as fast, you still get 75MB/s.
If nodes upload at 5mbps as you calculated, you’d indeed only get something like 20MB/s which is rather slow.

If they are slower, you might eventually encounter a bottleneck caused by the queued downloads instead of parallel download of different pieces.

BrightSilence · June 17, 2021, 9:05pm

Developers mostly. In many cases their concern would be speed from other cloud platforms or high bandwidth servers or total speed of many connections from many end points. Those scenarios scale without limits due to the distributed nature.

chris233 · June 17, 2021, 11:04pm

It looks like this explains a lot of what I was seeing. Going with an “e2-highcpu-2” Google Compute VM, I saw 52.8 MB/s with the CPU peaking at 400% (hyperthreading?).

chris@storj:~/rclone-v1.55.1-linux-amd64$ ./rclone -P copy StorJ:/test/P3200922.MP4 .
Transferred:           1.388G / 1.388 GBytes, 100%, 52.816 MBytes/s, ETA 0s
Transferred:            1 / 1, 100%
Elapsed time:        27.2s

I had planned to investigate more, but it seems I’ve reached my bandwidth quota, so I’ll have to wait for that to be raised.

On the one hand this does feel like “So it is fast!”, but on the other I wonder what this overhead might imply. Do two StorJ downloads at 25MB/s require as much CPU as one download at 50MB/s? How is StorJ financially supporting the hardware for an S3 connector they don’t charge for, given the massive bandwidth and CPU it appears it will require?

To my original point, however, here’s more overhead that one wouldn’t really be aware of until they got elbow-deep in testing StorJ.

kevink · June 18, 2021, 4:55am

It should be similar.

Probably the same way the pay SNOs more than they charge customers at the moment.

True… The client requirements for CPU are kind of insane… I wouldn’t even run a gateway on my NAS and just use the storj hosted one. Too much CPU usage…

SGC · June 18, 2021, 9:49am

might not be all cpu’s that has the same problems computing the encryption or whatever it’s doing…

for gateway … well its new… first issue is getting it working rather than figuring out how to pay for it, so long as it isn’t unreasonable ofc…

and for direct customer access to the network, well then the customer is the one doing the encrypt decrypt compute, and then not an issue storj needs to deal with.

also there are specially designed cpu’s or nic’s with the ability to process these kinds of workloads at breakneck pace… much like the 28 intel scalable xeons are still 500% faster than 64core epyc when dealing with data thats AVX512 compatible.

i don’t know enough about cpu encrypt decrypt dedicated architecture to really speak much to it, i’m sure thats a science on to itself…
But don’t assume that just because many cpu’s have difficulty dealing with the data, that there isn’t dedicated hardware or specialized hardware that won’t just breeze through it like it was nothing.

i got a couple of 11 year old xeons and was running theta edge compute, where it does video encoding, additionally i was doing edge compute on a 4 core 4th gen intel consumer cpu.
which is like 5 or 6 generations later than my xeon 5630L’s

the consumer cpu took like 15 minutes to process a task, and the Xeons did it in like a minute flat, i think the vm it was running on was also restricted to 4 cores on the xeons, but its not always easy to keep the vm’s from borrowing compute from other cores…

assuming that a task is difficult for any hardware would be a mistake, feature sets or different architecture / cache can make the world of a difference.

would be interesting to know what type of hardware is required to best compute the storj network interaction.