Errors and low upload speed via Uplink CLI

Hello everyone,

I’ve been experimenting with upload and download configurations for Storj in recent days. I’ve read the post on transfer optimization (Hotrodding Decentralized Storage) and followed most of the recommendations. However, I’ve run into some strange behavior that I’d like to ask about.

I’ve been running some benchmarks uploading different combinations of files from my PC to Storj. These include some general cases: many small files (200x1 MB), a few medium-sized files (10x1 GB), a single large file (1x10 GB) and a mixture of file sizes. My PC has 10 CPUs, 64 GB of memory, and an upload bandwidth of 40 Mbit/s.

I would like to ask about two specific phenomena I’ve observed:

1. Slow upload speeds

For the small file test I’ve configured both Rclone and Uplink in the same way, allowing 10 parallel transfers and a segment concurrency of 1 (since the files are all smaller than 64 MB). I know that uploads via Uplink are subject to the 2.68x expansion factor and I expected them to be significantly slower than uploads via Rclone through the S3 gateway.

But in practice, uploading 200 small files took not just 2.68x or even 5x but almost 20x longer! All the while, CPU use in Uplink was hovering around 20-30% with low memory use since the files are so small. The upload bandwidth usage was fluctuating a lot, going from full use (close to 40 Mbit/s) down to maybe 5-10 Mbit/s, then up again etc. I’d estimate the average bandwidth used was maybe ~15-20 Mbit/s – if that. I also noticed that performance was varying quite a bit between re-runs with the upload process sometimes taking around 12 minutes, other times over 15 minutes. I was not running any other uploads in the background that would affect the speed in such a significant way.

I cannot see any obvious bottleneck here so I’m really at a loss why Uplink is so extremely inefficient. When uploading larger files, the difference in speed was not nearly as dramatic. Rclone was effectively maxing out my upload bandwidth and hit approximately the maximum upload speed possible through my connection. Uplink was still slower but by only ~30%, indicating that the upload bandwidth was the main bottleneck here. I cannot verify if the performance difference would be larger if my upload connection was faster but I suspect so.

2. Errors when uploading

I was also running into plenty of errors and issues when performing uploads via Uplink, esp. those that took a longer time and/or contained many files. There were warnings that parts could not be uploaded, the necessary minimum number of segments not being reached etc. It was so problematic that sometimes as much as 15 % of files were not uploaded correctly. This has never happened with Rclone. No matter what I upload via Rclone, as long as the process isn’t crashing entirely for whatever reason, Rclone never misses even a single file.

What is the reason for this and how can I alleviate it?

Thank you in advance!

2 Likes

Please share some sample upload commands for review. Please also note the uplink version in use. uplink version

Thanks for the suggestion. I’ve checked and it seems I’ve been using an old version of Uplink. So I tried again with the latest one I could find (v1.96.6). It didn’t have a significant impact. The upload bandwidth still fluctuates a lot and it took about the same time to upload the 200x1 MB files. One good thing is that the newer version didn’t miss any files this time, but I’m not sure if that’s a general conclusion since previously those errors also didn’t occur every time.

Now, I’m aware that uploading many small files isn’t efficient, especially using Uplink. But it is baffling to me that it takes Uplink more than 12 minutes where Rclone is done within 45 seconds.

As described above, I used transfers=10 and parallelism=1. So the command would look something like this:

uplink cp source_folder target_folder --recursive --transfers 10 --parallelism 1

Does increasing --transfers help? Could you share your rclone command as well? Is the rclone storj native backend fast too or only the s3 backend?

Very likely your network equipment (modem?) is getting overwhelmed by number of connections.

Try to enable SQM on your gateway to mitigate this somewhat, but otherwise you’ll have to use S3 gateway, hosted somewhere that has a low latency high performance connection to the internet, e.g. on some VPS in the cloud. Or use Storj’s S3 gateways.

1 Like

All cases where a single object is less than 64MiB (or at least 5MiB) the upload is expected to be slow, because in this cases you likely will use inline segments, i.e. they are stored alongside with the related Metadata directly on the satellite (it’s distributed too, but you wouldn’t have thousands nodes as in the case of Remote segments). In such cases it’s better to use a backup software instead. For example, with restic you may use the same rclone remote, but copying will be much faster due to bigger chunks.

That’s the easiest one - rclone has retries, uplink doesn’t.

However, for the low upstream it’s better to use an S3 integration, but it uses the Server-side Encryption, so encrypting backup tools or encryption backend in rclone is highly recommended.

1 Like

In terms of encryption backend you mean uploading locally encrypted files or using rclone crypt? In that case, wouldn’t that mean that these files cannot be accessed with any other tool than rclone configured with the right encryption? That’s of course more secure but it seems like it’s drastically reducing reusability across machines and development environments, esp. if you need to handle yet another set of passphrases that needs to be either strictly linked to a bucket or consistent throughout all the buckets (like the Storj encryption passphrase). Seems pretty tricky to handle and could be a disaster waiting to happen when many people are uploading encrypted data and all it takes is for someone to accidentally use a different encryption password and all the data he ever uploaded becomes unusable.

On that note, do I understand correctly that the server-side encryption of the S3 gateway is basically the same kind of connection/transfer that you would have using an AWS S3 bucket, Azure blob container etc.? Isn’t that already a very secure connection? I mean, there’s 1000s of companies that store a lot of their data on S3 and I’m sure not many of them are consistently using client-side encryption.

I understand the client-side encryption security aspect is one of the USPs of Storj, though.

restic, Duplicacy, Duplicati, Kopia, etc. - not the full list of backup tools with integrated encryption.
Or yes, rclone Crypt.
But I would suggest to use a backup tool instead, especially if you have a lot of small files, most of backup tools uses encryption, packing and compression, you may also backup only a difference since the last backup.

Yes, if you use encryption, you need to use the same tool and the same encryption phrase on any machine, as with Uplink too.
Regarding backups from different machines I would like to suggest to use Duplicacy, it’s designed for this case.
Regarding different encryption phrases, they are easily handled with a different access grants/s3 credentials for each case (and if we are talking about rclone - you may configure different remotes using the same API Key and the Satellite URL, but different encryption phrases, the same for S3 credentials). I use this feature myself - in the one bucket I store different objects using own encryption phrase for each set, i.e.

$ rclone lsd us:test
          -1 2000-01-01 07:00:00        -1 rclone
          -1 2000-01-01 07:00:00        -1 test

$ rclone lsd us1-gw-mt:test
           0 2000-01-01 07:00:00        -1 folder
           0 2000-01-01 07:00:00        -1 folder2
           0 2000-01-01 07:00:00        -1 share
           0 2000-01-01 07:00:00        -1 uplinks

if you would use tools, not the web UI (by the way, rclone has GUI), then it likely never happen - the access grant/s3 credentials will be stored locally on the source device. If you mean the integrated encryption of such tools - then you simple will be unable to get access to the encrypted content without an encryption phrase, the command will fail. However, you may use different encryption phrases too, and again - with a separate crypt remote for each encryption phrase.

Of course you still may use only native with integrated encryption into a protocol and/or S3 gateway without external encryption, because it’s secure enough - without your access key nobody can decrypt your access grant to get an access to your content in Storj. Just in case of native your access grant doesn’t stored even encrypted on our servers, unlike S3.

2 Likes

Thanks, Alexey! I thought the access key is not stored on the S3 gateway at all? I also remember something about the API key being kept on the server but not the encryption keys, or something like it?
As thorough as the documentation is, it’s quite difficult for someone without prior knowledge of encryption concepts to keep track of what is being encrypted how and stored in what way. :laughing:

My concern about the encryption passphrases was less regarding infrequent backups from a local PC and more about a development environment with many people working on Storj buckets through various systems (incl. cloud VMs, Kubernetes etc.). In that case, adding another layer of complexity on top by requiring another set of passphrases for file encryption that’s being set up (e.g., in rclone crypt) sounds like a tough to maintain solution that’s quite error-prone. Those additional encryption keys would probably again need to be user-/project-/bucket-specific which seems impractical to me.

Storj access keys are at least hierarchically connected. So you can, at the very least, ensure that only keys derived from a certain passphrase (or a small number of high-level access keys) can be used on a specific bucket, for example. But if you have a completely separate system like rclone crypt in place and someone somewhere just accidentally used the wrong key for the local encryption, you can guarantee that you still have access to the data but it won’t be recoverable.

That brings me to an idea:
Is there an easy (and reliable) way to generate hierarchical or at least logically reproducible encrypted keys for local encryption like rclone crypt? The idea would be to allow local encryption but instead of having to manually provide very specific passphrases, one could automatically generate an encryption key that is logically linked to the passphrase such that it can be reproduced automatically.

the access grant is stored encrypted on the S3 gateway, this is explained in the linked article: Understanding Server-Side Encryption - Storj Docs
it’s still not available to anyone, unless they provide a decryption key, in this case - the Access Key from your S3 credentials. But this key is available only to you, we do not store it. So, if you shared it with anyone - they can theoretically decrypt your access grant with it and get a READONLY access to your shared data, unless you generated these S3 credentials with FULL access… And hers we are. Please share only limited S3 credentials and/or access grant, and never give your root access grant (with admin rights), use a derived access grants instead, see share - Storj Docs.

yes, it’s. However, you know - security is a primary goal to achieve a success (to do not allow to steal your work, you know…)
however, if you use a native integration - it’s pretty safe (if you track, where is it go in your company…). Using S3 integration is usually simpler, but (there is always “but”, yeah?), it’s less safe, because you transfer (register) your access grant to our servers, it’s stored encrypted still and you do have an encrypt/decrypt key and we - don’t, but… Even we compliance with all requirements, it’s less safe, than if you do not transfer your access grant anywhere.