Got my Tardigrade invite. Decided to run a couple of speed tests

Pentium100 · January 22, 2020, 10:39pm

I got my Tardigrade invite today and decided to run some speed tests on the europe-west satellite. I used the amd64 uplink binary.

First I tried uploading a 913MB file from my file server, it took 2m31s, so about 48mbps, but then I saw that the CPU of that server was pegged at 100% on all cores. I guess this is the encryption and the CPU of the file server is quite old, so this test is invalid. On the other hand, the same server can copy the file over ssh at ~265mbps over the internet at ~383mbps.

It took 24 seconds to delete the file though.

Anyway, I then used another server with a much better CPU. This one can get 480mbps on ssh upload (limited by my internet connection) and 600mbps on download (most likely limited by the server on the other end).
Anyway, it took 2m21s to upload the same file to Storj (~52mbps), 1m23s to download it back (~88mbps) and 22 seconds to delete it.

On upload the traffic was very “spiky”

(there was a “base load” of about 60mbps from other traffic)
As I understand the file is split into pieces and those pieces are then uploaded to the nodes. There are probably few fast nodes, they download their part quickly and then I am left with the slower nodes, so the speed drops until enough nodes get the pieces.
If all pieces of the same file were uploaded at the same time instead of one after the other, it probably would be faster.
Download graph looked similar, with bigger gaps between pieces, but faster overall:

The 22 seconds to delete the file seemed to long, I figured that it was probably because the file was large, so it was split into many pieces and all of those pieces needed to be deleted, so I uploaded a smaller file (the uplink binary in fact, 16MB). Upload and download speeds were similar, but it took 3 minutes and 16 seconds to delete it. That’s a bit too long and would mean that the network is not as usable for storing many small files.

I tried deleting the 16MB file after running the ls command and it took 4 seconds now. Something got cached?

It took 7 minutes to run the ls command. Ran it twice - once with an empty bucket, once with one file. Both times it was a bit over 7 minutes That’s long. Why would it take so long?

TL;DR upload/download speed could be better, but that would improve with more fast nodes and parallel transfers of all pieces. Deleting a file sometimes takes too long and listing the contents of a bucket takes far too long.

Still, it’s quite good.

littleskunk · January 22, 2020, 10:44pm

Interesting results. We should repeat this once the new version is out

Pac · January 22, 2020, 11:09pm

Very interesting results indeed @Pentium100, thx for these details.
Would love to hear about other test results.

Sounds like there’s still room for improvement, especially with regards to listing and deleting time. Does anyone have any idea why it would take so much time?

littleskunk · January 22, 2020, 11:14pm

I know that it will be faster with the new release

Pac · January 22, 2020, 11:34pm

@littleskunk alright, good news ^^

However I was thinking: let’s imagine for a minute that a colossal client (like Steam/Valve for instance) were to decide to put their data on the Tardigrade network: currently they do provide incredible bandwidth performances (they easily max out my 300mbps download connection speed).

Could the STORJ network compete with such speeds one day? Is this foreseen?

littleskunk · January 22, 2020, 11:36pm

That is a good question for the town hall meeting.

BrightSilence · January 23, 2020, 12:27am

This is a bit nit picky, but everywhere in your post where you’re mentioning pieces, you’re actually talking about segments. The terminology gets a bit tricky, but in the Storj network pieces mean a very specific different thing than what you are talking about. This gif always helps me. (ignore the term farmer, it’s an old gif)

Love the stats you’re sharing though. And I agree that dealing with multiple segments at the same time could significantly increase transfer speeds as long as the encryption and erasure coding isn’t bottle necked by your own system resources. Since each segment is stored on a different set of nodes, doing multiple segments in parallel would also increase the number of nodes you’re downloading from simultaneously and therefor won’t lead to a bottleneck on the other end either.

Pentium100 · January 23, 2020, 3:49am

@BrightSilence thanks for the gif. Yea, I got messed up with the names of things,
@Pac - yep, Steam is fast, I have seen sustained download speeds of 60MB/s or more.
@littleskunk - I’ll be sure to run these tests on the new version.

Pac · January 23, 2020, 7:17am

@littleskunk: Good point, I’ll add it to the list

Solu · January 23, 2020, 3:14pm

Same test on new Version would be great

Does the new release perform with 100% of network performance?

littleskunk · January 23, 2020, 4:03pm

Depends on what 100% means. We could say the current performance is 100% and the new version will be faster. We can also say 100% is the possible maximum and ask how far away we are from that. The answer to the second question would be a clear no. We are working on additional performance improvement that are not in the v0.30.5 release.

littleskunk · January 23, 2020, 10:49pm

@Pentium100 can you please run your test again with v0.30.5 and see how fast it is now?

Pentium100 · January 25, 2020, 12:07am

I ran the same tests with the new version of the uplink binary.

Upload speed for the big file is essentially the same as before, finishing the upload in 2 minutes 29 seconds, with the traffic graph looking the same.
Same with download of the big file.
I guess this wasn’t the change made in the new version.

ls now takes ~1 second (as opposed to 7 minutes). Now it’s usable.
rm of the 16MB file took 7 seconds, but sometimes it takes ~2 seconds.
rm of the big file took 25 seconds, second time it took 14 seconds.

Next I tried this with a bunch of small files (142 files, a few MB each, total size 372MB), primarily to find out how long it will take to list them all. I uploaded all those files in parallel, so the speed was more even and at one point I even maxed out my connection. Upload took ~40 seconds, so ~66mbps average speed. It felt like it took longer to actually start the transfers compared to uploading one file at a time.

ls still took about 1 second. Nice. Deleting those files took about 30 seconds, so I guess the delete requests are processed one by one, even if they are submitted at the same time.

Then I tried with 100 very tiny files (a few bytes each). It took 16 seconds to upload, 1 second to ls and 1 second to delete.
Then with 500 of those same tiny files. 71 seconds to upload, 1.1s to ls and 1 second to delete (well, one process stayed for a few more seconds).
One such file gets uploaded in 1.7 seconds and deleted in 1 second.
The upload time may be because of my server (since it does not have 500 cores), so it does not matter as much, I was mostly interested in the list and delete times.

So, deleting lots of small files is faster than deleting lots of bigger files.

However, I got an error, I have not seen before:

2020-01-25T01:05:53.230+0200    FATAL   Unrecoverable error     {"error": "metainfo error: metainfo error: value changed: \"cce7d073-3a6c-4b00-b250-b5b4ba7066b2/s0/testas/\\x02]|\\xea\\xf1<\\xfbk!5壍jD\\xd4M\\xc4Y-\\xd0U{\\x1d\\x8bUw\\xefSBئ\\x84+\\x8e\\xf8\\xef\\x1aY[7O\\xa4\\x8aְ\"", "errorVerbose": "metainfo error: metainfo error: value changed: \"cce7d073-3a6c-4b00-b250-b5b4ba7066b2/s0/testas/\\x02]|\\xea\\xf1<\\xfbk!5壍jD\\xd4M\\xc4Y-\\xd0U{\\x1d\\x8bUw\\xefSBئ\\x84+\\x8e\\xf8\\xef\\x1aY[7O\\xa4\\x8aְ\"\n\tstorj.io/uplink/metainfo.(*Client).Batch:1118\n\tstorj.io/uplink/storage/streams.(*streamStore).upload:337\n\tstorj.io/uplink/storage/streams.(*streamStore).Put:92\n\tstorj.io/uplink/storage/streams.(*shimStore).Put:49\n\tstorj.io/uplink/stream.NewUpload.func1:53\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}

I tried uploading the same file several times and got the same error. Interesting in the all those files are essentially the same - they are small segments of a video (the way you would have for HLS) and have names that are very similar (rec_1471110351.ts the last 3 digits are different).

I renamed that file and it was uploaded successfully. Strange.

I tried using the put command and the upload still failed. Apparently, the uplink binary or the satellite really hates that file name, since I can upload that content with a different name, but cannot upload other file with this name.

OK, this could have happened, because I interrupted the upload of that file (I am not sure if I did though) with Ctrl+C and it left the bucket or whatever in some inconsistent state, where the file does not show up with “ls” or “rm”, but is counted as existing when trying to upload it. I tried to reproduce it with another file a couple of but could not do it - if I interrupt the transfer, the file gets uploaded successfully the next time.

TL;DR - new version has improved ls and rm to be usable. Delete time seems to depend in the size of the files, even if the file should fit into one segment and the delete operations are done in parallel. Deleting lots of very tiny files in parallel does not take longer than deleting one such file, but deleting lots of 2-4MB files does take longer than deleting one such file, but not as long as deleting them one by one.

littleskunk · January 25, 2020, 12:11am

Thank you very much. I promise I will write a few more details later. Good to see that the ls improvement worked. Not so good that delete is slower than I would expect. I am currently working on the next release. I haven’t started to work on the changelog. Let me finish that and than come back to you.

Pentium100 · January 25, 2020, 12:15am

Thanks for the quick reply. I am not under any time pressure or anything, just playing really, out of curiosity.

Also I am interested in that error - it’s weird that the uplink or satellite really hates one particular file name

kevink · January 25, 2020, 7:17am

I only tested with a single 3.4GB file and no other files in my bucket on satellite europe.

Deleting a 3.4GB file took 40s in my case.
LS with only one file was basically instant, didn’t time it.

I reported my other findings here already: Production readyness?

The download speed maxed out at 25MB/s but was consistent over time according to netdata even though the cli showed it dropping at times even below 1MB/s.

The upload was limited due to my internet connection to 5MB/s but was consistent over time according to netdata although the cli at times got below 100kB/s.

The sad part was that my first upload failed because a piece could only be uploaded to 75 nodes instead of at least 80… that should never happen.

super3 · January 28, 2020, 10:48pm

Yes its super weird. Its a bug we call zombie segments. Its been fixed and will be rolling out to production in the next few days.

super3 · January 28, 2020, 10:52pm

We are aware the upload traffic can be spiky. I’ll flag this to the performance team and make sure there is a ticket for it.

Pentium100 · January 30, 2020, 9:53pm

Since there was a new version released I decided to run the tests again to see if anything changed.

Empty bucket:
ls: 1 second
deleting a non-existent file: 1 second (well, I get an error as I should)

Single 17MB file:
Upload: 2.4 seconds
ls: 1 seconds
delete: 1.49 seconds

Single 913MB file:
Upload: 1 minute 33 seconds
ls: 1 second
Download: 57 seconds
delete: 3.2 seconds

It seems that the upload and download speed has improved, delete speed too.

142 files (a few megabytes each) at once:
I hit the connection limit. Inserting “sleep 0.05” to reduce the rate.
Hmm, the limit appears to not only be 30 requests per second (actually enforced in time interval less than a second, so I cannot open 29 connections at the same time), but there is a similar limit on active requests.
I chose to insert “sleep” between the starts of uplink. I increased the interval until I got no more errors related to too many connections.

For reference, this is the script I used:

#!/bin/bash
for i in `ls -1 *.ts`; do
   ./uplink_linux_amd64 cp $i sj://testas &
   sleep 0.2
done

changing the command and the sleep value as needed.

OK, reducing the file count to 29 (76MB total):
After increasing the interval between new connections to 0.1 seconds I got all 29 files to be uploaded. The interval only needs to be 0.05 for all deletes to be successful.
Upload: 8 seconds (76mbps)
Download: 4 seconds
ls: 1 second
Delete: 4 seconds.

Back to the 142 files:
I had to increase the sleep to 0.2s and managed to get 137 files uploaded in 30 seconds.
Deleting them (0.15s) took 40 seconds. That’s strange.

OK, it looks like the simultaneous connection limit is enforced differently for uploads and deletions. I had 45 delete processes running at the same time without getting an “too many requests” error, while uploading didn’t go as high and still got a few errors.

However, deleting seems to still have something done in series or this may be some kind of rate limit, since deleting the files feels (I did not measure this) faster at first and then slows down. Maybe the satellite just rejects the uploads if there are too many connections, while keeping the delete requests “waiting in line”?

Looks like I need to have a better script to test this further - one that can limit both simultaneous connections and the rate of new connections. I’ll try to do something over the weekend.

kevink · January 30, 2020, 10:11pm

the rate limiting is acutally a bit annoying…