Abysmal Upload Success Rate

SGC · March 25, 2020, 9:02pm

Yeah that is a fair point, was doing a bit of math on if the regular avg drive latency could have something to do with it, in some cases…

with a regular 7200rpm drive the avg latency is 4.17ms so basically, near zero or max 9.4 which means 1/100th of a sec and thus 1/100th of bandwidth.

and if we assume i had ns rather than ms, then with my 400mbit connection would be able to move 480kb and then each file is split into 91 pieces… but for ease of math lets call it an even 100.
so any file less 48mb would be uploaded before the other guy could even access it…
in theory… i think i might need a better l2 ARC xD

ofc on avg it would be only half that size so 24mb with avg 7200rpm seek time.

Th3Van · March 25, 2020, 9:34pm

Yes, the current node are running on the 2TB Intel NVMe.

I did, however, find out that by using a different ISP’s with a lower ping (ms) to the server uploading, the upload sucessrate went up a bit.

ISP #1 :

root@server030:/disk103/# ping 95.217.161.205 -c 5
PING 95.217.161.205 (95.217.161.205) 56(84) bytes of data.
64 bytes from 95.217.161.205: icmp_seq=1 ttl=50 time=46.3 ms
64 bytes from 95.217.161.205: icmp_seq=2 ttl=50 time=46.4 ms
64 bytes from 95.217.161.205: icmp_seq=3 ttl=50 time=54.9 ms
64 bytes from 95.217.161.205: icmp_seq=4 ttl=50 time=47.2 ms
64 bytes from 95.217.161.205: icmp_seq=5 ttl=50 time=46.2 ms

--- 95.217.161.205 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4005ms
rtt min/avg/max/mdev = 46.287/48.250/54.959/3.380 ms

Using a ISP with a ~20 ms lower ping to the server uploading :

ISP #2 :

root@server030:/disk103/# ping 95.217.161.205 -c 5                             
PING 95.217.161.205 (95.217.161.205) 56(84) bytes of data.
64 bytes from 95.217.161.205: icmp_seq=1 ttl=51 time=30.7 ms
64 bytes from 95.217.161.205: icmp_seq=2 ttl=51 time=28.4 ms
64 bytes from 95.217.161.205: icmp_seq=3 ttl=51 time=28.1 ms
64 bytes from 95.217.161.205: icmp_seq=4 ttl=51 time=28.1 ms
64 bytes from 95.217.161.205: icmp_seq=5 ttl=51 time=28.2 ms

--- 95.217.161.205 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4005ms
rtt min/avg/max/mdev = 28.116/28.744/30.783/1.031 ms

Upload successful rate for 25-03-2020 went from ~34% to ~41% just by dropping ~ 20ms in ping.

//Th3van

kevink · March 25, 2020, 9:34pm

l2arc only helps with downloads
for uploads you’d need a ZIL

But yeah… one of my nodes is on my main zfs drive and there was not much difference despite the ssd cache for read/write. but it is full now so I can’t compare since 3 months.

kevink · March 25, 2020, 9:35pm

My upload successrate is 38% and my ping times are worse than yours:

kevin@droidserver:~$ ping 95.217.161.205 -c 5
PING 95.217.161.205 (95.217.161.205) 56(84) bytes of data.
64 bytes from 95.217.161.205: icmp_seq=1 ttl=50 time=111 ms
64 bytes from 95.217.161.205: icmp_seq=2 ttl=50 time=55.2 ms
64 bytes from 95.217.161.205: icmp_seq=3 ttl=50 time=70.1 ms
64 bytes from 95.217.161.205: icmp_seq=4 ttl=50 time=98.6 ms
64 bytes from 95.217.161.205: icmp_seq=5 ttl=50 time=337 ms

SGC · March 25, 2020, 9:44pm

Your gear should win every damn time…
now even your ping is better than mine XD

PING 95.217.161.205 (95.217.161.205) 56(84) bytes of data.
64 bytes from 95.217.161.205: icmp_seq=1 ttl=46 time=34.4 ms
64 bytes from 95.217.161.205: icmp_seq=2 ttl=46 time=33.10 ms
64 bytes from 95.217.161.205: icmp_seq=3 ttl=46 time=33.7 ms
64 bytes from 95.217.161.205: icmp_seq=4 ttl=46 time=37.4 ms
64 bytes from 95.217.161.205: icmp_seq=5 ttl=46 time=34.3 ms

SGC · March 25, 2020, 9:47pm

i do have a ZIL, i don’t think zfs will allow me to run a L2ARC without it…
or i mistakenly did so, at first but then it didn’t work, but i duno… not very well versed in zfs.
a full cache is a good thing… it means your system is ready to deliver the most recently accessed data…

whats the successrate stats for the zfs node?

570RJ · March 25, 2020, 9:48pm

My current stats since March 22:

========== AUDIT ==============
Critically failed:     0
Critical Fail Rate:    0.000%
Recoverable failed:    0
Recoverable Fail Rate: 0.000%
Successful:            3823
Success Rate:          100.000%
========== DOWNLOAD ===========
Failed:                462
Fail Rate:             0.687%
Canceled:              438
Cancel Rate:           0.652%
Successful:            66305
Success Rate:          98.661%
========== UPLOAD =============
Rejected:              0
Acceptance Rate:       100.000%
---------- accepted -----------
Failed:                19
Fail Rate:             0.015%
Canceled:              37449
Cancel Rate:           28.979%
Successful:            91761
Success Rate:          71.006%
========== REPAIR DOWNLOAD ====
Failed:                0
Fail Rate:             0.000%
Canceled:              1
Cancel Rate:           0.006%
Successful:            18118
Success Rate:          99.995%
========== REPAIR UPLOAD ======
Failed:                0
Fail Rate:             0.000%
Canceled:              642
Cancel Rate:           9.513%
Successful:            6107
Success Rate:          90.488%
========== DELETE =============
Failed:                0
Fail Rate:             0.000%
Successful:            28041
Success Rate:          100.000%

PING 95.217.161.205 (95.217.161.205) 56(84) bytes of data.
64 bytes from 95.217.161.205: icmp_seq=1 ttl=45 time=128 ms
64 bytes from 95.217.161.205: icmp_seq=2 ttl=45 time=121 ms
64 bytes from 95.217.161.205: icmp_seq=3 ttl=45 time=122 ms

kevink · March 25, 2020, 9:52pm

it does allow it.

that node is full with (almost) no traffic… so no stats…

BrightSilence · March 25, 2020, 9:53pm

Super weird calculation there if you take into account disk read latency but completely skip over the internet latency between uplink and storagenode. The 4ms you mentioned is pretty insignificant compared to latency introduced by the internet connection.

SGC · March 25, 2020, 9:59pm

i really hope i can expand my current node when it gets to that point…
shouldn’t people still be getting data from it and deleting?

SGC · March 25, 2020, 10:02pm

yeah theory doesn’t always work in the real world heh, tho everybody would have internet latency… even if its different from place to place and isp to isp… many factors to take into account was just doing a bit of a back of the envelops math to see if the numbers was really relevant or not, in regard to upload speed to user file sizes…

SGC · March 25, 2020, 10:03pm

that latency and successrate does kinda make me wonder even more lol think i’m just going to give up for tonite lol

570RJ · March 25, 2020, 10:09pm

Im running 12 nodes (12x8TB HDD) on ubuntu VM. All of it is just ext4 partitioned drives with some of them connected over USB3 port (external drives). Internal drives have much better performance but no difference for Storj. I do have some system level optimizations that I hope is helping with success rate.

SGC · March 25, 2020, 10:10pm

I suppose next step would be evaluation according to latency to the different satellites, one option could be bad latency to many of the satellites, against node located with low latency many sats.

the battle for next time lol… anyone got a link to a nice script for latency

570RJ · March 25, 2020, 10:13pm

Only latency that matters is one from customer to nodes. Latency to satellite is not important at all. Since customer is likely to chose satellite closer to them, there will be some correlation between the satellite latency and success rate (just because customer is choosing ones closer to themselves.)