Download Timeouts - what can I do?

Hi there,

since a week I am experiencing download timeouts from several satellites (once a day, no specific satellite) and download successes (only) around 93%.

I have the feeling my provider has changed something. All the rest seams to be fine, no other errors or warnings in the logs. Repairs 100%, audits 100%.

Any idea, where to search or what to do? I assume losing permanently 7% directly means -7% download payouts. That’s not that less.

Cheers

How old is your node and what kind of setup do you have?

Rpi4, 1.1 TB / 10 TB, 7 months or so.
50 Mbit DSL

on the 25th of dec my download successrates suffered a bit… not sure why.
either storage latency or network bandwidth limitations.

i would suspect you to be affected by the same…

week avg graph in proxmox for my 16TB node

my successrates has improved to 99%+ so its most likely down to storage latency, i did a rather extensive upgrade on the 23th so my drop is most likely partly related to that, along with the peak in egress at the time.

is your disk CMR or SMR?
and do your drop in successrates persist over longer periods?

and yes drops in egress would mean less earnings… not exactly 1 to 1 tho…
the avg earnings for a new node pr TB stored pr month is about 4$
1.5$ of that is the base TB earnings the remaing 2.5$ is the earnings from egress of which you would be down 7%.

so like with 1.1TB stored you would earn like 4 to 4.4$ and the 7% drop in egress would be 0.15 to 0.19 less so a drop of like 5% or maybe 4%
roughly…
if it persists it is ofc something to worry about, but really depends on if it persists…

CMR. Since Dec 23rd or so.



2021-12-27T23:59:04.231Z        ERROR   contact:service ping satellite failed   {"Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "attempts": 1, "error": "ping satellite: rpc: dial tcp: lookup saltlake.tardigrade.io on 192.168.178.1:53: read udp 172.17.0.4:56941->192.168.178.1:53: i/o timeout", "errorVerbose": "ping satellite: rpc: dial tcp: lookup saltlake.tardigrade.io on 192.168.178.1:53: read udp 172.17.0.4:56941->192.168.178.1:53: i/o timeout
\tstorj.io/common/rpc.TCPConnector.DialContextUnencrypted:114
\tstorj.io/common/rpc.TCPConnector.DialContext:78
\tstorj.io/common/rpc.Dialer.dialEncryptedConn:220
\tstorj.io/common/rpc.Dialer.DialNodeURL.func1:110
\tstorj.io/common/rpc/rpcpool.(*Pool).get:105
\tstorj.io/common/rpc/rpcpool.(*Pool).Get:128
\tstorj.io/common/rpc.Dialer.dialPool:186
\tstorj.io/common/rpc.Dialer.DialNodeURL:109
\tstorj.io/storj/storagenode/contact.(*Service).pingSatelliteOnce:124
\tstorj.io/storj/storagenode/contact.(*Service).pingSatellite:95
\tstorj.io/storj/storagenode/contact.(*Chore).updateCycles.func1:87
\tstorj.io/common/sync2.(*Cycle).Run:152
\tstorj.io/common/sync2.(*Cycle).Start.func1:71
\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}
2021-12-28T03:59:04.248Z        ERROR   contact:service ping satellite failed   {"Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "attempts": 1, "error": "ping satellite: rpc: dial tcp: lookup saltlake.tardigrade.io on 192.168.178.1:53: read udp 172.17.0.4:39889->192.168.178.1:53: i/o timeout", "errorVerbose": "ping satellite: rpc: dial tcp: lookup saltlake.tardigrade.io on 192.168.178.1:53: read udp 172.17.0.4:39889->192.168.178.1:53: i/o timeout
\tstorj.io/common/rpc.TCPConnector.DialContextUnencrypted:114
\tstorj.io/common/rpc.TCPConnector.DialContext:78
\tstorj.io/common/rpc.Dialer.dialEncryptedConn:220
\tstorj.io/common/rpc.Dialer.DialNodeURL.func1:110
\tstorj.io/common/rpc/rpcpool.(*Pool).get:105
\tstorj.io/common/rpc/rpcpool.(*Pool).Get:128
\tstorj.io/common/rpc.Dialer.dialPool:186
\tstorj.io/common/rpc.Dialer.DialNodeURL:109
\tstorj.io/storj/storagenode/contact.(*Service).pingSatelliteOnce:124
\tstorj.io/storj/storagenode/contact.(*Service).pingSatellite:95
\tstorj.io/storj/storagenode/contact.(*Chore).updateCycles.func1:87
\tstorj.io/common/sync2.(*Cycle).Run:152
\tstorj.io/common/sync2.(*Cycle).Start.func1:71
\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}

The ports look strange, don’t they? 53, 39889, 56941…

yeah you shouldn’t get those service ping satellite failed

Meaning the networks pings themselves are not correct or the connectivity is in trouble?

maybe regional blocks / firewall settings… then your online score should also be dropping…
checked my logs i don’t have any failed satellite pings for 27th and 28th.

seems weird, maybe @Alexey has some insights.
i’m not really that adept at node troubleshooting.

Hmm. 99.85% +/-. Had small OS update downtimes. And the one and single satellite ping timeout per day might not have a huge effect on the numbers. Don’t know. I think it has something with network latency to do - at least my regional network.

a half decent easy network test is to just try and run a continuous ping at like microsoft.com or storj.io or whatever.

when you cancel it with ctrl+c it will show you the number of packets lost and their times, which can be useful to tell if there is any indication of network/internet issues.

from windows you do search or run cmd
and then in the command prompt you run ping -t www.microsoft.com
and double ctrl +c to end it.

let it run for many hours or a day

from linux cli its just ping www.microsoft.com
and again ctrl+c to end it

if the network is stable you should have pings of less than 60ms or a 100ms, but could be as low as just a few ms…
its not to unusual that a packet is lost, but you should have less than 1% losses
most likely more like 0.1% or less
depending on how long you run it and if the network / internet works more or less correctly

you could also try the particular satellites that seems to throw errors, but first we really want to know if the internet seems to work correctly, if that is fine then try pinging the particular satellite or satellites for an extended period.

there also exists a ton of different software for such things, but i just like the default ones because i rarely use it on the same computers, saves me the trouble of installing stuff every place i go test.

and usually ping is more than enough to make a basic evaluation.

1 Like

First input. 0.856% seems to be a lot. What do you think?

--- www.microsoft.com ping statistics ---
1519 packets transmitted, 1506 received, 0.855826% packet loss, time 1981ms
rtt min/avg/max/mdev = 13.242/18.913/82.478/11.021 ms

yeah that is a bit higher than what would be expected in most cases.
certain seems like you have a network issue.
next you could try to ping your router / gateway to see if it’s local network issue or router / internet related.

it’s usually something like 192.168.1.1 or 192.168.0.1
for most home networks, but it can be different in many cases also.

www.microsoft.com ping statistics —
1111 packets transmitted, 1098 received, 1.17012% packet loss, time 1789ms
rtt min/avg/max/mdev = 7.646/9.920/27.285/2.810 ms

hmmm my packets doesn’t look much better actually… so i guess it’s fine…
maybe it’s microsoft’s server

www.google.com ping statistics —
760 packets transmitted, 754 received, 0.789474% packet loss, time 763ms
rtt min/avg/max/mdev = 30.437/33.055/53.051/3.155 ms

seems slightly better… but like i said… some loss should be expected, its rare for it being flawless…

my successrates for today is 99% + for both download and upload…
also a packet lost doesn’t mean the data stream is interrupted, it will just be retransmitted.

1 Like

to the router no loss:

407 packets transmitted, 407 received, 0% packet loss, time 1004ms
rtt min/avg/max/mdev = 1.319/3.011/21.738/2.082 ms

I’ve updated a router’s setting to priority the node’s IP.

I can also adjust the DSL settings from “max performance” to (more) “stability”.
Not sure, if that helps.

Just because you’re seeing similar issues doesn’t mean that’s ok. Packet loss normally shouldn’t happen at all. I did a test on my end and these are my results.

--- www.google.com ping statistics ---
1073 packets transmitted, 1073 received, 0% packet loss, time 1073221ms
rtt min/avg/max/mdev = 3.965/4.329/12.704/0.744 ms

During this I figured I’d give it a bit of a hard time and aside from my 4 nodes, running chia and ethereum mining, I added a video stream from the same device to outside my network, ran a few speed tests and downloaded some large files. In the mean time I was constantly watching youtube videos. And yet…

The large downloads and speed test did make the ping times go up to 2x-3x as high as average, but that’s it.

Now I feel like I really have nothing to complain about with my ISP considering the average ping times of around 4ms. Higher ping times are absolutely acceptable… packet loss to that extent… not so much.

4 Likes

well i’m testing my network now, and will be replacing my isp soon because they are shit at their jobs.
it certainly wouldn’t surprise me if my isp is to blame.
not only have they had like downtime like once a month for the last year.
crashed for 9 hours, and recently when i was retesting my internet speeds i was getting numbers that was worse than my old 400mbit/400mbit connection, even tho i supposely am suppose to be on 1Gbit sync fiber

still 1% packet loss shouldn’t affect the network much… at worst it might slow it down a bit.
but yeah higher than i like to see…

router priority for the node did not help:

330 packets transmitted, 324 received, 1.81818% packet loss, time 901ms
rtt min/avg/max/mdev = 13.594/43.614/387.909/70.766 ms

not sure, if that helps for analysis:

Ip:
    Forwarding: 1
    408717978 total packets received
    1 with invalid addresses
    264330010 forwarded
    0 incoming packets discarded
    144387907 incoming packets delivered
    355849030 requests sent out
    75 dropped because of missing route
    38 reassemblies required
    19 packets reassembled ok
    19 fragments received ok
    38 fragments created
Icmp:
    23294 ICMP messages received
    11373 input ICMP message failed
    ICMP input histogram:
        destination unreachable: 22887
        echo replies: 407
    23631 ICMP messages sent
    0 ICMP messages failed
    ICMP output histogram:
        destination unreachable: 22817
        echo requests: 814
IcmpMsg:
        InType0: 407
        InType3: 22887
        OutType3: 22817
        OutType8: 814
Tcp:
    94580 active connection openings
    11108 passive connection openings
    11494 failed connection attempts
    15 connection resets received
    5 connections established
    161356294 segments received
    206585883 segments sent out
    168274 segments retransmitted
    12 bad segments received
    10720 resets sent
    InCsumErrors: 2
Udp:
    369614 packets received
    0 packets to unknown port received
    0 packet receive errors
    29828 packets sent
    0 receive buffer errors
    0 send buffer errors
    IgnoredMulti: 27876

I have noticed that Storj is more sensitive to this than most other things. I actually got briefly disqualified in the early days (back when disqualification was happening at 90% and could luckily still be recovered from). That turned out to be my routers fault. I never really found out what it was that caused it, but factory resetting the router fixed the issues in the end.

But I should add, I never noticed any issues with anything other than Storj. So while the issue was definitely on my end, only Storj seemed to really have a problem with it. The connection was still fast and other stuff unaffected.

If you’re referring to QoS, I generally advise to keep that off. It’s not all that great and can cause more issues than it fixes.

2 Likes

been testing out my LAN, it does seem like my pfsense has some sort of minor buffer issue, might be related to my packet losses, will be looking further into that…
the pfsense documentation tho very good doesn’t have any easy answers on how to deal with the issues, all the recommendations seems to either be default or already implemented when i initially configured it.
does only seem to account for 0.1% of the dropped packets internally, but perhaps the external interface is more … stressed…
ran an external test from a vps and gets 0.8% loss again…
it would seem the fault for me atleast and most likely is with my pfsense… but i must admit i do put a rather big demand on it, and i’m sure it can be rectified with some sort of advanced configuration… if this is truly the case.