Relatively high download fail rate?

Using the successrate script, I found out that I have a download failure rate of about 6%. Upon further checking the logs, I saw that the downloads exclusively fail for the us1 satellite with the error message “remote closed the stream”. I assume that this should not be a problem, and it’s just because my node is slower than other nodes because I live in the EU and have a high latency to the us1 satellite?

If it’s not 100%, then it is for sure latency and time to respond, yes. But nothing to worry about, from what I understand.

I advice to have a script run regularly in order to notify you, when sth is not working well - in order to allow you to be proactive and not reactive in error cases.

Not to the satellite. The satellite doesn’t transfer data. The customers of that satellite have a faster nodes around.

1 Like

If they are using storj natively, but what if they are using the s3 compatibility thing? How is that called? I thought that was a satellite as well, is it a different entity?

It’s a different entity. And even different locations. They consist of GatewayMT and auth service.
See Storj-hosted S3 Compatible Gateway | Storj Docs

successrates for my 17tb node the 3rd of april located in the EU.
if your successrates aren’t near 100% its your storage, system or networking.
usually it’s the storage.
and downloaded is usually the first score to drop, due to it being unable to be cached and is fairly time critical.

uploaded files can go into system RAM or storage caches…
this is also why you will see an increase in RAM usage as the storage performance degrades or even if your CPU is unable to keep up…

reduced download and good upload means your storage can keep up, but it has increased latency due to the workload, meaning its working somewhere in the upper 50% of its ability.

it’s pretty normal especially now when the network is seeing a lot more traffic due to the ukraine war.

I am using a USB3 drive with an intel nuc so it’s possible that this is due to latency if the disk spins down. However, this only seems to happen on US1 which would suggest network latency as the culprit.

its certainly possible that it’s network related.
else one would expect it to be evenly distributed between the satellites…

however that is a tricky thing also… since the amount of existing data might also need to be taken into account…

you could have 10 transfers for one satellite and 1000 for another.

the first might get zero failures because its a 6% failure rate…
the next day its about 50% change it has another zero % failure rate with its 10 transfers.

the other one has 1000 so it will have 60 failures each day and it would be very consistent.

data can be difficult to analyze, i cannot imagine why you would see a network issue like that, so my money will be on the disk latency because that is the usual suspect…

doesn’t mean i’m right or even close… its just what i’m use to seeing in these types of issues.

US1 and EU1 are the only nodes im getting noteworthy traffic from. EU1 has 0 failures since my node started and US1 has 6%. Egress traffic is about 500mb/day for EU1 and about 1GB/day for US1 so it should be somewhat comparable. I think it’s network latency.

Can you check latency with nmap or something similar to the 2 nodes? Would be interesting, if it’s really latency or some filter / rerouting on your ISP side (or else).

I am not using a VPN. Here are some ping statistics from the NUC, eu1 is around 30ms, us1 around 100ms.

I don’t see how it could be network interface, cables etc. if it only affects us1.

us1

eu1

EDIT: It seems to have been a temporary network issue. When I look at only the last 50,000 lines of my logfile the error rate is 0 for all satellites.

My provider is vodafone germany, unfortunately packet loss is often through the roof with our connection.

Screenshot from 2022-04-04 19-44-58

Out of curiosity, which log message would I grep for to find the download failures? Just grep “download failed”?

If I do that, I see only US1 failing with “drpc: remote closed the stream” and sometimes “use of closed network connection”.

Yes.

Still the advise: use a monitoring tool and/or alerting script.

Great to hear, that it was temporary.

You could additionally set your DNS manually to one of those IPs. This might speed up US lookups, too. Give it a try.

Ping to the satellite doesn’t matter for uploads or downloads. It could be used as a measure of the network latency only for GET_AUDIT and maybe GET_REPAIR traffic (but it likely will come to the other repair workers in another location), it also doesn’t matter for the GET and PUT, because your storagenode communicate directly with the customer’s uplink, so you need to measure ping to these customers.

Ok then it might just be that my provider offers a very bad connection to 6% of us1 customers (who might be canadian or even from south america for all I know). @CutieePie unfortunately I was mistaken and the issue wasn’t temporary, I see frequent failures in the logs, the script just somehow did not work when I created a log using cat logs.txt | tail -n 50000 > newlog.txt.

so i ran some tests to give you something to compare against lol…


--- us1.storj.io ping statistics ---
158 packets transmitted, 147 received, 6.96203% packet loss, time 157267ms
rtt min/avg/max/mdev = 116.597/119.641/150.249/3.267 ms

i wouldn’t call my numbers pretty certainly something i should dig into some day in the near future… but dropping packets aren’t a massive issue…
most systems with heavy traffic will drop packets to keep up.
even having 0% packet loss can be an issue, because that might mean the network buffers fill up and there is created more time differential between the data processing and the incoming packets.

--- eu1.storj.io ping statistics ---
159 packets transmitted, 148 received, 6.91824% packet loss, time 158319ms
rtt min/avg/max/mdev = 22.791/26.466/78.049/5.326 ms

but yeah i really need to get around to looking at my network buffer issues.
last i checked it it was at 2% and i sort of just ignored it … this seems excessive, but i know the cause i just haven’t found an exact solution for it…

need to expand my buffer size because my virtual nics are using 1/8th the buffer size of my nic hardware, which i think is causing the issue… along with very high io load.

anyways i doubt your network issues are related to the successrates… it will just retransmit.
what it will do is eat up extra bandwidth, because it will basically send the dropped % of packets again.

storj isn’t really very network dependent, disk latency and iops will be a limitation much quicker.
you can send data to the us1 sat about as fast as your disk can find it on the platter.
on avg

your avg latency does seem like being double of mine… so maybe you are on copper and the majority of storagenodes is switching to fiber and because your latency is now higher then your download % drops because a combination of disk activity and network latency makes customer downloads finish just a bit to slowly.

that would be my guess anyways… but networking isn’t my day job… so

@Alexey how to know our node and customer uplink is good?how to measure by ping?

No way to measure anything with ping except detecting network problems.

1 Like