Started getting download errors

CutieePie · January 6, 2021, 12:14pm

Hi All, wasn’t sure this was worth reporting but it hasn’t gone away so here goes.

#Edit - I’m failing to make uploaded images bigger, just trying to work out how. Moving them somewhere else.

My ELK has been picking up one client, that has started causing download errors - it started on the 24th December, and has been slowly increasing - logs show that the client IP has never successfully connected to my node, although I guess IP’s can change.

I’m ignoring, as assuming it’s timing out or something, but thought I would log it as I’ve probably done something wrong

My node ;

started in late Nov 2020 and still isn’t fully vetted on all satellites.
all counters are at 100% in dashboard.
It’s a docker instance, running on Alpine VM (3.11.6) with non-redundant SMR / SSD Tier

Below shows connections from IP that is failing to download, and increase in errors over time, starting on 24th Dec 2020.

All the errors are one of two types.

Download failed :used of closed network connection
Download failed :write : broken pipe

I uploaded all the satellite id’s and piece id’s along with timings to pastbin.com

Thanks for your help, I am loving this forum ! I am learning so much.

LinuxNet · January 6, 2021, 1:01pm

The error message appears if another node was faster or the download was canceled by the Tardigrade user.

The second error message should mean the same thing. (I would like to be corrected if I am wrong)
I had that too some time ago.

LinuxNet · January 6, 2021, 1:21pm

You just have to worry if audits fail.

If you save your log outside of the Docker.

SGC · January 6, 2021, 1:46pm

from my testing it seems latency dependent, under optimal conditions these errors barely show up…
but that’s with the storagenode having access to 98% more resources than what it uses.
so 2% hdd utilization avg
2% internet bandwidth utilization and 2% server load, on 500mbit fiber, 16 threads dual cpu’s, and IOPS equal to 2-3 x a CMR HDD, from a dual or triple raidz1 arrays, with ssd caching… so on and so forth.

Overkill x Overkill then i get maybe 1 error or less in a day… sometimes all the way down to 1 a week.
basically flawless run on the storagenode…

doesn’t take much before it will start going up, it seems it’s mostly latency dependent, i use a write cache with sync always, meaning all data goes to the write cache and is then written in bursts which helps make the iops less random writes and more sequential.

something similar can be done by increasing your write buffer to from 128k to 512k

or by moving the database to an SSD drive… just keep in mind that even tho this improves performance, it also moves the storagenode on to two devices rather than one, of which either fails will spell trouble.
but it does help a lot it seems.

really the errors doesn’t matter much, they just look annoying… so long as your successrates are pretty good then i wouldn’t worry about it…

if you can it might be a good idea to set your ssd (which i assume is a cache for your SMR HDD)
to sync preferred / always or something like that… the basic idea being all writes are routed to the cache and then written in bigger chunks, making the writes sequential instead of random and thus greatly increasing the write speeds.

also helps against fragmentation, ofc if you successrates are like 95%+ then why bother… but long term it might really pay off…

i should setup a test for that… might be interesting to know… because it’s so much easier to run on the default writes, the sync = always methodology is pretty demanding and will require the SSD to soak everything + max at the amounts of iops it can handle.

but with a single HDD i doubt that’s an issue, but my old sata ssd sure wasn’t up for keeping up with my zfs arrays

SGC · January 6, 2021, 4:45pm

if i was the hazard a guess, then it’s the writes being slow on the SMR because it handles random writes poorly and rewrites even worse.

one option that is fairly easy to manage and used by many, is to simply add more nodes on other hdd’s
then because the ingress is split evenly between the nodes, you will by adding just one extra node reduce the write on the SMR by 50%.

dedupe on zfs is a nightmare, the idea is good, but the machine required to manage it is … ofc that doesn’t mean it’s the same for the REFS.
but i would suspect it will be, the problem with deduplication for hdd or storage in general is that the system needs to keep a record of all the files/data blocks that exist and actively compare them as data is saved, as the stored data set grows this ends up being more and more demanding.

did try and check if i could find a clear cut quick answer on dedupe with ReFS, but doesn’t seem like it…
but i do know the storagenode will not benefit from it, nor will it save space, and incase there is built in redundancy in the software the deduplication will basically merge the duplicates… meaning if one gets damaged and the software tried to recover from it’s backup, it will be unable to…

so i would turn off deduplication off more or less immediately, atleast for the storagenode.

there are other memory based deduplication things that can be quite good to run… but that’s memory related and shouldn’t relate to the filesystem i would think…

but i’m very unfamiliar with ReFS and ReFS dedupe, so it might be fine… but i doubt it…

your latency for storj will be the 8ms + hdd seek time + hdd read time + 8ms back

really if your node hasn’t vetted yet, you shouldn’t be seeing any significant load, nothing that should affect it… my bet will be solely on the dedupe for causing this…