HDD slowing down for some reason

apperently when somebody moves a thread, they get the replys written when moving the thread directed at the mover of the thread @Alexey , are being written at the time…
and i cannot figure out how the change it.
the forum is a bit odd acting from time to time…

actually directed @littleskunk in changelog 1.4.2
i watched something quite interesting about optimization a while back… amazing stuff… ill see if i can find it, because you should really see that…
mind blowing.

sounds good that you guys are looking into that, it’s a real issue… i got 9 drives… trying to keep up with 1 node… and performance dropping if i look at it the wrong way… hold my byte…OH SO HEAVY

well double the piece halves the io basically… but thats not always a good thing… like in our zfs discussions… i’m no expert in the disk storage area… but it seems to me like there is something utterly wrong that i cannot squeeze through more pieces than what i am doing.

must be something with how its ordered and written out to disk… to be honest it does feel a bit like there is no caching or structure to it… ofc thats not really what one starts thinking about when one starts to build stuff like this…
maybe it just throws data around way to randomly… i mean… egress we don’t see a ton of… so thats not really killing the system and i got a slog / write cache
so why can’t i handle more than 5mb a second ingress…
pretty sure i could saturate my 1gb network connection in data transfers for weeks without it dropping much… even if i used the server for a ton of stuff… which is what is so damn weird…

i got 9 hdd’s 2vdevs raidz1 which load balances between them… i got a dedicated OS SSD / SLOG (write cache) i got another dedicated 750gb ssd for l2arc which takes care of anything repeating jobs so they will have 1-3ms latency, 600gb allocated for l2arc, 48gb ddr3 ram, dual 4 core 8 threaded xeon cpu’s… i optimized my recordsizes which granted seems to help a bit… i optimized my zfs best i can.
im running docker on baremetal, my storage is on baremetal same system…
granted i’m new on linux but i’ve been working with related stuff for decades…
my vm’s run real nice now… never had a faster system lol but i’ve been trying to optimize and get all of this working for two months.

and don’t get me wrong it works, pretty well… but if this is what it takes to just barely keep up… xD
in my opinion it has to be how its dealt with the data somewhere… something that does that hdd’s slow down… which is basically random io… not sure what i can do at Q1T1 but its not much above 10-20mb sustained. if that… so really thats what i would assume it is… not enough sequential write operations, so the program lets the host system figure out how to write stuff and read or something like that… i duno… my guess…

how difficult can it be to download 5mb/s and write it sequentially… or the database refuses to live in memory… but that should just move to my l2arc…with enough time… but then again the system did run fairly smooth the last time i got up to 3 days run time… but have had a lot of trouble lately… but think thats over now… finally… maybe i can get some sleep then lol

your turn to have sleepless nights i guess lol

Becouse you have long array whith does not working fast with small storj files. The longer the array, the more IOs was wasted on small files and the slower it works. It very good if with proper recordsize 4+1 array work some percent beter than single drive. But 8+1 or 8+2 works worse with single node with 100% guaranty compared to single drive.

ZFS prefetcher tuning for 8+MB files.
I continue to argue that the l2arc and slog are useless for the storj a little less than completely

1 Like

I can confirm and easy prove it:

1 Like

actually my array is the one of the best raid setups one can do for IO performance, which is why i set it up like this…
and i do get something like 4000 IOPS on my array

my L2ARC at the moment when serving only storj, has an over 60% avg hit rate so thats just plain wrong,
and my SLOG drastically improves my latency for whatever reason, it was recommended by people that if memory serves actually have provided code base zfs and done extensive research on the behavior of zfs, they recommended the SLOG and running Sync always to reduce some types of IO into more sequential workloads, for improved latency…

and it works brilliantly… so i duno why you would say that… ofc my L2ARC is 10% of the size of the entire storj dataset it assists… so that maybe why it’s getting so good numbers…
i would suspect that would drop when i get closer to filled the 24TB node.

the zfs prefetch will basically turn itself off or go idle if its not doing getting good results

@0dmin what program is that?
and my l2arc is also dead at the moment, but thats because for some reason it seems there is no test data flowing… ill check up on my numbers in the middle of the coming week when the l2arc is full and warmed up, currently it not even 24% full because i’ve been having some unscheduled reboots…

on top of that i don’t expect only to use this pool for storj, but also for vm’s and a lot of other stuff, thus the l2arc is to help mitigate that, but from what i saw a few days ago after having the system online for like 3 days straight and the l2arc being close to getting filled, it was looking pretty good with a 60% hit rate, and i can feel how much faster everything is when i use webhosts, streaming video and such off the pool.

so if it will aid the storagenode then thats just a bonus… tho my l2arc is only 600GB which is very small for a 30TB pool, ofc that is more dependent on how big datasets one work with, but in the storagenode case i think the recommended 5-10% of dataset size in L2ARC is a good place to be, so really when my storagenode allocated fills up i should add another 600gb L2ARC.

It is telegraf+influxdb+grafana, I have centralized monitoring for evrything… but for you, I have a command that you can use on your storage with zfs and get raw numbers:
Linux: arc_summary -a
FreeBSD: zfs-stat -a

But not for Storj? Otherwise why this topic starts with words about cant handle more than 5mb Ingress…

1 Like

i didn’t start this thread… the thread i was writing on got split while i was writing the post and the forum acted up… and alexey titled my comment and made it a thread on its own… was initially responding to little skunk in the changelog thread… before alexey came rumbling through with his cut and pasting stuff around in the forum… so confusing sometimes when he does that…

well i look at the overall performance of my storagenode… i can see that increased activity on my drives will sink my successrates, and when i tried turning up my performance i do get better numbers… its just a insane hill to climb.

so i was making the argument that if my system cannot keep up, a system that should performance while be in basically all aspects be 4-5 times faster than a single drive, and with a slog and l2arc cannot keep up… then how would the people with less gear have a chance to get decent performance…

many of the smaller nodes are basically dying trying to keep up with the network and the network just hammers them to death… also i duno if you noticed, but if you are offline for extended periods the activity will not be less when you get back… it gets magnified until a time when the network seems to think you have caught up… then the flow will slow down…

I can imagine that l2arc doesn’t see much use, but I would expect SLOG could significantly speed up writes. Are you not seeing any advantages of that? And if not, how so?

I appreciate that he does. Some discussions are just completely different from the original topic and they could drown out the original topic. If they get their own place both discussions get the room to continue without interrupting the other.
Btw, you can turn off the notifications for a topic if you want.
image

What exactly should accelerate the SLOG? Writings in the database? There is very low traffic.
In normal when sync=standard SLOG only processed sync writes. Database in our case.

When someone changes to sync=always all writings operations, instead of immediately reporting success to the application after placing the block in memory in txg_pool, will begin to write it to the SLOG first and only then report success. Since memory is faster than SSD, such a setting will only slow down the pool.

Me and other Guys with big experience in zfs trying slog with storj some months ago. Nobody see performance gain from it. Maybe something change from this time but i cant see reasons for it. And from New version it is not actually because we all move full databases on ssd mirrors.

the SLOG lowered my read significantly latency from the HDD part of the pool, i think i went from 60ms to 9ms, if the hdds can keep up with the writes without getting backlogged there wouldn’t be a gain from a SLOG only a decrease of throughput on the overall system performance.

but i have to work with the limited hardware i got, in the best configuration i can manage and verify that my solution work for what i need them to…

the lower latency seemed to make a substantial difference for my egress numbers, but thats as you have pointed out many time fairly subjective, but its difficult to argue with an… 80 or so % reduction in latency…
ofc my write latency on async would go up, but if i go from 200picoseconds is it… i forget micro maybe… to 1ms doesn’t really bother me… its should be fine for what i need…
however 60ms for the storagenode reads on the otherhand is a long long time IMHO.

also will be using my pool for many other things, and thus it configuration and hardware will relate to optimizing for all my use cases not just for the storagenode, even tho the storagenode has been the main focus until now.

Yes, that was my thinking. But I guess it doesn’t matter anymore with the new feature.

I am running my node in a VM with a zvol attached as the virtual disk. Since the zvol is used in O_DIRECT mode, SLOG helps me.

OTOH if you store files directly on zfs then yeah, SLOG would not be much help unless you used sync=always for better reliability.

Regarding L2ARC - with the current loads it is not useful for me, most hits come from L1 anyway. Then again, my server has 100GB RAM (16GB given to the node VM, currently L1 ARC size is about 60GB). I have actually disabled L2ARC for now.

L2ARC is indeed not very helpful for a node (unless you have crazy big ones) because only the DB gets accessed frequently and will therefore stay in ARC anyway.
The SLOG helps to smooth the IO load on the HDD because the database writes are kind of cached (not written to the ZIL on the HDD thus reducing the load).
The SLOG however is not helpful for file writing itself as those are async. If you use sync=always then everything goes through the SLOG and oddly enough that does seem to increase the upload successrate a bit (went from 39% to 41% for me) but it’s not relevant and not worth the additional write cycles on my SSDs.

So with the new feature of configuring the DB path, neither SLOG nor ZIL will be of much help for the storagenode if you store the DBs on an SSD mirror directly.

one thing i’ve sort of noticed, which kinda seems to correlate with my current understanding of zfs.
because the l2arc works not on files but on checksums, it can actually load often used bits of an hdd image and leave others on the hdd.
so cool i was kinda worried the images would eat up a ton of l2arc or arc
but it really feels like the vm runs from ssd and then gets stuff from the hdd when its not commonly used… which is perfect for my use case, then it’s just a matter of a bit of training and one cannot feel it isn’t on an ssd

whats the point in mirroring a slog, i mean the slog is a redundancy system… you can live pull it and nothing happens… and what are the odds the slog ssd breaks exactly when you get a power outage…

also tried pulling the power on my system while it was live… didn’t do anything to it either…
i pulled the 4th drive after i pull my redundant drive in my raidz1… nothing happened … ofc zfs was kinda upset, but took it like a champ and never lost a byte…
i’ve run on a backplane that continually threw me write or read errors when i did scrubs…
shuffled drives around, resilvered maybe 15 times if not more in less than 2 months, i’ve been so mean to this… that its boarderline riddiculous and yet zfs just takes it…

so redundant drive for a redundant drive … O.o maybe a bit overkill but ofc depend on the import of the data on it.