Best Record size for zfs

SGC · January 11, 2021, 7:14am

still sort of testing, and lower is better i believe…
clearly the rpool one which is the OS drive, could use a bit of a refresh and to be set to sync always like bitlake and qin.
opool isn’t really used, rpool and opool are the oldest pools tho.

NAME      SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
bitlake  43.7T  18.2T  25.5T        -         -     4%    41%  1.00x    ONLINE  -
opool    5.45T  3.98T  1.47T        -         -    16%    72%  1.00x    ONLINE  -
qin      5.44T  1.08T  4.36T        -         -     6%    19%  1.00x    ONLINE  -
rpool     139G   107G  31.7G        -         -    47%    77%  1.00x    ONLINE  -

and bitlake is the only one that has been run sync always with a slog for most of it’s life aside from the week where the slog was down.

@kalloritis
yes that would be an argument for it, but why does that make any sense, any data block can or will essentially be slip by recordsize, so the capacity required to avoid fragmentation would be variable in relation to that and the size of the pool then… wouldn’t it

and it’s not like that explains how to get to the 80%, how did they arrive at that number.
i mean i could imagine data scientists arriving at that number, by long detailed study… which would explain why it’s so difficult to figure out how they arrived at that number.

i don’t disagree with your explanation, i just don’t feel like i’m any closer to understanding why 80%, sure one can test it over years of usage, but i prefer a simply piece of math or such to show why it has to be like that.

for example if we imagine a pool at the size of 1mb and 128k recordsizes that would have 8 records, and thus it would need lets say 2/8th to allow it to easily have space to write a continuous recordsize.
but it would essentially be the same for a 2TB pool or a 100TB pool still requiring like 256k of continuous free space to write a record.

whats GC?

yeah the snapshot thing is pretty self evident, but nothing is good with the storj data blob
to much data basically / to many files

kalloritis · January 12, 2021, 3:05am

Have you checked out Michael W. Lucas (shown in the center here: https://www.youtube.com/watch?v=i1GjP1Q_SG0), who writes on ZFS as well. He’s played with ZFS in some of the ~~dumbest~~ most creative, ~~silliest~~ interesting, ways to find ways to break it and why it works the way it works. He’s got some interesting reading material because his satire is somewhat baked in and makes it less dry.

Now I know he talks about FreeBSD ZFS and not openZFS/ZFSoL, but for the most part… ZFS is ZFS.

SGC · January 13, 2021, 10:04am

i think they are aiming towards merging ZFS into on project again, which makes good sense, not really something where people need 3 or 4 different version’s or how many there are… ofc the oracle on will never be merged… atleast in the near future.

i think i’ve seen a talk with him about some ZFS development work, nope that was some of all the other michael’s, the most stuff i check about zfs are the release dates on new features… which since i’ve only been using it for 10 month, has been pretty much pointless lol

and then i watch talks or lectures from zfs developers about their features.
there is just so much about zfs, it’s like learning linux… the rabbithole is deeper than you care for.

i did manage to make zfs throw errors on a good drive… because i took it’s mirror out and reattached it multiple times to try and see if the connection was bad / okay after giving it contact cleaner…
then the good drive told me a checksum was lost… but i suppose that doesn’t matter since it was just a checksum

oh wait i have seen something of michael w lucas’s talks, the one about jails

that was actually kinda interesting, but most of them usually are.
zfs is just so complicated, i just feel like giving up lol
i do enjoy autotrim, not sure i would have turned that on without having seen the talk from the developer about the creation of it and what it does.

kevink · January 13, 2021, 11:45am

Since my esata controller under load occasionally reconnects a HDD, I actually have the ocasional corrupted file where file and checksum don’t match, which the scrub detects. I has only been like 10 files in the last 6 months but still… would be worse with other filesystems though…

SGC · January 13, 2021, 1:49pm

was that also a problem when running with redundancy… ?
i suppose it’s not really a huge problem anyways, but still kinda curious about zfs behavior
the weird part for me with the checksum that went bad, was that the checksum was on the good hdd, while it was the bad hdd in a mirror pool i was disconnecting and reconnecting 3-5 times over a few seconds

best guess was that the checksum was split between the hdd’s and the hdd cache forgot the writes that was made…

actually… why don’t you just put the slog on an onboard sata controller or m.2, pcie or whatever, then it shouldn’t matter for file consistency that the storage part disconnect and reconnects once in a while, because written data will go to the slog and then it will only need to write when it actually has access to the drive…

not sure if you have to go sync always for that tho, but don’t think so

kevink · January 13, 2021, 1:54pm

It was better with the raidz1 but only because the HDDs got reconnected less often. I did have the occasional corrupted file back then too.

SLOG is already on internal SATA and is rock-solid but would only be used for pieces with sync=always.
But that wouldn’t solve the problem because either way the pieces are kept in RAM. If zfs had realized the write error when writing the file (and not just during the scrub), it would have corrected it and it wouldn’t matter if the data was written to the SLOG or only kept in RAM.

SGC · January 13, 2021, 1:57pm

with sync always i can pull drives without getting corrupt files… ofc i haven’t done that on a regular basis… tho i have done so atleast 3-4 times without causing corruption… usually by mistake
sync always is quite good, if the ssd can keep up.

i believe it’s one of the reasons they use it for databases, because avoiding corruption in such data is very critical, but it’s been well almost a year since i was digging though all the stuff… so not 100% sure, but think so

kevink · January 13, 2021, 2:03pm

yes there is no doubt that sync=always is superior in terms of data safety. Maybe it would prevent the data corruption I’m seeing. I guess I could try. However, it’s a general timeout issue as I’m also seeing lots of write/read errors when the drive reconnects because those operations time out, but zfs tells me that no applications have been affected and that the errors have been corrected

SGC · January 13, 2021, 6:06pm

yeah it won’t solve the read timeouts, but with a decent few GB of SLOG and sync always, because every write will the require an acknowledgement on that it has been written, cannot be corrupted, due to the SLOG simply soaks the write data while the disk is inaccessible.

if you are lucky the reduced IO might also make the esata link more stable… it’s a bit weird that it’s so unstable tho… but haven’t really used esata so i don’t know if that’s a common thing.
also doubt it’s due to inactivity, but you could try and go in with … hdparm
i think it’s -B or -P 255 to turn off power mangement.
that could, even tho it shouldn’t, cause issues with the disk turning off and then being slow to respond again… also you might want to try to go into the pcie power management in the bios and turn that off.
the issue may very well be related to some sort of power management interfering.

i’ve seen that kinda stuff pretty often in bluetooth / wireless headphones, where they will give a weird bliurp sound when it drops and reconnects, because of some pwm issue.
it happens quite a lot in different things and most enterprise setups just turn it all off for the same reason…

even my pcie ssd / appaccelerator recommends turning off the cstates on the cpu, because they can cause latency, that may slow down the drive

kalloritis · January 13, 2021, 7:24pm

So Jails are actually a FreeBSD thing, not really a ZFS thing.

This is not to say that ZFS didn’t really become a bit synonymous with FreeBSD for a while though. Basically if you wanted to use ZFS, well, you ran it on a FreeBSD box- mind you before ZFSoL became closer to fully stable like it is now.

What we use in Proxmox is actually ZFS on Linux (since we’re technically using Debian), and while very similar does have its minute but important differences from ZFS on FreeBSD (mostly related to feature sets) and is not directly portable back and forth between the two.

So this is two fold- Yes to protect the data in the table, and second to make sure that the transaction log of what’s happened in that database is also current/correct to what can be loaded from the database table files. If the transaction log does not match the stored data, things like replication start to fail horribly.

I have to ask- what hardware is it running on? Models and the more specifics are helpful.

SLOG would help sink the writes, but L2Arc, and a massive one (>128GB), would be needed to answer to the reads… maybe.

eSATA is kinda sketchy…no heavy use of it kinda caused it to not have the robustness of SAS. As such, I’ve never tried to use it for anything other than a backup drive for speed (before USB3). If I need SATA outside of a single chassis, I go to using SAS controllers (LSI 92xx or LSI93xx personally preferred) and using SFF-8088 (SAS2) or SFF-8644 (SAS3) cables and then using connector cards are the target chassis and then SFF-8088/8644 to SATA fan out cables. Yea, it seems like a lot of cabling, but I know it’ll work.

That’s a first. I mean it has -some- merit but will also cause the system to never let cores go into lower power states and will cause extra heat generation because of that.

kevink · January 13, 2021, 8:45pm

Not sure that’s the problem. Even with a SLOG the writes actually go from RAM to the HDD, unless zfs detects the write errors, but if it did, it would detect them in async too. Additionally writes are queued for up to 10 seconds before flushed to disk so if a full flush would fail, more than just a single file would be affected, but it is always just a single file that my scrub finds, even though I may have about 20 reconnects within a month.

The Mainboard is an Asrock AB350M-HDV, the esata card is a StarTech.com 2 Port eSATA 6 Gbit/s PCI Express with sata multiplex support and the external case is a FANTEC QB-35US3-6G with 4bays but 3 HDDs are used.

A while back when I used raidz1 the HDDs maxed out at 66% usage no matter the load and I had (almost) no problems back then. Then I switched to using them separately and every time at least one HDD is at 100% usage, there is a chance any of the HDDs might get reconnected.

That would be quite strange but my CPU actually doesn’t seem to support any C-States I don’t know why or if I just can’t see them but netdata always reports C0 but other Ryzen CPUs are actually switching C-States… idk. It suprised - and bothered - me too. My Ryzen 2400G setup pulls ~65W with 3 HDDs, I thought I could get it lower but without C-states…

Anyway, due to this weirdness I actually have to update and start my nodes one after the other (waiting until the filewalker is done as it uses the HDD 100%) in order to reduce the risk of HDDs disconnecting. There was never real damage to my personal files, just a few storj files, so I was fine with it for now

SGC · January 14, 2021, 1:41pm

Yeah i just thought i had seen a talk with that michael guy about ZFS, but it was about Jails, i think ZoL and BSD / OpenZFS just got merged into one being OpenZFS 2.0.
which sports the long awaited … .for me… persistent L2ARC, sadly haven’t gotten to try it yet… was kinda hoping to be able to hack a proxmox solution together… but seems like there might be some tricky stuff with booting over it, which is what i do now… so ill wait a bit.

yeah the portability from linux to freebsd tho nice, is still not a big issue imo… slows down the transfer a bit, even tho it’s possible zfs send might work i duno… really don’t have the need, i’m stuck with proxmox for now.

yeah one runs into that voting problem… if you got 1 replicated source, which is basically 2 set of data… now if you have an error in one… how do you know which one the error is in…
it’s why they say clusters should always be made to consist of atleast 3 machines.

basically what i wanted to do with it was to avoid, data fragmentation on the disk, and convert random write IO into sequential write IO and then it wanted to make sure that if i lost power
(because i have very stable power, so a UPS doesn’t make sense, currently)
then my SLOG SSD would have PLP, and microsecond response times, so that basically no matter what happens, what was written to disk will be in the SLOG and thus cannot be lost.
ofc the CoW sort of solves that problem, but this gives me an additional layer of redundancy in regard to that and also the other advantages.

additionally the SSD got internal raid5 and ECC on it’s cells / ram, so should be very safe there, and on top it has it’s own internal firmware / os which will ensure the intrigrity of the raid5 / ecc is maintained by disabling chips if needed.
enterprise gear is lovely lol and it’s overprovisioned 10% for the redundancy features and i then took an additional 40% off the top to keep it functioning at optimized performance levels.

well the reads are less important, the incoming writes that goes wrong is what causes corruption, sure a L2ARC would also help, and it does certain make a different when i pull my drives while the system is operating… but the storagenodes reads so random, its almost a waste running the l2arc on them.
and IOPS can become kinda tricky… i know SSD supposedly can do lots of IOPS, but still my QLC and MLC commercial ssd’s i used before would run into issues just serving my 1 x 14tb node at times.
granted they where running both SLOG and L2ARC, and the OS
mainly because the SLOG soaks up random write IOPS and thus the performance ends up looking like Q1T1 numbers.
which most SSD’s don’t do to well at.

so yeah long story short, i would rather save the IOPS for the SLOG to capture data and avoid corrupted writes rather than serving files to the storagenodes which there is basically no rewards in serving aside from payment, but it doesn’t punish the node… while corrupted writes can potentially cause damage that might one day kill the node, if one was unlucky enough or the corruptions often enough.

true, no it wouldn’t detect errors on async, because async doesn’t require an acknowledgement to be sent back before the write is accepted as completed…
thus async writes where the disk powers off and losses the cache, your slog or ram would have deleted the async data when it’s is sent to the disk.

while on the sync, the slog holds it (the reason for the slog is mainly power loss protection) everything goes there when the time is up it goes to the drives and after each bit is written there will come an ACK back before the slog is allowed to purge the data.
thus if you loose contact to the hdd, the slog will simply keep soaking data so long as it can, until full or the other drive comes back… when the drive comes back it restarts the writes and at ACK’s will purge.

it’s why it’s basically failsafe, its near impossible to make it fail, because in some cases like kalloritis says certain use cases even a single bit flip can ruin your week or month…

my SSD is a bit of a snowflake, popular snowflake but never the less…
the earlier version could pull near 200 watts and required like 2x molex cables on it.
mine will only do 38 watts, but my server can only provide it with 25watts because it’s pcie2.0
the earlier version wasn’t very popular because many of the storage engineers that decided to go with them, ended up basically cooking the servers… WOOOPS

haven’t really dug into all that tho… but did try to disable it, maybe i should follow the manual at one point see if that fixed my weird netdata outputs.

also we are pretty far from Best recordsizes now… maybe this is better suited for the discord.

kevink · January 14, 2021, 1:50pm

In general you are right but that’s more the application side. Async writes for an application are a success as soon as the RAM accepts the “file”/write instructions. If you then lose power, that data is lost. But if zfs can’t flush those ops to disk because the disk shortly disconnects, it probably doesn’t just delete those ops. but even if it did, my scrub wouldn’t find a corrupted file because it would have never gotten onto the drive. Therefore I think my problem is a little bit different and more complex.

But yes, this is far from the best recordsize now

SGC · January 15, 2021, 7:33am

if copy on write works, then in theory it’s should never give you a corrupted file only an old file.
only way to really get a corrupted file is intermittent connection and really with the ack on sync files then it woudln’t be ack before done, and thus already checksummed.

the async would send the command and then consider it down… i guess there should be a checksum done on the write, ofc that would slow down the process… i duno… async files is usually considered non critical files and thus are expendable.
like i said running sync always is difficult for the hardware, sync is basically done to make everything easier… i duno exactly why it works, i can’t only assume async writes is allowed to make errors… even in zfs, because zfs was designed for spinning hdd’s and today with ssd’s we can actually manage to do all writes as sync is we need to and it seems to be flawless…

i duno if it will work for you, but that’s also partly why i want you to try it.
aside from that, it’s one of those things that’s costs nothing to test, and is basically just flipping a switch.
been running it here for about as long as i have been running zfs and it’s not like i’m moving away from it, as a matter of fact i did in the beginning run without sync always on my OS pool… but ended up with issues when i rebooted randomly for a while…
so now i run sync always on everything… and i really like it, been amazing, haven’t had one app acting weird or anything… everything just works as it is suppose to…

atleast afaik, even netdata has been behaving it seems either since an update which i’m not sure i did on purpose, but had to reinstall it i think, so maybe i got a more stable version… but it could also be correlated to sync always on the OS drive, or maybe how i use it…
i do use the when to refresh charts set on “on focus” more other than on “always” these days because i suspect it was what seemed to break netdata…

whatever fixed it… something did…still on “always” netdata gets kinda wonkey after a little week.
but not easy to verify… i am planning on setting up another proxmox computer, so maybe ill leave that a bit more stock and see what happens.

i must admit tho… i kinda hate the thought of even running it when not using sync always on everything lol but maybe i’m just imagining it… mostly anyways… but i don’t think so.
so you make the perfect test animal, i mean test subject…

just lean back and try not to struggle to much, so the straps doesn’t dig in lol
this won’t hurt… much.

kevink · January 15, 2021, 8:03am

not if there was no old file as all storagenode uploads are new files…

SGC · January 15, 2021, 8:35am

then you would get a failure in the storagenode, not in zfs because zfs would literally not be able to tell something was missing.
and i don’t think the network minds that if you say you don’t have the file, ofc at one point if files get lost then one of them will be audited… which i suppose isn’t good because that would be a failed audit…

but how would you get to the point of having lost a file… if it goes to an ssd in nano seconds or even a few miliseconds, lets see… if we say a 1 to 10ms range… which seems reasonable worst case write latency, and storj is low data transfer just a 1MB/s or whatever… 100KB/s
most ssd’s even only sata ssd’s can do like 600MB/s, so thats 600KB/ms
so 4ms it takes to transfer an avg file to the ssd, then it’s safe and it cannot go back if sync always is on.

thus within those 4 ms we have to get the error and not only do we have the get the error it also have to be in RAM, because i assume it will end it’s write to disk by a checksum, i suppose if it doesn’t check the SLOG checksum, then having non ECC RAM might do it… but still it should run a checksum when it’s done and if it doesn’t match in the RAM then it can get the checksum from the SLOG.

so a memory failure can or has to happen within that 4ms range while writing an avg storagenode file.
then we might get an error…

thats really the only way i can see it actually creating an error, but not 100% sure on how it works, but still normally without the sync always… i think it just fires it off and then just assumes it’s fine to move on to other stuff, because that’s what async files are… they can survive taking damage, it’s suppose to be built into them or their over all program to consider stuff written that way is unsafe…

just like you wouldn’t mail your credit card and codes by postage because you consider it unsafe, so not sure one can really blame zfs for being able to run at faster gears to…

the sync never command is also kinda cool… tho that will seriously make your system be weird…
it will work for something in limited time periods… like say during migration it might half the migration time, and doesn’t “SEEM” to make any errors…
but doing that on your OS drive… man it will get so wobbly that you will think it’s smoking reefers outback.

kevink · January 15, 2021, 8:56am

This speculation is not going to help us…
The fact is, some new files get corrupted onto my HDD. The reason behind it is unknown. Whether using sync=always will help is unknown.

The storagenode won’t care about 20 missing/corrupted files from the millions I have.

SGC · January 15, 2021, 9:17am

the argument i’m trying to make is that with an ssd slog on the pool’s and sync always, then it’s basically impossible to get errors, afaik.

SGC · January 15, 2021, 10:04am

also to get back to finishing the other point i was trying to make for the SLOG SSD.

if we say we have an ingress of 300 KB/s and and that takes ½ ms… i think i counted miliseconds as microseconds… so at 600MB/s for a SATA SSD
oh right… that just makes it faster… so 6MB/ms

anyways that means less than 1ms to save 98% of all files, and checking the avg ingress over the last months it’s like way less that 300KB/s most likely around 150KB/s
but lets keep it simple at say 600KB/s
which takes 1ms to transfer to the slog… so the time in a second where the data of a full second’s of ingress is essentially unsafe is 1/1000

so your system will need to fail at that exact time, and ofc a PLP SSD removed that completely, since in reality the response time might be even less before the SSD can get it in DRAM ofc that won’t happen on SATA because of the obvious limitations.

ofc it won’t protect against some issues, my initial thoughts on it was to protect myself against power outages without having a UPS.

no it won’t, but you might the day you actually cause damage to the node for whatever reason, then the continual damage maybe enough to put it over the edge… but really it’s not the blobs folders that worry me… it’s the programs, the OS and such sort of using async writes to make stuff move at a proper speed, but then eventually fail because of it.
weeding out that kind of issues can take a long time… so i just really like that i know if something goes wrong it’s programming, not the hardware recording something wrong because it was rushed.

kevink · January 15, 2021, 11:30am

Sorry but you are making too many points about too many subjects and it’s getting more and more off-topic (and too much for me to read…).