Zfs discussions

so people say, but the original darpa spec says crc would be used over lan…while not over the internet… ofc that was their 40 years old design.

alas not really relevant to anything i’m doing xD so however

i’ve been continuing to dig into zfs, and wanted to hear if i got this right… from reading a good deal of iXsystems posts i sort of got that if i run with a SLOG i can run with synchronous writes disabled.
so far i’ve had it running for over 24 hours with sync=disabled and doesn’t seem to be any clear detriments for me … yet
been trying to copy my storagenode, AGAIN lol so really don’t have a choice, goes way to slow on standart or always sync, does seem to affect some webpages i run on the server a bit…
logging me out over time and what not… but this way i can get through like 2+ TB a day
ofc thats an avg of 160MB/s because its both reads and writes… +4MB/s from the storagenode
hmmmm that number is suspiciously close to the SATA300 i got my SSD moved to… i should try to move that back now that i can…
see if it goes much higher…

ohh ffs i just cannot catch a break with these old drives… either it’s damage to my backplane which i sort of figured i had ruled out, but even with different drives it wants to give me errors on the drives…
tho it maybe just be the drives…i guess i should just stop again, setup a 2nd zfs pool on another computer and move the entire node out, and refurbish the server.

lovely, these 10$ i’m getting added to my wallet and escrow might just be the hardest money i’ve ever earned lol…

the irony being that had i been running raidz2 then i essentially wouldn’t have had to do a damn thing about it… yet

maybe it’s a sign i should just get that cluster made already… got the hardware for it anyways…

Where did you get that from? The SLOG is there to cache sync writes. So if you disable sync writes, the SLOG is not being used at all because all writes will be going through the RAM like the async writes do.

It will work perfectly… until a power failure, kernel panic, or some other problem turns of or reboots the server. Then you’ll have corrupted databases etc.
also, sync=disabled will not use the zil anyway.

The ZIL is usually written on the array/vdev/pool disks, for easy recovery incase of a power loss, then zfs will know where it left off and which files might have been worked on before the power went out.
Getting a SLOG for your ZIL moves Intent Log to the Secondary Intent Log, on an SSD perferably persistent memory like situation like a samsung evo pro 970 which tops out at about 1ms latency at high workloads, i generally see that as a better buy than the intel optane even tho optane has lower latency at lower work loads… 970 is the best performing imho

then the data might not have been written to disk, but it still hit the Intent Log and Thus the whole Synchronous writes becomes nearly surplufluous…
however running a SLOG isn’t free either… in some situations you will run out of bandwidth, because you are basically using double bandwidth when stuff needs to hit the SLOG before going to the array…

@kevink while that article made a few interesting points, it didn’t touch on the question of if when using a SLOG makes its pretty much safe to disabled synchronous writes.
was kinda interesting the stuff about PLP and that latency was added from going to the PCIe bus and then back to the HBA again…forgot about PLP which is certain a feature i would want for my SLOG and then it might be wise if i can avoid going out of the HBA, sadly i doubt that… and ill have to check up on what the actual numbers are for the latency caused… but i’m pretty sure it explains my CPU IO delay after i put my SSD on the older MOBO SATA300 xD
oh yeah i read this about ZIL/SLOG

i read this article about SLOG / ZIL and sort of got that out of it… even tho they ofc doesn’t straight up say it… which is okay, because there would ofc be some risk involved, as there usually always are in high performance stuff.

@Pentium100
sadly i think you might be right… the performance gain of running with sync writes disabled tho are excellent heh took my system like two days to copy my node… and i think i got like half the speed… on standart and even lower on always…

still if we imagine a power failure… with it on disabled if we imagine the data is going to the SLOG
then the drive would be like 10ms on avg behind at present… so if i’m getting 4MB/s so when the connection is dropped i’m like 10ms behind (which PLP might catch) so 10/1000 of 4MB so 40KB lost.
and zfs doesn’t correct the … what are they called the way it adds to files, routes, pointers… after its done writing, so essentially it would just look like it was never written at all…
ofc if that data was to be audited then it would be missing… but really 40kb… don’t think it really should be a concern … if the SLOG actually works when sync is disabled or if i can get it to work somehow if it doesn’t… xD

anyways i know i want it to work like that… i know it might be a bit risky… but the performance numbers i’m seeing are just to good to ignore…

SLOG is there only for sync writes.

Even 40KB in a database can be the difference between a working database and a corrupt one. But it may be much more than 40KB. Databases have their own logs, but for them to work properly sync() cannot lie.

Basically sync=disabled setting for zfs would be useful for a temporary database or something else that, if it crashes, does not really matter.
Besides, Storj uses async writes for the pieces themselves (something that can result in a missing piece after a power failure).

after lots of performance testing i ended up running SLOG only (maybe if i had a better ssd this would be the case) and the sync settings to all, so that it always utilizes my ssd for incoming writes, thus it can ack at minimal speeds, it seemed what running async actually affected the data flow from the “satellites” / network…

this seems to have reduced my latency even further… slowly clawing my way up to near 90% successrate of uploads… xD

 ========== AUDIT ==============
Critically failed:     0
Critical Fail Rate:    0.000%
Recoverable failed:    0
Recoverable Fail Rate: 0.000%
Successful:            138
Success Rate:          100.000%
========== DOWNLOAD ===========
Failed:                13
Fail Rate:             0.394%
Canceled:              5
Cancel Rate:           0.151%
Successful:            3283
Success Rate:          99.455%
========== UPLOAD =============
Rejected:              1
Acceptance Rate:       99.996%
---------- accepted -----------
Failed:                0
Fail Rate:             0.000%
Canceled:              3605
Cancel Rate:           14.742%
Successful:            20849
Success Rate:          85.258%
========== REPAIR DOWNLOAD ====
Failed:                0
Fail Rate:             0.000%
Canceled:              0
Cancel Rate:           0.000%
Successful:            3
Success Rate:          100.000%
========== REPAIR UPLOAD ======
Failed:                0
Fail Rate:             0.000%
Canceled:              70
Cancel Rate:           14.199%
Successful:            423
Success Rate:          85.801%
========== DELETE =============
Failed:                0
Fail Rate:             0.000%
Successful:            543
Success Rate:          100.000%

short log, had to rerun my run command this morning as my rsync of the storagenode finally finished, only took since 28th… lol but didn’t seem to affect the storagenode while it was running, so that was good… not by much anyways… tho my successrates was slightly lower…

so i should finally have my entire storagenode on zle (zero length expansion) compression with 256k recordsizes, and a factor of 1.01x but the system seems to run better on zle rather than lz4 duno why… i’m guessing it’s endless attempt at compressing encrypted data is not worth it.
so basically i sacrifice a slight bit of disk space like 2% less space required when using lz4, but i gain lower latency which pushes up my successrates, or so i think… xD

while i’ve been waiting for the damn rsync to finish made an epic cron storagenode log setup…

might seem rather simple to veteran linux users, but have been looking for something like this for docker, so i just made it myself… and now i can keel my node without even dropping a line in my log ever… xD and i can run docker logs storagenode --follow

so all is right with the world… then i created a logs dataset and tried dedup(utterly disappointing) ended up using gzip-9 on the log folder for a 9.8x log compress using 1mb recordsize… the 1mb recordsize didn’t give much… but i’m so close to 10x lol, if i could squeeze more out i would…

expand it for details on how it work, it’s well documented.
and ofc critic is always welcome.

oh yeah and i got my hdd fixed, ended up finding it was a problem mainly with it’s buffer/cache so i turned it off, now it’s been running for days with only 2 read errors… xD and i did pull my OS drive which i had forgotten to put on the UUID type zfs integration, so had to turn off the machine while it was running… zfs didn’t care tho… was running standart at the time… tho… so there is that caveat
running async didn’t seem to cause any damage, only thing it does is making my webguis disconnect over extended periods.
so one could do that for performance reasons… and i did for moving the files faster… took over a week if i didn’t… copying between two datasets on the same array of drives is rough.

That doesn’t sound right tbh… sync=always would force async writes to go through the SLOG instead of through the RAM so would make async writes a lot slower.
I don’t see how this would improve performance for you.

I’m as much of a perfectionist as anyone could be… However, there’s something to be said for fighting with a system which will yield significant diminishing returns with each attempt at creating the most “successful” node.

The upload success metric is not very important as long as the rate is near the network average of around 75% or so… I can never remember exactly what the algorithmic set rate is…

The download success metric is far more important, as that’s how a node accumulates most of its revenue.

I agree. Even a slow node on a Pi that might even only have 20% successrate will get full eventually. Most revenue is due to egress so download is the only important rate and that one is >90% for most people.

exactly… was another thing they touched on in one of the iXsystems posts or videos i’ve seen.
basically if your data has certain parameters such as high io for various reasons.
then you want to do sequential writes to the drive, the ram is volatile and thus needs to be flushed more regularly, and then it goes to the harddrive…
if you send it to the ZIL / SLOG then you write the data in no time at all with up to 100 times the IO and then flush is every 5 seconds in one large sequential write which is what the hdd’s do best.
that can affect some database loads on a ZFS system greatly…
and essentially even tho working with 2mb sized data chunks, then the storagenode is essentially just a massive database.

wasn’t sure it would work, but it seems to… and yeah download successrate is mostly at the theoretical limit, but i have noticed a good deal of odd behavior on the bandwidth side of the download…
like say it will keep the 98% successrate, but the bandwidth used might half, depending on settings and activity… which is why i wanted to reduce my read latency on the drives.

also you say 20% successrate is fine… yeah i guess so long as the system is automatic balancing the load, but do you think that’s the high bandwidth customers that are in those 20% of uploads…

i’m optimizing because later when my node gets into a decent capacity range, then i cannot really work with everything or allow myself to take the same chances i can on a 7 week old node.

and i needed to move it into its own dataset, so i could changes its properties and get it running ZLE was just something i wanted to see if will work well…

so long as i’m paid for stored then i would rather have optimized it a bit… might not be a great increase going from 75% to 85% but still its like 40% of the files i was losing the race to earlier…
might not matter… but then again… it might

if nothing else, then i know my system is running well, so i don’t run into performance issues sooner than i need to…

kinda working on building a small data center here… so
but yeah @anon27637763 and yourself is most likely right, it might not matter at all…

i’m more happy with my copy being done and my new log method… xD
and btw you do know some systems actually go 99.8% or so in both ingress and egress right…
so i still have atleast 10% more to gain… of useless numbers…

oh and fun fact, ever since i got to 85% successrate… my ingress have been lower… hehe but i think thats just the network atm

Found this, pretty great, even tho the speaker isn’t to use to presenting…
trying real hard to dumb it down enough for those watching to actually understand it, but not get offended xD

I guess it could reduce the HDD IO load.
A program using async writes however will have to wait longer because writes to the SLOG are still slower than writes to the RAM. But this might not matter in many applications, especially with a fast SSD.

from how i understand it and how the iostat looks then it still touches memory, only difference is that it won’t go directly to HDD but to the SSD instead until it’s all flushed in a big sequential write.

so any application sending async writes will not feel any difference… really i also think that me putting my OS, L2ARC and ZIL on the same SATA ssd… it created a whole lot of moving stuff around between partitions and slowing everything else down…

and since i’m not well versed enough to have just moved my OS to the pool yet… the simple solution was to get rid of the L2ARC because it was by far the biggest demand on the drive.
and i already expanded my RAM to 48GB so should be enough…

and apparently it is…however i kinda broke my VM’s network or something broke… so maybe when i get those back i might miss the l2arc, but at present it only seems to have slowed down my over all performance, sure some minor things might have been better, but my performance now seems generally alot better…
but i also had like 4 days of nearly constant high load on the drives, and then on top fighting with a bad drive and having to scrub the entire pool all the time… so really got some load testing in…

i may go back to having a L2ARC but i kinda want to get an pcie nvme one.
tho if i feel i need it when i get back to the VM’s then maybe having a much lower feed rate on the L2ARC could be one good solution… also me adding TRIM also changed stuff and now that the storagenode is on ZLE instead of LZ4 then really knowing exactly what had the most impact might be difficult to figure out.

also with the sync on always, means i am slightly more safe if i loose power until i get a UPS for atleast emergency shutdown, that it seems to produce better latency for my node is another bonus.

but as you might already have caught on to, then i change my opinions a lot! xD
have to work with the information one got and what can be tested…

also there is the whole SSD wear issue… have you seen how much data it can go through when you are copying data and it cannot figure out what to do… so again another factor in why i figured i would turn it off… for now… i will most likely turn it back on and off a good deal times more over the next few months to really gauge the performance difference under different load conditions.

and certainly if i can get that persistent L2ARC feature… then why wouldn’t i want it… but a very old L2ARC would also be full of very strong data and with a limited feed rate even a multi day backup or such should evict the useful data much…

first time today i tried to run my rsync without the storagenode running… my copying speeds was 4 to 8 times higher, so without a doubt the storagenode puts some serious pressure on disks…

but i knew that back from my V2 testing, which is partly why i got the setup i do…

I recommend these posts: Reddit - Dive into anything

All writes go to RAM but with sync=always even the async writes have to wait until the data is in the SLOG/ZIL which takes a long time. Writes are generally flushed from RAM to disk every 5 seconds, doesn’t matter if async or sync writes. Data from SLOG/ZIL is never read, unless your server crashes and zfs has to recover.

yeah a l2arc probably has no benefit with that amount of RAM, unless you always run out of RAM xD

well i read it was suppose to work i think from iXsystems, and i tried it and i saw a significant performance boost… i duno exactly why…
it was not like the zil wasn’t always there nor that the l2arc had been removed because i tried it before also but it didn’t work as great as it does now…

and if you read into the thread you posted, then there are a few telling us what can’t work and flaunting their knowledge of how the basic concepts work, but when it comes down to it and somebody actually tests it, then they are wrong…

I got no clue why it works… ZFS is very advanced i doubt all of it’s creators could tell you why this would work… tho i’m sure a few could, the fact is that testing shows that it does, particularly for high io loads…
obviously it wouldn’t improve your throughput especially on larger arrays because of the “limited” ssd bandwidth.

if i was to hazard a guess then setting sync to always works in my case because when one set it on… the ram and the zil /slog flushes in sync and thus its a bigger lower io write instead of two separate ones… (but again… its like me attempting to explain general relativity using a bowling ball an orange and a trampoline) the base concept might be right, but it’s just such an oversimplification that it cannot really be used for much, nor can anything be derived from it, atleast in perfectly useful terms…

tho it does kinda make kinda sense when explained, then it requires gravity to explain gravity… which makes it circular logic… so basically i’m saying i duno why… but when i push this button it goes faster… atleast for my current setup and work loads.

running out of RAM at this level without running a few to many fixed memory allocated VM’s … well you don’t feel it… all you see is that the ram will drop from 85%-90% to maybe 70 or 60% depending on the load… like if i set the large rsync copy running it drops to like 60… that again one could change by running a lower record size because that apparently decides how much RAM the ARC will drop to be ready to receive and work on data…

basically they say that zfs should run best at 90% RAM usage, maybe even higher because dropping the stored memory data takes basically no time at all and the data is already stored on disk… so with ZFS when its running without load you actually run on next to no free memory, but thats a good thing… means the system is running at its best and fastest lowest latency possible because it has a ton of stuff ready in arc…

So i’m stuck with a zfs issue…
i currently have a 5 drive raidz1 with a failed drive
so to migrate the datasets to a new pool i put in 4 new drives, however because of my f’ed backplane i had to put one of the four new drives in the formerly used bay because i only could get 8 of them to work.

and thus comes the issue, now zfs or the zpool command refuses to create a new pool on the 4 new drives because zfs still regards the offline drive as hooked up over /dev/sdh
even tho i used the ata… uuid or whatever its called… i tried a reboot or the -f parameter
zpool create zroot …/dev/disk/by-id/ 4 long drive names

if anyone got a good idea, i’m open to suggestions… but hopefully ill figure it out soon…

cannot detach /dev/disk/by-id/wwn-0x5000cca2556d51f4: only applicable to mirror and replacing vdevs

zpool create -f zroot raidz1 /dev/disk/by-id/ata-TOSHIBA_DT01ACA300_ Z252JW8AS /dev/disk/by-id/ata-TOSHIBA_DT01ACA300_99QJHASCS /dev/disk/by-id/ata-T OSHIBA_DT01ACA300_99PGNAYCS /dev/disk/by-id/ata-TOSHIBA_DT01ACA300_531RH5DGS
cannot create ‘zroot’: I/O error

proxmox zfs tool told me it was /dev/sdh which was the linux name of the old drive… can’t i just rename it… hmmm

If you’re absolutely sure that these are the drives you want to use:

zpool labelclear /dev/disk/by-id/ata-TOSHIBA_DT01ACA300_ Z252JW8AS-part1
zpool labelclear ...

Followed by

wipefs -a /dev/disk/by-id/ata-TOSHIBA_DT01ACA300_ Z252JW8AS
wipefs -a ...

Then try creating your pool again.

1 Like

now we are cooking with gasoline… xD

no luck… i even took switched around two drives… :sweat_smile: even one of the last active ones while the system was running… did stop the storagenode tho lol… system wasn’t happy but seemed to survive it just fine lol
figured if i placed an active drive from my pool into the failed drive bay and then put the one for the new pool in the just moved drive’s bay… it would get confused and just ignore that there had been a drive there… i guess ill open it up tomorrow and attach directly to an onboard sata port and then i can easily bypass this failsafe feature

its like zfs just sets the bay on standby until the old pool is restored

/dev/disk/by-id/ata-TOSHIBA_DT01ACA300_99PGNAYCS: calling ioctl to re-read partition table: Device or resource busy

Interesting. What kind of enclosure do you have?