Zfs discussions

The ARC is a read cache… not sure if there is a dedicated write cache… i think it just takes what it needs from ram… then there is the Zil or SLOG which i believe is the same just two different names… the zil is the ram version the SLOG is secondary LOG device… it keeps like metadata / checksum tables for where to find stuff or something… or its the write … think its the former tho

sorry if i called the ARC a write cache…

well the ARC and l2arc is good for other stuff… like say running your vm’s smoothly without disrupting the storagenode, also if your OS is on the same vdev then anything it might need at times the ARC and L2ARC will learn to keep around… or if somebody wanted to try to do performance testing… then they would have to use large dataset for the ARC and L2ARC not to figure it out and just starting to have milisecond turnaround from incoming require until outgoing data stream…
lots of things i find amazing about ARC… tho i must say it does need to get some of that deep learning added… that might just make it mindblowing brilliant.

i don’t expect to much egress yet… but as my node fills and more drives get added, then i’m sure i will be very fond of it… i already am very fond of the ARC and L2ARC for how it makes my VMs run.
best of both worlds… i get near infinite capacity and ssd speeds for the stuff that’s most often used.

yeah my egress and ingress is about the same…tho it does look a bit like i have been getting slightly better numbers with 1m record… but may just be the avg at present that has gone up

Absolutely! I was only referring to running a storagenode.

Zfs caching is great for a lot of things and even the storagenode dB might benefit from it greatly.

I am and I have stepped on this a few times. I am running VMs on zvols and actually noticed the problem on my servers then went looking for a reason.

This can bite you even if you are using mirror vdevs or 512n drives, but, for example, you back up the data by doing a zfs send to a backup server that’s 512e and raidz :slight_smile:

that also sort of makes sense… because they end up being like striped blocks across the drives and then 1 subtracted for the parity…
i also get 16k in my case… 5 drives in raidz1 so 4x4
thats interesting… tho ill have to read through that thread tomorrow… so maybe 64… kinda hate to have to go that high… but thats easy to say when i got little real idea what i’m talking about lol…
i suppose if the drives are 4k then minimum would be 16 and if you go over you will use another 16k (one full stripe) and if you went with 8k vol blocksizes then half of the stripe would always be empty…

not sure why it needs to go to 64k but i suppose ill learn that tomorrow xD

i expect to set all my drives to 4k sector and logical after next reconfiguration… with my new hba i should be able to do that and that gives me 7% less overhead. \o/
or sector is already 4k they are just emulating 512 for my controller sake.

another way to explain my slow transfer between two datasets on the same vdev

Geez, this got turned into a real rats’ nest. So the conclusion about compression ratio is that it does not show what the compression is actually doing?

It might show what it is doing but not what we expect. Like: with recordsize 1M a file of 2.23MB needs 3 records which is 3M. Now there is lots of free space at the end of the record because the file size is a lot smaller. So the compress ratio is 3M/2.23M=1.33
This also means that if the average file is 2.23M (but many are actually smaller) then it would have been better to use a recordsize of 768k because 2.23M would fit in there perfectly and should then give a very small compress ratio. I might try that when I’m back home in a few weeks.

So zfs might show it correctly but my assumptions were wrong because in this case it is not a real space saving compared to the actual file size. For log files you have a real space saving but that is of minor importance to a storagenode😁

Im currently in the progress of testing compression for btrfs, because i noticed the overhead being quite something on my xfs filesystem, which itself should be optimized for dbs and many small files out of the box. Ill keep you guys posted if that changes anything once the full node is synced over to the new btrfs fs, or if the overhead is just inevitable

So what did we learn: in short

Recordsize
increase your recordsize: If you have HDD IOwait issues / Low toughtput on copying large files
decrease your recordsize: if your system nearly stalls or is slow for searches, databases, antivirus, randomly
Optimization: set recordsize to 256k will half the io required for the same work.

Compression
ZFS runs best with lz4 compression which is why it’s default.
Optimization Option: Performance/CPU io/ CPU time on a storagenode one can switch zfs lz4 to zle compression.
Note: zle works best on the blobs folder, lz4 works best on the rest…

Other Optimization Options
atime off = should halves the IO used on disk tasks.

Advanced Optimization
4kn drives should run 64k vol blocksizes at async 12, tho sadly this cannot be changed on an existing …vdev or pool…

Reasons and Ramblings Below…
if one didn’t run compression it might take up x1.33 more space on disk… from what i understand… tho not sure if even zfs without compression would adjust for that, but the compression ratio seems to indicate that it might not.

i returned to a factor of 2 on my recordsize, so 256k from 128k don’t want to fuck my system over to much trying to optimize for storj.

it runs much more stable than 1m recordsize even if 1m did seem to use less io, ofc i got the hba’s installed so now everything is running different.
but had spikes of 1.2sec delays on 1m recordsizes and now i got max of 100ms everytime i reconnect to the server… else it’s basically just a couple of ms of backlog… even the regular hdd’s are running much better… but that maybe atleast in part due to the HBA’s

however one interesting thing i noted was that my storagenode did have higher ingress when i was running 1m recordsizes… but could just be random chance… will give it a test on a later date.
might setup a test node at one point… or record some storj data imput and then simulate it to test such things as the correct record sizes…

A Storj block is 2319872 bytes max it seems

I do think 256k recordsize might make the best sense. because its a multiple of 8.85
making it less than 1.02 off and would basically half your IO requirements, if we assume the database is in cache/ram

it will decrease your IO, but say you need to access a database entry or read the 25 first bytes of a file and it will load the full recordsize of a file, read the first 25bytes of 256k and then discard the rest…making some operations immensely slow. basically your computer would be getting work does at a rate of 0.01% in that particular case while working at full tilt. thus it will limit your potential bandwidth in some situations.

using zle compression atleast on the blobs folder, i would use lz4 or gzip on logs.

and then you should set atime=off
supposely it doubles disk io and is basically just an old artifact of features no longer used… it logs latency on folders, but better ways exist to perform the same task.

anyways you might not get any space saved, by running different sizes, but you do get space saved versus running without compression, if you use a recordsize that doesn’t fit well.
and to save cpu io / cpu time then you should run zle compression

you cannot compress the data because it’s encrypted, and a part of modern day encryption is to make the information look like static (old tv style) there exists no known way to compress that… and it’s most likely impossible, because if you could compress it then you are essentially decrypting it, because you are finding patterns…

which for AES 512 requires more energy to decrypt it than the sun releases from now and until it burns out…

so yeah… don’t waste your time even trying…

That should not happen if the recordsize only applies to the storj dataset. Changing the systems recordsize was not part of the storj optimizations and is indeed a bad idea. You describe it well why. The system consists of a lot of small files.

well i’m pretty new at zfs, this is still the first pool i ever made… so being blissfully ignorant i mounted the zpool and after a week worth of testing and attempting to crash it by pulling drives out while it was running… figure i would break it before i started using it… but it held out just fine… and so i created my storagenode on it… but it became a folder in the zpool “zroot” and thus all the dataset commands aka zfs aka zpool and such command won’t accept it… thus to change any stuff on the storagenode, which now is large enough that it takes up more of the zpool than i do…

it was actually why i tried to copy it a few days ago… i was trying to get it into a dataset…but its to large now… out ran my capacity to copy it with ingress on the vdev… xD

so yeah i apply all the changes to the entire zpool, which has some vm’s and i figured since i’m mainly using larger files anyways… going from 128k to 256k won’t hurt me… and storj will run better…

i don’t think you really get that much advantage of running 1m recordsize… unless if you really need 16 times the io from the default, and remember if you want to do some processing on the entire dataset like getting all the checksums or whatever… then for each checksum you retrive, you will load 1m record and throw away the rest… so with me already having like 2million files or more… that means to read all my checksums would take 2million mb loaded data…so 2tb…
not sure how big the the zfs checksums are but with 256k record sizes… i would only have to read 500gb a much more manageable size.

1mb record is for like files that contain video or such…making it perfectly sized to jump around in fast, without the need to be crazy accurate and can allow you to have throughput upwards of 600mb/s on just 5 drives in a raidz1

but keep running 1m recordsizes and we will see long term what will happen… ill stay at 256k i think thats a good place… only reason i could see for 1mb or 768k would be something like a SMR drive… ofc then another thing comes into consideration, which is just how big is the blocks the smr drive writes in… because if one syncs that up with the filesystem / partitioning then one might actually greatly improve the performance of the SMR drive for certain applications.

i got 5 drives, so double the io means they in a sense process io like 10 drives running 128k … very rough estimate XD and primitive perspective,… but easiest way to explain it

oh and btw… remember to scrub…
just started running one today xD estimate is 9 hours in total work time.
not sure if that would be faster or slower with higher or lower recordssizes tho… may just be the same since it might really read everything… and check it… but i duno
i know the guy that recommended me that one could run OS and ARC at the same sdd, clearly just used his zfs for low IO use cases…

[[[ Verifying Data Integrity ]]]

-the commands are pretty self explanatory-
a scrub will take a while since it will basically try to access all data
it will run at a reduced priority and thus might have a performance impact
but shouldn’t be a major disruption to normal operation…

$ zpool scrub poolname

$ zpool status poolname

stops
$ zpool scrub -s poolname

I don’t think zfs stores checksums in a data record. The checksum would already be significantly smaller than the default recordsize of 128k and would waste a lot of resources.

I’m not sure how it would perform on an smr but it should save some iops.

Yeah never run OS and l2arc on the same ssd.
I actually decided to run my OS from 2 wd Red in raidz and use 2 ssds as cache for read and writes. I preferred that setup over using the ssds in a raidz, especially because ssds typically have a cache too which is just some ram. So if power goes out, you still lose some data… Also means they are not perfect as write caches and there is a risk but I figured this is the best setup for a small homeserver without ecc ram or ups…

I scrub every week, always takes more than 24 hours on 3TB…

1 Like

i also have no idea where checksums are stored, tho it would kinda make sense to keep it as part of the records… but i’m sure there are very good reasons for however it’s done…

i’m not worried about data loss when running CoW, doesn’t really corrupt anything… and power here is stable as a mountain… i’ve had computers running for years without reboots…

and most of the data i work with is non critical, i just like it not to rot over extended storage, which i never thought about in the past, but after getting more and more into data storage, and decades of using computers, i’ve started to notice how my older data tends to degrade, which can be a pain.

Yeah never run OS and l2arc on the same ssd, i really assumed that the OS would just end up being loaded into memory for the most part… going to move it over to the zfs pool when i get around to it and then boot from a usb just to not having the HBA slowing down my boot times.

tried to move my ssd to my onboard sata connectors, mostly just for a fun experiment and because i had some non functional bays in the server and thus couldn’t install all the new drives for my 2nd vdev.
looks a bit like that caused some additional CPU delay, maybe because i max out the io or because the max bandwidth is much lower, or simply because sata150 which i think it is… runs on a slower frequency, making the cpu wait longer for it to respond…
while i think it is due to the latter reason, then i will be very interested to see if my cpu io delay goes down when i move the drive back onto the HBA which is sata600 i think.
the system seems to run fine tho, almost better than ever, but i did also manage to get my ram up from 800mhz to 1066mhz and i got my energy saving and turbo boost enabled, which gives me a major kick in additional speed and nearly halved my power consumption…

so not easy to tell what is the true cause of the io delay… but ill move it back at some point soon, to check, because if the cpu io delay is directly affected by the sata link to the OS drive… then that would be a strong point for maybe getting a better ssd… or rely on ARC.

i assume you mean raidz1 and not raidz0 since doing your setup in raidz0 would be kinda pointless… atleast in case of protecting against disk failures… :smiley:

totally neglected to scrub… but mostly because this is like the end of the 6th week of me ever using zfs.
didn’t seem that important at first… until i found out at the start of this week or so, that i had a disk get faulted by zfs and it had never been fixed…

50% done now in 5 hours of scrub on 7.75T actual data ofc its only 5 drives so… so looks like its going to take about 10hr…
i suppose there must be many factors that affect scrubbing time, but shouldn’t you be faster than me if you have two drives in raidz1?
i mean my data is stored over 4 drives essentially , so i can read at 125% faster of what a single drive should be able to scrub at… but in a mirror setup you should read at 200%

ofc the number of files and what not might have an effect on that… but still you should scrub faster than my setup by far…

Yeah raidz1 of course, a mirror. It is scrubbing at full speed of the hdds but it does not matter how many drives are mirrored because every disk is read and they both contain the same data. So it takes at least the amount of used space divided by reading speed of a single drive.
small files however do affect the speed and the disk load too because scrubbing is a low priority taks and quickly slows down if there is other load.

Storj having hundreds of thousands of files didn’t help the scrubbing performance :grin: certainly not with recordsize 128k.

BTW If your OS SSD usage is low, it might indeed work well to have a l2arc cache on it.
So if you only use your OS partition for running dicker containers and all the workload then happens on your hdds, then there should be almost no ssd usage and most files should be in the arc anyway.

That does not make sense.
raidz1 is the equivalent of RAID5 (a bunch of drives with single parity)

You probably meant raid1, which zfs calls just mirror.

Ah yes, I got confused there… I’m running a mirror, which would be equivalent to raid1 (just better :grin:)

1 Like

i suppose in raidz1 because its like raid5 then it cannot run on two drives like raid
which is why you need another name for it… i just so green i’m still kinda thinking of the number as the redundant disk… xD but i suppose that isn’t always correct. because math is hard xD

i actually hadn’t thought of that… raidz1 and raid1 seems so alike but parity isn’t mirroring xD doh

i think i just answered my own question inregards to speed, i was thinking of bandwidth multiplication but not actual data integrity checks… ofc one would have to read ever full drive, which means a scrub would take the time it takes to read a drive and thus if there are 1,2 or 5 in mirror or raidz2 with 10 drives then so long as there is enough bandwidth, then it takes about the same time to read them fully.

meaning you have 8tb, drives or so…

excuse my stupidity.

1 Like

generally i just think i want to throw everything on to the zfs, aside from maybe some sort of backup bootable partition, incase something goes wrong and i have to fix it… then its nice to still have an os without to much trouble… and be able to troubleshoot the primary OS… kinda want to get a nvme samsung 970 pro evo at 1tb… they just have so lovely performance… but i think ill performance tune with the slow crucial bx300 750gb one i got, before i go full retard on iops.
and want to see if this cpu delay is actually related to the OS drive and how it would perform when running on the zpool with the ssd as arc… to know how best to hook it up…