Zfs discussions

Odmin · May 19, 2020, 5:28pm

SLOG is added to the right place and sync=standart here is no manipulations or mistakes.

No, I used very accurate graphs.

No, it’s wrong, you can see on my screens that I used NVME also for L2ARC, and my graphs display very accurate results.

here is the result:

These numbers not changing during a time.

I also wonder why I see SLOG activity on your graphs, and I try to reproduce your situation with SLOG on my other location:
zpool iostat -v 1

This is ZFS on Linux with default settings with the storage node.

So, the default setting on ZFS is good but not optimized for the storage node. All my previous results it was ZFS on FreeBSD that I optimized for storage node. Also, I have ZFS on Linux that also was optimized, and this system also not using SLOG. To reproduce your situation, I bring a new system with Debian 10 + ZFS with evrything by default and it used SLOG during writes.

SGC · May 19, 2020, 5:38pm

thats odd, so only debian utilizes the slog… seems like a bug then maybe… the ZoL stuff has been used for making some new features to the FreeBSD version of ZFS, maybe something got broken…
from my understanding the OpenZFS 2.0 will merge both FreeBSD and ZoL into one branch or something like that if i understood it correctly.

even if the slog feature was broken who would really notice right away it basically a backup feature for a filesystem that in theory shouldn’t need it… and in most cases the data lost without it is like in the a few mb or kb range… ofc in a 10gbit network system on fulltilt that could be upwards of 6GB in the 5 sec it takes for it to flush pr default… of people running at those levels would have reduced the flush time so maybe 1gb… and then there is the whole would very modern hardware have PLP if so they might actually save the data without the slog…

so i’m my uneducated opinion the slog could be broken in the BSD versions, even tho it does sound kinda unlikely… or it simply has a way to keep track of if the data reached the drives fast enough to not bother with the slog… since its basically only there to protect against cache / ram data loss and such.

i could see that as a minute optimization the linux democrazy might have glossed over in favor of having a stable system instead.

so yeah that might be where i would put my money… this time xD the BSD version of ZFS is simply more advanced and can tell your drives have gotten the data fast enough to not bother with the slog… ofc then it sort of make the slog a bit more redundant than it already was

i’m also on debian. also pretty much default here… aside from sync=always and 32k recordsize atm… xD and ashift 12

kevink · May 19, 2020, 5:39pm

So what were your optimizations that made your normal system not use SLOG for storagenode?
I’m also using zfs on linux with ubuntu 19.04 with default on everything except recordsize.

Seems strange that there is a configuration that prevents the usage of a SLOG for sync database writes…

You are right about your graphs, I didn’t look at the other ones, sorry.

Krey · May 19, 2020, 6:00pm

10 operations per second. It is not a lot. It is near zero.

kevink · May 19, 2020, 6:03pm

yeah yeah… my point is, it is used consistently and changes the write pattern of the hdd visibly, even if the slog gets only 10-50 operations/s. The DBs dont need more apparently.

Krey · May 19, 2020, 6:21pm

Zfs on freebsd works significally faster than ZoL

Krey · May 19, 2020, 6:27pm

This is not enough. I recommend paying attention to these dataset parameters

-O compression=on \ (or zle)
-O atime=off
-O xattr=off
-O exec=off
-O acltype=off
-O relatime=off \

echo “30” > /sys/module/zfs/parameters/zfs_txg_timeout

SGC · May 19, 2020, 6:34pm

yeah i atime i’ve turned off… exec is just for execution… but i suppose on the storagenode blobs folder or such it might make sense…
xattr i haven’t checked… but yeah it is a hog for resources i’ve read… extra attributes

what do acl and relatime do?, i’m already on zle copied my entire storagenode to get that stamped through on it…

@Krey
have you set autotrim=on if you use ssds that should give you a 30% performance increase

Krey · May 19, 2020, 6:38pm

this is dangerous with consumer grade ssd. I make it manually and only under observation

SGC · May 19, 2020, 6:47pm

well the ssds keep my slog and my l2arc… so no real loss there if they wear out it’s quite safe…

i also allocated 25% additional unused space on them as to limit their wear

and they are old so meh… thus far it’s been running for a month with no ill effects.

O.O
are you insane you do know that special devices for meta data kills the pool if they get corrupted right… and you cannot remove them…
so if that is your current pool you should seriously think about getting those special devices under control.

kevink · May 19, 2020, 6:50pm

Ah yes, I use compression=on and atime=off

not the rest though… might take a look at those later. thanks.

Krey · May 19, 2020, 6:51pm

this errors apears only when trimming. smart clear and contains 0 errors. looks like this is timeout issues.
Whats why i turn off autotrim.
I think to add a tripple sdd to this mirror with other brand.

SGC · May 19, 2020, 6:54pm

O.o

i can see why you would want to run with autotrim off… lol
you sure the devices support trim… i seen a couple of people recommend strongly that devices needed to support trim…

i think mine trims stuff like … well my iostat -l says the trim wait is 42 ms… so i guess they wait that long xD

 zpool iostat -l 180
              capacity     operations     bandwidth    total_wait     disk_wait    syncq_wait    asyncq_wait  scrub   trim
pool        alloc   free   read  write   read  write   read  write   read  write   read  write   read  write   wait   wait
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
rpool       1.99G   137G      0     25  20.6K   326K  731us    1ms  549us  414us   56us    2us    1ms    1ms  524us   41ms
zPool       13.6T  24.6T     23    234   175K  4.19M   33ms    7ms   20ms    1ms  865us    2us   28ms    6ms   12ms    4us
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----

Krey · May 19, 2020, 7:22pm

trim is the special device command BLKDISCARD. Without support this you never see “trimming” in zpool status.
This ssds not bad. It have dram and slc cache. i have it more than 10pcs and never get troubles with it until this case.

SGC · May 19, 2020, 7:28pm

yeah you shouldn’t see it trimming, but i think the developer said it could happen, even tho he tried to avoid it… and it does seem like something fairly recently added…
but i duno… think he mentions in this… i found it quite interesting, but kinda new to the whole opensource thing, so maybe it’s not as new as i think.

SGC · May 19, 2020, 7:31pm

one thing about zfs i really don’t understand… if why one can remove like mirrors as top level vdev’s but one cannot remove a drive added as a top level vdev… nor raidz

what makes mirrors different in such a way… that only they can be removed.

seems like something that i might actually dig into zfs just to try to solve or figure out atleast some of the reasons why it is like that… it doesn’t make any sense in my brain lol

SGC · May 19, 2020, 7:41pm

inregard to the

i assume this is in relation to the storagenode dataset…
or is there good reasons to turn it all off on all datasets… the acl sounds pretty useful? and i duno if xattr is used for anything in debian…

oh and are you running
special_small_blocks for your special devices… duno how well this works with meta data tho… but i assume meta data might run best with small blocks… hench giving a reason for the feature or anything thoughts on that?

Krey · May 19, 2020, 7:48pm

i think you can remove it

yes, only for storj

matadata always writing to special device when it is added to a pool. it regardless special_small_blocks.
special_small_blocks options set a size of blocks that (less or equal) goes to special device regadless main pool devices.

SGC · May 19, 2020, 7:52pm

yeah i was just wondering if you ran 512 block on the special devices and then … maybe 4k on the pool…
depending on what the special device supports ofc… i guess it might not matter much for modern ssds… not much anyways… but still there would be extra IO to be gotten there, which since you use special devices must be pretty important, for searching for whatever indexing stuff with heavy utilization of metadata you got.

acltype and relatime is off in proxmox but default it seems.

Krey · May 19, 2020, 7:54pm

this is not about sector size. it about recordsize\volblocksize