Zfs discussions

SGC · May 19, 2020, 8:05pm

Okay i guess i misunderstood something somewhere about those special devices, no matter… for now i consider them a bit to dangerous for my use case … why do you use them?
i mean you must have a pretty good reason to risk the pool for having them…

what was this for exactly…?

i have been pondering changing my flush time for the slog flush txg… tho not sure if that is a good idea… i was thinking of trying it on maybe 10sec instead of the default 5
command looks slightly related…

i was tinkering a bit with the small block stuff because i wanted to see if i could get my slog to run 512… but turned out to be related to other issues also…mine is set to 512, but since it only applies to small block devices it should be irrelevant i assume
apparently its only applied to special devices, sadly

Krey · May 19, 2020, 8:18pm

it speedup a storj must more significally when slog or either l2arc. Just get info about files 100 times faster if i remember correctly. It offload main storage pool devices more significally than slog and more importantly on readings.

increase time when transactions scoped in memory transaction log. All transactions not only sync writes.

I don’t understand what problem you are trying to solve by messing with the ashift.

SGC · May 19, 2020, 8:58pm

i was trying to figure out why i couldn’t copy empty files faster than 2k files a second, ofc i can make it do that if i changed from sync=always
but sync=always decreases my pool hdd read latency, from what i understand it due to the fact that sync=always uses the slog to synchronize random writes into sequential writes on the hdd’s

anyways sync=always seems to be good for my pool, have had some hdd latency issues i was dealing with… they have turned out to be related to ashift of my pool, because the hardware is a mix of 4kn and 512 drives, so i was attempting to find out what the IO bottleneck was exactly…
obviously the SLOG but why… turned out that my writes and maybe reads are Q1T1.
maybe because zfs want my drives to be designated in more separate LUNS, currently all drives occupied the same LUN which runs into a default set max of zfs to max 10 concurrent IO pr LUN
thus my slog ssd ends up with Q1T1 random writes something which it’s very poorly suited for.

the ssd’s i use can handle 56k and 89k iops, but on Q1T1 they max out at 4k iops and 7k iops.
so it turned out not to be really ashift related, it’s just the ashift that causes my hdd’s added latency, because my native 512 sector drives ends up with an IO amplification of 16x because zfs writes per default in 8k blocks to the devices, and thus each 8k block represents IO 16 x 512byte blocks

but now i’m looking into either setting up my LUN’s more correctly or maybe as an easy fix just increasing the zfs_vdev_max_pending from 10 to something much higher… but i don’t like that solution.

it’s just performance tuning / configuration adjustments trying to fix mistakes i made because this is my first zfs pool xD

so yeah basically… i’m trying to make it go faster… when doing simulated loads that has basically no real world practical usage… hehe but the ability to perform high IO jobs well is never a bad thing…

ofc real considerations of if its worthwhile comes into play eventually…
i still have much higher IOwaits than i like, but thats what i get for running disks that doesn’t suppose each other… missed that when i purchased them…

Krey · May 19, 2020, 9:12pm

Where is no write amp if you pool has ahift 4k (12).
And i cant understand why you mix zvol and zfs only for zfs. It make sence for something like extfs on zvol.

Krey · May 19, 2020, 9:18pm

Dont mess with terms. Block on zvol say 8k write spread to all devices. You have small block and long array. Thats why you may sence that you get amp. But i doubt where is no amp just not all devices in you raid writing one block, most of is just idle. Some of this write paddings.
Pentium100 many times point you to this thing (yes i read all this topic) . You need 64k block at least with 8 hdds on pool.

In your case when you use pool not only for storj we can get right settings and stats.

Can you post “zfs get all” output.

SGC · May 19, 2020, 9:45pm

zvol are so far as i can tell virtual volumes on a pool they have nothing to do with the underlying blocksizes of zfs, which i didn’t know at the time…

zfs on ashift 12 is 8k blocks because it needs to do that to allow for its compression, however this comes at a cost of a io amplification of 2x ofc… unless if the compression is over 50% effective… which is why its only recommended to run a compressed zfs pool if you get above a x2 compression ratio, else it actually has a detrimental effect on over all performance… aside from the advantages it might give for storing meta data in ram and such.

well i’m still learning and i don’t like just blindly following guides…
64k blocks … you talking about recordsize… well those are variable anyways… the reason which you say 64k recordsizes is because 8x8 is 64 so thats a “stripe” across the drives… but that also requires the data to be there… else zfs will use smaller records down to 4k which is what the ashift limits it to…

but yeah it’s a bit of a damn science to make sense of it all…
i’m still trying only 9 weeks into linux and zfs here
ill figure it out, and learn some hard lessons along the way so people can say i told you so…
i might not listen now… but i will when i understood the reasons why…
i understand what you are saying, and it sort of makes sense, but my hardware is 512 so i need to move down to that first… so i don’t get io amp from zfs communicating with the hardware itself because i failed configuring it correct.

and if i can do 89000 iops or 4000 iops might not make much of a practical difference for storj, but i plan to use the pool for other things and at the moment i can really see there is a serious issue with how its running… its affecting my latency… nearly idle hdd latency of 60ms is pretty high… and that ssd can get 200ms latency is also not normal… i’ve tinkered my way to a system that is operational with good latency but to do that i’ve sacrificed a ton of throughput

Krey · May 19, 2020, 9:59pm

Volblocksize is exacly max size which zfs write zvol block directly on pool.

Ahift not related to records, compressions etc. Actually it is physical sector size of underlying devices.

If you write 8k record with compression on and it fit 4k it will write to one 4k sector of one device. Otherwise it will be writen to two 4k sectors on two devices.

Storj data cant be compressed. Zfs compress ratio more related to boundaries of blocks not for actual compression. It is devided logicalused/used.

Krey · May 19, 2020, 10:02pm

The modern suggestion is use compress everytime and everywhere

Krey · May 19, 2020, 10:06pm

Only if all devices on pool have 512 sector. If one is 4k you must stay on 4k ashift.
More over if someone create pool on all 512 devices but it keep in mind migration to modern hdds it must use 4k ashift.
All that you loose on 12 ashift is some space. Not speed looses there. Physically is no more than sectors alignment whats this call shift. Just all records in pool starts with numbers divided say 4k without rest.

SGC · May 19, 2020, 10:16pm

volblocksize is related to virtual volumes on a pool, to supply comparability for mounting them into other stuff… like say vm’s, thus they are irrelevant.

ashift is most certain related to compression, and dealing with compressible data creates a 2x io amplification, because it will write two 4k sectors instead of 1 x 8k one if the hardware supported that…

i’ve been knee deep in this stuff for over a week doing little else than reading about it and try to fix my bad configuration on the hardware i got in the system now.

i got a pretty good grasp of just how the whole ashift, sector, zfs thing works… sure when moving upwards the 64k record stuff may make perfect sense, but my issues are far more fundamental then recordsizes… besides recordsizes i can adjust later thus for what i’m doing right now they are practically irrelevant.

thats kinda my point… ofc for the storj workload my system is fine…
but its not fine for high io and i do get upwards of 20% cpu utilization just because of io wait even with all my mitigation to reduce latency and workload of the disks…

think about it this way… if storj wants to correct a db record on my pool… it might need to write a few little details… to do this zfs will utilize minimum 4k sector sizes / logical block or whatever voodoo name we want to give them, however since my drives are 512byte that means one db write is 8 blocks of 512, instead of something that might just have been 1 … this pulls io form other tasks and because it happened in the middle of something else, it causes IO wait because the disk head is working much longer for tasks that should be quick.

if you are on 4k drives then they are designed for dealing with 4k…

Krey · May 19, 2020, 10:34pm

No it is related to size of continous block that write to pool. You can use zvols as raw devices without filesystem on them. Like for databases with its own disk structures. And yes it ususally set by calculation physical secor size * device count, next in underlying fs set cluster size to this calculated value

This is bullshit, sorry. Compression is property that you set one time per second per dataset. But ashift you can set only once when create a pool. because it is physical characteristic of you pool.

Krey · May 19, 2020, 10:44pm

I’m afraid you got this impression because you don’t know how to test it correctly. I don’t understand much myself after years, and you just started.
But i have ability to ask directly more experienced comrades who simple make changes locally in it code rather to search properly option it its hundreds module parameters.

Krey · May 19, 2020, 11:54pm

4k device write 8k at the same speed as it write 2x4k or 8x512 if soft write at least 8k at once. And physical boundaries of sectord (ashift) properly alignment.

It never write 8k at device with 4k or 512. On 8 device array it write 2x4k on two devices with ashift 12 or 8x512 on 8 devices with ashift 9. Plus parity sectors one 4k on first case and two 512 on second. In this two cases second case give more performance due to more physical device loaded on work to write 8k data piece.
But calculations changed when you set 64k block size.
And write with 64k pieces at least.
And you can understand from this calcutions why power of two rocks when you design you datastore.
And i want say special thank to stojlabs developers about new version not only for db path but for ability to specify buffer sizes.

SGC · May 20, 2020, 7:55am

haven’t see the new version yet… you keep talking about recordsizes like zfs is regular raid, zfs uses variable blocksizes which is why it’s so difficult to take disks out or put them back into raidz.
you keep talking about recordsizes but they are much later than what i talk about… i want my disks to be able to write 512 bytes and thus carry out 1 IOPS… if they cannot write 1x 512 bytes then i will have io amplification, because if i write 1byte it will require in the case of 4k (ashift 12) it would be a minimum of 8 IOPS and with my current zfs configuration even my SSD that can do 89000 IOPS is limited to 7000 IOPS, meaning i can create situations where my SSD
is writing say 20kb/s and be maxed out at 7000 IOPS, and it has nothing to do with recordsize… multiple disks or anything… this is just the hardlimit of max IOPS on my SSD electronics which i use for my SLOG, and initally i thought it was due to the blocksizes, but like i’ve explained a number of times by now, it turned out to be related to Query depth.
most likely because of my LUN configuration when seen by ZFS
no point in making it more complex than it needs to be

Well zfs is pretty advanced stuff… but i’m pretty good with advanced stuff. and i might only be 9 weeks in to using zfs but i’ve been learning about it for a good deal longer.
i wasn’t quite prepared on just how much it sucked compared to raid in some aspects, and all that just to get some compression that i basically can’t use because it’s only movies or storj data on my server thus far… Geee i’m really happy Sun went that direction, but i guess it makes sense for the stuff they did with it…

raidz isn’t regular raid you seem to confuse the two

Krey · May 20, 2020, 8:31am

i talk about zfs raidz. recordsize is variable it is true but I doubt about a block in zvol. I asked this question and am waiting for an answer now.

SGC · May 20, 2020, 8:37am

zvol are not related to what i’m working on i only touched upon them briefly and have only what little knowledge i’ve learned from a test and gleaned from the oracle zfs manual, it seems zvol’s are virtual volumes upon a pool… thus they have no relation to ashift or raidz

this is why you get a data amplification when writing say 4k blocks to a zvol, because zfs seems to work in 8k blocks you usually end up with a factor of two, or thats my standing theory…

Krey · May 20, 2020, 8:38am

IOPS is just IO operation per second. IOPS with 512 bytes, 8k or 64K a differ in time.all you meanings and calculations of this theme out of my mind.

is writing say 20kb/s and be maxed out at 7000 IOPS

especially this.
this technically uncconnectible between traffic and store iops.
somewhere between network and traffic will be software that record data on disk.

SGC · May 20, 2020, 8:52am

the hell they are… it also defines what the minimum IO.
say i wanted to copy 1million files in a second… then i needed atleast 1million iops for that, at 64k with 4k hardware would need 16 million iops at 512 at 64k you would need 128million iops
also for stuff like caching and such you may get throughput amplification because, if say its a search then you might read 10 bytes in each block, and then the rest is cached… and thus for every 64k record you will end up with 63k+ taking up room in the cache thus practically choking a cache with useless data.

this blocksize iops thing is very fundamental and can cause a lot of trouble in certain workloads.
its most likely why you want to run with a metadata device, because if your system was correctly tuned you shouldn’t need one.

well poor configuration doesn’t make your system faster… its a matter what you tune it for… and what you need.

i want to be able to copy 1 million empty files with sync=always… and i want to do that in about 10sec like my machine should be able to, instead of 20k files in 10sec…
its a pretty radical difference in peformance that i don’t want to miss out on because the system is incorrectly configured because it will make stuff like searching and db loads directly onto the pool insufferable

Krey · May 20, 2020, 8:53am

you said about amplification and I dissuaded you.
I suggest you stop and learn the basics. and get rid of volumes, in favor of ordinary datasets.

you still did not answer on question about you store. I think that you don’t know where and why you have volumes.

SGC · May 20, 2020, 9:02am

what are you talking about i don’t have any zvol
i got a regular pool with no zvol’s on it… i tried making a zvol because i thought zvol was below pool but while figuring out how to do that i found out that zvol are virtual volumes!!! even tho i went on a did a performance test on it to check if it would write with the IOPS my ssd can manage but again i only got the 4k and 7k which turned out to be a Q1T1 issue

i’m done trying to explain this… what you keep saying doesn’t make any sense