Zfs discussions

My initial goal was prove you that on real storagenode workload SLOG and L2ARK not make sense.
I did it amd provided collected information.

Will waitng for the next challenge :slight_smile:

Is there a good command to test write performance similar to that tar command? I would like to test especially the commit time.

I can recommend fio, please wait a few min. I prepare example.

@littleskunk Here is few examples, you can play with blocksize and iodepth (for storage node workload simulation iodepth should be starting from 4 to 16). Please replace filename=/dev/sdX with your disk or file. Also, I pay special attention, on ZFS, for got correct results we need reboot host every time before the launch next type on the test.

fio [options] [job options] <job file(s)>

Job file examples
Read test (read.job):

[readtest]
blocksize=4k
filename=/dev/sdX
rw=randread
direct=1
buffered=0
ioengine=libaio
iodepth=4

Write test (write.job)

blocksize=4k
filename=/dev/sdX
rw=randwrite
direct=1
buffered=0
ioengine=libaio
iodepth=4

Hybrid test, read/write (hybrid.job)

[readtest]
blocksize=4k
filename=/dev/sdX
rw=randread
direct=1
buffered=0
ioengine=libaio
iodepth=4
[writetest]
blocksize=4k
filename=/dev/sdX
rw=randwrite
direct=1
buffered=0
ioengine=libaio
iodepth=4

We should analyze the output and pay our attention to the parameters that I marked:
During test:
Jobs: 2 (f=2): [rw] [2.8% done] [13312K/11001K /s] [3250/2686 iops] [eta 05m:12s]

Output:
read: (groupid=0, jobs=1): err= 0: pid=11048
read : io=126480KB, bw=14107KB/s, iops=3526, runt= 8966msec
slat (usec): min=3, max=432, avg= 6.19, stdev= 6.72
clat (usec): min=387, max=208677, avg=9063.18, stdev=22736.45
bw (KB/s) : min=10416, max=18176, per=98.74%, avg=13928.29, stdev=2414.65
cpu : usr=1.56%, sys=3.17%, ctx=15636, majf=0, minf=57
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=99.9%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
issued r/w: total=31620/0, short=0/0
lat (usec): 500=0.07%, 750=0.99%, 1000=2.76%
lat (msec): 2=16.55%, 4=35.21%, 10=35.47%, 20=3.68%, 50=0.76%
lat (msec): 100=0.08%, 250=4.43%
write: (groupid=0, jobs=1): err= 0: pid=11050
write: io=95280KB, bw=10630KB/s, iops=2657, runt= 8963msec
slat (usec): min=3, max=907, avg= 7.60, stdev=11.68
clat (usec): min=589, max=162693, avg=12028.23, stdev=25166.31
bw (KB/s) : min= 6666, max=14304, per=100.47%, avg=10679.50, stdev=2141.46
cpu : usr=0.49%, sys=3.57%, ctx=12075, majf=0, minf=25
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=99.9%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
issued r/w: total=0/23820, short=0/0
lat (usec): 750=0.03%, 1000=0.37%
lat (msec): 2=9.04%, 4=24.53%, 10=49.72%, 20=9.56%, 50=0.82%
lat (msec): 100=0.07%, 250=5.87%

There is no need to speed up writings. It gorgeous at the ZFS.
As for readings, low latency can be obtained by transferring metadata to ssd special device.

yeah was reading up on that, because i thought i could get my slog and l2arc to run with the special_small_blocks parameterā€¦
alas only for the special metadata deviceā€¦ and those cannot be removed once added and if they fail the pool failsā€¦ so i decided to stay far away from that presentlyā€¦

neat feature tho
well my 8k zvol blocksize is keeling my ssd speed in IO because the ssd is 512 sector size, and thus i get a 16 time IO amplification reducing my slog IO to 4kā€¦ :confused:
so not sure i think that is brilliant, but not really zfsā€™s fault that i let a stupid OS configure my poolā€¦ so there is that

bonnie++ is my go-to for storage benchmarking.

so now that iā€™ve sort of accepted i cannot figure out a good way to remake my pool, i was attempting different solutions to get past my 8k block issue.

it seems that i can create a zvol inside a pool, and even tho the pool utilizes 8k blocks, i can define that the zvol should use 512byte blocksā€¦ does this mean i could essentially migrate everything into a zvol and sort of just leave the rest of somehow export the zvol or something similar to make it utilize the entire pool eventuallyā€¦

anyone got any sense if this is a terrible ideaā€¦ for whatever reasonā€¦
i really need to get out of that 8k bind because my io is basically running at 1/16th capacity because of itā€¦ i got a few drives that will do 4k so they might only feel x2 io amplificationā€¦ this might also explain my previous issues with excessive latency

from how i have understood itā€¦ i can use 512 byte blocks on the vol and then do like i duno somewhere between 128k and 512k recordsizes leaning towards 128k which should give me a throughput of about 500mb/s, which is fine by me if i save a good deal of extra memory and bandwidth at timesā€¦
so zfs should start at 512byte and then simply expand it until it fills the recordsize and if i work with raw data i can achieve peak io on say if i wanted to move empty files around quicklyā€¦ or do massive database rw

so all sounds goodā€¦ what am i missing?

If you are using raidz with 4K drives you have to use volblocksize of at least 64K. Using a smaller volblocksize will result in the zvol taking twice as much space as there is data in it.

I have found this out firsthand. Hey look, I can get more performance with volblocksize=4K, awesome. Hmm, why is my 500GB zvol taking up 1TB?

2 Likes

well there are two drives that are 4kn and right now the volblock is 8k and with 8k iā€™m not seeing any data amplificationā€¦ maybe thats why it made it like thatā€¦ but running 64k vol block would put your io amplification at 16x also when using 4k hardwareā€¦ i might have to get rid of those 4k drives then maybe get a couple of regular 512 so everything fits and then maybe run 1k if i see any data magnification when running 512ā€¦
i guess i got tests to runā€¦

you sure you donā€™t mean recordsize?

your pool has either ashift=9 or ashift=12.
If it is ashift=9, then the data is misaligned on the 4K drives and you get reduced performance.
If it is ashift=12, then you get higher performance, but a small-blocksize zvol will take up more space.

A zvol with 64K blocksize does not really result in 16x write amplification most of the time because the files usually are bigger than 4K.

However, I could run NTFS with cluster size 64K (or some other filesystem - ext4 does not support large block sizes) and possibly waste some space when storing tiny files, but get higher performance.

iā€™m on ashift 12 and yeah most files are bigger than 4k, but that doesnā€™t change that stuff like database io which in general would be pretty small i supposeā€¦
i wonder if i can get away with forcing the slog to run 512byte, been sort of trying to do that when i found that i could create new volumes with different blocksizes inside my poolā€¦ which is interesting to say the leastā€¦

ofc even if the slog runs 512ā€¦ then stuff going into the slog would be fastā€¦ but going out would still be at a 16x io amplification atleast when it hits the hdds.

and if i did get it to be 64kā€¦ thats 128x io amplification on my slog devicesā€¦ that would put me back at twice the io of what a hdd can manage and thats on my ssd

zfs create -s -V 1G pool/test-zvol
dd if=/dev/urandom of=/dev/zvol/pool/test-zvol bs=1M

And tell me how much the zvol takes. It should be more than 1GB.

crtl + c wonā€™t kill itā€¦
dd: error writing '/dev/zvol/zPool/test-zvol': No space left on device

okay got itā€¦ had to figure out how to keel it with top and kill -17 to get the darn thing to dieā€¦
useful thing to know how to do that thoā€¦

OMG its still goingā€¦ ffs

Sorry, forgot oflag=direct, that would allow you stop it faster.

Essentially, since this was an async write, the 1GB is cached in memory and will be written to disk.

I figured that writing 1GB would not take a long time.

Hereā€™s how it looks on my pool (6 drive raidz2):

zfs list -o name,volblocksize,volsize,used | grep stor/test              
stor/test4K                   4K       1G  2.02G
stor/test64K                 64K       1G  1.00G
stor/test8K                   8K       1G  2.01G

stop it fasterā€¦ i canā€™t keel itā€¦

Did you use a very large zvol? Or is your pool very slow? I think the only way to stop it would be to restart the server. However, it should not take long to write 1GB. My server write it in a few seconds.

it only took a brief moment before it said it was out of spaceā€¦ but the terminal remain active and i couldnā€™t do anythingā€¦
aside from writing

it was only on the 1gb volā€¦ but i did use my primary pool to create the vol onā€¦i donā€™t really have a lot of pools at the momentā€¦ didnā€™t look to be anything that would cause troubleā€¦

on the plus side it doesnā€™t seem to be writing data to my poolā€¦ xD
its just like it keeps writing on the same 1gb vol again and againā€¦

i just cannot believe i cannot kill itā€¦ iā€™ve tried like 3 different guides for killing stuff and it just wonā€™t die
its like there is something that starts it back upā€¦ i did manage to find it using ps -aux something it was clearly itā€¦ but nothing i try will stop itā€¦ weirdly enough

Basically, without oflag=direct, dd writes in async mode, so, it wrote the 1GB to memory very quickly, said it was out of space and left the kernel to write the data from the memory to the disk.

reboot it is thenā€¦
not sure it can write async since i got sync=always on :smiley:
which actually could have been part of the problemā€¦ because i was testing and kinda remove my slogs and my l2arcā€¦ so maybe not totally your fault xD

i did think of adding like a progress thing to itā€¦ so i could see what it was doingā€¦ then i bet i could have caught it with thatā€¦