Zfs discussions

Odmin · May 16, 2020, 1:02pm

My initial goal was prove you that on real storagenode workload SLOG and L2ARK not make sense.
I did it amd provided collected information.

Will waitng for the next challenge

littleskunk · May 16, 2020, 1:03pm

Is there a good command to test write performance similar to that tar command? I would like to test especially the commit time.

Odmin · May 16, 2020, 1:04pm

I can recommend fio, please wait a few min. I prepare example.

Odmin · May 16, 2020, 1:25pm

@littleskunk Here is few examples, you can play with blocksize and iodepth (for storage node workload simulation iodepth should be starting from 4 to 16). Please replace filename=/dev/sdX with your disk or file. Also, I pay special attention, on ZFS, for got correct results we need reboot host every time before the launch next type on the test.

fio [options] [job options] <job file(s)>

Job file examples
Read test (read.job):

[readtest]
blocksize=4k
filename=/dev/sdX
rw=randread
direct=1
buffered=0
ioengine=libaio
iodepth=4

Write test (write.job)

blocksize=4k
filename=/dev/sdX
rw=randwrite
direct=1
buffered=0
ioengine=libaio
iodepth=4

Hybrid test, read/write (hybrid.job)

[readtest]
blocksize=4k
filename=/dev/sdX
rw=randread
direct=1
buffered=0
ioengine=libaio
iodepth=4
[writetest]
blocksize=4k
filename=/dev/sdX
rw=randwrite
direct=1
buffered=0
ioengine=libaio
iodepth=4

We should analyze the output and pay our attention to the parameters that I marked:
During test:
Jobs: 2 (f=2): [rw] [2.8% done] [13312K/11001K /s] [3250/2686 iops] [eta 05m:12s]

Output:
read: (groupid=0, jobs=1): err= 0: pid=11048
read : io=126480KB, bw=14107KB/s, iops=3526, runt= 8966msec
slat (usec): min=3, max=432, avg= 6.19, stdev= 6.72
clat (usec): min=387, max=208677, avg=9063.18, stdev=22736.45
bw (KB/s) : min=10416, max=18176, per=98.74%, avg=13928.29, stdev=2414.65
cpu : usr=1.56%, sys=3.17%, ctx=15636, majf=0, minf=57
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=99.9%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
issued r/w: total=31620/0, short=0/0
lat (usec): 500=0.07%, 750=0.99%, 1000=2.76%
lat (msec): 2=16.55%, 4=35.21%, 10=35.47%, 20=3.68%, 50=0.76%
lat (msec): 100=0.08%, 250=4.43%
write: (groupid=0, jobs=1): err= 0: pid=11050
write: io=95280KB, bw=10630KB/s, iops=2657, runt= 8963msec
slat (usec): min=3, max=907, avg= 7.60, stdev=11.68
clat (usec): min=589, max=162693, avg=12028.23, stdev=25166.31
bw (KB/s) : min= 6666, max=14304, per=100.47%, avg=10679.50, stdev=2141.46
cpu : usr=0.49%, sys=3.57%, ctx=12075, majf=0, minf=25
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=99.9%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
issued r/w: total=0/23820, short=0/0
lat (usec): 750=0.03%, 1000=0.37%
lat (msec): 2=9.04%, 4=24.53%, 10=49.72%, 20=9.56%, 50=0.82%
lat (msec): 100=0.07%, 250=5.87%

Krey · May 16, 2020, 7:36pm

There is no need to speed up writings. It gorgeous at the ZFS.
As for readings, low latency can be obtained by transferring metadata to ssd special device.

SGC · May 16, 2020, 7:45pm

yeah was reading up on that, because i thought i could get my slog and l2arc to run with the special_small_blocks parameter…
alas only for the special metadata device… and those cannot be removed once added and if they fail the pool fails… so i decided to stay far away from that presently…

neat feature tho
well my 8k zvol blocksize is keeling my ssd speed in IO because the ssd is 512 sector size, and thus i get a 16 time IO amplification reducing my slog IO to 4k…
so not sure i think that is brilliant, but not really zfs’s fault that i let a stupid OS configure my pool… so there is that

cdhowie · May 16, 2020, 9:37pm

bonnie++ is my go-to for storage benchmarking.

SGC · May 17, 2020, 6:37pm

so now that i’ve sort of accepted i cannot figure out a good way to remake my pool, i was attempting different solutions to get past my 8k block issue.

it seems that i can create a zvol inside a pool, and even tho the pool utilizes 8k blocks, i can define that the zvol should use 512byte blocks… does this mean i could essentially migrate everything into a zvol and sort of just leave the rest of somehow export the zvol or something similar to make it utilize the entire pool eventually…

anyone got any sense if this is a terrible idea… for whatever reason…
i really need to get out of that 8k bind because my io is basically running at 1/16th capacity because of it… i got a few drives that will do 4k so they might only feel x2 io amplification… this might also explain my previous issues with excessive latency

from how i have understood it… i can use 512 byte blocks on the vol and then do like i duno somewhere between 128k and 512k recordsizes leaning towards 128k which should give me a throughput of about 500mb/s, which is fine by me if i save a good deal of extra memory and bandwidth at times…
so zfs should start at 512byte and then simply expand it until it fills the recordsize and if i work with raw data i can achieve peak io on say if i wanted to move empty files around quickly… or do massive database rw

so all sounds good… what am i missing?

Pentium100 · May 17, 2020, 6:46pm

If you are using raidz with 4K drives you have to use volblocksize of at least 64K. Using a smaller volblocksize will result in the zvol taking twice as much space as there is data in it.

I have found this out firsthand. Hey look, I can get more performance with volblocksize=4K, awesome. Hmm, why is my 500GB zvol taking up 1TB?

SGC · May 17, 2020, 6:57pm

well there are two drives that are 4kn and right now the volblock is 8k and with 8k i’m not seeing any data amplification… maybe thats why it made it like that… but running 64k vol block would put your io amplification at 16x also when using 4k hardware… i might have to get rid of those 4k drives then maybe get a couple of regular 512 so everything fits and then maybe run 1k if i see any data magnification when running 512…
i guess i got tests to run…

you sure you don’t mean recordsize?

Pentium100 · May 17, 2020, 7:01pm

your pool has either ashift=9 or ashift=12.
If it is ashift=9, then the data is misaligned on the 4K drives and you get reduced performance.
If it is ashift=12, then you get higher performance, but a small-blocksize zvol will take up more space.

A zvol with 64K blocksize does not really result in 16x write amplification most of the time because the files usually are bigger than 4K.

However, I could run NTFS with cluster size 64K (or some other filesystem - ext4 does not support large block sizes) and possibly waste some space when storing tiny files, but get higher performance.

SGC · May 17, 2020, 7:06pm

i’m on ashift 12 and yeah most files are bigger than 4k, but that doesn’t change that stuff like database io which in general would be pretty small i suppose…
i wonder if i can get away with forcing the slog to run 512byte, been sort of trying to do that when i found that i could create new volumes with different blocksizes inside my pool… which is interesting to say the least…

ofc even if the slog runs 512… then stuff going into the slog would be fast… but going out would still be at a 16x io amplification atleast when it hits the hdds.

and if i did get it to be 64k… thats 128x io amplification on my slog devices… that would put me back at twice the io of what a hdd can manage and thats on my ssd

Pentium100 · May 17, 2020, 7:10pm

zfs create -s -V 1G pool/test-zvol
dd if=/dev/urandom of=/dev/zvol/pool/test-zvol bs=1M

And tell me how much the zvol takes. It should be more than 1GB.

SGC · May 17, 2020, 7:15pm

crtl + c won’t kill it…
dd: error writing '/dev/zvol/zPool/test-zvol': No space left on device

okay got it… had to figure out how to keel it with top and kill -17 to get the darn thing to die…
useful thing to know how to do that tho…

OMG its still going… ffs

Pentium100 · May 17, 2020, 7:29pm

Sorry, forgot oflag=direct, that would allow you stop it faster.

Essentially, since this was an async write, the 1GB is cached in memory and will be written to disk.

I figured that writing 1GB would not take a long time.

Here’s how it looks on my pool (6 drive raidz2):

zfs list -o name,volblocksize,volsize,used | grep stor/test              
stor/test4K                   4K       1G  2.02G
stor/test64K                 64K       1G  1.00G
stor/test8K                   8K       1G  2.01G

SGC · May 17, 2020, 7:38pm

stop it faster… i can’t keel it…

Pentium100 · May 17, 2020, 7:41pm

Did you use a very large zvol? Or is your pool very slow? I think the only way to stop it would be to restart the server. However, it should not take long to write 1GB. My server write it in a few seconds.

SGC · May 17, 2020, 7:52pm

it only took a brief moment before it said it was out of space… but the terminal remain active and i couldn’t do anything…
aside from writing

it was only on the 1gb vol… but i did use my primary pool to create the vol on…i don’t really have a lot of pools at the moment… didn’t look to be anything that would cause trouble…

on the plus side it doesn’t seem to be writing data to my pool… xD
its just like it keeps writing on the same 1gb vol again and again…

i just cannot believe i cannot kill it… i’ve tried like 3 different guides for killing stuff and it just won’t die
its like there is something that starts it back up… i did manage to find it using ps -aux something it was clearly it… but nothing i try will stop it… weirdly enough

Pentium100 · May 17, 2020, 7:56pm

Basically, without oflag=direct, dd writes in async mode, so, it wrote the 1GB to memory very quickly, said it was out of space and left the kernel to write the data from the memory to the disk.

SGC · May 17, 2020, 8:00pm

reboot it is then…
not sure it can write async since i got sync=always on
which actually could have been part of the problem… because i was testing and kinda remove my slogs and my l2arc… so maybe not totally your fault xD

i did think of adding like a progress thing to it… so i could see what it was doing… then i bet i could have caught it with that…