For a long time I stopped looking at it a useless activity, it is independent of almost local settings.
Moreover, it is not suitable for comparison, the locations are different, the number of nodes on the host is different, the system environment is different. Network load and concurency different every minute.
tar -c /mnt/storj/node09/storage/blobs/pmw6tvzmf2jv6giyybmmvl4o2ahqlaldsaeha4yx74n5aaaaaaaa/ | pv -brap > /dev/null
I totally missed this thread. Thank you for inviting me to the party.
I have choosen the opposite direction. My record size is 8K because out of my 5TB storj data most of the pieces are somewhere between 4K and 2MB. Compression is off. So far performance is nice. My bottleneck would be HDD IOPS but I can compensate that by just spinning up multiple nodes on different drives. For each drive I setup one ZFS pool and give it a 2G write cache. Is there a command to find out how much of the write cache is used? zpool iostat SN2 2 -v
is showing me the current āloadā but how about the maximum ever seen? How do I get that?
8k recordsize? That will make a 2MB file need 256 IOPS in comparsion to a 1MB recordsize needing only 3 (for a 2.23MB file)⦠that is a huge difference but with a decent drive that might not be the bottleneck in HDD performance, at least not with only <40Mbps ingress.
The majority of files is 2.23MB.
It is probably the database that needs many more IOPS.
I assigned a 1 GB write cache to my STORJ pool and it never went over 30MB usage but no idea how to find out the maximum ever used⦠It did however smoothen the write load on my HDD significantly.
For reference this is how I created my top 10 file size list: find ./ -type f | xargs -d '\n' du -s -B1 | cut -f 1 | sort | uniq -c | sort -n
I have used the europe satellite and ignored salt lake. I donāt want to optimize my node for test data. I want to optimize it for customer data. 2MB pieces are mostly test data.
You should plan for $write_troughput * $zfs_txg_timeout
SLOG size. The txg timeout by default is 5 seconds, but you can change it by changing the module parameter.
I use ext4 on top of a zvol with 64K volblocksize.
if you about zil try check this commands (for linux):
cat /proc/spl/kstat/zfs/zil
zpool list -v
What would happen if I optimize it the other way around? SLOG size / write throughput = timeout? Any advantage?
~# cat /proc/spl/kstat/zfs/zil
15 1 0x01 13 3536 11875199444 38479856715053
name type data
zil_commit_count 4 2289740
zil_commit_writer_count 4 2289710
zil_itx_count 4 603158
zil_itx_indirect_count 4 0
zil_itx_indirect_bytes 4 0
zil_itx_copied_count 4 0
zil_itx_copied_bytes 4 0
zil_itx_needcopy_count 4 594580
zil_itx_needcopy_bytes 4 3388841576
zil_itx_metaslab_normal_count 4 0
zil_itx_metaslab_normal_bytes 4 0
zil_itx_metaslab_slog_count 4 212869
zil_itx_metaslab_slog_bytes 4 3543190264
~# zpool list -v
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
SN1 5,45T 17,4G 5,44T - - 0% 0% 1.00x ONLINE -
ata-WDC_WD60EFRX-68MYMN1_WD-WX51DA4762J0 5,45T 17,4G 5,44T - - 0% 0,31% - ONLINE
logs - - - - - - - - -
mirror 1,88G 128K 1,87G - - 0% 0,00% - ONLINE
nvme0n1p1 - - - - - - - - ONLINE
nvme1n1p1 - - - - - - - - ONLINE
SN2 5,45T 5,25T 208G - - 37% 96% 1.00x ONLINE -
ata-WDC_WD60EFRX-68MYMN1_WD-WX21D1526TL0 5,45T 5,25T 208G - - 37% 96,3% - ONLINE
logs - - - - - - - - -
mirror 1,88G 216K 1,87G - - 0% 0,01% - ONLINE
nvme0n1p2 - - - - - - - - ONLINE
nvme1n1p2 - - - - - - - - ONLINE
SN3 1,81T 363G 1,46T - - 0% 19% 1.00x ONLINE -
ata-WDC_WD2003FYYS-70W0B0_WD-WCAY00217591 1,81T 363G 1,46T - - 0% 19,5% - ONLINE
logs - - - - - - - - -
mirror 1,88G 36K 1,87G - - 0% 0,00% - ONLINE
nvme0n1p3 - - - - - - - - ONLINE
nvme1n1p3 - - - - - - - - ONLINE
What does that tell me?
i use for seeing how much my SLOG is i use zpool iostat -v
$ zpool iostat -v
capacity operations bandwidth
pool alloc free read write read write
----------------------------------------------- ----- ----- ----- ----- ----- -----
rpool 6.86G 152G 0 24 7.64K 283K
ata-OCZ-AGILITY3_OCZ-B8LCS0WQ7Z7Q89B6-part3 6.86G 152G 0 24 7.64K 283K
----------------------------------------------- ----- ----- ----- ----- ----- -----
zPool 13.4T 24.8T 78 369 7.03M 9.94M
raidz1 12.7T 14.6T 76 121 7.00M 1.84M
wwn-0x5000cca2556e97a8 - - 30 33 1.75M 469K
wwn-0x5000cca2556d51f4 - - 15 33 1.75M 469K
ata-HGST_HUS726060ALA640_AR11021EH21JAB - - 15 27 1.75M 472K
12046184624242064708 - - 0 0 0 0
wwn-0x5000cca232cedb71 - - 15 27 1.75M 472K
raidz1 777G 10.1T 2 149 38.6K 4.47M
ata-TOSHIBA_DT01ACA300_531RH5DGS - - 0 36 9.89K 1.13M
ata-TOSHIBA_DT01ACA300_Z252JW8AS - - 0 36 9.51K 1.11M
ata-TOSHIBA_DT01ACA300_99QJHASCS - - 0 39 9.69K 1.13M
ata-TOSHIBA_DT01ACA300_99PGNAYCS - - 0 37 9.54K 1.11M
logs - - - - - -
ata-OCZ-AGILITY3_OCZ-B8LCS0WQ7Z7Q89B6-part5 80.3M 4.42G 0 98 843 3.63M
cache - - - - - -
ata-Crucial_CT750MX300SSD1_161613125282-part1 189G 411G 2 32 73.4K 3.68M
----------------------------------------------- ----- ----- ----- ----- ----- -----
excuse the mess of random disk identifiers used to add the drives, but it was what i could get working at the time, still getting the hang of this zfs thing.
rpool is the OS
zPool is storage⦠really should get that renamed⦠the capital P is kinda driving me crazy, if i wasnāt to begin with⦠jury is still out xD
and yes iām a drive down⦠working on that
iostat is pretty great⦠slow because itās avg over a long period, but pretty great for getting a good sense of how the pool is performing.
thus far i found -v -l -w
IIRC long timeouts will make the response spiky (if the timeout is 1 hour, so for one hour it would not write to disk and then would write all the changes at once and I think new writes would block during that time), but would have better overall performance because the writes are arranged to be almost sequential.
I have set mine at 30 seconds IIRC.
There is a rather long discussion about it here:
what yours
zfs get used [dataset]
size, regarding to
du -hd0 [/data]
?
~# zfs get used SN2
NAME PROPERTY VALUE SOURCE
SN2 used 5,25T -
~# du -hd0 /mnt/sn2/
5,3T /mnt/sn2/
du is rounding so here are the exact numbers:
~# du -hd0 -B1 /mnt/sn2/
5773421608448 /mnt/sn2/
but running 8k record sizes donāt you end up getting checksum bloat / slowdown from all the extra data
you sure you donāt mean the blocksize of zfs
proxmox set my blocksize to 8k and recordsize was 128k by default⦠i assume 8k is the minimal size of the record and 128k is max then⦠the 8k one cannot be changed on a zfs pool after its created⦠the recordsize we can change as much as we like⦠if one disables the safeguards one should even be able to go up to 16M recordsizes⦠if @kevink is feeling frisky
i think ill stay at my 64k recordsize⦠seems to work nice
sorry, this is the same.
need du -bd0
And can you run tar (example some posts up) for simple speedtestā¦
blocksize - parameter for volumes (zvol)
recordsize for datasets
~# du -bd0 /mnt/sn2
5690147768514 /mnt/sn2
So about 77.5 GB difference
~# apt-gtar -c /mnt/sn2/storagenode/storagenode/storage/blobs/v4weeab67sbgvnbwd5z7tweqsqqun7qox2agpbxy44mqqaaaaaaa/a2/a2si54scflkurix7hclsov3liqbjfgbuit4fpmqmwioe4i3yva.sj1 | pv -brap > /dev/null
tar: Entferne führende ā/ā von Elementnamen
2,22MiB [ 203MiB/s] [ 203MiB/s] [ <=> ]
~# tar -c /mnt/sn2/storagenode/storagenode/storage/blobs/v4weeab67sbgvnbwd5z7tweqsqqun7qox2agpbxy44mqqaaaaaaa/a2/* | pv -brap > /dev/null
tar: Entferne führende ā/ā von Elementnamen
tar: Entferne führende ā/ā von Zielen harter Verknüpfungen
907MiB [40,5MiB/s] [40,5MiB/s] [ <=>
Let me execute that tar test on the artificial test data as well. This output should have been from europe without any test data.
~# tar -c /mnt/sn2/storagenode/storagenode/storage/blobs/pmw6tvzmf2jv6giyybmmvl4o2ahqlaldsaeha4yx74n5aaaaaaaa/a2/* | pv -brap > /dev/null
tar: Entferne führende ā/ā von Elementnamen
tar: Entferne führende ā/ā von Zielen harter Verknüpfungen
3,85GiB [45,4MiB/s] [45,4MiB/s] [ <=>
why not just use zfs get all SN2
$ zfs get all zPool
NAME PROPERTY VALUE SOURCE
zPool type filesystem -
zPool creation Fri Feb 28 21:53 2020 -
zPool used 10.7T -
zPool available 18.1T -
zPool referenced 1.97G -
zPool compressratio 1.01x -
zPool mounted yes -
zPool quota none default
zPool reservation none default
zPool recordsize 64K local
zPool mountpoint /zPool default
zPool sharenfs rw=@192.168.0.10/24 local
zPool checksum on default
zPool compression lz4 local
zPool atime off local
zPool devices on default
zPool exec on default
zPool setuid on default
zPool readonly off default
zPool zoned off default
zPool snapdir hidden default
zPool aclinherit restricted default
zPool createtxg 1 -
zPool canmount on default
zPool xattr on default
zPool copies 1 default
zPool version 5 -
zPool utf8only off -
zPool normalization none -
zPool casesensitivity sensitive -
zPool vscan off default
zPool nbmand off default
zPool sharesmb off default
zPool refquota none default
zPool refreservation none default
zPool guid 7412244713760912023 -
zPool primarycache all default
zPool secondarycache all default
zPool usedbysnapshots 0B -
zPool usedbydataset 1.97G -
zPool usedbychildren 10.7T -
zPool usedbyrefreservation 0B -
zPool logbias latency local
zPool objsetid 54 -
zPool dedup off default
zPool mlslabel none default
zPool sync always local
zPool dnodesize legacy default
zPool refcompressratio 1.91x -
zPool written 1.97G -
zPool logicalused 10.8T -
zPool logicalreferenced 3.70G -
zPool volmode default default
zPool filesystem_limit none default
zPool snapshot_limit none default
zPool filesystem_count none default
zPool snapshot_count none default
zPool snapdev hidden default
zPool acltype off default
zPool context none default
zPool fscontext none default
zPool defcontext none default
zPool rootcontext none default
zPool relatime off default
zPool redundant_metadata all default
zPool overlay off default
zPool encryption off default
zPool keylocation none default
zPool keyformat none default
zPool pbkdf2iters 0 default
zPool special_small_blocks 0 default
or whatever dataset you want to use it on ofc⦠that will tell you exactly whats written ā¦
okay bad example it doesnāt count the child dataset in written.
NAME PROPERTY VALUE SOURCE
zPool/storagenodes/storj type filesystem -
zPool/storagenodes/storj creation Tue Apr 28 9:39 2020 -
zPool/storagenodes/storj used 6.99T -
zPool/storagenodes/storj available 18.1T -
zPool/storagenodes/storj referenced 6.99T -
zPool/storagenodes/storj compressratio 1.01x -
zPool/storagenodes/storj mounted yes -
zPool/storagenodes/storj quota none default
zPool/storagenodes/storj reservation none default
zPool/storagenodes/storj recordsize 16K inherited from zPool/storagenodes
zPool/storagenodes/storj mountpoint /zPool/storagenodes/storj default
zPool/storagenodes/storj sharenfs rw=@192.168.0.10/24 inherited from zPool
zPool/storagenodes/storj checksum on default
zPool/storagenodes/storj compression zle local
zPool/storagenodes/storj atime off inherited from zPool
zPool/storagenodes/storj devices on default
zPool/storagenodes/storj exec on default
zPool/storagenodes/storj setuid on default
zPool/storagenodes/storj readonly off default
zPool/storagenodes/storj zoned off default
zPool/storagenodes/storj snapdir hidden default
zPool/storagenodes/storj aclinherit restricted default
zPool/storagenodes/storj createtxg 964311 -
zPool/storagenodes/storj canmount on default
zPool/storagenodes/storj xattr on default
zPool/storagenodes/storj copies 1 default
zPool/storagenodes/storj version 5 -
zPool/storagenodes/storj utf8only off -
zPool/storagenodes/storj normalization none -
zPool/storagenodes/storj casesensitivity sensitive -
zPool/storagenodes/storj vscan off default
zPool/storagenodes/storj nbmand off default
zPool/storagenodes/storj sharesmb off default
zPool/storagenodes/storj refquota none default
zPool/storagenodes/storj refreservation none default
zPool/storagenodes/storj guid 11797936557373032860 -
zPool/storagenodes/storj primarycache all default
zPool/storagenodes/storj secondarycache all default
zPool/storagenodes/storj usedbysnapshots 0B -
zPool/storagenodes/storj usedbydataset 6.99T -
zPool/storagenodes/storj usedbychildren 0B -
zPool/storagenodes/storj usedbyrefreservation 0B -
zPool/storagenodes/storj logbias latency inherited from zPool
zPool/storagenodes/storj objsetid 1084 -
zPool/storagenodes/storj dedup off default
zPool/storagenodes/storj mlslabel none default
zPool/storagenodes/storj sync always inherited from zPool
zPool/storagenodes/storj dnodesize legacy default
zPool/storagenodes/storj refcompressratio 1.01x -
zPool/storagenodes/storj written 6.99T -
zPool/storagenodes/storj logicalused 7.01T -
zPool/storagenodes/storj logicalreferenced 7.01T -
zPool/storagenodes/storj volmode default default
zPool/storagenodes/storj filesystem_limit none default
zPool/storagenodes/storj snapshot_limit none default
zPool/storagenodes/storj filesystem_count none default
zPool/storagenodes/storj snapshot_count none default
zPool/storagenodes/storj snapdev hidden default
zPool/storagenodes/storj acltype off default
zPool/storagenodes/storj context none default
zPool/storagenodes/storj fscontext none default
zPool/storagenodes/storj defcontext none default
zPool/storagenodes/storj rootcontext none default
zPool/storagenodes/storj relatime off default
zPool/storagenodes/storj redundant_metadata all default
zPool/storagenodes/storj overlay off default
zPool/storagenodes/storj encryption off default
zPool/storagenodes/storj keylocation none default
zPool/storagenodes/storj keyformat none default
zPool/storagenodes/storj pbkdf2iters 0 default
zPool/storagenodes/storj special_small_blocks 0 default
you can get slightly better numbers using lz4 in compression⦠i only get a x1.01 compressionratio
but i figured i would rather waste 2-3% disk capacity rather than have my cpu constantly trying to compress stuff that isnāt compressible⦠so i run lz4 on the zPool but zle on the storagenodes folderā¦
i could have sworn i had changed that recordsize to 64k
and the power of ZFS read cache:
~# tar -c /mnt/sn2/storagenode/storagenode/storage/blobs/v4weeab67sbgvnbwd5z7tweqsqqun7qox2agpbxy44mqqaaaaaaa/a2/* | pv -brap > /dev/null
tar: Entferne führende ā/ā von Elementnamen
tar: Entferne führende ā/ā von Zielen harter Verknüpfungen
907MiB [ 762MiB/s] [ 762MiB/s] [ <=>
this is soltlake, full of testdata.
arenāt you just copying data from one place to another? or how exactly does this tar thing workā¦
sorry iām pretty linux code daft. xD