Zfs discussions

Krey · May 14, 2020, 10:15am

For a long time I stopped looking at it a useless activity, it is independent of almost local settings.
Moreover, it is not suitable for comparison, the locations are different, the number of nodes on the host is different, the system environment is different. Network load and concurency different every minute.

Krey · May 14, 2020, 10:28am

tar -c /mnt/storj/node09/storage/blobs/pmw6tvzmf2jv6giyybmmvl4o2ahqlaldsaeha4yx74n5aaaaaaaa/ | pv -brap > /dev/null

littleskunk · May 14, 2020, 10:29am

I totally missed this thread. Thank you for inviting me to the party.

I have choosen the opposite direction. My record size is 8K because out of my 5TB storj data most of the pieces are somewhere between 4K and 2MB. Compression is off. So far performance is nice. My bottleneck would be HDD IOPS but I can compensate that by just spinning up multiple nodes on different drives. For each drive I setup one ZFS pool and give it a 2G write cache. Is there a command to find out how much of the write cache is used? zpool iostat SN2 2 -v is showing me the current “load” but how about the maximum ever seen? How do I get that?

kevink · May 14, 2020, 10:33am

8k recordsize? That will make a 2MB file need 256 IOPS in comparsion to a 1MB recordsize needing only 3 (for a 2.23MB file)… that is a huge difference but with a decent drive that might not be the bottleneck in HDD performance, at least not with only <40Mbps ingress.
The majority of files is 2.23MB.
It is probably the database that needs many more IOPS.
I assigned a 1 GB write cache to my STORJ pool and it never went over 30MB usage but no idea how to find out the maximum ever used… It did however smoothen the write load on my HDD significantly.

littleskunk · May 14, 2020, 10:34am

I have used the europe satellite and ignored salt lake. I don’t want to optimize my node for test data. I want to optimize it for customer data. 2MB pieces are mostly test data.

Pentium100 · May 14, 2020, 10:36am

You should plan for $write_troughput * $zfs_txg_timeout SLOG size. The txg timeout by default is 5 seconds, but you can change it by changing the module parameter.

I use ext4 on top of a zvol with 64K volblocksize.

Krey · May 14, 2020, 10:37am

if you about zil try check this commands (for linux):

cat /proc/spl/kstat/zfs/zil
zpool list -v

littleskunk · May 14, 2020, 10:40am

What would happen if I optimize it the other way around? SLOG size / write throughput = timeout? Any advantage?

littleskunk · May 14, 2020, 10:42am

~# cat /proc/spl/kstat/zfs/zil
15 1 0x01 13 3536 11875199444 38479856715053
name                            type data
zil_commit_count                4    2289740
zil_commit_writer_count         4    2289710
zil_itx_count                   4    603158
zil_itx_indirect_count          4    0
zil_itx_indirect_bytes          4    0
zil_itx_copied_count            4    0
zil_itx_copied_bytes            4    0
zil_itx_needcopy_count          4    594580
zil_itx_needcopy_bytes          4    3388841576
zil_itx_metaslab_normal_count   4    0
zil_itx_metaslab_normal_bytes   4    0
zil_itx_metaslab_slog_count     4    212869
zil_itx_metaslab_slog_bytes     4    3543190264
~# zpool list -v
NAME                                          SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
SN1                                          5,45T  17,4G  5,44T        -         -     0%     0%  1.00x    ONLINE  -
  ata-WDC_WD60EFRX-68MYMN1_WD-WX51DA4762J0   5,45T  17,4G  5,44T        -         -     0%  0,31%      -  ONLINE
logs                                             -      -      -        -         -      -      -      -  -
  mirror                                     1,88G   128K  1,87G        -         -     0%  0,00%      -  ONLINE
    nvme0n1p1                                    -      -      -        -         -      -      -      -  ONLINE
    nvme1n1p1                                    -      -      -        -         -      -      -      -  ONLINE
SN2                                          5,45T  5,25T   208G        -         -    37%    96%  1.00x    ONLINE  -
  ata-WDC_WD60EFRX-68MYMN1_WD-WX21D1526TL0   5,45T  5,25T   208G        -         -    37%  96,3%      -  ONLINE
logs                                             -      -      -        -         -      -      -      -  -
  mirror                                     1,88G   216K  1,87G        -         -     0%  0,01%      -  ONLINE
    nvme0n1p2                                    -      -      -        -         -      -      -      -  ONLINE
    nvme1n1p2                                    -      -      -        -         -      -      -      -  ONLINE
SN3                                          1,81T   363G  1,46T        -         -     0%    19%  1.00x    ONLINE  -
  ata-WDC_WD2003FYYS-70W0B0_WD-WCAY00217591  1,81T   363G  1,46T        -         -     0%  19,5%      -  ONLINE
logs                                             -      -      -        -         -      -      -      -  -
  mirror                                     1,88G    36K  1,87G        -         -     0%  0,00%      -  ONLINE
    nvme0n1p3                                    -      -      -        -         -      -      -      -  ONLINE
    nvme1n1p3                                    -      -      -        -         -      -      -      -  ONLINE

What does that tell me?

SGC · May 14, 2020, 10:44am

i use for seeing how much my SLOG is i use zpool iostat -v

$ zpool iostat -v
                                                   capacity     operations     bandwidth
pool                                             alloc   free   read  write   read  write
-----------------------------------------------  -----  -----  -----  -----  -----  -----
rpool                                            6.86G   152G      0     24  7.64K   283K
  ata-OCZ-AGILITY3_OCZ-B8LCS0WQ7Z7Q89B6-part3    6.86G   152G      0     24  7.64K   283K
-----------------------------------------------  -----  -----  -----  -----  -----  -----
zPool                                            13.4T  24.8T     78    369  7.03M  9.94M
  raidz1                                         12.7T  14.6T     76    121  7.00M  1.84M
    wwn-0x5000cca2556e97a8                           -      -     30     33  1.75M   469K
    wwn-0x5000cca2556d51f4                           -      -     15     33  1.75M   469K
    ata-HGST_HUS726060ALA640_AR11021EH21JAB          -      -     15     27  1.75M   472K
    12046184624242064708                             -      -      0      0      0      0
    wwn-0x5000cca232cedb71                           -      -     15     27  1.75M   472K
  raidz1                                          777G  10.1T      2    149  38.6K  4.47M
    ata-TOSHIBA_DT01ACA300_531RH5DGS                 -      -      0     36  9.89K  1.13M
    ata-TOSHIBA_DT01ACA300_Z252JW8AS                 -      -      0     36  9.51K  1.11M
    ata-TOSHIBA_DT01ACA300_99QJHASCS                 -      -      0     39  9.69K  1.13M
    ata-TOSHIBA_DT01ACA300_99PGNAYCS                 -      -      0     37  9.54K  1.11M
logs                                                 -      -      -      -      -      -
  ata-OCZ-AGILITY3_OCZ-B8LCS0WQ7Z7Q89B6-part5    80.3M  4.42G      0     98    843  3.63M
cache                                                -      -      -      -      -      -
  ata-Crucial_CT750MX300SSD1_161613125282-part1   189G   411G      2     32  73.4K  3.68M
-----------------------------------------------  -----  -----  -----  -----  -----  -----

excuse the mess of random disk identifiers used to add the drives, but it was what i could get working at the time, still getting the hang of this zfs thing.
rpool is the OS
zPool is storage… really should get that renamed… the capital P is kinda driving me crazy, if i wasn’t to begin with… jury is still out xD
and yes i’m a drive down… working on that

iostat is pretty great… slow because it’s avg over a long period, but pretty great for getting a good sense of how the pool is performing.
thus far i found -v -l -w

Pentium100 · May 14, 2020, 11:14am

IIRC long timeouts will make the response spiky (if the timeout is 1 hour, so for one hour it would not write to disk and then would write all the changes at once and I think new writes would block during that time), but would have better overall performance because the writes are arranged to be almost sequential.
I have set mine at 30 seconds IIRC.

There is a rather long discussion about it here:

Krey · May 14, 2020, 11:16am

what yours
zfs get used [dataset]
size, regarding to
du -hd0 [/data]
?

littleskunk · May 14, 2020, 11:21am

~# zfs get used SN2
NAME  PROPERTY  VALUE  SOURCE
SN2   used      5,25T  -
~# du -hd0 /mnt/sn2/
5,3T    /mnt/sn2/

du is rounding so here are the exact numbers:

~# du -hd0 -B1 /mnt/sn2/
5773421608448   /mnt/sn2/

SGC · May 14, 2020, 11:33am

but running 8k record sizes don’t you end up getting checksum bloat / slowdown from all the extra data

you sure you don’t mean the blocksize of zfs
proxmox set my blocksize to 8k and recordsize was 128k by default… i assume 8k is the minimal size of the record and 128k is max then… the 8k one cannot be changed on a zfs pool after its created… the recordsize we can change as much as we like… if one disables the safeguards one should even be able to go up to 16M recordsizes… if @kevink is feeling frisky
i think ill stay at my 64k recordsize… seems to work nice

Krey · May 14, 2020, 11:35am

sorry, this is the same.
need du -bd0

And can you run tar (example some posts up) for simple speedtest…

blocksize - parameter for volumes (zvol)
recordsize for datasets

littleskunk · May 14, 2020, 11:46am

~# du -bd0 /mnt/sn2
5690147768514   /mnt/sn2

So about 77.5 GB difference

~# apt-gtar -c /mnt/sn2/storagenode/storagenode/storage/blobs/v4weeab67sbgvnbwd5z7tweqsqqun7qox2agpbxy44mqqaaaaaaa/a2/a2si54scflkurix7hclsov3liqbjfgbuit4fpmqmwioe4i3yva.sj1 | pv -brap > /dev/null
tar: Entferne führende „/“ von Elementnamen
2,22MiB [ 203MiB/s] [ 203MiB/s] [     <=>                                                                                                                                                                                                                                                                                   ]
~# tar -c /mnt/sn2/storagenode/storagenode/storage/blobs/v4weeab67sbgvnbwd5z7tweqsqqun7qox2agpbxy44mqqaaaaaaa/a2/* | pv -brap > /dev/null
tar: Entferne führende „/“ von Elementnamen
tar: Entferne führende „/“ von Zielen harter Verknüpfungen
 907MiB [40,5MiB/s] [40,5MiB/s] [                                                                                                                                <=>

Let me execute that tar test on the artificial test data as well. This output should have been from europe without any test data.

~# tar -c /mnt/sn2/storagenode/storagenode/storage/blobs/pmw6tvzmf2jv6giyybmmvl4o2ahqlaldsaeha4yx74n5aaaaaaaa/a2/* | pv -brap > /dev/null
tar: Entferne führende „/“ von Elementnamen
tar: Entferne führende „/“ von Zielen harter Verknüpfungen
3,85GiB [45,4MiB/s] [45,4MiB/s] [                                                                        <=>

SGC · May 14, 2020, 11:46am

why not just use zfs get all SN2

$ zfs get all zPool
    NAME   PROPERTY              VALUE                  SOURCE
    zPool  type                  filesystem             -
    zPool  creation              Fri Feb 28 21:53 2020  -
    zPool  used                  10.7T                  -
    zPool  available             18.1T                  -
    zPool  referenced            1.97G                  -
    zPool  compressratio         1.01x                  -
    zPool  mounted               yes                    -
    zPool  quota                 none                   default
    zPool  reservation           none                   default
    zPool  recordsize            64K                    local
    zPool  mountpoint            /zPool                 default
    zPool  sharenfs              rw=@192.168.0.10/24    local
    zPool  checksum              on                     default
    zPool  compression           lz4                    local
    zPool  atime                 off                    local
    zPool  devices               on                     default
    zPool  exec                  on                     default
    zPool  setuid                on                     default
    zPool  readonly              off                    default
    zPool  zoned                 off                    default
    zPool  snapdir               hidden                 default
    zPool  aclinherit            restricted             default
    zPool  createtxg             1                      -
    zPool  canmount              on                     default
    zPool  xattr                 on                     default
    zPool  copies                1                      default
    zPool  version               5                      -
    zPool  utf8only              off                    -
    zPool  normalization         none                   -
    zPool  casesensitivity       sensitive              -
    zPool  vscan                 off                    default
    zPool  nbmand                off                    default
    zPool  sharesmb              off                    default
    zPool  refquota              none                   default
    zPool  refreservation        none                   default
    zPool  guid                  7412244713760912023    -
    zPool  primarycache          all                    default
    zPool  secondarycache        all                    default
    zPool  usedbysnapshots       0B                     -
    zPool  usedbydataset         1.97G                  -
    zPool  usedbychildren        10.7T                  -
    zPool  usedbyrefreservation  0B                     -
    zPool  logbias               latency                local
    zPool  objsetid              54                     -
    zPool  dedup                 off                    default
    zPool  mlslabel              none                   default
    zPool  sync                  always                 local
    zPool  dnodesize             legacy                 default
    zPool  refcompressratio      1.91x                  -
    zPool  written               1.97G                  -
    zPool  logicalused           10.8T                  -
    zPool  logicalreferenced     3.70G                  -
    zPool  volmode               default                default
    zPool  filesystem_limit      none                   default
    zPool  snapshot_limit        none                   default
    zPool  filesystem_count      none                   default
    zPool  snapshot_count        none                   default
    zPool  snapdev               hidden                 default
    zPool  acltype               off                    default
    zPool  context               none                   default
    zPool  fscontext             none                   default
    zPool  defcontext            none                   default
    zPool  rootcontext           none                   default
    zPool  relatime              off                    default
    zPool  redundant_metadata    all                    default
    zPool  overlay               off                    default
    zPool  encryption            off                    default
    zPool  keylocation           none                   default
    zPool  keyformat             none                   default
    zPool  pbkdf2iters           0                      default
    zPool  special_small_blocks  0                      default

or whatever dataset you want to use it on ofc… that will tell you exactly whats written …
okay bad example it doesn’t count the child dataset in written.

NAME                      PROPERTY              VALUE                      SOURCE
zPool/storagenodes/storj  type                  filesystem                 -
zPool/storagenodes/storj  creation              Tue Apr 28  9:39 2020      -
zPool/storagenodes/storj  used                  6.99T                      -
zPool/storagenodes/storj  available             18.1T                      -
zPool/storagenodes/storj  referenced            6.99T                      -
zPool/storagenodes/storj  compressratio         1.01x                      -
zPool/storagenodes/storj  mounted               yes                        -
zPool/storagenodes/storj  quota                 none                       default
zPool/storagenodes/storj  reservation           none                       default
zPool/storagenodes/storj  recordsize            16K                        inherited from zPool/storagenodes
zPool/storagenodes/storj  mountpoint            /zPool/storagenodes/storj  default
zPool/storagenodes/storj  sharenfs              rw=@192.168.0.10/24        inherited from zPool
zPool/storagenodes/storj  checksum              on                         default
zPool/storagenodes/storj  compression           zle                        local
zPool/storagenodes/storj  atime                 off                        inherited from zPool
zPool/storagenodes/storj  devices               on                         default
zPool/storagenodes/storj  exec                  on                         default
zPool/storagenodes/storj  setuid                on                         default
zPool/storagenodes/storj  readonly              off                        default
zPool/storagenodes/storj  zoned                 off                        default
zPool/storagenodes/storj  snapdir               hidden                     default
zPool/storagenodes/storj  aclinherit            restricted                 default
zPool/storagenodes/storj  createtxg             964311                     -
zPool/storagenodes/storj  canmount              on                         default
zPool/storagenodes/storj  xattr                 on                         default
zPool/storagenodes/storj  copies                1                          default
zPool/storagenodes/storj  version               5                          -
zPool/storagenodes/storj  utf8only              off                        -
zPool/storagenodes/storj  normalization         none                       -
zPool/storagenodes/storj  casesensitivity       sensitive                  -
zPool/storagenodes/storj  vscan                 off                        default
zPool/storagenodes/storj  nbmand                off                        default
zPool/storagenodes/storj  sharesmb              off                        default
zPool/storagenodes/storj  refquota              none                       default
zPool/storagenodes/storj  refreservation        none                       default
zPool/storagenodes/storj  guid                  11797936557373032860       -
zPool/storagenodes/storj  primarycache          all                        default
zPool/storagenodes/storj  secondarycache        all                        default
zPool/storagenodes/storj  usedbysnapshots       0B                         -
zPool/storagenodes/storj  usedbydataset         6.99T                      -
zPool/storagenodes/storj  usedbychildren        0B                         -
zPool/storagenodes/storj  usedbyrefreservation  0B                         -
zPool/storagenodes/storj  logbias               latency                    inherited from zPool
zPool/storagenodes/storj  objsetid              1084                       -
zPool/storagenodes/storj  dedup                 off                        default
zPool/storagenodes/storj  mlslabel              none                       default
zPool/storagenodes/storj  sync                  always                     inherited from zPool
zPool/storagenodes/storj  dnodesize             legacy                     default
zPool/storagenodes/storj  refcompressratio      1.01x                      -
zPool/storagenodes/storj  written               6.99T                      -
zPool/storagenodes/storj  logicalused           7.01T                      -
zPool/storagenodes/storj  logicalreferenced     7.01T                      -
zPool/storagenodes/storj  volmode               default                    default
zPool/storagenodes/storj  filesystem_limit      none                       default
zPool/storagenodes/storj  snapshot_limit        none                       default
zPool/storagenodes/storj  filesystem_count      none                       default
zPool/storagenodes/storj  snapshot_count        none                       default
zPool/storagenodes/storj  snapdev               hidden                     default
zPool/storagenodes/storj  acltype               off                        default
zPool/storagenodes/storj  context               none                       default
zPool/storagenodes/storj  fscontext             none                       default
zPool/storagenodes/storj  defcontext            none                       default
zPool/storagenodes/storj  rootcontext           none                       default
zPool/storagenodes/storj  relatime              off                        default
zPool/storagenodes/storj  redundant_metadata    all                        default
zPool/storagenodes/storj  overlay               off                        default
zPool/storagenodes/storj  encryption            off                        default
zPool/storagenodes/storj  keylocation           none                       default
zPool/storagenodes/storj  keyformat             none                       default
zPool/storagenodes/storj  pbkdf2iters           0                          default
zPool/storagenodes/storj  special_small_blocks  0                          default

you can get slightly better numbers using lz4 in compression… i only get a x1.01 compressionratio
but i figured i would rather waste 2-3% disk capacity rather than have my cpu constantly trying to compress stuff that isn’t compressible… so i run lz4 on the zPool but zle on the storagenodes folder…

i could have sworn i had changed that recordsize to 64k

littleskunk · May 14, 2020, 11:53am

and the power of ZFS read cache:

~# tar -c /mnt/sn2/storagenode/storagenode/storage/blobs/v4weeab67sbgvnbwd5z7tweqsqqun7qox2agpbxy44mqqaaaaaaa/a2/* | pv -brap > /dev/null
tar: Entferne führende „/“ von Elementnamen
tar: Entferne führende „/“ von Zielen harter Verknüpfungen
 907MiB [ 762MiB/s] [ 762MiB/s] [           <=>

Krey · May 14, 2020, 12:00pm

this is soltlake, full of testdata.

SGC · May 14, 2020, 12:01pm

aren’t you just copying data from one place to another? or how exactly does this tar thing work…
sorry i’m pretty linux code daft. xD