Zfs discussions

Krey · May 15, 2020, 10:48am

I can’t parse this.

Early you say

I say about stripe on all top level vdevs.
looks as you mean something yours by word “stripe”.
Stripe is the raid0. Other Word i think is alternation.
How possible alternate something one?

Krey · May 15, 2020, 11:16am

single drive (160MiB/s at start), 4MB recordsize, data from v4…

root@gwd:~# tar -c /mnt/test/ | pv -brap > /dev/null
tar: Removing leading `/’ from member names
553MiB [42.2MiB/s] [50.2MiB/s]

raidz 4+1, same disk models, 4MB recordsize, data from v4…
root@gwd:~# tar -c /mnt/test2/ | pv -brap > /dev/null
tar: Removing leading `/’ from member names
177GiB [79.7MiB/s] [79.7MiB/s] [

same as above but 8K recordsize
root@gwd:~# tar -c /mnt/test2/ | pv -brap > /dev/null
tar: Removing leading `/’ from member names
177GiB [54.8MiB/s] [54.8MiB/s] [

SGC · May 15, 2020, 4:37pm

granted i may be using confusing lingo in this relation… rookie here…
usually to my knowledge a stripe is like a datablock like a hdd sector size which is written across multiple drives, forming a raid sector size which is called a stripe i believe… i’m unsure if that includes the parity parts of it or not…
but i suppose it wouldn’t since in a regular raid which i’m more familiar with, the parity is located on one drive, thus enabling the expansion and reduction of the raid volumes, which is a feature i really miss in zfs actually…

SGC · May 15, 2020, 4:43pm

how do i do this tar thing… last time i tried it was just like it ran in the background… or are u using other methods of measuring it…

anon68609175 · May 15, 2020, 4:48pm

No! BTRFS IS NOT FOR STORJ! Sorry for caps.
Don’t use btrfs or you will loose node. BTRFS is slow for us + buggy.

Krey · May 15, 2020, 5:10pm

Maybe pv is not installed in you system. Try run pv, if fail install it with you packet manager.

Krey · May 15, 2020, 5:27pm

Stripe of raidz as I called your pool just meant that the data is distributed across both raidz and read from both raidz. that’s all.
In the sense stripe as data block as in a traditional array in its exact form, this concept is absent in the ZFS. Instead, there is a variable size record.
You can see info graphics at https://www.delphix.com/blog/delphix-engineering/zfs-raidz-stripe-width-or-how-i-learned-stop-worrying-and-love-raidz

this article is difficult to understand and, by and large, the title contradicts its content and the data in the tables. But the RAID-Z block layout scheme and its explanations (especially in terms of where the paddings come from) is very useful.

SGC · May 15, 2020, 5:54pm

yeah i got you, i’m sorry just use to the regular raid lingo… really hate this every damn thing has it’s own damn language…

it’s insanity lol, yeah got it working now with pv thx…
need a bit before i throw out some numbers… got my second scrub of the last 24hours running… but looking quite respectable this time around…
just a few slight expected misunderstandings from me being mean to the pool for a while…

 pool: zPool
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-9P
  scan: scrub in progress since Fri May 15 10:25:24 2020
        13.0T scanned at 399M/s, 12.5T issued at 385M/s, 13.5T total
        20K repaired, 92.79% done, 0 days 00:44:11 to go
config:

        NAME                                             STATE     READ WRITE CKSUM
        zPool                                            ONLINE       0     0     0
          raidz1-0                                       ONLINE       0     0     0
            wwn-0x5000cca2556e97a8                       ONLINE       0     0     0
            wwn-0x5000cca2556d51f4                       ONLINE       0     0     0
            ata-HGST_HUS726060ALA640_AR11021EH21JAB      ONLINE       0     0     0
            ata-HGST_HUS726060ALA640_AR11021EH2JDXB      ONLINE       0     0     0
            wwn-0x5000cca232cedb71                       ONLINE       0     0     0
          raidz1-3                                       ONLINE       0     0     0
            ata-TOSHIBA_DT01ACA300_531RH5DGS             ONLINE       0     0     0
            ata-TOSHIBA_DT01ACA300_Z252JW8AS             ONLINE       0     0     0
            ata-TOSHIBA_DT01ACA300_99QJHASCS             ONLINE       0     0     5  (repairing)
            ata-TOSHIBA_DT01ACA300_99PGNAYCS             ONLINE       0     0     0
        logs
          ata-OCZ-AGILITY3_OCZ-B8LCS0WQ7Z7Q89B6-part5    ONLINE       0     0     0
        cache
          ata-Crucial_CT750MX300SSD1_161613125282-part1  ONLINE       0     0     0

errors: No known data errors

first test gave me a wooping 3mb/s tho… xD so thats nice

cdhowie · May 15, 2020, 8:18pm

No, parity is not located all on one drive, unless you are using RAID4 – which is discouraged. Since the parity information is all on one drive, the parity drive is written to for every write. This throttles writes to the speed of one drive and wears the parity drive faster than the others.

RAID5 round-robin assigns the parity block to a different drive for different stripes.

RAID4:

Disk 1   Disk 2   Disk 3
0        1        p
2        3        p
4        5        p
6        7        p
8        9        p
10       11       p

RAID5:

Disk 1   Disk 2   Disk 3
0        1        p
2        p        3
p        4        5
6        7        p
8        p        9
p        10       11

Note that even RAID5 can be expanded with additional drives while active.

SGC · May 15, 2020, 8:52pm

ah right, forgot that was for raid4 only… pretty sad that one cannot add or remove from zfs tho… and the last lecture i saw about it didn’t look very promising, even if he attempted to make it work, it seemed like a bit of a mess…to be honest.

Odmin · May 16, 2020, 8:12am

I created another dashboard with Operations detailed stats and can show for all that SLOG device is never used for storj only operations

Description:
nvd0, nvd1 - mirror, L2ARC
nvd2, nvd3 - mirror, SLOG
daX - HDD drives

Statistic collected from the last two days (when activity was high):

SGC · May 16, 2020, 9:37am

you are using zabbix for that stuff?
well with the current usage of the network my l2arc won’t even warm up… at this speed it will take 14 days to warm up… lol and i keep doing stuff that ends up in me rebooting… which isn’t really helping…

and for the time i’ve had the pool setup like i have now i think the l2arc have been just getting warm once… but hopefully ill start having better stability now… i swear its a feature not a bug…

right now my l2arc is basically dead, but i’m not doing anything…
you got 4x NVME ssds? hooked up to your pool… please tell me they are not samsung pro evo’s
how big is your l2arc? ofc if it never fills then whats the point lol

no matter if the l2arc works for storj or not… i need it for my vm’s and other stuff i want to run on the pool… if it will help my storj then thats good… from what i can see it seemed to do that…
but yeah i haven’t seen any amazing io out of it… but it has had a good hit rate of up to 66-70% when the pool was only serving the storagenode.
i don’t really have any records of that stuff yet and i have barely have it warmed up since i increased it to 600gb, from what i’ve heard then when dealing with large datasets, one cannot get to much l2arc nor ram… ofc ram might be better, but thats also expensive… i could go to 288gb with this server… but for now im going to have to make do with 48gb ram of which most is arc or meta data

l2arc might not be an economical solution or a good return on investment, but it will help… even if just a tiny bit xD and is sure does help my VM’s over extended periods… if one is turned off for a week and i power it up again it springs to life like it was living on an SSD… ofc when i do stuff i don’t usually do if the pool is busy i can feel that its stored on HDDs… but the l2arc really takes the worst of that out… and when its trained it keeps me happy… also it seems to deal with my docker log export appending to logfiles every minute script xD
which is a nice bonus lol not the most efficient solution in the world… thats for sure

was looking at my throughput with sync always and slog…
which is kinda terrible… actually its beyond terrible… lol
but latency is good… hehe and at worst case it seems like i can deal with 2k files a second and about 40mb/s

which i know is kinda horrible… but i’m trying to work around that issue… i think the slog is writing 128k record / blocksizes and thus it’s bandwidth might be taken up by a very limited number of files read and writes.

if i wanted raw performance i would turn the sync on disabled for the storagenode… the storagenode doesn’t seem to care… or set it to standart…
besides its not like the storagenode really needs the throughput, so if i can go from 60ms read latency to 10, then thats seems very advantageous for my egress.
but it has been as is a bit of a configuration headache.

kevink · May 16, 2020, 10:38am

Not sure how you did set it up but I could proof the opposite. The SLOG is definitely being used and it also changes the write pattern on the HDD… Will provide picture onces my scrub is done… (Well may need to wait for more ingress though…)

Maybe you can also post your output of zpool status?

Odmin · May 16, 2020, 10:46am

the same as I mentioned before

Yes, I have 4 NVME, it doesn’t matter what a vendor, as I show before L2ARC and SLOG not bring any benefits for storj only operations. I think @littleskunk would like see POC results and I show it.
I personally prefer pure analytics digits from a real workload (storage node) instead of writing a lot of words without any proof of work.
I will disconnect L2ARC and SLOG from my poll because of not make sense and any benefit.

If you share your massive with other stuff like VM’s it can bring benefits, but it will be your unique use case, but we talking about storage node workload and anyone not care about your VM’s

littleskunk · May 16, 2020, 11:07am

Sync = Always and it will be used. Performance impact is a different question.

SGC · May 16, 2020, 11:26am

i did find my bottleneck now…
i ended up trying to transfer a million 10byte files which then ofc maxed out my slog ssd in IO
because it’s 512bytes sector and my proxmox created zfs pool runs 8192, thus every IO from the zfs pool to my slog is multiplied by 16 and then this old sata ssd caps out at around 60k IOPS

its the main bottleneck for now when i want to run sync=always to improve my hdd read latency
ofc if you got enough hdd’s then read latency would be low because they can keep up with the IO, but mine sure haven’t been able to until now.

hench with sync always i go from 60ms to 10ms read latency recorded by zpool iostat -l
but many factors are involved in this kind of stuff… one solution isn’t right for another setup.

Odmin · May 16, 2020, 11:48am

Yes, it true, but why you shout enable it and degrade the performance of the pool? it will be an open question…

I have a joke for comparison

This man can do it, but does he need to do this…

I used: Standart

Standard uses the sync settings that have been requested by the client software,
Always waits for data writes to complete,
Disabled never waits for writes to complete.

littleskunk · May 16, 2020, 12:11pm

I am using a single disk per pool. The storage node is writing a lot of small files. This means my HDD is running at max IOPS. In this stituation I can’t really degrade the performance.

SGC · May 16, 2020, 12:11pm

seems like i might need to migrate my pool to 512 blocksize to utilize my hardware correctly… i got nothing that can utilize 8k blocksize, so at best it halves my IO and at worst it gives me a 16x io amplification… granted if it was 8k or purely 4k sector based hardware it wouldn’t be a huge issue…

Odmin · May 16, 2020, 12:48pm

Can I ask you, what a problem we should try to solve on this thread? or what a goal we need to achieve?
Because I can’t continue without this information…