Zfs discussions

kevink · April 24, 2020, 3:15pm

I’m wondering what kind of workload you are expecting.
my home server doesn’t see much workload even though I have lots of stuff running from 3 nodes to nextcloud.

If I need a backup OS I use Ubuntu on a USB stick.

How are you doing backups on your zfs pool? I’m still looking for a good solution to store zfs snapshots. I could send them as files to backblaze but there are no tools to keep zfs snapshots as files… So for the time being I send my snapshots to a 2nd pc and sync my pictures using duplicati…

SGC · April 24, 2020, 3:18pm

So lets talk about optimization…

did you guys see that there has just been pushed persistent L2ARC options to ZoL.
and the new ZSTD compression has also been committed or whatever its called when its released…

that is amazing… i really need a persistent L2ARC bothers me that it takes the better part of a week to retrain my L2ARC.

so i’m going to keep reading up on this or try to figure it out…

basically i suppose i need to update my ZoL (if that can be done safely) and then figure out the commands from the zfs man page…

anyways i know what i will be researching the next few days when i have free time… xD
ZSTD might not be totally useful for storj related zfs… but the persistent L2ARC will make reboots so much smoother…

kevink · April 24, 2020, 3:25pm

That’s nice and all but it’ll probably takes years until it is available on Ubuntu lol…
and since I’m stuck with 19.04 until I upgraded my whole system, my zfs version is actually ancient.

felixbrucker · April 24, 2020, 3:55pm

While i generally agree on the “encrypted data cant be compressed” part i also see a lower disk usage with my rsync transfer showing 504GB while my disk uses 474GB

im not sure how exactly storj files are composed, but there might just be some part that can be compressed, or btrfs itself handles these files better (different recordsize or whatever)

don’t waste your time even trying…

i dont waste any of my time, just a little cpu for the compression calc

ill see how the total disk consumption looks like once everything is moved over, so far it looks promising

SGC · April 24, 2020, 5:30pm

6% sure it’s something, but it might just be a deviation, you need to account for a great many factors, but lets assume you are on the same drive, so it isn’t some sort of weird sector size artifact… i tried lz4 and gzip-9 on zfs with little or no result in compression.

you sure it’s not the same thing we saw on zfs, … if you are compressing from a non compressed data volume some area of some of the blocks / records will be empty and thus you will get what seems to be compression, but its really just the empty parts of the blocks / records thats being reduced to near zero.

like say 128k block / recordsize divided into the 2265kb files of storj max piece size, gives you 17.7
so 3/10th of a block will be empty and each block is 100% divided by 18 parts =5.5% each which we then take the 3/10th off… so 1.66…%

hmmm that doesn’t make sense… well my guess would be that is it and you maybe didn’t use a very big dataset,

i do get 5% running 256k record sizes using lz4…so thats my guess you are on a uncompressed volume with 256k recordsizes… but thats mostly artificial because of the recordsize compared to the filesize compared to how much space it would take on an uncompressed partition / volume

but yeah it does look very motivating at first, but zle will do the same job for less cpu… won’t give you more room in your ARC tho.

6% still means you save room on disk… but it just means it was most likely empty to begin with…

when using zfs you save the optimal room to begin with… nothing seems to be able to change how much space the storj files take… only if you turn off compression does recordsizes become relevant for anything other than io

felixbrucker · April 24, 2020, 6:09pm

If possible i try to avoid using zfs, so the choice was limited to btrfs for compression testing

Generally every reduction of used space on disk over what i got from xfs is favoured, so id be fine with even only 0% reduction, however that is achieved

I also got no idea which recordsize (or equivalent) btrfs uses by default

I used zstd compression, as that seemed like the most advanced one of the ones avail

SGC · April 24, 2020, 8:00pm

yeah zstd is very new, even if it has been in-development for a great many years by now.
well if you are looking to save space any filesystem that can run compression without to much penalty on the system should do more or less the same job…
you could try to compare lz4 or gzip to zstd… if zstd doesn’t give you a better result then i am pretty sure you are just compressing zeros into basically mathematical shorthand ie 10e9 or 10^9 and such
then for performance reason you will most likely be better of with something like ZLE.

ofc if you use the file system for other stuff then it might make it worthwhile.

recordsize is basically blocksize i guess, i did consider btrfs… but i decided it was a bit to new for me to trust it, i hear good things… but also that i can be a bit temperamental in its raid type solution… but sure wouldn’t mind that add a disk remove a disk feature its going to… have … has … or whatever

well depending on what you are trying to do, any base file system will most likely fill the needs, tho ofc some configuration might be required to truly optimize every bit on a drive…

thats what compression in a file system gives you… and tho you may feel btrfs is a good choice, because it’s newer and has a few cooler features… then keep in mind zfs is the most expensive filesystem ever developed…

so really if you want the best of the best in data reliability and system reliability then systems that are well established are the way to go… ofc btrfs won’t ever get any better without testers, user feedback and interest , and if they can do even half of what they promise then it will be quite spectacular.

felixbrucker · April 24, 2020, 8:45pm

any filesystem that can run compression without to much penalty on the system should do more or less the same job…

Indeed, that might be true, ill test that with the next node and the fastest compression avail for btrfs

SGC · April 25, 2020, 11:46am

what methods do you guys use for attaching disks to your pools

i was replacing a damaged drive and ended up using.

$ ls -la /dev/disk/by-id

to find the drives, and

$ zpool replace poolname /dev/sdX /dev/disk/by-id/..........uuid target disk goes here

tried to follow a ton of different different crazy long convoluted suggestions, but didn’t have much luck with that… mostly because i’m bad at command line linux i guess.

not sure if this works with an empty drive tho, i think it should… however i used the proxmox webgui to create a zfs partition on it…

the reason i use the UUID of the drive is that then i can shuffle them, which imo is a critical feature of an raid setup…ofc it ain’t critical until you pull out the drives and have to replace them in something else… xD

anyways… i was wondering if i made any big mistakes or if anyone have some suggestions on better ways to do this … supposedly easy task… did manage to take me the entire morning to figure it all out and getting it resilvering.

suggestions and critic welcome…

kevink · April 25, 2020, 12:46pm

This is my first zfs setup and I didn’t have to replace any disks yet… So I can’t say anything about that.

SGC · April 25, 2020, 2:31pm

well the drive had 55000+ hours of hours spinning time… so can’t really blame it for having issues… just barely reached a decade. almost makes me want to buy more of them… but kinda on the small size using 3tb
and i suspect it might be a circuit board issue, of which i got a spare… so maybe ill get it to work again lol but hopefully the node can start earning some proper profits for some replacement drives…

fmoledina · April 25, 2020, 4:46pm

This should work. What error were you getting when you tried the above command? Was /dev/sdX already listed in the zpool when you did zpool status?

SGC · April 25, 2020, 6:03pm

the drive kept throwing errors, dismissed it the first time because i thought it was from when first tested the zpool redundancy by pulling a drive and then later shuffling them around… just to see if i could do that or not with how i set it up…

so i figure i had forgotten to type the command to put it back in the pool… so i did a … deep long silence…

$ zpool clear

which then did a resilvering and i thought everything was fine… then a few days later it popped up again… then i decided i might as well push up my switch from my raid card to my dual hba’s and hope zfs with direct access would solve whatever was wrong… or maybe it was a cable… so i shuffled the drives… and the issue came back after i cleared it again… took a few days before it noticed the read errors…

so i started to worry a bit more and learned about the whole scrub thing… seemed relevant lol
this put it into repair mode on the faulted drive, which didn’t help either…

so when i was at it, i added a couple of 6tb drives… i was initially going to add 5 but my backplane had some corrosion from being in a bit to damp room… kinda did see that one coming … but i had hoped to get it in a more controlled climate.

so now instead of adding 10 drives of 2 vdevs of 5 drives, it will just be 5 drives until i get it fixed… i might move the node to another computer for a week take down the server and try to fix the backplane… didn’t look to bad… but i digress

so to move the pool data. i started putting in the 6tb and replacing the 3tb current occupying the 5 of the 7 working front bays.
and i started ofc with the one that i couldn’t get to behave…
it would resilver with no errors, then the status would be without errors for a couple or days or until i shortly after i started a scrub, then it would get faulted to many read errors…

i then figured it could repair it,… seems like a bad sector or something… but when i checked the smart data then it looked more like it drops the connection… and i switch the cables, the sas controller and shuffled the drives… so only the drive circuit board or magnetic recording remains as stuff that could cause and issue… the drive is also the exact same kind as the rest… so no incompatibility issues either.

did read that zfs is made for 4 drives in raidz1… but meh i doubt that’s really matters… many other people seem to run 5 drives with no issues…

anyways the node is growing and fairly fast… so its a matter of getting room freed for it to live on…
the plan is to use the same approach also for the functioning disks… afaik that should work fine.

yeah duno why it listed them using those annoying none descript identifiers… they are basically useless and move around all the damn time lol… proxmox made my raidz… so maybe thats why…
yes the sdX was in the pool already, i could easily have used a sdY sdN … sdX you know what i mean… to replace it, but then that would be the way zfs attached it, if the pool was moved to a different computer or whatever… with using uuid, i can just do a zpool mount -a

and really… usually i don’t even need to do that… it just seems to spring to live… when i don’t boot the system without it having access to the drives and it starts creating my zpools mount with temp folders on my OS drive… then it gets mad when it tried to mount…

it’s also about keeping it simple, because i know ill be the one fixing it later, and i don’t want to create trouble for myself… rather spend 10 minutes extra doing it right now… than 4 hours of headache when it crashes hard…

so yes it was already listen in the zpool xD
proxmox used a 3rd way of identifying the drives tho… it creates a partition, then gives it a boot partition incase you boot over the drive and then it gives the zfs vdev partition a zfs-blablarandomnumbersandlonglettername…
and uses that partition name to mount the drives correct… i guess that’s also a way of doing it… but then i ran into the nice and simple world of linux partitioning tools…

slams head into table

i was so mad after going through all of this stuff that was essentially 5 minutes worth of work if there had been some sort of reasoning and logic behind the structure of how to do this stuff…

if this wasn’t the last drop, then it was the second to last… next time i get the chance i think i will give freebsd another chance… or maybe another one of the related types / versions / distributions / forks… i always just assume it was me being unaware of how linux worked… but from what i understand from the BSD side of things… then its just because linux is basically build by democrazy… and that makes it’s userland INSANE…

so yeah… sorry linux, imma gonna have to put you down…
its kinda sad because i do kinda like some of the features, and how i can customize everything if i dig deep enough…and the big big user base compared to the competition…

i just don’t think i can do linux commandline and keep my sanity and the same time…
tho some would say that was gone a long time ago lol or that i cannot do commandline linux… and they would sort be right… so lets just call it a metaphor instead…

        resilver in progress since Sat Apr 25 11:03:55 2020
        9.12T scanned at 297M/s, 6.64T issued at 217M/s, 9.71T total
        1.33T resilvered, 68.39% done, 0 days 04:07:42 to go

i sort of assume issued meant transfer speed to the disk… but 217MB/S
i mean this is a SAS 4k enterprise drive… i guess that could explain the kinda ridiculous speed.
don’t think i ever seen a 7200rpm drive reach those speeds… and usually write is the slowest…

kevink · April 25, 2020, 6:28pm

This is literally the longest story I have read today…

And all I can say is: don’t give yourself up, there is always hope for those that struggle with Linux

SGC · April 25, 2020, 6:59pm

well it was what i could get to run… lol so i won’t exclude that i will end up in linux anyways… but i kinda really like to run BSD for a host of reasons… 1 i can repurpose the OS and software because it’s Open to any use case without anyone attacking with legalese over corpyrights and patents.
2. it has great documentation, 3 it has a userfriendly userland.

windows has the last two… linux lack all of the above, GPL isn’t opensource, but people need to get paid… its not for no reason that windows, iOS and may more contains fundamental parts of BSD…

only bad thing about freebsd is their speed at fixing new security flaws…
but generally i could see myself needing a software base to use for building something else in a few years… so would make good sense to have that option and understand the OS well…

really you read the whole thing… stop it, nobody has time for that… go watch youtube instead…

kevink · April 25, 2020, 8:49pm

I got used to Debian and Ubuntu so I stuck with it. Don’t want to learn the problems of other distros too
But zfs support is not the best in those… At least in the older versions.

My connection was too bad for youtube…

Pentium100 · April 26, 2020, 2:08am

I don’t get it, fdisk seems to work fine and is not that difficult to learn. You can also use parted or even gparted (GUI).

It’s on the whole pool. So, if you have 10 drives, each would be doing 21MB/s in this case etc.

The “issued” part is net in ZoL 0.8 I think. Before it used to scan all the data doing doing something that was logical on the file system, but resulted in a lot of random IO on the drives. The new feature is that zfs consolidates the IO to make it more sequential for the drives, but it also results in slightly less total IO.

SGC · April 26, 2020, 7:21am

so scanned is like read and issues is write in close to the disk IO stream, sort of make sense why one would change the names instead of just calling it something that makes sense xD
but ill blame that on zfs…

well i did try parted… keep in mind last time i was using terminal for much was like in dos 6.22…
so tho my keyboard kungfu higher than most users, it not really something i get into.
parted was fun tho… i started it and then i had to press i for ignore 15 times for all the partitions it wanted to refix to squeeze out an extra block, then i start to try to make a partition.
but first i run into it complaining over that my label on the partition was wrong… which turned out not being me creating a partition, but designating a partition type like GPT in case of ZFS
then i finally start on making the partition.
and i end up finally being asked what it was named, because it was actually what i just wanted to do to begin with… put the disk serial on the partition name and move on so zfs could identify it and rebuild the partition how it liked… so really just the name was fine… but oh no, no, no…
i then had to designate filesystems, start figuring it exactly how it is one starts to write on a drive… if memory serves one cannot start from byte 1… or its unwise…
ofc if thats the case why not make it an advanced user feature… and let the new users just have a common used default. or atleast some suggestions on what to use…
then i just figured what the hell we will do 1 and then just 600000000000000000000 because it’s a 6tb drive and at this point i had been reading suggestions online for like 45 minutes and trying to create a partition for like 30min…
the it says… cannot start the partition after it ends… and ends up with a weird -1200000000000000000 start and 1200000000000000 end
then i tried just using 0 for auto… with no avail…
at this point i wasn’t sure if zfs would destroy the partition and remake it or use the one i made… so i figured what the hell ill see if i cannot find a way where i know it will be a bit more optimized for zfs.
since i was sure zfs would mostly take care of it, if i knew how to ask it to identify the disk…
with using the sdX identifiers.

before parted i tried gpart but that turned out to be somethiing old and didn’t go well with GPT and it wrote lots of red warning when i listed my disks / partitions… most likely the same stuff as parted forced me to ignore or fix on start up…

it’s not that the tools don’t work… they just make me feel like i’m using a helicopter to drive in a nail…
it’s not that it isn’t a great tool… but i’m trying to drive in a nail… i just needed to name a random partition, in the end i just pressed the button in my proxmox webgui… create disk… literally two clicks and i was already further than, 30 minutes in the command line trying 3 -4 different partitioning tools, running into various warnings, errors or naming conversions i’m not use to dealing with, been using linux for 7 weeks now…

and in the end, i didn’t even partition the second disk, because like i always thought zfs took care of it, i just had to manage to find the UUID and feed it to zfs in the right way so it would just do its thing and replace on drive with another in a way so i can shuffle the drives later because it knows them by Universally Unique IDentifier, which i also like to do because then even if the disk dies totally… the system can always identify it…

either of you wouldn’t happen to know how i make my led’s in my hotswap bays blink so i can tell with disks to remove…

but yeah took me all morning yesterday to get from i want to replace my faulted drive in zfs using a permanent identifier… also did rewatch zfs for newbies on youtube… but wasn’t much help since the command he was using didn’t seem to work for me… tried like 15 out of 20 suggestions on how to make it happen…

so really i cannot blame the partitioning tools, they where just not helping me drive the damn nail in at the speed i liked to…

but i the end i basically wrote my own commandline solution instead of using one of the 20 i had read… i’m sure one was very close, for some reason the identifier i was using looked very different…
i think the other guy may have used the serial put on a partition as a name and then using the name of the partition for zfs to recognize the disk.

almost done resilvering the 2nd drive… now if i can id the two correct drives to remove, then ill try to resilver two at a time… then i might actually be done before my storagenode runs out of space.
and then i only need to change my storagenode size once… even tho at also should be a pretty basic thing… i’ve seen quite a few mentions of it going poorly on the forums… so i figured i would just change it once if possible…

got about 48hours of space left… should be able to resilver the remaining 3 drives in maybe 30
might even be able to make it doing it one by one… but then i won’t have to make the hotswap indicator lights blink xD

but yeah the entire battle ended up being just a one command line issue… which was kinda what i suspected to begin with… just had to figure out how to get there…

ofc i needed to know the sdX of the drive i had to replace which i can do with zpool status. and then the UUID of the drive i want to use to replace it…

which usually ends up with me having to figure out which sdX drive is which UUID
but if i keep using linux i’m sure ill find much more optimized solutions… it just get advanced for somebody not use to running linux… even to do the most basic of things…

need to replace the wheel on your car… please start by removing your engine…O.o

kevink · April 26, 2020, 9:46am

Learning command line can be tough but I found it to be very rewarding.
But for formatting drives I always use gparted. Worked fine for me, however not for zfs. I only used the command line with zfs.

I never had any problems with changing my storagenode size. In fact I was doing it all the time worked just fine.

SGC · April 26, 2020, 11:17am

My most recent and most annoying issue, is that after i finally figured out my network interfaces configuration in proxmox required netmask lines added…
which made proxmox spring back to life, but then the vm’s seemed to be offline… for no obvious reason…

then the other day i noticed that netdata was showing a 2nd vmbr0 called vmbr0v1 and now i was running a lshw and another vmbr0v2 was showing up… but that’s what i get for not following the recommendations on how to do stuff… lol
pain and suffering.

but i might just crash and burn this proxmox install in the near future, if i don’t find a solution to the issue… ofc the last time i nearly did that it was due to the netmask configuration issue … lol
then for some odd reason its added 2 or 3 (there isn’t a real consensus between software…) proxmox doesn’t show any new vmbr0v1 or vmbr0v2

@kevink
well for zfs it seems you don’t need to really use anything most of the time… because it will simply grab the drive and just resilver it.
i would be surprised if one cannot just add new drives without any partition at all staight into being a vdev for your pool, and when one can use the disk UUID then there isn’t really much point in using anything else aside maybe other identifiers which are more commonly used in linux…
like say if i check either

$ ls -la /dev/disk/by-id
or
$ zpool status
then i’m using what i think is the UUID, but there seems to be another identifier

ata-HGST_HUS726060ALA640_AR11021HH2JDXB

with the others i used went something like this…

wwn-0x5000cca2486d50f3

maybe because the other drives are SAS they don’t show up with the other Sata devices
i mean they are listed in one list of all the drives with the wwn-0x500… numbers

but the sata seems to have a list with much more useful names, the ata-… name one can use for actually seeing which drive type it is…

like say this example
ata-Crucial_CT750MX300SSD1_161615325264
wwn-0x500a075113125282

both of these are the same drive, so i should see if i can use the other one, because it’s more practical to read…

the SAS drives do show up under scsi as
scsi-35000cca2656a71f4

but thats nowhere near as useful as the ata, but yeah most likely just my ignorance of how it all works.

didn’t want to resize my node just in case…new software to me, so no idea what works and what doesn’t… so i try not to change anything that i ain’t forced to… i did just change it to 20tb, to give it some breathing room… doubt it will run into the wall before i’m done expanding so it actually has the space… also my ingress dropped… so figure it was time to update the node version…

but still on v1.1.1 even tho it says update to 1.3.3 and has said that for days now lol
or maybe i should start using the pull latest instead of beta but last i checked that only gave people trouble.