Zfs discussions

kevink · April 26, 2020, 12:49pm

I just plugged my new drive in, looked it up in /dev/disk/by-id/ using its ata- name. Then I told zfs to create a pool on it. After that I mounted it for storj. That was easy.

As for storagenode updates: use the search and don’t mess around with it…

SGC · April 26, 2020, 1:18pm

yeah might as well make zfs do the work.

for updating the storagenode i just follow the manual update guide from the documentation.storj.io
or i think i do… xD

or almost… i added a few extra steps.

https://documentation.storj.io/setup/cli/software-updates

i added a step 1 after shutting down the node… to just make sure it actually shutdown… the shutdown command will force it to shutdown after 300sec, but it usually shuts down basically immediately…
0. export my logs
$ docker logs storagenode >& /zPool/storj_etc/storj_logs/2020-04-26_storagenode.log

shut down node with
$ docker stop -t 300 storagenode
verify the node is shut down
$ docker ps -a
removing the node
$ docker rm storagenode
update the image
$ docker pull storjlabs/storagenode:beta
updating changes i wanted to add to the config.yaml
mv config.yaml config.yaml.old mv config.yaml.new config.yaml

and then ofc the docker run command
from the most recent version of it… i keep a few of them so i know exactly which one i ran before i switched to a new one… but sometimes there is something i want to change, and then it might take 14 days before i reboot the node… so it’s nice that everything i changed meanwhile is just ready to go when i run my update script.

lol i take it that they got the word processing block in the forum ported from somewhere else… what kind of savage doesn’t count from 0…!!!

what did you mean about the search tho… i totally don’t follow, is that a watch tower thing?

kevink · April 26, 2020, 3:29pm

Yeah for example. There are lots of posts about how to update and “updates not working”. I won’t Wirte anything about that in this thread.

Pentium100 · April 26, 2020, 8:12pm

WWN is a unique number to the drive, it is written on SAS drives, it’s kind-of like MAC address for a network card.
IDE and SCSI drives do not have them. I think that older SATA drives do not have them as well.

No. Basically what zfs does now is “scan” the metadata, so scanned 1TB means "scanned metadata for 1TB of files). “Issued” means actual reads from the drives.
https://utcc.utoronto.ca/~cks/space/blog/solaris/ZFSScrubScannedVsIssued

No the traditional way was to start first partition at sector 63, however this does not work well with 4Kn drives (because the start of the partition is in the middle of a physical sector), so the new standard is to start the first partition at sector 2048 which would work for anything with blocks of 1MB or less.

OK I usually use fdisk and not parted, but as I remember it’s not that difficult.

(parted) mklabel gpt 
(parted) mkpart                                                           
Partition name?  []? test                                                 
File system type?  [ext2]? zfs                                            
Start? 0%                                                                 
End? 100%                                                                 
(parted) print                                                            
Model: Unknown (unknown)
Disk /dev/zd640: 107GB
Sector size (logical/physical): 512B/8192B
Partition Table: gpt
Disk Flags: 

Number  Start   End    Size   File system  Name  Flags
 1      1049kB  107GB  107GB  zfs          test

(parted) quit

With fdisk (I am more used to it) it’s a bit different:

Command (m for help): g
Created a new GPT disklabel (GUID: ...).

Command (m for help): n
Partition number (1-128, default 1): 
First sector (2048-209715166, default 2048): 
Last sector, +/-sectors or +/-size{K,M,G,T,P} (2048-209715166, default 209715166): 

Created a new partition 1 of type 'Linux filesystem' and of size 100 GiB.

Command (m for help): p
Disk /dev/zvol/storage/ftest: 100 GiB, 107374182400 bytes, 209715200 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 8192 bytes
I/O size (minimum/optimal): 8192 bytes / 8192 bytes
Disklabel type: gpt
Disk identifier: CAD33D49-8418-0345-8B5B-AF87DC5784CE

Device                   Start       End   Sectors  Size Type
/dev/zvol/storage/ftest1  2048 209715166 209713119  100G Linux filesystem

Command (m for help): w
The partition table has been altered.
Calling ioctl() to re-read partition table.
Syncing disks.

The filesystem type does not really matter that much, as long as it’s not swap or something. Windows is very picky about this, but Linux seems to work just fine.

Also, you can just give the whole drive to zfs.

zpool replace <pool name> <failed drive ID> <new drive ID>
The failed drive ID you can take from zpool status output.

I usually use the drive IDs from /dev/disk/by-id. Either the ata-xxxx format if available or the wwn if not.

SGC · April 27, 2020, 7:42am

didn’t know i could write % to define the disk size…
not like thats defined anywhere… it says start and end while writing sector and logical information in bytes total size in TB

can i write start and end in factions also, how about [ insert obscure mathematical knowledge ] and if you set it to start at 1 or 0% i mean one can start at 0% but one cannot start on sector 0… unless if they count from zero which is kinda confusing sometimes, because it should really be consistent.
but i suppose the common folk doesn’t need to be confused by that, but may they should… zero is a pretty important invention.

i like how it suggests ext2, really… isn’t that like antique, why would anyone want to use that by choice…
it’s like handing out clay tablets on the library and a cuniform script stylus… nothing wrong with that…

well i was forced to try and use them…and mainly i just wanted to get onward so i could pass on the tasks of setup on the zfs… made me feel like i had to dress up like a pirate and play pin the tail on the donkey to start my tesla.

so in your little example here… which one is the correct… because space allocation in fdisk is vastly different from allocation in parted… i mean you start at 1049k in the parted on and at 2048
i suppose fdisk is right because it gives you the default setting ot use.

so parted actually allocates it outside the default… not like that could go wrong… also isn’t it wise to allow for free space at the end of the drive to allow slight expansion of the disk partition in case it develops bad sectors… i mean sure one allocates some sectors for the life of the disk… but if its filled completely, and then develops bad sectors… can it then actually try to move them to good sectors… i think not…

they all had WNN which i think is their UUID but not totally sure.

great well this looks promising, goddamn read errors on the drives i literally just put in… hopefully it is nothing serious… lol i really should get some sort of limiter so that my disk cannot run full tilt on transfer speeds… hopefully i can find a way to make zfs do that… like say max 80mb written or read at any one time to a drive… or something like that…

kinda want to see if that makes it better… ofc these drives had never been tested before i just threw them into the server… and they where used… and mis matched and now they have been running on pretty much full tilt for days.
also i hotswapped them…might need to stop doing that… doesn’t really seem like it’s a good idea… either for the other drives or the drive i pull … ofc i don’t bother setting them offline either i just pull the drive lol

deal with it lol atleast i get my redundancy tested…

do you know what the cksum error mean and why is it on the replacing-3 when none of the drives has it… and why isn’t the one with the read error the one with the checksum problems.

looks a bit weird to me, but i was running without redundancy for just a brief 10 hours… i knew shouldn’t had done that… but i decided to pull 2 drives instead of one when i was replacing drives in the system… seemed to go fine until i pulled the wrong one…

 pool: zPool
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Mon Apr 27 03:34:49 2020
        3.26T scanned at 188M/s, 2.83T issued at 164M/s, 10.1T total
        584G resilvered, 28.12% done, 0 days 12:52:17 to go
config:

        NAME                                           STATE     READ WRITE CKSUM
        zPool                                          ONLINE       0     0     0
          raidz1-0                                     ONLINE       0     0     0
            wwn-0x5000cca2556e97a8                     ONLINE       0     0     5
            wwn-0x5000cca2556d51f4                     ONLINE       7     0     0  (resilvering)
            sdf                                        ONLINE       0     0     5
            replacing-3                                ONLINE       0     0     5
              sdi                                      ONLINE       0     0     0
              ata-HGST_HUS726060ALA640_AR11021EH2JDXB  ONLINE       0     0     0  (resilvering)
            wwn-0x5000cca232cedb71                     ONLINE       0     0     0
        logs
          sdd4                                         ONLINE       0     0     0
        cache
          sdd5                                         ONLINE       0     0     0

i mean it should have been fine, the pool was online… and i pulled a replaced drive…
then i pulled a good drive, put it back resilvered it tried to pull another one, again without any luck… was looking at the disk tray led’s seemed like zfs was trying to tell me something… but i guess it wasn’t
anyways so i just threw a new drive into the old bay and took the rest with me…
which meant the pool was degraded… but essentially it should still have all the data…
then i set it to replace the first drive…

did got much slower now that it didn’t have all the drives… which was kinda expected, but wanted to see how it was for myself and my storagenode wasn’t to happy about it either… but that wasn’t to bad yet…

then i started the 2nd resilvering because i wanted to see if it would got faster…if i ran two of them… the logic being that it seemed like the speed was limited by the disk write… and this may have been the case if i had been at full drives… and maybe have had 10 drives or so instead of 5 -1 and resilvering two…replacement drives
basically making it 4 vs 2 … and ofc writes will be slower than reads…but then on top comes all the parity calculations since i removed my redundant drive…

so yeah maybe i cause this, but i would rather crash it sooner than later… right now its just a tiny 7 week old storagenode that’s on it… so…

damn my storagenode tho… i’ve had a 50-60% drop in ingress since i started the double resilver… and already had dropped like 20% … i should try and enable asynchronous writes… did that once already during testing, but got some weird results…

Pentium100 · April 27, 2020, 8:00am

I also didn’t, google told me

I don’t know, you can always read the documentation.

ext4 is just ext2 with additional features. The “type” is probably the same for both.

parted is in bytes, fdisk is in sectors (512B each).

No, it is not done like that. The drive has spare sectors that cannot be accessed at all and uses those to remap the bad sectors. zfs by default leaves a bit of empty space at the end of the drive so that if you have to replace the drive and the new one is 1MB smaller it would still work.

Leaving empty space at the end of a SSD can be done to improve write performance without TRIM, enterprise SSDs do it out-of-the-box (460GB SSD is really a 512GB SSD with a lot of overprovisioned space), but the same can be done manually. Just run blkdiscard on the SSD before partitioning it.

WWN is called a “World-wide name” and designates a specific device. You can also have a partition UUID.

SATA and SAS support hot swapping I do not remember having any problems with it.

The drive did not report a read error, but the data was corrupt.
However, it seems that zfs was able to recover the data. If not, you get a list of corrupted files at the end of the zpool status output. Or you get errors: No known data errors, which means that the data is most likely OK.

SGC · April 27, 2020, 8:27am

yeah my 750gb ssd is like 512gb provisioned, just to give it some breathing room.
didn’t bother checking if there was anything provisioned by default…
the leaving space at the end to make it fit is a pretty neat idea… ill have to keep that in mind for future partitioning.

well hotswapping should be okay, but my server case isn’t really high end… freaking chenbro, which means the damn hdd bays can be a bit of a pain to put in… which could cause undesirable vibrations and bumps, when i try to hammer them in… tho after i got proper screws for the trays it did help a lot xD
but as much as i like my server, i really kinda hate my server chassis.
its okay but it does have that cheap feel to it.

i doubt thats it was the issue… but i cannot put it out of my mind either… not to bad either to simply get use to simply turning the system off or atleast offline the drive in zfs before i pull it xD

and one of the drives did report a read error ofc that was one of those that i added… didn’t catch that…
its also kinda weird, because i scrubbed like 24 hours ago… and never had a checksum issue before… and now after i hotpulled a few drives i suddenly get checksum errors.

and the sdf drive hasn’t been out, or maybe it has… i should go check which on i pulled and put back, because i wrote it down…

would be nice to know if thats a hotswap thing… also 7 read errors from a newly attached drive is kinda a worry…

Pentium100 · April 27, 2020, 8:34am

run smartctl on that drive and see if there are any pending or reallocated sectors.

As you wish, I usually hotplug stuff, so I do not have to restart the server etc.

SGC · April 27, 2020, 10:04am

same, but just saying maybe a combination of my bays being kinda cheap and the fact that the server isn’t in a proper rack mount or that i don’t offline the drives when pulling them… might cause some sort of issue… because its far from the first time i notice stuff like this when i shuffle drives…

but then on the other hand, it could just be that it’s because i actually pay more attention to it and when i am installing or removing drives i do move full dataset and thus find the errors that have accumulated over time…

its like running perfectly similar drives in arrays or vdevs, it doesn’t hurt your performance or wear on the drives… while running mixed drives of same size, can cause performance loss or excessive drive wear.

duno how bad this is on zfs either, but on regular raid it’s a fairly big issue, if one wants it to run great.
but i would assume it’s more or less the same on zfs, since its really a mechanical issue of the drives not operating in sync, which will make the slowest drive drag down the performance of the rest, causing it also to be the drives that might be running at full tilt more of the time.

don’t get me wrong i think hotswap is great, but i also hate seeing problems with my disks and the recommendation is to offline the drive you are hotswapping at the very least… according to the manual of atleast Oracle and most likely also freeBSD.on ZFS
the linux zfs man page seems a bit … sketchy IMHO

Pentium100 · April 27, 2020, 10:12am

performance loss I can understand, if one drive is slower, it will slow the array down. But increased drive wear? I don’t see how that would happen. By the way, my node is running on a 6 drive raidz2, 4 drives being 7200RPM and two are 5400RPM and SMR on top of that. It runs OK, except the slower drives do slow the array down.

Maybe because the slower drive would be running at higher load, that would cause more wear on it, but if the array was made only from the slower drives, they would wear the same.

Hardware RAID probably is more sensitive to this, I know some controllers cannot have SAS and SATA drives in the same array for example. But I also have used arrays made from different size drives (after replacing a failed small drive with a bigger one) and have not noticed any problems.

SGC · April 27, 2020, 11:19am

i saw some stuff with this one guy that worked on hyper scale systems, and had done testing ranging in the millions of runtime hours on various setups.

he explained it something like this… tho it was a couple of years ago since i’ve see it, so i might be slightly wrong in explaining it… ill see if i can find his youtube channel.
and only in relation to regular raid…

if we imagine 5 disk in a raid, all the same… then their heads move across the plater writing or reading, because they are all the same then this is a very consistent motion, basic sequential writes.

but if we imagine it being different drives, then some of the drives will have to move their heads out of sync with the rest… because of different search times and what not and because all the data is striped across all the drives, then you need that drive to finish to have the data.
this makes the drive work harder, because it might be constantly moving around the head to keep up.

ofc stuff like this is mitigated by cache, buffers, smart raid controllers…
but the essential problem exists no matter how you think of it…
you basically cannot have 4 drives of one type and then just add a different one without running into the fundamental issue of the mechanics of them not being exactly the same…

they will have different times for moving the head, different seek times, maybe even different sector sizes, different controllers on the disk itself, sure it might not cause drive wear and if it does it might be fully mitigated by various caches, buffers, controllers and such… but sometimes it won’t go that well
and at best you are not getting the performance of the rest of the drives…

atleast with raid, which is fairly sensitive to this stuff… i had a mirror array that couldn’t run because one of the drives was “broken” or ran at 100kb speeds at random read writes… so it just forced the mirror drives down to its level and the entire system just stalled until i got rid of the bad drive…

ofc in that case it would not damage the good drive, but that case might make the bad drive fail faster…

tho i must admit my zfs is currrently running on a big mix, but when i’m done replacing the last drive today hopefully, then i will be running two very similar 6tb drives… both 6tb HGST HUS7260 and then they are split into two types… 2x 6tb SAS type and 3x 6tb SATA He6 drives
but for now on zfs i wouldn’t disregard testing running different drives in one vdev… but to my understanding it should be near impossible for it to run anywhere near what perfectly similar drives will…

sadly i didn’t run performance benchmarks on my old zfs pool setup… but then again one mostly runs into the ARC or L2ARC
i did get 10-15mb/s random read writes when using crystal disk 4KiB Q1T1
i think it was something like this 1800-2k mb/s in the first one, maybe 600 or 150 in the 2nd one and like 60-100mb in the 3rd and 10-15 in the last one over multiple tests, pretty sure of the numbers… just cannot remember them completely…might have saved them tho.

running 5x 7200rpm SATA in raidz1 with l2arc tho the l2 arc didn’t change the crystal disk results… so duno if it disables it for testing or whatever…also it was run from inside a VM on the system… but i believe it was directly written to the pool…

anyways it could be interesting to try and compare what kind of result people get from their different zfs setups… to get a sense of what works and what doesn’t…

SGC · April 28, 2020, 11:52am

So been continuing my deep dive into ZFS and found a few things you might not be using but really should…

this will do trim, which is off by default in ZoL, trim basically means it will remove deleted data from the storage block / pages in SSD’s, which gives you greatly improved performance on stuff like your L2ARC SSD, so long as it supports trim… from what i understood in the video i saw about it, then this comes at a very limited performance cost, vs a great reduction in write latency on SSD’s
ofc if you run this it will also make it much more difficult to recover accidentally deleted data.
but for a L2ARC that is kinda pointless and it will auto detect the SSD and generally only run trim on them, but if you have critical data, then make sure it only runs on the L2ARC SSD … . duno how tho

zpool set autotrim=on

when i turned it on, my SSD latency dropped like 30-50% during heavy load.

anyone know if i can mv a folder from my main dataset into a child dataset faster than if i copy it…
or is that a terrible idea…

started on trying to move my storagenode into it’s own dataset with lz4 and then zle in the blobs folder which is its own child dataset of a child dataset…
but it just takes forever to copy over 5tb, between datasets… so i started pondering if i would be better off just shutting down the storagenode and doing a mv command, and if that would be near instant… or take the same time rsync seems to take…

kevink · April 28, 2020, 12:44pm

I don’t even have that option… Maybe my zfs is too old or it is different on Ubuntu.

If you read my first post, you’ll realize that you can’t use different datasets for the folders of the storagenode because you can’t move files between them.so any upload that goes to tmp first will fail because it can’t be moved to blobs.

Pentium100 · April 28, 2020, 12:45pm

Yes, this feature was introduced in zfs version 0.8

kevink · April 28, 2020, 12:48pm

Great, I have 0.7x… Really need to upgrade one day.

SGC · April 28, 2020, 12:54pm

really…that sucks because i already started copying into my blobs dataset… i guess thats a redo…

anyone of you know if i can create two pools on one vdev?

i kinda want to remake my pool… before i run out of space for it again… finally finished replacing my 3tb drives with 6tb ones so now i got 15tb free.
but still takes forever to copy stuff between datasets because its still the old pool running small volblock sizes

SGC · April 28, 2020, 6:15pm

i finally found the guy i was taking advice from when i was starting out with researching and planning my server setup for storj + i kinda wanted my own server, and was getting fed up with bitrot.

This guy i think really knows what he is talking about.

He kinda reminds me of wendel from Level1Tech, anyways you asked why i wanted matching drives…

i can highly recommend his great videos, tho if memory serves there isn’t much for illustation… but its been a few years since i watched them… maybe it’s time to go through atleast the raid ones again, i remember he shared many very interesting things in his videos which i never thought about on a wide range of subjects regarding IT.

and he can actually explain it well xD some of it i still barely understand lol

enjoy

Pentium100 · April 28, 2020, 6:47pm

So, what I have been saying - if you use different drives, you get the performance of the slowest one.
But not that using one faster drive in an array will make all drives wear out faster or something like that. The SMR drives in my array do bring down the performance, but that is to be expected.

No, without doing something stupid, like creating a zvol and another pool on top of that zvol.

SGC · April 28, 2020, 7:12pm

no that wasn’t what he said exactly, he said sometimes you can get so that some drive types are writing on two cylinders and thus they will end up moving the heads around twice the number of times…
tho this doesn’t matter short term… but it will affect the life time reliability of the drive.
ofc this will not happen in all cases, you may have cases where it doesn’t matter and you just loose performance… hell you might even have cases where having mixed disks could extend the life time of the array… you just don’t know and you will at the very least have a big hit on performance.

ofc raid does get kinda ridiculously fast.,… so if you don’t mind unpredictable wear or don’t plan on running the array for the lifetime of the disks.

Pentium100 · April 28, 2020, 9:08pm

Depends on your point of reference. 5 fast drives and 1 slow drive will run slower than 6 fast drives. The slow drive may wear out faster as well. However, if you compared it to an array with 6 slow drives, then the performance and life would be the same as with 5fast/1slow array. Replacing one slow drive with a fast one would not make the array fail faster.

As for reliability - there are a lot of unknowns. For example, let’s say I got 6 brand new identical drives with sequential serial numbers. They would work great, but if there is some manufacturing defect, it would likely be present in all of them. Since access patterns of drives in an array are almost identical, chances are that more than one drive may fail at the same time or shortly after each other.
If I used drives from different manufacturers or at least different batches from the same manufacturer, maybe some brand new drives, some older this would reduce the chance of multiple drives failing at the same time.