Zfs discussions

SGC · May 6, 2020, 10:41pm

errr thats a good question… basically its trays to which the disks are mounted and then loaded into a SAS backplane, which is split in 3 sections of 4 slots attached to 2 LSI 2308 chipped HBA’s via 3 sas cables

pretty sure it’s zfs that is trying to keep me from making a mess… basically it’s reserving a bay for the disk that was removed… so you cannot put in a drive and make a new pool or vdev without repairing the old one… or atleast leaving room for it’s full size of drives.

makes good sense, in most common use cases and i’m sure there should be ways to work around it…

its just zfs trying to make me be nice to it, so i don’t end up with a ruined pool… which is actually quite smart… make it annoying to work with when you are working outside the usual recommended ways of doing stuff.

and on top of that it makes sure there is room for me to restore it… or attached my failed drive if i was to try and reconstruct data on the pool.

that kind of stuff really tells that a system is well made… i’m also quite surprised i haven’t managed to permanently damage it yet… i’ve run it for 2 months now… and it never died lol

anon27637763 · May 6, 2020, 10:46pm

Wait a minute… You’re running ZFS through a RAID controller?

I was under the impression that ZFS doesn’t like that.

SGC · May 6, 2020, 10:50pm

it can do raid, but it just run’s straight through… sure you can run on raid cards… atleast lsi… and you don’t even need to reflash them to IT mode.

tho for the bigger 92xx + series you need to setup each drive you attach in the megaraid as a raid 0 solo drive… which is kinda weird… but then you can get essentially direct access to a drive…
even move it between different machines without formating it and still read the data outside the raid controller in stuff like external usb trays… tho ofc there are some IO disk technology generational things that can make such stuff very annoying… but if they are from about the same tech generations it works basically every time.

and checking smart data through a raid card is just even more convoluted commands… lol but yeah didn’t really get anything out of switching away from my 9260 16i card
aside from that now im running low profile pcie which gives me access to all 5 or 6 pcie on my mobo instead of two full thats split

and now i got 12i and 4e, so going to hook up a disk shelf in the near future… when i get my server room finished so my hardware will stop corroding lol…
sucks to have 12 bays and only 8 that want to work…

lsi raid may not be as reliable as zfs but it sure is practical… i barely ever had to do anything to that… i just plugged in drives and it would just run and run… never had a problem with it… before i learned about zfs, then figured… yeah thats the more problems i need to deal with… lol

SGC · May 6, 2020, 11:07pm

think ill try to offline the old pool tomorrow and then see if that will allow me to utilize the bay of the failed drive…

there isn’t a way to make a vdev of two drives replace the 1 lost drive?

no wonder the large setups like to run mirror setups… so much easier to work with.

SGC · May 9, 2020, 12:21pm

So i got a step further, i finally got around to moving the disk on to an onboard sata port.
and after failing a few times, trying to clear the disks completely a few times with the command you gave and fdisk when those didn’t want to work… eventually i’ve found now that 3 out of the 4 drives will listen when using their by-id name but the last one will only listen on the /dev/sdd name instead of /dev/disk/by -id/ata…

hmmmm think ill go put the disk i took out of the zpool back… then maybe zfs will stop getting in my way… it has to be zfs f’ing with me

loud explosion
[follow by deep rumbling sound of armored vehicles rolling in…]

[Me : closes the door to the bunker and nukes the surface]

did you know one can get cryptographic attacks on your machine from running wget commands without really looking at what it is…
and ofc i didn’t run zfs on my boot drives, so didn’t have a snapshot to keel it with…
i do now tho, had been thinking about getting that reinstall done for a while… much better zfs setup this time…

Slog and L2ARC on the same big SSD 500gb to the slog, abit for the slog accidentally made it 30, but was to lazy to change it, the rest free for allocating worn blocks and to allow the disk to not get over filled, which keels ssd’s

OS on seperate SSD with its own zfs partition so it can boot even if i crash the storagepool.
(now i will need to boot my PC from a BXE server in the near future, which should be interesting to try…
then i should also maybe be able to feel if there are issues with the storage server as i used it, which i figured was a smart little feature.

and essentially isn’t that what people want in the future… there is little point in storing data half assed on tons of different devices… if not for backup…

instead it can be more enduring and lower latency in a pool. and zfs takes care of all the mess.
also then i guess i essentially can run zfs on windows, if i’m on network storage xD

went great with the zfs pool tho… was a breeze to import onto the new system… just had to connect all the drives… tsk tsk… so demanding, zfs is a bit of a diva…

oh yeah so i was trying to move the drives out, and had tons of minor issues, hba cards not connecting right because of my ghetto setup… i really need to get my server case’s metal bracket for low profile pcie… kinda went from full profile to low and just took the metal bracket out…
don’t do that its annoying, can work, but highly unstable if moved and can often sort of work… which makes it even more painful to find errors.

moving a drive out of the server, putting back the pool drives didn’t seem to help, now i’m not sure if one of the drives is actually dying or if its because its on yet another controller from a different generation, i did manage to figure out how to fdisk it… but then i use the dev/disk/sdx definition
but just on that one drive… i cannot if i use other identifiers on it… even tho it is the same drive…

got my latency reduced by a factor of 10 on my writes… xD
down to a 4ms peak and an avg of about 0.5ms tho with high activity it seems it can go higher, or maybe zfs was doing something but had a brief period of 30-50ms
but getting close to the levels where getting an nvme drive wouldn’t even do anything measurable.
with a 3 year or so old consumer grade SSD using MLC tech

anyways, so i wanted to figure out which disk was acting up… which for me atleast in linux is very difficult… while in windows its just basically open megaraid a rclick it and ask it to blink…

so i wanted that feature… commandline would be fine… just the damn backplane / lsi controller blink my bay led so i can find the drive… and then i ran into the damn landmine of an wget attack.
didn’t know what it was before i tried to run the second command… and the system just locked up…
and i was like mentally becoming awake to what i actually was doing… thinking this shouldn’t take that long… so i cancelled the command looked back at the site and realized what it could be…
so jumped on my netdata… and it was reading 500mb/s on drive and using 60% cpu
just enough that the system wouldn’t die but enough that it would finish in decent time…
so i asked it to shutdown.
short search confirmed the very likely option of it being an attack, and so reinstall yay

goddamn linux tho in this case i can only fault myself, maybe month 3 on linux will be the month where i figure out how to make my bays in my backplane blink.

supposedly i should be able to use something called sas2ircu to send a locate command which the controller will act upon…apparently thats stupid question, stupid enough that it lead me pretty quickly into a bit of a bear trap… good thing i wasn’t a bear lol
[sometime later]
finally

  pool: zroot
 state: ONLINE
  scan: none requested
config:

        NAME                                STATE     READ WRITE CKSUM
        zroot                               ONLINE       0     0     0
          ata-TOSHIBA_DT01ACA300_531RH5DGS  ONLINE       0     0     0
          ata-TOSHIBA_DT01ACA300_Z252JW8AS  ONLINE       0     0     0
          ata-TOSHIBA_DT01ACA300_99QJHASCS  ONLINE       0     0     0
          ata-TOSHIBA_DT01ACA300_99PGNAYCS  ONLINE       0     0     0

errors: No known data errors

turned out to be my freaking corroded backplane f’ing up the connection to one drive…
maybe now i can actually manage to get back to having some redundancy… was seriously considering creating a virtual drive of 2x 3tb with my “useless raid function on my hba’s” and mounting that as my 5th drive to get my redundancy back… tho for now the bad 6tb drive seems to be running flawless also… so might just leave it at that for now.

Krey · May 14, 2020, 5:15am

About slog regarding to messages in change log discussion.

I check zpool iostat with slog and without it and i can’t confirm its effectiveness.

kevink · May 14, 2020, 5:18am

Have you attached it to the right pool?

As soon as I attached it to my storagenode pool, I saw a difference in the writes to my drive. Instead of a constant write of 5MB/s I now get a very nice and cached pattern, while my SLOG gets the constant write:

SLOG:

Storagenode HDD:

Krey · May 14, 2020, 5:21am

2, 5 MiB very low Throughput.
Now, however, this is all not important because finally you can move entire db to ssd

Regarding to other tunnings, visible improvements i get with 4mb recordsize with compression and special device for metadata and small pieces.

Pentium100 · May 14, 2020, 5:48am

There are two types of writes - sync and async.

Async writes work essentially like this:
app-> storage: “Hey, when you are able, write this data to disk”
storage → app “OK”
… some time later…
the data is written to disk.

The important part here is that the application is told “OK” even though the data is not on the disk yet. If the server crashes before the data is written to disk, the data is lost, but the application still thinks that the data is there.
This works for most files, but would corrupt a database.

Sync writes work liek this:
app → storage: “Hey, write this data to disk and tell me when it is definitely on the disk”
data gets written to disk
storage → app: “Done”

This makes sure that the application knows whether the data was written to disk - if the server crashes after step 1, then the application was never told that the data was written, so this is consistent.

Of course, sync writes are slower than async ones, because the OS can aggregate a bunch of async writes into one big write to disk redusing the number of IO operations.

Also, most file systems have a journal (so that the filesystem is not corrupted when power fails), zfs calls that “ZIL”.

Here’s how async write happens with zfs:

Data is placed in memory
The application is told “OK”
A bunch of writes are aggregated and written to disk.
zil is not used here.

However, zfs uses zil for sync writes:

Data is placed in memory
Data is written to zil
The application is told “OK”.
A bunch of writes are aggregated and written from memory to disk.
The data is deleted from zil.

Zil may look useless here (it is written to, but not read), but its reason is to cope with power failures/reboots etc. When the server boots, zil is read and the data is written to its proper place.

However, this means that for sync write, the disk is written twice (once for zil and second time for “real”). This results in lower performance, especially because zil writes are small and random.
Moving the zil to a separate device (SLOG) improves performance. If the SLOG is on a faster device than the main disks, it improves performance a lot.

However, if you use SLOG, you should consider using two devices in mirror configuration. If the SLOG device fails under normal circumstances, nothing really bad would happen, however, if the SLOG device fails during boot, you could lose data.

I hope I explained it clearly enough, however, if something is not clear to you - ask and I will answer.

Krey · May 14, 2020, 5:58am

I use zfs more than 10 years. I know all of this very long time.
Regarding storj slog not so effective because maximum brakes are concentrated in other places. Not in database writes.

kevink · May 14, 2020, 6:45am

What throughput did you expect with <40Mbps ingress?
All the small random writes however stress the drive a lot, especially SMRs. The slog does help to smooth the write pattern.

4mb recordsize is only achievable throug hacks (?), the maximum i could set was 1MB. And it is an illusion that this big recordsizes and compression help because most files are ~2.23MB so with 4mb recordsize and compression you only see the compression removing the overhead between the filesize and the recordsize. We discussed that intensly (probably somewhere around the first 40-100 messages) because I thought I was getting a compressratio of 1.33 according to zfs.

Pentium100 · May 14, 2020, 6:56am

Yeah, I tested that. The “compression” was meaningless.

Krey · May 14, 2020, 6:57am

before set recordsize >1MB you need set appropriate module parameter. This is not hack
echo 4194304 > /sys/module/zfs/parameters/zfs_max_recordsize

Why illusion? Effectiveness simply checked with test over tar.
4MB recordsize results in a 2.23MB file being read with one zfs request. This is zfs mechanic - first read with recordsize bytes, next reads with 2x until it up to max_prefetch
1MB recordsize and you get 2 read requests, first for 1MB, next for rest 1.23MB.

indeed. Without compression data was accuped more space than needed.

kevink · May 14, 2020, 7:05am

Yes it safes a few operations, that’s why I increased it to 1MB and might even increase it to 4MB. But the benefit is still small. Also what happens if you want to write more data? with a 4MB recordsize there is 1.7MB free space left. So if zfs wants to fill that, it’ll have to pack half a file into that recordsize, so it will have to read 1 record and then write 2 records again. But that does of course happen with every recordsize, the record will just be smaller.

Also it was a bit of a problem with all the databases in the same dataset because they don’t like 4MB recordsizes without a cache…
Now our crys have finally been heard and we can move the databases to a different dataset with recordsize 16K (?) on a ssd and 4MB recordsize make sense again. But the IO from file writes probably isn’t what makes the drive load too heavy.

Krey · May 14, 2020, 7:06am

my tests

Krey · May 14, 2020, 7:12am

Did you know what zfs have variable recordsize?2.23 occupied 2.23 ( with correction to logical sector size) . Where some trouble with Last record, exactly why we need turn on compression. Not for compression itself but for properly pack records.

SGC · May 14, 2020, 9:02am

regarding the SLOG and Sync always
i found out i had a bad drive, which may have screwed with my numbers, also it’s not the bandwidth performance i was looking at with the SLOG but over all latency i got when using iostat -l and iostat -w
mainly looking a writes tho… as the reads are usually much lower, decreases in write latency seemed to increase my ingress.

i’m running 64k recordsizes now… the 256k experiment slowed down my system a bit and did that zfs couldn’t use my ram up above 80%, from what i understand recordsize can greatly affect how much ram zfs clears for making room for stuff…

also even tho zfs has variable recordsizes then it’s not always applicable because the variable recordsize is a result of zfs’s default compression, however one can’t always work with compressed data, only when moving it around and sending it else where…

Your testing looks extensive, but i’ve difficulty in seeing how i would apply the numbers…
tho i’m running a 4+1 raidz which in your numbers seem to be 2.5x better performance on randomized or semi randomized I/O which is sort of what i would expect and what i expect storj to act like.

i don’t however think that the storj workload compares to copying files around, i’ve only done testing on the storagenode, because i knew pretty well what i wanted to run for performance reasons i would go with raidz1 and then not to many drives so i could resilver and scrub inreasonable times.

tested 16k recordsizes for a short time but my newly added vdev of 3+1 raidz1 was taking on space a bit more than i had hoped, but i sort of expected that much… so i’ve now set it to 64k which i expect to keep it on for months atleast to get a proper idea about how it affects my over all performance…
my ram usage is up tho which is great hate seeing it free… and the system seems to run smoother, but having had a failing drive for god knows how long, (most likely since i bumped into the server rather forcefully) then its difficult to say what has had what effect…

i’ve done a lot of tests on my live node, haven’t really kept any real notes on it or anything, i know i should but i always kinda feel that stuff slows me down…

the fact is i can with many repeated tests, see an increase in both ingress and egress as my zpool latency goes down, and using an SSD SLOG seems to do that…
if i turn sync=always then it goes up, if it turn it off it goes down… i’ve tried over weeks, days and hours… tho take into account you SLOG cannot have high latency, that defeats the point in using it…

it will nor make your throughput better, because everything ingoing will be passing through the SLOG.
it will however increase your write and your read latency, i believe because the SLOG flushes in one large sequential write rather than in one from each Async in ram and sync on SLOG.
so a 2 for 1 deal, sequential is easier for the HDD’s. which frees them up for more random read IO again decreasing latency…

ofc it can be difficult to test on a live system, because the usage changes from time to time… i haven’t found a good way to simulate storj storagenode workloads on the system… tho i recent became aware of a piece of software called disks which is in ubuntu… it can be set to custom datasizes and if they should be random across the partition drive or whatever…
i expect to test that out, but for now i’m running a raidz0 because i blew out a drive… yay
so don’t want to stress my system at this moment for testing…
the avg file size of a storagenode piece is 2.2 or 2.17… MB

i must applaud you for being so through in your testing, i only wish they where truly applicable numbers in relation to the storagenode load…

but i’m only 2 months into this storagenode thing… i’m sure ill find a way soon to simulate the storagenode useage… else ill have to record it and just basically copy paste the performance of the live system so i can like you do some proper well informed performance testing.

Krey · May 14, 2020, 9:23am

You should not use egress and Ingress to test zfs parameters because you dont control storj traffic. Use tar for what.

Krey · May 14, 2020, 9:27am

It is not true. Recordsize always variable.

you make all writes synchronous that can only slow down you pool.

SGC · May 14, 2020, 9:52am

maybe with the recordsize, zfs is really complicated.

sync=always improves my latency and it also seems to improve my storagenode successrates, to account for changes over time i exactly did that… changed nothing for a long while to get a sense with how data was flowing, this being test data also seems to make it fairly predictable.

then i turned it off, kept and eye on what happened… if it worked well i would often turn it off again… see what happened, then turn it on and off a host of times over different periods to verify that it was working…

tar… thats like something to do with packages or compression, i’m a windows user until very recently… so to me tar is just a compression i open in winrar xD

when i get a chance to do simulated workloads then i will test… but until then i got a perfectly good test platform in my live node… giving me the best real world results i can get…

what kinda successrates do you get?
my node i can get up to 85% but my windows vm perversion on a debian hypervisor seems to affect node performance a lot, even tho my vm is basically idle…
initially my theory was that it wrote to the disk, but doesn’t really seem like it and if it did then my L2ARC should catch up on it and keep its dirty windows hands off my pool…
alas it didn’t help… still goes to 75% when it’s running… thought it was cpu I/O related so i moved my entire storagenode onto a ZLE compression dataset instead… alas no luck

what can i do with Tar that simulates a live node?