Zfs discussions

running raidz1 on 5x3tb semi nice toshiba drives, still sata but quite nice… just checked and they are 4k logical… 512e i guess i should really get switched over the logical size when i get my last HBA which will make my 12 bays in the server support 4kn, thats like 7% capacity saved and then same on the new pool… so yeah i just checked, the drives are 4k/512e and my ashift is at 12, OS picked it at install… i might reinstall when i get the new HBA’s put in… running on a 9260 i16 lsi raid controller which i just basically run in IT mode.

but i digress… my setup…
dual 5630e (i forget if the e is in the front or the back…) 48GB RAM currently at 800Mhz but should be able to run at 1066Mhz but got the placement wrong on the install…
4x 1Gbit Nics 1, dedicated for storj on a 400mbit/400mbit full duplex fiber connection
12 Bay 2U Rack mounted.
5x3TB toshiba which i selected for their cache size and that the blackblaze annual failure report looked like they would below 0.5% and so far no complaints, also they had 64MB cache which was nice for their time… and then i turned it off… lol

the SSD that runs the OS is a BX300 crusial 700gb-720gb doesn’t run great when stressed, but its still one of the winning consumer drives on storagereview, its review and test performance is actually good enough that i decided against actually switching to a 1tb 970 evo pro, just because i would rather spend my money on something i actually can use lol… i might end up switching, but for now.
not running OS/SLOG/L2ARC on the same ssd seems like it might solve the issue… so thats the plan.
maybe buy a second one to do a mirror setup.

then my upgrade i’m putting in is 2x LSI 2308 chip HBA’s 9217
1x 4i4e and 1x8i which gives me 12 for my backplan and 1 for a DAS… would like to do a better solution here, but didn’t have the patience to figure out how to make my backplane run of 1 8087 cable.
pretty sure it can, but requires custom backplane cables to go between the sections… so it takes up 3x 8087… so annoying that i cannot daisy chain it.

and then i got 5x 6tb drives a couple of them are not like the others… but i’m hoping that it won’t put to much strain on either of them… besides the 2 drives are the enterprise ones… so in theory they should be able to take more punishment… not sure how happy zfs will be when i try to put 2 sas and 3 sata drives in a pool… most likely a terrible idea…but we will see … if it works at all.

E5 2690… those sound rather fast oh yeah i got like 16 threads at 2100Mhz L5630 turns out they are named… xD 40watts a piece, which is nice and low sadly i haven’t manage to get my BIOS power management to corporate, so the server keep sucking down juice at like 400watts…
figured i might do some VM so i knew i wanted and easy way to expand my processing power and ample RAM… and i can get 288GB and 24 threads at 3.3Ghz and ofc more with Turbo.

when i picked it i couldn’t understand why x5600 series basically was faster than the E series you got… found out it was the goddamn damage power management… regretted making that choice many
times… 1st gen power saving tech… so nothing works on default… and ofc i started messing around in the damn bios… might have made it a lot worse… lol

since i’m running some stuff on VM’s then utilizing NUMA correctly gives major performance benefits, turns out that the VM’s should be configured to also utilize NUMA xD
sure seems to have helped a lot… had issues with my VM’s in the past when the server was under load…

such as them stalling completely, or the stream video output being all weird, audio drift and what not.

well you know how hdd’s are, it could be an io issue due to high numbers of files… thus far i’ve moved way over 1mil files. more like 1.5m now which is double duty so 3m files due to it having to read and write on to the same drive. i’m sure that a part of it… and ofc my raid controller cache’s are turned off and off on the drives… surely not helping… then i got 5 drives but i believe they may need to run in sync so even if in raid, then random io isn’t much faster than 1 drive… and then.

wow that is a lot of drives… also thinking of getting one of those 36bay DAS and the server is just suppose to be mainly be a controller for the drives with a bit of processing power and ram… for whatever… but was i buying today i would have bought a newer generation…

i like the command, i found rsync - aHAX suggested on a forum somewhere, so i’m using that, don’t think i really need the other stuff either… but ill not it down, seems like a pretty evolved command line.

the delete parameter sounds like a very nice feature. not much sync without that xD

This is completely expected. As was mentioned before, pieces stored on your node are encrypted. And encrypted data looks like random noise. Such high entropy data simply can’t be compressed. The difference you’re seeing is likely the result of compressing the databases, logs and piece metadata that is stored with the pieces. I have no clue what the 1.33 compression ratio is supposed to mean, but there is simply no way you’re going to actually be able to reduce the size anywhere close to that much.

1 Like

I decided to run a test. Normally the files of my node are stored on ext4 (which is on a vdev). I created a snapshot, mounted it and chose one folder in blobs

On ext4 du shows me that the folder uses 342GB.
I copied the files to another server to zfs - one data set with recordsize=128K, another with 1M, both compression=on
du on recordsize=128K shows total size as 349GB
on recordsize=1M it’s 348GB

zfs list shows 349G on 128K and 347G on 1M
It does show compression ratio as 1.03 and 1.3 respectively.

So, the compression just compresses zfs own inefficiencies.

2 Likes

Okay i think i finally got this…
when setting the record to 1m size, smaller files take up excess space on the drive apparently, i would guess would be on avg about 1/3 more when dealing with storj data… then zfs compresses the unused space of the records thus saving 25% disk space, which is 1.33x compression ratio.

it’s not like zfs cannot compress stuff, its just the damn empty big records that gets compressed…
one of my vm disk are compressed by 1.77x and take up like 52gb but only shows as 32gb when using #du but my 3.8tb storagenode, goes from taking up 3.8 to taking up 5tb written on disk and then its compressed back into 3.8tb xD

tho all this stress did make my failing harddrive or cable issue show up again…
all this zfs stuff is interesting… sadly this wasn’t as exciting as i had hoped lol
was also finding it kinda odd since i already tried to compress stuff… tho it does allow for larger record sizes without them being physically larger because they simply compress… ofc then comes ram and cache purge penalties, because when working on the data i would assume they are not compressed… not many things that can work on compressed data… .but if memory serves zfs is not to bad at that… tho i’m sure there must be some sort of penalty for running large record sizes.
else why does most run with 64k or 128k when using zfs with compress is default.

Oh and can i get a SOLVED

That is why I was so surprised. There shouldn’t be any significant compression gain. And apparently there isn’t… Not sure why zfs reports a compress ratio of 1.33x.
Normally that means files are compressed to 0.66% (verified compress ratios on other datasets with different data and it is correct). However this seems different for the storj dataset.

It can’t be because the small files are better compressable because it is a dataset value and would still mean that the whole dataset on average saves 33% which it clearly doesn’t.

Oh well… Another weird thing. I still prefer my zfs with a big recordsize because it can make use of the zfs caching mechanisms and increases the data integrity, especially because I’m using external HDDs and you never know if a USB port or power supply or enclosure is going to quit on you. Already had one hdd get a corrupt superblock, but I was able to copy all files.

atleast i made the new zfs folder in a dataset, so i could destroy it easily…
pretty annoying tho that the compression didn’t work… and that destroy command gives me the creeps… begun writing the location backwards… so i start with end child dataset name and then add the parent dataset, just so i won’t accidentally but a slip of a finger destroy my entire pool…

i should really see if there isn’t a parameter to add that makes you verify the destruction. xD
i guess the data is still there as it takes so little time to destroy it as a few second… can be that disrupted, but still…
the H bomb of zfs commands
got my final HBA, so looks like i’m going to be expanding my pool today… might also do a full reinstall, kinda a bit annoying with how my 2nd install of proxmox ended up a freaking mess…and i got no realistic idea of just how to move it to my pool.
decided to move my boot to my internal USB so that i can get around the whole issue with making the system boot into the zfs pool, does seem like it should work, but last time i tried booting into the vdev
and shuffled the drives around in the bays i lost the ability to boot, even tho it seems like each of the zfs drives have partitions so one should be able to boot of either of them…
not sure if it’s a great solution, but it does kinda explain what that internal USB connection is for…
can’t see any logical reasoning for using it for anything else, aside from maybe some sort of encryption key or what not…

why is this the case?
from what i understand the checksums are the same, and thus for larger record sizes you actually have less data integrity, due to the fact that if a record is corrupt the entire record is considered corrupted.
if you have 1 bit error in 1mb record then that record is corrupt, while if you have a 1 bit error in a 64k record then only 64kb is corrupt…

aside from that, from what i am told the 1mb record does that if the cache receives 1000 files it will dump 1000mb worth of data to make sure it can contain them…

sure you will save on overhead for the checksums and such… which might be pretty nice on something like raidz2… anyways… explain your logic.

If a dataset has child datasets, then zfs destroy command will show this:

cannot destroy 'pool/ds': filesystem has children
use '-r' to destroy the following datasets:
pool/ds/ds1
pool/ds/ds2
pool/ds/ds2@snapshot

thats nice to know xD

Exactly, which is what I said in the previous sentence from the one you quoted as well as my earlier comment in this thread.

I think a few of us got excited about the 1.33x compression ratio experienced by changing the recordsize. Even though it didn’t make sense given that the data is encrypted, I haven’t experienced an inaccurate ZFS compression ratio so it was worth investigating fully. I’m still at a loss for what the 1.33x is referring to because it should be referring to the entire dataset, not just the databases, metadata, and logs.

I see now an issue at the ZFS Github page entitled Reported compressratio is incorrect so I guess that’s that.

1 Like

Hot tip: when preparing to destroy datasets, set up the entire command you want using zfs list -r, and actually complete the command to list out all the child datasets. Only once you’re sure you have what you need, bring that previous command back with the up arrow key, go back to the beginning of the line and change list to destroy. I never type out a zfs destroy command from scratch.

2 Likes

Oh I was actually comparing zfs to ext4, not different recordsizes.

well it’s the record size that takes up more space with empty and then is compressed, thus it is a compression ratio of 1.33, it just not really data because it’s the bigger record sizes…
however it does give us an interesting number… because we all get exactly the same ratio…
this could indicate that it has something to do with the storj data size, if storj’s sizes matches the record sizes we should get optimal efficiency.

i guess… anyways… this investigation is far from over xD

I guess that could be possible. On a dataset with standard recordsize 128k I had a compress ratio of 1.06

The vast majority of files is over 1MB. But even storing the smaller files should be fine. Storj doesn’t modify data, it only either deletes them or writes new files. Therefore recordsize should be fine and result in less read and wrote ops. Can’t really confirm that yet though. In netdata the usage of my ext4 and zfs node looked pretty similar. They are both not very impressed lol

well if we theorize a bit and i go from what little i read in recent days about zfs.
then the recordsize if to small simply writes across multiple records, thus creating slightly more overhead, like say if i wrote across 5 records i would have to know the start and the end of the records i wrote… atleast, compared to if it could be in one record then i would only need to know the exact record i’m looking for, which is less overhead (io)

i was having my 3.8tb storagenode transfer running on my second monitor, and one thing i noted was that the storagenode files are usually 2.23mb or kinda small like 60-70kb but there was some certain patterns to it… if we take it so that most files we will deal with is 2.23 or whatever mb, then that is the number of which our record size should be a multiple… a 1MB record would make 1 file take up 3 pieces atleast if not more, not sure how records are for overhead and what is exactly contained, checksums and such…

also the 2.23mb to 1.33 isn’t to far from each other… if one factor some smaller 70k files into it for increasing the avg compression ratio… like if i say its 2.01mb pr file, then it would take 3 records and be pretty much exactly 1.33 in compression because the 0.99mb of the last record is empty.

so if we got a 1.06 compression that means we are 6% off from an equal multiple of the ratio of avg files size to recordsize.

oh wait i forgot 1.33 is 25% because 3/3 + 1/3 equals that when we subtract it, then it was 1/4
so 3 records is 3mb and each file was 2.23mb (i should really recheck that) so if we multiply by 0.75 we have subtracted a 1/4 and then we are there… damn near spot on
2.25mb
so thats exactly what is happening…

so if we want to verify that result then the 128k should have about 1/20th because that is equal to 5%
meaning it writes 20 blocks and then the 19th is mostly empty.
and 128k x 19 is 2.4mb not as spot on, but i didn’t bother with finding / calculating / using the exact numbers… so it would be a little off.

so storj is though on the disk io, so we want as large a record size as possible, and we want to run maybe with compression, but zfs will then attempt to compress each file we know is impossible to aside from the empty data… i think there was a option for zfs to just “compress” zeros which is basically like using short hand in math…like writing a billion 10^9 and then it wouldn’t waste resources trying to compress the encrypted data… if we then make a child dataset for the blobs folder with a record size of 256k which should be the divisor of the 2.25mb or so files… and the small files like 70k would just “compress” this would limit the io and not make the cache or arc drop loads of additional space in advance of dealing with many small files… such as if we said some periods of time with people uploading tons and tons of 70k files, if we ran 1mb records then 1k files would make the arc free up 1gb of space, when it only needed 70mb (70k x 1000) in the 256k case it would only free up 256mb.

but there are like 2mil files or so in my blobs folder and since the compression ratio was 1.33 then i will assume most was 2.23 or whatever… else we should have gotten a different number from 1.33

so then we will mostly deal with larger files atleast eventually but new uploads would start smaller and then expand…
not sure if we can put multiple datasets inside each other… but if we can then we could set the storagenode for whatever is database optimized record size and the child blobs dataset at 256k.
this should minimize our IO and optimize our ram utilization… ofc with low RAM smaller record sizes like 128k might be better… i need to check the numbers of the different file sizes to evaluate if the 70k or other sizes will have an effect… ofc depending on how right i am (what i was told) on the caching workings in zfs then 1M record size could still be a contender to minimize io… like say for people utilizing stuff like SMR drives.

if all goes well tomorrow i will be migrating to new drives, then i might be able to run some testing…
i could really use more nodes on my machine lol… its not easy to compare them… tho on the plus size for now… i’m seeing some decent numbers on my drives running 1m record … tho at times it just spikes randomly at times… my ssd have had a 1.2sec backlog during today while only the storagenode running… that is most likely due to something with 1M record sizes… i’m going to switch to 256k and see if that runs smoother…

ZLE (Zero Length Encoding)

  • A very simple algorithm that only compresses zeroes.

is the one that compresses zero’s only… might be great on the blobs folder

1 Like

These are some fine observations and calculations. Could be correct.
Guess you’ll find out more definitely with your new node.

I checked file sizes before deciding the recordsize and over 2/3 of the files were >1MB and almost none >3MB so I figured 1M might be a good value. If most files are indeed 2.23MB, then lowering the recordsize to 768K would make the file fit perfectly into 3 records without the need for compression. However, 512k might still be good too.

The ram and cache considerations are interesting though, I haven’t seen any problems with the ram cache yet but haven’t looked too closely or read up on it a lot. During normal operations the load on the disk and the cache is actually quite low on my system with 40mbit/s ingress. But the ingress doesn’t go through the cache so zfs doesn’t need to free anything from the arc for ingress. It will of course need ram to cache the writes to the disk as it is probably an async write operation but I don’t think it reserves ram in pieces of records, the file will just be stored in ram as it is. So bigger recordsizes save write iops especially compared to the standard 128k recordsize. Well, at least for the bigger file which are maybe 2/3.
For a read that might be different but that is only relevant for egress. Here we might indeed have the problem that downloading small files from the nodes would unnecessarily consume huge amounts of arc, if your assumption about the usage is correct.
Currently the egress is very tiny so it doesn’t really make a difference

So I guess overall it is fun to play around with zfs and recordsize but it might not make much of a real difference :joy:
The iops savings on the normal big 2.23mb files is 18 (for 128k) vs 3 (for 768k) but the question is, how many parallel uploads and downloads do you need for it to make a difference on your drive. If it is old and slow or connected by USB 2 then it will probably make a difference but my new and fast drive connected by usb3 is just laughing at the current workload anyway. Hardly gets 10% iowait, backlog most of the time <50ms. And because there is hardly any egress, my SSD cache shows almost no activity either.

this is one of the issue with large record or block sizes… i think this is why my 3.8tb rsync took like 50hours… also i think ill set my vol blocksize to 32k in the new pool i make… (not the record size) vol blocksize cannot be changed. i’m generally dealing with larger files, and i got good deal of ram + a ssd arc that will only get much larger now that i get my OS removed from it… not sure if the arc runs the same blocksize but i don’t think so, if memory serves i did format it to some other filesystem upon creating it, as per recommendation…

anyways try reading that and you will see why larger recordsizes can be a major issue at times… ofc not always… tuning can make system ridiculous fast at some tasks and just as ridiculously slow at other… many things to consider

this also seems to be still a thing… which is why i kinda want to change…

and here somebody recommends writing in large blocks for copy which seemingly works like a charm…

ryao commented on Apr 23, 2012

Try using dd with bs=16384 as a parameter. The throughput should improve. As for waiting 30 minutes, that is a separate issue. I imagine that you would want to discuss that with @dajhorn.

@mikhmv

Author

mikhmv commented on Apr 23, 2012

Hi Ryao,
the performance is between 124 and 420 MB/s. It is really great.

The question is how to make so great performance for standard linux command like “cp” as it has speed around 11MB/s?
How can I say system to make all write operations with block 16384?

seems familar right…
ops sorry
post is here, if you want to read it

got a damn list with like 20 points of sequenced tasks i would like to get done tomorrow when i start my rebuild, but ill be happy to get through 10 and then just get back online in a few hours… but we will see how it goes… hopefully 32k vol blocks on new drives tomorrow, then maybe i can wait atleast 3-6 months before i need more drives…

next up for tonite figure out if i want to seperate my l2arc and my slog

if you use raidz with 4KB drives (ashift=12), then you should use volblocksize=64K at least. If you use a smaller volblocksize then the zvol will use up more space on the pool than there is data.

If you have vdevs with a lot of drives, then you may have to use even bigger volblicksize. Basically 4KB*(number_of_drives_in_vdev - raidz_level) or 64KB, whichever is bigger.

I am using 6 drives in raidz2, so for me it’s 4KB*(6-2)=16KB, so 64KB is bigger and I have to use 64KB.

It does not help that by default volblocksize is 8KB

1 Like

Hmm this is interesting information.
I’m not using any zvols though.