ZFS fragmentation

Anyone here running a ZFS node for over a year? I wonder how well ZFS holds up when it comes to fragmentation.

My node is on ext4 which is inside a zvol. I would not expect rsync of all the files to be very fast. Then again, the node accesses files randomly anyway.

ext4 inside a zvol? Are you using a virtual ext4 disk that is stored over iSCSI on a ZFS pool?

No. I am running the node inside a VM. The host has zfs and all VMs use zvols as their disks.

Just some notes:

  • I was using ext4 on 1 node as a test, but I switched to btrfs while I was resizing the node because of the fact that resizing/moving/extending an ext4 filesystem results in much higher downtime of the node. btrfs mitigates the necessity to use rsync.

  • I don’t recommend operating a Storj node on btrfs (nor on ext4) without a bcache located on an SSD (approximatelly 32 GB of SSD cache per 1 TB of HDD storage)

Extending ext4 does not take that long, at least for ~20TB sizes and it can be done online. I thought about using zfs inside the VM, but I thought that zfs inside a zvol would be a bit slow.
Shrinking ext4 is another matter, that’s why I expand the node gradually, a couple of TB at a time.

The host has 100GB of RAM and most of that is used for cache (inside the VM or on the host). I have two SSDs for SLOG, but no L2ARC. The performance seems good enough, at least for the current traffic level, I get 99% or so average success rate on both uploads and downloads.

I prefer zfs over btrfs.

The claim “Extending ext4 does not take that long, at least for ~20TB sizes and it can be done online” isn’t universally true - its truthfulness depends on a variety of factors that by many ext4 users cannot be taken into account in advance.

This depends on the settings used. With sparse_super2 (saves a bit of disk space) you can only do so offline.

With the default settings for ext4 (at least on centos6-8 and debian 5-11) it can be expanded mostly-online. Shut down the VM, expand virtual disk and partition. Start VM, run resize2fs. But OK, not in all cases.

1 Like

afaik fragmentation is mostly related to how close you run your pool to its max capacity… anything less than like 97% seems to work… i forget why…
some report issues with fragmentation down to 70% i suppose that might be down to record sizes or avg file sizes.

i really haven’t had any major issues with zfs… or i should say most of my issues with zfs has turned out to be hardware related…
bad disks, slow disks, sporadic disks… not enough disks, not enough cables, not enough planning, not enough foresight…

zfs is like learning a new skill… it has its own language and takes a lot of time and usage to get really proficient.

for a small hardware setup i wouldn’t use zfs for storj… zfs is significantly slower than ext4.
atleast when stuff isn’t in ARC or ++++…

been using zfs for a couple of years now… and fragmentation hasn’t been an issue at all…
but i’m sure one could make it an issue if running a pool above 80% storage capacity for extended periods of time…

or maybe thats 97% … 97% capacity usage seems to be the cutoff…
heard it from a pro and did test it… after that the pool just chokes and dies pretty much…
ofc doesn’t literally die… the performance just tanks… and i mean tank

also don’t use xfs…
i know some might recommend using that because its faster… which is sort of true… but only part of the story… xfs is old really old… so don’t use it… you don’t really need to know more than that… its a depreciated solution, which is inferior to ext4…

but it is faster in some ways… but so is jumping off a cliff rather than taking the brand new stairs

My understanding is that fragmentation in ZFS refers to the free space, not the data. I use ZFS on something other than Storj (about 250 TiB) for several years without fragmentation problems. You do need to run a scrub regularly (recommend weekly for SATA drives, monthly for SAS drives).

Setting up ZFS on large systems requires quite a bit of prep, such as determining number of drives per VDEVs, tuning parameters such as compression settings, deduplication and atime)

1 Like

yeah that is exactly right, the zfs fragmentation is a measure of how … well this seems to explain it best,
https://utcc.utoronto.ca/~cks/space/blog/solaris/ZFSZpoolFragmentationMeaning

do take into account that this is for illumos and quite old, but the theory i believe is more or less the same… illumos was just a slightly different branch of zfs, which i believe are mostly all rolled into what is called OpenZFS 2.0 today which is basically all ZFS except the old IBM one which is totally unrelated software and the Oracle branch which is a paywalled enterprise version, of which OpenZFS is based, since the collapse of Sun Microsystems… but i digress.

in short fragmentation is a measure of the avg size of the chunks of free space on a pool.
thus it can be used to advice when one should add more storage space to avoid fragmentation of new data… which seems like a smart way to approach the problem.

most things in zfs i find is created to ensure the most uptime and the least time spent on maintenance, but these features only really start after one starts having multiple top level vdevs.
then zfs will load balance across the vdevs depending on io sizes, vdev remaining capacity and activity.

combined with deletions over time this balances the pool data, so the system can make the most use of the underlying hardware… might not work for all workloads… but sure should work well for most.

and yes you should run scrubs regularly, tho it doesn’t affect fragmentation, the scrub is like an ECC memory scrub, it verifies the data to their existing checksums to ensure data integrity.
which is really what zfs is all about… data integrity, that is the alpha omega of what zfs does.

i really like ZFS, but it is not a graceful dancer… its more like a main battle tank

I also use ZFS for other things than Storj and I have around 7% fragmentation.
That is why I wonder if someone here really uses it and did not run into write performance problems.

There are a lot of comments but nobody really answered my question :smile:

My theory is, that ZFS is especially bad for Storj. But I some people here seem to use it for Storj. My guess is, that after you reach something like 80% capacity on RAID-Z2 and you have very bad fragmentation because of how Storj works, your write performance will tank. But that is just my theory and I love to be proven wrong. And if you have avoided these problems, I wonder what settings you used.

like its stated in the link for illumos, it isn’t an exact measure… and the performance drop would only happen when the incoming data is in larger chunks than the avg free space chunks on the pool.

then most of them will be split, causing future problems for a pool…
however because zfs uses variable recordsizes, it would also depend on the recordsize which is the max allowed size of a chunk of data on the pool…

so at 80% fragmentation the avg chunk of continuous free space on the pool would be 16K.
thus writing anything larger than that would become fragmented when written which would rapidly start decreasing pool performance…

so if you are using the standard configuration of 128k recordsizes, then at 80% fragmentation storage should have been added a long time ago.

i’m currently at 38% fragmentation, which sort of hits exactly the range one would expect, since my pool has recently been 98% full and i was running 256k recordsizes, which would put my expected fragmentation at 40% after long term usage.

if you are very worried about fragmentation, people like oracle fix it or mitigate it by running sync = always and have super super faster slog devices, which pretty much makes it so that all data is written to the SLOG device and then is flushed from RAM to the disk every 5 seconds in a sequential write, thus basically removing most random io write issues and all causes of fragmentation…

been wanting to test that out… have tried to run that for like 6 months or so… or longer… but its really tough to have good enough hardware to actually push all iops through the SLOG device.
it always comes at a major performance loss and the extra wear and prices of devices is just next level.

been wanting to get an optane drive to test it… since it might be one of the few types of hardware that is actually fast enough…

i’ve been running storj on zfs for like 3 years now… runs fine, but i got basically all assistive devices you can add to the pool… and 228GB RAM
i was worried about fragmentation when i started using ZFS most likely because i was coming from windows, but haven’t seen any issues yet…
did remake my first few pools or 5… because or rookie mistakes, bad disks and growth, the current one is from april last year.

i will most likely do more tests on it in the future, but its difficult to test against problems that doesn’t manifest.

Awesome, thanks for the info.

I edited my previous post for clarification. But the fact that you could fill up your pool to 98% alone is enough proof for me that there is no write problem (at least with your setting).

I am not sure about SLOG. As far as I understood, SLOG only caches synchronous writes. But are synchronous writes even needed? If I don’t use virtual HDDs or iSCSI, I could get away with async writes? Like a debian that mounts the NFS share? Or a TrueNAS Scale with a Docker image on the host?

On the other hand, SLOG devices seem to have become a lot cheaper…

All requests (upload, download, delete) perform some INSERTs or UPDATEs to sqlite databases. These are fsynced. Additionally, each new piece is fsynced after the upload. SLOG will help with all of these.

Ahh sorry, forgot to add, that the DB is on a different SSD anyway.

I mean worst case a 2mb file gets corrupted and repaired without forced sync, right?

the trick is that you run
zfs set sync=always poolname
this makes 90+% sync writes (not sure why it isn’t 100% but it just isn’t…
then because all the writes are now sync, they will hit the slog and thus they are stored on disk and doesn’t need to be evicted from RAM earlier, and thus its all flushed in one big sequential write every 5 seconds (SLOG default flush timer)

this makes everything faster, helps against fragmentation… i’ve tried so hard to make it work… but my ssd’s have always been to slow or to easy to wear out…
so far the time i’ve given up on that…

so i just run sync=standard currently, which does make my system run better… but my fragmentation has also gone up… it use to be much lower… i might try and get an optane drive and see if i could push the fragmentation back down… which would be interesting…

i’m still learning a lot about zfs, there is so much to learn…

when you write to a disk there are two types of writes.
sync and async
sync is time critical… like database writes and such… if they go bad, one can get into trouble real quick… so they are forced infront of most other writes, to avoid breaking stuff.

async is less time critical, these writes will often be written in a few seconds depending on how busy the system is, but they can sit around in memory for minutes… usually just a couple tho…

but that is a LONG time in the digital world… lots of things that can go wrong… however these things are async and thus not really critical stuff… it might be stuff that can be requested again from a server on the network… or recomputed… or some image or some such thing…

so developers, configuration and such choose which category stuff falls into…
async is fast and sync is slow…
sync is write through and async is basically allowed to be cached.

ZFS will always have a ZIL - Zero Intent Log.
the SLOG or Separate Log is a dedicated ZIL device, thus your storage media will save on doing double writes, because ZFS will allocate a fast area of your storage media for the ZIL if it doesn’t have a SLOG.

so for optimal performance, a SLOG is the minimal requirement for upgrading a pool.
SLOG size is 5x max write per sec and will usually be less…

like say you got 10Gbit networking… so ingress is max 1200MB/s thus 1200x5 = 6000MB thus your SLOG should be about 6GB, so you can get away with a tiny SLOG device.

some like to mirror them, but i think its mostly a waste, the SLOG is basically a redundant system which will only be used if your power fails or such… the data is really stored in RAM and only if the system forgets what it was doing will it go and look at the SLOG or ZIL to ensure important data is stored correctly.

storage nodes run fine at sync=disabled
but that makes all sync writes into async… makes everything lightning fast… but not everything likes that… because the system is basically just lying about sync writes being written and this can cause issues… but the storagenodes seems 100% okay with that.

however some software will expect sync writes to take time… and thus might not be ready when your system says its done, because its to quick, which can cause hiccups.

but sync=disabled is so FAST… so so FAST… but i try not to use it… but at times it can be difficult not to lol.
minimum gain in performance with that is like 25-50% and can be multiples even…

but don’t turn off sync writes… there is good reason they are sync…

ZFS default setting is that you let the OS or software decide if it needs sync or not. With sync=always you force it to be sync.

For VMs and Databases I can really understand the need for sync. If a VM or a DB gets corrupted, that would be really bad. But a 2mb Storj file? Nobody would use sync for a smb share, because even if something gets wrong, the clients knows that the transfer failed because there was no ack. I am not 100% if this is the same for a fstab mount, but my guess so. And even if this is not the case, should we care? Does it matter if a single or even 10 storj files get corrupted?

Sure a SLOG device would always be better, but the Storj rewards are in my opinion too small to justify buying Intel Optane.