ZFS fragmentation

SGC · July 15, 2022, 3:33pm

yeah thus far i haven’t seen any detrimental effects of running with sync=standard.
my guess would be fstab isn’t defined on the sync matter… because a zfs mount can be done over fstab and that doesn’t affect the sync thing at all…

the bad thing about data getting damage is system / software instability.
i think that is why oracle and such companies used the whole sync always trick… also it limits fragmentation and makes it near impossible for data to be damage, which is nice if something runs for like a decade.

yes a lost file or two isn’t a big deal… infact when i have had major issues, because of lack of ack it seems that i don’t fail audits…

had my zfs run for multiple hours without contact to the storage because of issues.
which is weird… i would have thought stuff like that would mess up storagenodes…

but apparently either the nodes or zfs makes sure everything goes well… no idea why…

yeah i would like optane, but its a big expense and mostly just because i want to tinker lol

Pentium100 · July 15, 2022, 4:28pm

Fragmentation depends on what is done with the data. As an extreme example - take an empty pool and gradually fill it with data without deleting anything. Fragmentation will be low. On the other hand, if you constantly create and delete files at random, you will have high fragmentation.

As for performance, so far the traffic is so low that I would not notice anything. I mean ingress is 2-4mbps.

Fragmentation of my pool is 24% and usage 63%.

SLOG is for sync writes. It is very similar to what a journal is on ext4 and others. SLOG is only ever read after a crash, to complete the transactions.
The idea of putting ZIL on a separate (faster) device is that you return from sync writes faster and avoid writing to the hard drives twice.

Sync vs async is not about time, it is about consistency. Some software (databases etc) really need to know that stuff it wrote just now is still going to be there after a crash, so it waits for the data to be written to the drives, instead of just cached. For software using async writes, it is acceptable to lose some information after a crash

If your server loses power and the SLOG SSD dies, so does your pool.

AFAIK, Storj node uses async writes for the files, so, if your server crashes at the wrong moment, the satellite will thing that you have a file that you actually don’t. This may impact your audit score.

By the way, since I am using a VM, all zfs sees are sync writes, even though the writes themselves are async.

SGC · July 15, 2022, 4:45pm

i don’t believe so… then one would simply fall back upon the copy on write…
however it is possible that some sync writes could be lost…

and even then normal systems loss data all the time… zfs is very stable even in very limited configurations… i’m a bit surprise i haven’t managed to kill a zfs pool yet…
i can usually destroy most things lol

Pentium100 · July 15, 2022, 5:24pm

I never had SLOG fail on me after a power failure, maybe it would “only” screw up the last writes (which would probably kill the filesystems inside zvols), but that is too high a risk to try.

SGC · July 15, 2022, 5:27pm

thats what vm’s with virtual disk zfs pools are for
but yeah i must admit i haven’t exactly tested that…

i do find simulating zfs stuff rather practical when i’m doing something that is difficult to find documentation on.

Pentium100 · July 15, 2022, 5:55pm

I can think of multiple different setups that would match this sentence. Which one do you mean?

SGC · July 15, 2022, 5:56pm

basically simulated virtualized zfs pools, for testing…
i find them very useful

Storgeez · July 15, 2022, 6:08pm

That would be extremely pricey and pointless.

atomsymbol · July 15, 2022, 9:40pm

Unless the text “1 GB” is replaced by “1 TB”, which you already knew about.

Pentium100 · July 15, 2022, 10:37pm

With the traffic that is right now, L2ARC or other type of SSD read cache is pointless, unless your drives are very slow.

atomsymbol · July 15, 2022, 10:48pm

How long does it take for your Storj node(s) to complete initial scanning after a reboot of a machine?

Pentium100 · July 15, 2022, 11:04pm

I do not remember exactly as I never checked, but I think it’s a few hours. Last time I rebooted the VM, I expanded the virtual disk and waited for the initail scanning to complete before running resize2fs.
Uptime of the host is 511 days, I really do not remember how long it took after a complete reboot.

Then again, the host has 108GB of RAM, 16GB is given to the VM, so, probably not a standard setup. I get about 93% hit rate with just the RAM cache (L1), but the statistics are for all pools, I cannot isolate one VM or zvol.

However, even if the initial scan could be made faster after a reboot using bcache (I have bad experience with it and found that zfs L2ARC works much better, but it could have been just me), unless you reboot the VM or the host often it does not really matter. Now, if egress went up to, say, 30-60mbps and I saw the hit rate go down, I would add L2ARC.