ZFS fragmentation (mostly dbs)

agente · September 19, 2024, 11:23am

New nodes are incredibly fast at first, but I’ve noticed things start to slow down after a while. Recently, I tried copying some database files to a different location, and the transfer of piece_expiration.db (1.5-2GB) was painfully slow, around 5-10 MB/s. The issue? Fragmentation.

My pools are currently 50-60% fragmented, and piece_expiration.db is a prime example. After moving the file (removing frag), I was able to restore transfer speeds back to 150-200 MB/s

How are you dealing with fragmentation issues, especially on large nodes?

My ZFS setup:

Special device on NVMe
Compression: on
Recordsize: 1M for storage, 64K for DBs
Atime: off
Sync: off

andrew2.hart · September 19, 2024, 11:43am

You’re statements make no sense to me… zfs is cow, copy on write, so a db would always be fragmented?
Maybe the wal saves it somehow

Roxor · September 19, 2024, 11:46am

Specifically for that DB file… isn’t it going away soon in favor of expirations being tracked in regular flat files?

If you only notice degraded speeds during bulk maintenance tasks… but the node is speedy(as your special-metadata-device makes all filewalker/GC/trash operations fast) I wouldn’t worry about it.

If it bugs you… a periodic zfs export/import/rename will restore performance… but doesn’t seem to be worth the time.

arrogantrabbit · September 19, 2024, 2:31pm

I’m not, because fragmentation is not an issue.

copying databases is not a usecase that needs optimizing, databases are not accessed sequentially.
Database on the HDD are very slow, regardless of fragmentation level.
to make databases faster you need to get rid of HDD in the data path. One option is to have sufficient caching, another — force databases onto the SSD.

arrogantrabbit · September 19, 2024, 2:43pm

Missed this. So the solution for you is

zfs set special_small_blocks=64K pool/storagenode/databases

EasyRhino · September 19, 2024, 5:11pm

my databases are on SSD. the size needed isn’t that large so it isn’t a hardship.

JWvdV · September 19, 2024, 5:12pm

How do you know?
The only fragmentation score I know, is about free space. Which is quite different from the used space.

agente · September 19, 2024, 6:12pm

Everything started when I noticed slowdowns in trash deleting in one of my ZFS pools (almost as slow as ext4). The 10TB node had 60% fragmentation (zpool get fragmentation). At that point, I tried moving all the databases to put them on the SSD, and I noticed that moving the 2GB piece_exp file was going at 5-10MB/s (rsync). That’s it. I thought fragmentation was causing that slowdown. I moved it for eliminate fragmentation and speed went back to normal. Normally I don’t have problem with db in the same hdd (different dataset to 64k). I will investigate and maybe will move everything outside (I want to keep ultra simply setups)

arrogantrabbit · September 19, 2024, 7:08pm

But if you haven’t actually measured where is the bulk of time spent, how could you decide it had anything to do with databases or fragmentation, or if this separate usecase of copying a database file is in any way representative?

agente · September 20, 2024, 9:04am

Translation of my first post:
“Hey guys! I noticed this thing. The transfer of a simple file was really slow… It seems that it’s really heavily fragmented. By moving it, I solved the problem. Could it be that the general slowness is due to excessive fragmentation of the whole pool? Have you had the same experience?”

General answers: No…

Me: Ok! I’m going to spend time on this… seems a specific problem

Toyoo · September 20, 2024, 5:33pm

I’m going to agree with @arrogantrabbit that fragmentation of databases should not matter much.

That said, I still have in place an old script that on each node restart runs a vacuum operation on each database. This operation rewrites the whole database file from scratch, reducing fragmentation and size. You can search for “vacuum” on the forum to see different approaches to do so. I initially put it in place exactly to reduce fragmentation of the old orders database file, IIRC, but this is a solved problem for a long time now. The operation itself doesn’t have any negative impact, and I kept it because it automatically reduced the size of piece expiration and bandwidth databases to reasonable values after large deletions, so it still had some positive effect.

If you really want the fragmentation number on database files to go down, that would be my recommendation.

Alexey · September 21, 2024, 7:37am

But I still interested to get a feedback, why the slowdown is happened, especially if you use an SSD cache/special device.

(it partially confirms my tests, that ZFS is not good for much of small files, ext4 and NTFS are too, but usually ZFS and BTRFS suffers from this way much than expected).

agente · September 21, 2024, 8:22am

I realize that I write down my thinking wrong.
Resuming again… I noticed a node slowly deleting trash (as ext4 nodes). I trasferred a file of 2g at 5-10mb/s so I thougth … fragmentation can be a problem in zfs? not only db (maybe there too) but with entire data (/node/storage).

Actually I have nodes working great in zfs (mostly) but not yet big as ext4 (ext4 are more than 2 years old) and I don’t have the technical capabilities for a deep analysis.
I didn’t know what a JBOD was before using Storj.
I am guided by ChatGPT.

PS: slow node with special dev:
drwx------ 2 root root 2 Sep 21 06:46 wa
drwx------ 2 root root 2 Sep 21 06:55 wb
drwx------ 2 root root 2 Sep 21 07:04 wc
drwx------ 2 root root 2 Sep 21 07:13 wd
drwx------ 2 root root 2 Sep 21 07:22 we
drwx------ 2 root root 2 Sep 21 07:30 wf
drwx------ 2 root root 2 Sep 21 07:39 wg
drwx------ 2 root root 2 Sep 21 07:54 wh
drwx------ 2 root root 2 Sep 21 08:50 wi

Still deleting 08-26 dir

Julio · September 21, 2024, 10:36pm

God help chew! Chew should avoid that, like the plague.