New nodes are incredibly fast at first, but I’ve noticed things start to slow down after a while. Recently, I tried copying some database files to a different location, and the transfer of piece_expiration.db (1.5-2GB) was painfully slow, around 5-10 MB/s. The issue? Fragmentation.
My pools are currently 50-60% fragmented, and piece_expiration.db is a prime example. After moving the file (removing frag), I was able to restore transfer speeds back to 150-200 MB/s
How are you dealing with fragmentation issues, especially on large nodes?
My ZFS setup:
Special device on NVMe
Compression: on
Recordsize: 1M for storage, 64K for DBs
Atime: off
Sync: off
Specifically for that DB file… isn’t it going away soon in favor of expirations being tracked in regular flat files?
If you only notice degraded speeds during bulk maintenance tasks… but the node is speedy(as your special-metadata-device makes all filewalker/GC/trash operations fast) I wouldn’t worry about it.
If it bugs you… a periodic zfs export/import/rename will restore performance… but doesn’t seem to be worth the time.
copying databases is not a usecase that needs optimizing, databases are not accessed sequentially.
Database on the HDD are very slow, regardless of fragmentation level.
to make databases faster you need to get rid of HDD in the data path. One option is to have sufficient caching, another — force databases onto the SSD.
Everything started when I noticed slowdowns in trash deleting in one of my ZFS pools (almost as slow as ext4). The 10TB node had 60% fragmentation (zpool get fragmentation). At that point, I tried moving all the databases to put them on the SSD, and I noticed that moving the 2GB piece_exp file was going at 5-10MB/s (rsync). That’s it. I thought fragmentation was causing that slowdown. I moved it for eliminate fragmentation and speed went back to normal. Normally I don’t have problem with db in the same hdd (different dataset to 64k). I will investigate and maybe will move everything outside (I want to keep ultra simply setups)
But if you haven’t actually measured where is the bulk of time spent, how could you decide it had anything to do with databases or fragmentation, or if this separate usecase of copying a database file is in any way representative?
Translation of my first post:
“Hey guys! I noticed this thing. The transfer of a simple file was really slow… It seems that it’s really heavily fragmented. By moving it, I solved the problem. Could it be that the general slowness is due to excessive fragmentation of the whole pool? Have you had the same experience?”
General answers: No…
Me: Ok! I’m going to spend time on this… seems a specific problem
I’m going to agree with @arrogantrabbit that fragmentation of databases should not matter much.
That said, I still have in place an old script that on each node restart runs a vacuum operation on each database. This operation rewrites the whole database file from scratch, reducing fragmentation and size. You can search for “vacuum” on the forum to see different approaches to do so. I initially put it in place exactly to reduce fragmentation of the old orders database file, IIRC, but this is a solved problem for a long time now. The operation itself doesn’t have any negative impact, and I kept it because it automatically reduced the size of piece expiration and bandwidth databases to reasonable values after large deletions, so it still had some positive effect.
If you really want the fragmentation number on database files to go down, that would be my recommendation.
But I still interested to get a feedback, why the slowdown is happened, especially if you use an SSD cache/special device.
(it partially confirms my tests, that ZFS is not good for much of small files, ext4 and NTFS are too, but usually ZFS and BTRFS suffers from this way much than expected).
I realize that I write down my thinking wrong.
Resuming again… I noticed a node slowly deleting trash (as ext4 nodes). I trasferred a file of 2g at 5-10mb/s so I thougth … fragmentation can be a problem in zfs? not only db (maybe there too) but with entire data (/node/storage).
Actually I have nodes working great in zfs (mostly) but not yet big as ext4 (ext4 are more than 2 years old) and I don’t have the technical capabilities for a deep analysis.
I didn’t know what a JBOD was before using Storj.
I am guided by ChatGPT.