New nodes are incredibly fast at first, but I’ve noticed things start to slow down after a while. Recently, I tried copying some database files to a different location, and the transfer of piece_expiration.db (1.5-2GB) was painfully slow, around 5-10 MB/s. The issue? Fragmentation.
My pools are currently 50-60% fragmented, and piece_expiration.db is a prime example. After moving the file (removing frag), I was able to restore transfer speeds back to 150-200 MB/s
How are you dealing with fragmentation issues, especially on large nodes?
My ZFS setup:
Special device on NVMe
Compression: on
Recordsize: 1M for storage, 64K for DBs
Atime: off
Sync: off
Specifically for that DB file… isn’t it going away soon in favor of expirations being tracked in regular flat files?
If you only notice degraded speeds during bulk maintenance tasks… but the node is speedy(as your special-metadata-device makes all filewalker/GC/trash operations fast) I wouldn’t worry about it.
If it bugs you… a periodic zfs export/import/rename will restore performance… but doesn’t seem to be worth the time.
copying databases is not a usecase that needs optimizing, databases are not accessed sequentially.
Database on the HDD are very slow, regardless of fragmentation level.
to make databases faster you need to get rid of HDD in the data path. One option is to have sufficient caching, another — force databases onto the SSD.
Everything started when I noticed slowdowns in trash deleting in one of my ZFS pools (almost as slow as ext4). The 10TB node had 60% fragmentation (zpool get fragmentation). At that point, I tried moving all the databases to put them on the SSD, and I noticed that moving the 2GB piece_exp file was going at 5-10MB/s (rsync). That’s it. I thought fragmentation was causing that slowdown. I moved it for eliminate fragmentation and speed went back to normal. Normally I don’t have problem with db in the same hdd (different dataset to 64k). I will investigate and maybe will move everything outside (I want to keep ultra simply setups)
But if you haven’t actually measured where is the bulk of time spent, how could you decide it had anything to do with databases or fragmentation, or if this separate usecase of copying a database file is in any way representative?
Translation of my first post:
“Hey guys! I noticed this thing. The transfer of a simple file was really slow… It seems that it’s really heavily fragmented. By moving it, I solved the problem. Could it be that the general slowness is due to excessive fragmentation of the whole pool? Have you had the same experience?”
General answers: No…
Me: Ok! I’m going to spend time on this… seems a specific problem