Is the vacuum necessary?

SGC · May 23, 2020, 8:42am

I cannot help but wonder if vaccuming the database is a good or a bad thing… people on here has been talking about that a lot lately, most sort of just go… oh look at this thing… and then i do this and it’s smaller after i’m done… and on harddrives smaller is better… so must be right… vacuuming is better…

i don’t really know a lot about database management, i’ve worked with a few here and there…pretty simple stuff at first… you and like rows and columns of information, that you basically can search…

now if i was making a database and wanted it to be fast on disk… i would develop some sort of system… so that i would split the file into chunks of the database… so i know the first 1000 rows are the first 1000 sectors on the hdd, so if i wanted to load like say row 1500 i would go to the first sector and skip 1498 sectors and then read from line 1500 and onwards for a few rows depending on what i’m actually looking for…

that means you would have sectors in the file with maybe vast amounts of freespace to optimize how the database operates, and then you vacuume it and put everything into a neat little bunch without any spaces…

meaning that the next time something is changed, it will not be written inside the database… then it would be written behind the sequentially written database file, which is… data fragmentation… which is exactly what you are trying to avoid…

so please show me how this vacuuming works and why i would even trust it be change my database which is most likely made to make optimal usage of the performance.

besides if you can copy a 400 mb database off a hdd in 4 seconds, can’t really be that fragmented to begin with… i think it was 0dim that was copying his fragmenteded database from a spindle drive at 4mb/s showing just what fragmentation of files can do.

granted i know i won’t benefit from any of your vacuuming stuff, because i run my system is a special way to avoid such issues, or atleast minimize them to the level that my 24tb node db shouldn’t get trouble before the 1tb regular hdd nodes do… so problems should get solved before they affect me really.

i would just like to see some proof of why this vaccuming is suppose to work, like direktorn i’m skeptical at best… and with good reason, because making a database smaller doesn’t sound like a solid plan for avoiding fragmentation of the files…

Ignorance is bliss, until reality sets in…

Alexey · May 23, 2020, 12:59pm

You perfectly right. It will happen again. And can be worse than before the vacuum. But can temporary fix the problem (or more like shift it to the future).

This is why @littleskunk mentioned

cdhowie · May 23, 2020, 1:58pm

It’s generally a good thing except under very specific workloads. Database maintenance is important.

In fact, RDBMSes generally have some kind of auto-vacuuming mechanism in place specifically to avoid excessive fragmentation.

This is not always true, of course. The database would be a lot smaller without indexes, but then lookups would be slow. It’s not about the size of the data; it’s more about fragmentation – both filesystem fragmentation as well as record fragmentation, though the latter is really only important when reading many records in a single query.

Fragmentation is also much less significant when the database is kept on an SSD since there is no seek time (though SSDs are still optimized for sequential reads; the performance gap between sequential and random access is just not as pronounced on SSDs).

What you’re describing is a combination of indexing (lookups) and clustering (sequential ordering of the data). However, clustering requires constant work – as soon as you insert a new record that would fall between two others, the data is no longer clustered. This is why databases that support clustering have a command you need to run to perform clustering, and it’s a one-time operation that must be routinely done.

However, clustering is really only useful if you are reading large chunks of records that fall within a contiguous range of values for a specific column – then you’d cluster on that column. I don’t think the storagenode access patterns would benefit from clustering. The indexes are sufficient; they are used to look up which database page contains a record with given column values that are covered by the index so that the engine can seek directly to it. There is just an additional level of indirection through the index.

My understanding is that indexes are pretty aggressively cached, and if the database engine doesn’t cache them then the system’s I/O cache likely will.

The SQLite documentation explains the process and its benefits.

Internal fragmentation of the database structures can be just as bad as filesystem fragmentation of the database file.

One could say the same about not vacuuming your databases.

SGC · May 23, 2020, 2:00pm

i was schooling myself a bit on what vacuum actually is… lol because i really had no clue…

sounds a lot like trim on ssd’s, but also sounds a bit like it could do more harm than good depending on a lot of factors… i think ill just leave the storjlings to deal with database issues lol

and make sure my system can cope… been running on max concurrent infinite lately… haven’t seen the database is locked issue for many days… but it was usually an issue i saw when after server reboots and storagenode reboots, but storagenode reboots doesn’t show them now…
even tho i’m still on the old version. 1.3.3

so going to run a few more tests, see if it only happens when i limit max concurrent…
or if its from when my server has been rebooted and is kinda slow due to arc and l2arc loading…

96 hours of server uptime and my l2arc is just about warmed up lol 600gb of former ram data ready to go from dedicated ssd xD

cdhowie · May 23, 2020, 2:02pm

Ok, then… without trying to sound hostile… why are you commenting on this topic, which is about performance of vacuuming in a ramdisk vs vacuuming in-place? Feel free to start another thread on whether vacuuming is necessary.

SGC · May 23, 2020, 2:21pm

@alexey i suppose this one shoud also go in the is vacuum necessary?
because i don’t understand why something is needed doesn’t mean i don’t need it…
i got an open mind which is why i actually bother starting to research the topic of vacuum since people keep talking about it in relation to the db locked issues… which i’ve also been seeing alot at times.

maybe i need to vacuum, maybe i don’t… if nothing else it’s an interesting concept i like to understand a bit more about… ignorance is a dangerous thing, but then again so is a little knowledge lol so yeah…

i duno sorry, pretty sure if i did vacuum zfs would do it in ram maybe not 100% but enough that making a ramdisk for it wouldn’t make sense.

if your vacuum is slow, don’t you think you might be running to big blocksizes and run into an actual caching issue, it’s 500mb shouldn’t be anything any modern computer couldn’t handle with ease, so long as it’s configured to use regular size blocks on drives.

in this example a workload related to something with databases, i forget what… he runs into a data amplification of x40 because of his blocksize and when caching isn’t working correctly his workload ends up being basically tons of extra disk reads…

example from the link below

The sum of read bandwidth column is (almost) exactly the physical size of the file on the disk ( du output) for the dataset with primarycache=all : 2.44GB.
For the other dataset, with primarycache=metadata , the sum of the read bandwidth column is …wait for it… 77.95GB.
A FreeBSD user has posted an interesting explanation to this puzzling behavior in FreeBSD forums:

clamscan reads a file, gets 4k (pagesize?) of data and processes it, then it reads the next 4k, etc.

ZFS, however, cannot read just 4k. It reads 128k (recordsize) by default. Since there is no cache (you’ve turned it off) the rest of the data is thrown away.

128k / 4k = 32
32 x 2.44GB = 78.08GB

cdhowie · May 23, 2020, 2:29pm

We get it: ZFS is perfect and solves all problems people have with everything. I guess I’ll try to take days to migrate my entire server to ZFS instead of trying to make things work better on my awful ext4 volumes.

I’m getting kind of tired of ZFS being brought up as the panacea for any and all performance issues.

First: it’s not perfect either. Second: it’s not feasible for everyone to switch to it.

kevink · May 23, 2020, 3:23pm

and it is not neccesarily sth for non-advanced users.

SGC · May 23, 2020, 3:36pm

ill say this… make sure your ashift is correct for your hardware…
that can really mess up your day… i’m running with 2 x 4kn hdd’s and some 512e and 3x 512
which basically means i’m f’ed no matter what i do until i replace my 4kn drives with either 512e or 512 sector hdd’s

so no zfs won’t fix everything, and can be kinda advanced…
if you do choose to switch then ashift 12 is 4k blocks and ashift 9 is 512
and zfs also won’t improve everything… some IO might be slightly higher when having compression on, something to do with that to be able to benefit from compression it will write 8k blocks and then assumes people only use compression if they get a better than 2x compression… thus one ends up with a bit lower IOPS, unless if one disables compression… but compression also does save a bit of space… depending on recordsizes…

but really ashift is the only important thing i’m aware of really… because it cannot be changed when a pool has been created, most other stuff can be changed on the fly.

Odmin · May 23, 2020, 6:39pm

I really not understand, why you started this discussion over?

Look into this post with explaining, this post with the investigation and this post with a solution.

SGC · May 23, 2020, 6:59pm

i didn’t alexey decided to split this from another topic.