Disk fragmentation is inevitable... Do we need to prepare?

Cmdrd · May 4, 2020, 2:50pm

On my 18.04 systems I had to compile the latest version of sqlite3 from source. Was pretty straightforward, downloaded the source code from the sqlite3 website, extracted, changed into the directory, “./configure”, “make”, “make install”, then good to go.

To run it you will have to update your environment paths or just run it via /usr/local/bin/sqlite3. You could also move it into /usr/bin or another binary directory that is in your environment path.

Alexey · May 4, 2020, 8:46pm

You always can use a Docker

Cmdrd · May 4, 2020, 8:57pm

That’s just too easy haha. I actually forgot about that option.

buchette · May 4, 2020, 10:18pm

Tried on another system, my database integrity is OK, it’s just my old SQLite version that is not able to do an integrity_check nor a VACUUM.

Now i will try to update sqlite but my Linux skills are so…poor.

Alexey · May 4, 2020, 10:22pm

I really suggest you to use a docker way from the article

SGC · May 5, 2020, 7:03am

now that the network tests are down or whatever is going on… did anyone else notice just how much disk activity the storagenode creates while idle…

i kinda noticed because my server wasn’t really doing much else, and i noticed just how high i was idling…

storagenode with 50kb out and 100kb-200kb in…

utilizing 5 hdd disks, 1 ssd SLOG and my ARC on top all at something like 10-20% utilization and dropped immediately when i shutdown the node…

thats like same IO as a single hd can keep up with in idle O.o
anyways just pointing out something kinda odd…

will try to do a test without a network connection, not saying anything is wrong… just that wow the storagenode runs pretty rich in idle…

Floxit · May 6, 2020, 12:19am

Hey, I was actually quite interested by Diskeeper in the past. I get Diskeeper 12, then I searched about documentation and live demo for Diskeeper 18 Pro, and I found one engineer making a live demo on their own Condusiv Youtube channel. So, if anybody wish learn a bit more about that, and if you have questions, this experienced engineer, Spencer Allingham, is totally open to answer it. Despites the commercial side and need to sell their technologies, they look quite honest in my opinion in the way they work and explain, and the live demo is quite convincing. They are also specialized in the SQL server queries & i/o performance, so I guess Diskeeper is not a bad software at all for a SNO, but its also not especially necessary.

Though, I think the built-in is a bit too intrusive in this one week setting, because its not only starting one time a week, it’s actually working several days when the system (mouse) is idle, and from my opinion, it makes too many useless works as necessary, because Windows is consoliding space constantly in the background. I know the staff behind this software try to defragment only used data (though, in our SNO case, its a bit less predictable, since every single little piece could be asked)… That’s their “Efficient mode”, and, with their “Intelliwrite” feature preventing the fragmentation before it occurs (based on what I understood, they try to redirect the write commands to a contiguous freespace to favourize a sequential write, but this guy will be more qualified to explain more precisely what it does exactly), being not intrusive, which are the most interesting for a SNO. This software is actually made to be the less intrusive and using the less moving mechanic than necessary to defragment, and that’s I like from them. The true goal of the Diskeeper soft is the I/O reduction by optimising them closely with the “Write driver” of Windows, improving hopefully the lifetime of the drive and the overall performances especially in case of huge i/o traffic. Then, the other main competitor is PerfectDisk, using a smart placement (by moving the more used data on the beginning of the disk, but from the Condusiv point of view, it wastes more ressources that be really efficient, and for our own case, again, except it could help for most accessed pieces, it’s mainly random pieces in large random aeras and unpredictable read access), and no dram caching, if you wish compare both with their trial.

Otherwise, for the built-in defrag/optimising/space consolidation of Windows, I guess its possible to disable the defrag svc or change the schedule for being a bit less intrusive (I was hearing my noisy disks constanly spinning all the night, for days, when I was waking up in my sleeping room where is the nodes haha). The only way to stop this was to launch the defragmentation manually to say to Windows (hey buddy, I think you can let my hard drives working in peace for now).

For the negative point of this software; I’ve only one bad side from my experience; Samsung sata SSD’s are using their own RAM cache from the software Magician (RAPID mode). When I did the same I/O tests than Spencer Allingham (he used IOMeter) with Diskeeper 18 Pro, it was more performant for the SSD with the built in RAPID than DRAM caching of Diskeeper. But, for the hard drives, it works smoother/faster as expected. As you said, its also possible to disable the “Intellimemory” feature using dram caching, and its only used to read, not for write (I think its quite safe to use though). And finally, one cool feature; they managed to install the soft with no rebooting required (so normally no interruption for the nodes). Thus, I didn’t tested again with the nodes, I think I’ll give a try now I’ve a massive number of pieces.

About editions, the Home is limited to 2gb ram caching, the Pro is not limited, that’s the only difference. If you prefer avoid the DRAM caching, the Home edition for 3 workstations usage is enough (the V-Locity which is only the server version is way too expensive for a SNO).

Thus said, I’m not affiliated at all to the company, this is not an advertising, I was only a user, but that’s a trust software indeed (they’re Microsoft Gold Partner), with experienced engineers as open to answer to questions than the Storj labs. It would be nice if they would accept to test one node and give a feedback haha (maybe I should try to convince them, why not).

Alexey · May 6, 2020, 10:05pm

I would disable a write cache if you do not have a managed UPS and calibrated software to gracefully shutdown your PC with a long power outage.

odarriba · July 24, 2022, 9:54am

Which one is your advice? I tried to find information but depending on where you look at they say different things.

I’m currently using ext4 + fast_commit enabled and running inode directory optimizations from time to time.

Toyoo · July 24, 2022, 10:58am

I’ve tested ext4 and btrfs, and of these two ext4 is much faster. And for ext4 reducing the inode size helps quite a lot. It also helps to have enough RAM so that inodes stay in cache, I’ve certainly noticed a difference after adding RAM to my NAS.

AFAIK fast_commit should not affect the file walker process. What inode directory optimizations do you have in mind?

odarriba · July 24, 2022, 1:42pm

Do you have any performance measurement on this?

Changing it on my disks will require me move all the data, format and move it back

With fsck.ext4 you can use -D to optimize directories. It looks like when you create and then delete a lot of files (something that happens in Storj at the end) it helps to optimize inodes.

It feels like doing a du on all directories runs a bit faster, but not sure if it is real.

Toyoo · July 24, 2022, 3:13pm

Around 30% faster. On my test instance du took ~292 seconds on ext4 defaults with standard deviation of ~11 seconds, and ~223 seconds with sd of ~3 seconds in the scenario described in that post.

Interesting, will have to try!

Toyoo · August 7, 2022, 1:09pm

Ok, to follow up on this topic, I cannot reproduce the fragmentation rate you are observing. My tests from another thread only result in fragmentation of /dev/sda1: 952521/1908736 files (0.3% non-contiguous), e4defrag -c also returns a score of zero—best possible. I will have yet to test du -s though after an explicit fsck -D—adding it to my test runner now. After all, synthetic scores don’t always reflect reality.

odarriba · August 8, 2022, 6:23am

I think the fragmentation levels are starting to appear after some months of operation and lots of new files / deletions.

I’m currently moving my data to newly formatted drives and the fragmentation is gone - just until it slowly starts to appear again on daily operation. If I sort the disks by fragmentation levels, they are in the same exact order of node ages

Toyoo · August 8, 2022, 1:52pm

Sounds reasonable! Can’t even test it myself, I’ve recently moved all my nodes because of the -I 128 post…

JDA · August 12, 2022, 3:40pm

Having worked on large projects involving lot of files stored on the file system been added and deleted around, fragmentation in inevitable, but it can be mitigated in a number of different way:

If the filesystem allow it always tell him at the file creation the final size of the file, never under / over alocate
Try to standardize the file size (by a factor increment and have the maximum size been < minimum
size * 10), if not possible use plan B
Plan B: Use “buckets” files for small files. For example allocate huge files (1GB) and store standardize file size inside (aligned with the cluster size of the file system)
For example, if the file system has 4k cluster, create:
- bucket_4k_01 (Store 1bit to 4k files)
- bucket_64k_01 (Store 4.1k to 64k files)
- bucket_512k_01 (Store 64.1k to 512k files)
- etc. (Create as many bucket types you need)
  You will lose a little bit of space at first (creating empty buckets, and cluster alignement) but you will avoid most of the fragmentation and minimize IO and filesystem overhead (at the cost of having to keep an index to get the data from the buckets)

Pac · August 12, 2022, 8:59pm

Plan B sounds like recreating a file system within a file
Isn’t it like recreating the wheel?

SGC · August 13, 2022, 8:44am

yes and no…
has to do with block sizes… inside a file there isn’t really any blocksizes…
and because the storage is preallocated, that will never become fragmented… thus one can use it as a sort of cache…

but like stated, preallocating storage means if you write only 1 byte in it, then it still takes up all the space… like say 1GB

to preallocate or not to preallocate, that is the question.
either has unique advantages and disadvantages.
and there really isn’t a best for most use cases.

with preallocated you will basically never have fragmentation, in theoretical optimal conditions… atleast ofc that is almost impossible in real life.

Krawi · August 13, 2022, 9:54am

And what about the fragmentation inside of these buckets?

SGC · August 13, 2022, 10:44am

think of the concept as a notebook.
if you leave room for additions then the sorting won’t get fragmented.

ofc there is always the matter of scale…

this is one of the reasons why the more advanced storage uses fairly large blocksizes.
and then smaller writes can be written into these larger blocks + anything added later to a small write can fit inside the block.

however the problem with larger blocksizes is that each block is read in one sweep…
so basically have to change,read or write 1byte, the system will read and cache the full block.

this is then mitigated by storing writes in main memory or caches for extended periods.
an async write can be sitting in main memory for multiple minutes…

and thus if more stuff is added, it will be written later to disk in one sweep / full block.

something like the fairly new draid option for ZFS, will run as a raid for the most part… but for small files, instead of writing a full default block size of like 128KB in a stripe across the raid,
it will write it on two disks in a mirror instead… thus the minimum write size goes from 128KB to usually 4k because that would be the sector size of most HDD’s

storage gets really complex really fast, as one starts to dig into the details of it.
but long story short caches is amazing… its why there already exists so many caches in a computer.

you have CPU cache L1, L2, L3, Storage / HDD Cache, and RAM is also a sort of Cache.
all of these caches will often be used for storagenode operations, adding an extra SSD cache helps cheaply extend main memory in a sense, and decrease how often the underlying storage is accessed, these kinds of things helps limit fragmentation.

because fragmentation is a fundamental issue in all informational storage.

best example is a notebook, and a cache is sort of a note page, which one uses to sketch out the data before its put into the notebook in a more sorted manner.
then to take it one step further one could imagine the notebook being a cache for actually writing a book.

the problem becomes that the pages in the notebook is often fixed… which is comparable to how HDD’s have fixed data locations, and limited ability to read the data, because of the write head only being able to be in a single location on the plater.

SSD’s are able to read blocks from any location without mechanical delay, which is why their iops are so much better, making fragmentation a more limited issue.

Q1D1 is still hell tho… in most cases…