Disk fragmentation is inevitable... Do we need to prepare?

I would disable a write cache if you do not have a managed UPS and calibrated software to gracefully shutdown your PC with a long power outage.

1 Like

Which one is your advice? I tried to find information but depending on where you look at they say different things.

I’m currently using ext4 + fast_commit enabled and running inode directory optimizations from time to time.

I’ve tested ext4 and btrfs, and of these two ext4 is much faster. And for ext4 reducing the inode size helps quite a lot. It also helps to have enough RAM so that inodes stay in cache, I’ve certainly noticed a difference after adding RAM to my NAS.

AFAIK fast_commit should not affect the file walker process. What inode directory optimizations do you have in mind?

Do you have any performance measurement on this?

Changing it on my disks will require me move all the data, format and move it back :sweat_smile:

With fsck.ext4 you can use -D to optimize directories. It looks like when you create and then delete a lot of files (something that happens in Storj at the end) it helps to optimize inodes.

It feels like doing a du on all directories runs a bit faster, but not sure if it is real.

Around 30% faster. On my test instance du took ~292 seconds on ext4 defaults with standard deviation of ~11 seconds, and ~223 seconds with sd of ~3 seconds in the scenario described in that post.

Interesting, will have to try!

Ok, to follow up on this topic, I cannot reproduce the fragmentation rate you are observing. My tests from another thread only result in fragmentation of /dev/sda1: 952521/1908736 files (0.3% non-contiguous), e4defrag -c also returns a score of zero—best possible. I will have yet to test du -s though after an explicit fsck -D—adding it to my test runner now. After all, synthetic scores don’t always reflect reality.

I think the fragmentation levels are starting to appear after some months of operation and lots of new files / deletions.

I’m currently moving my data to newly formatted drives and the fragmentation is gone - just until it slowly starts to appear again on daily operation. If I sort the disks by fragmentation levels, they are in the same exact order of node ages

Sounds reasonable! Can’t even test it myself, I’ve recently moved all my nodes because of the -I 128 post…

Having worked on large projects involving lot of files stored on the file system been added and deleted around, fragmentation in inevitable, but it can be mitigated in a number of different way:

  • If the filesystem allow it always tell him at the file creation the final size of the file, never under / over alocate

  • Try to standardize the file size (by a factor increment and have the maximum size been < minimum
    size * 10), if not possible use plan B

  • Plan B: Use “buckets” files for small files. For example allocate huge files (1GB) and store standardize file size inside (aligned with the cluster size of the file system)
    For example, if the file system has 4k cluster, create:

    • bucket_4k_01 (Store 1bit to 4k files)
    • bucket_64k_01 (Store 4.1k to 64k files)
    • bucket_512k_01 (Store 64.1k to 512k files)
    • etc. (Create as many bucket types you need)
      You will lose a little bit of space at first (creating empty buckets, and cluster alignement) but you will avoid most of the fragmentation and minimize IO and filesystem overhead (at the cost of having to keep an index to get the data from the buckets)
1 Like

Plan B sounds like recreating a file system within a file :thinking:
Isn’t it like recreating the wheel?

yes and no…
has to do with block sizes… inside a file there isn’t really any blocksizes…
and because the storage is preallocated, that will never become fragmented… thus one can use it as a sort of cache…

but like stated, preallocating storage means if you write only 1 byte in it, then it still takes up all the space… like say 1GB

to preallocate or not to preallocate, that is the question.
either has unique advantages and disadvantages.
and there really isn’t a best for most use cases.

with preallocated you will basically never have fragmentation, in theoretical optimal conditions… atleast :smiley: ofc that is almost impossible in real life.

1 Like

And what about the fragmentation inside of these buckets?

1 Like

think of the concept as a notebook.
if you leave room for additions then the sorting won’t get fragmented.

ofc there is always the matter of scale…

this is one of the reasons why the more advanced storage uses fairly large blocksizes.
and then smaller writes can be written into these larger blocks + anything added later to a small write can fit inside the block.

however the problem with larger blocksizes is that each block is read in one sweep…
so basically have to change,read or write 1byte, the system will read and cache the full block.

this is then mitigated by storing writes in main memory or caches for extended periods.
an async write can be sitting in main memory for multiple minutes…

and thus if more stuff is added, it will be written later to disk in one sweep / full block.

something like the fairly new draid option for ZFS, will run as a raid for the most part… but for small files, instead of writing a full default block size of like 128KB in a stripe across the raid,
it will write it on two disks in a mirror instead… thus the minimum write size goes from 128KB to usually 4k because that would be the sector size of most HDD’s

storage gets really complex really fast, as one starts to dig into the details of it.
but long story short caches is amazing… its why there already exists so many caches in a computer.

you have CPU cache L1, L2, L3, Storage / HDD Cache, and RAM is also a sort of Cache.
all of these caches will often be used for storagenode operations, adding an extra SSD cache helps cheaply extend main memory in a sense, and decrease how often the underlying storage is accessed, these kinds of things helps limit fragmentation.

because fragmentation is a fundamental issue in all informational storage.

best example is a notebook, and a cache is sort of a note page, which one uses to sketch out the data before its put into the notebook in a more sorted manner.
then to take it one step further one could imagine the notebook being a cache for actually writing a book.

the problem becomes that the pages in the notebook is often fixed… which is comparable to how HDD’s have fixed data locations, and limited ability to read the data, because of the write head only being able to be in a single location on the plater.

SSD’s are able to read blocks from any location without mechanical delay, which is why their iops are so much better, making fragmentation a more limited issue.

Q1D1 is still hell tho… in most cases…

1 Like

This suggests me to use a virtual disk file on the normal disk. This it pretty the same.
You can do this now, then check how much it is better than a normal disk.

1 Like

Yes and no, most files systems are very inefficient at handling millions of small files. That’s what databases are for. But I agree the Bucket have diminishing returns past a certain point (when the file get large enough of if you have a reasonable number of files)

That’s the neat part, 0 fragmentation as you bucket contain (bucket size / max single file size) any deleted files is replaced by a new one on the same blocks so no fragmentation can occur (at the loss of some empty bits)

I see it more as “Storing the large files on the filesystem and the small files in dedicated databases”. My storage node is large and fragmentation is very hard to contain as small files get mixed with large ones.
But I know that Storj philosophy if mor toward lot’s of small/medium nodes and I’m ok with that. It’s more a though and experience feedback than a feature request :slight_smile:

1 Like

I just created a fresh new node 1 month ago, all default parameters (Windows, NTFS 4k Clusters) and I now have 53GB stored.

This is what the drive clusters looks like:

The fact that Storj create a 4MB empty file, write what it need inside then trim it to the correct side is very bad for data allocation. It should create a file the size of the final data, not an arbitrary number.

This would only work if the entire size is pre-allocated. But yeah, that would be quite nice.

Oof, yeah. I’m pretty sure this used to be smaller (I think something around 256KB). But 4MiB makes no sense as pieces can never be more than 64/29 ~= 2.2MiB. This setting should at most be 2.5MiB if you want to keep a small margin. You can change it in the config.yaml. Smaller might not be ideal either as that might fragment files as they come in. It would be really great if the uplink could just communicate the actual piece size before sending, so we don’t have to use one universal setting to begin with.

2 Likes

setting preallocate to small might also leave you with near useless capacity between the larger blocks.

i’ve been running on the 4MiB preallocate setting with ZFS for like 1½ year now, my pool fragmentation is at 53% currently… which isn’t great but so far as i understand, its fine…

don’t remember how we ended with the 4MiB number back then…
of ZFS being Copy on Write might also make it less of a factor.

also keep in mind HDD’s makes a lot of optimization with how the put down blocks of data and how to best retrieve i’m unsure if the “block map” you @JDA showed really indicates any issues.

it is a cool way to visualize it and might be very useful… but its also takes a lot of training for a doctor to evaluate an xray.

plus its a very new node, it would be interesting to see similar data from much older windows nodes running the same or similar setups for comparisons.

but yeah … maybe 4MiB is a bad setting… i’ve been running it and maybe even promoting it for lack of better options…
any optimization would be great, but at present i duno why i would switch, since it has been working fine thus far…

i’m pretty sure one wants to leave a certain sized gap for other smaller blocks to fit in-between the largest storj blocks, how much… no real clue…

doubt i would have started using it at 4MiB without a semi good reason or argument for why it was a good choice, but can’t remember so its a mute point really… lol

maybe somebody else wiser on this could enlighten us.

1 Like

OK I have a slightly older node (10 months) with 1.45TB of data and after 4h I managed to get a defrag report.
Note that I defragmented this volume a couple of time around 3 months ago

Stats:

Unfragmented:  4 377 037 items
Fragmented:      322 718 items
Gaps:            186 392 gaps

“Big picture”

Zoom:

PS: I can’t do the same on my big node, it’s over 17TB of data and the reports will take weeks to generate.

3 Likes

I decided to check the ap1 folder for my oldest node for fragmentation. This is kind of a worst case scenario on ext4, as it’s a multiuse array with several nodes and other things that is usually around 90% full (and has been over 95% full at times).

Total/best extents                             3925124/3626416
 Average size per extent                        384 KB
 Fragmentation score                            1
 [0-30 no problem: 31-55 a little bit fragmented: 56- needs defrag]
 This directory (redacted) does not need defragmentation.

So doesn’t seem to be an issue on ext4 at least.