Disk fragmentation is inevitable... Do we need to prepare?

Having worked on large projects involving lot of files stored on the file system been added and deleted around, fragmentation in inevitable, but it can be mitigated in a number of different way:

  • If the filesystem allow it always tell him at the file creation the final size of the file, never under / over alocate

  • Try to standardize the file size (by a factor increment and have the maximum size been < minimum
    size * 10), if not possible use plan B

  • Plan B: Use “buckets” files for small files. For example allocate huge files (1GB) and store standardize file size inside (aligned with the cluster size of the file system)
    For example, if the file system has 4k cluster, create:

    • bucket_4k_01 (Store 1bit to 4k files)
    • bucket_64k_01 (Store 4.1k to 64k files)
    • bucket_512k_01 (Store 64.1k to 512k files)
    • etc. (Create as many bucket types you need)
      You will lose a little bit of space at first (creating empty buckets, and cluster alignement) but you will avoid most of the fragmentation and minimize IO and filesystem overhead (at the cost of having to keep an index to get the data from the buckets)
1 Like

Plan B sounds like recreating a file system within a file :thinking:
Isn’t it like recreating the wheel?

yes and no…
has to do with block sizes… inside a file there isn’t really any blocksizes…
and because the storage is preallocated, that will never become fragmented… thus one can use it as a sort of cache…

but like stated, preallocating storage means if you write only 1 byte in it, then it still takes up all the space… like say 1GB

to preallocate or not to preallocate, that is the question.
either has unique advantages and disadvantages.
and there really isn’t a best for most use cases.

with preallocated you will basically never have fragmentation, in theoretical optimal conditions… atleast :smiley: ofc that is almost impossible in real life.

1 Like

And what about the fragmentation inside of these buckets?

1 Like

think of the concept as a notebook.
if you leave room for additions then the sorting won’t get fragmented.

ofc there is always the matter of scale…

this is one of the reasons why the more advanced storage uses fairly large blocksizes.
and then smaller writes can be written into these larger blocks + anything added later to a small write can fit inside the block.

however the problem with larger blocksizes is that each block is read in one sweep…
so basically have to change,read or write 1byte, the system will read and cache the full block.

this is then mitigated by storing writes in main memory or caches for extended periods.
an async write can be sitting in main memory for multiple minutes…

and thus if more stuff is added, it will be written later to disk in one sweep / full block.

something like the fairly new draid option for ZFS, will run as a raid for the most part… but for small files, instead of writing a full default block size of like 128KB in a stripe across the raid,
it will write it on two disks in a mirror instead… thus the minimum write size goes from 128KB to usually 4k because that would be the sector size of most HDD’s

storage gets really complex really fast, as one starts to dig into the details of it.
but long story short caches is amazing… its why there already exists so many caches in a computer.

you have CPU cache L1, L2, L3, Storage / HDD Cache, and RAM is also a sort of Cache.
all of these caches will often be used for storagenode operations, adding an extra SSD cache helps cheaply extend main memory in a sense, and decrease how often the underlying storage is accessed, these kinds of things helps limit fragmentation.

because fragmentation is a fundamental issue in all informational storage.

best example is a notebook, and a cache is sort of a note page, which one uses to sketch out the data before its put into the notebook in a more sorted manner.
then to take it one step further one could imagine the notebook being a cache for actually writing a book.

the problem becomes that the pages in the notebook is often fixed… which is comparable to how HDD’s have fixed data locations, and limited ability to read the data, because of the write head only being able to be in a single location on the plater.

SSD’s are able to read blocks from any location without mechanical delay, which is why their iops are so much better, making fragmentation a more limited issue.

Q1D1 is still hell tho… in most cases…

1 Like

This suggests me to use a virtual disk file on the normal disk. This it pretty the same.
You can do this now, then check how much it is better than a normal disk.

1 Like

Yes and no, most files systems are very inefficient at handling millions of small files. That’s what databases are for. But I agree the Bucket have diminishing returns past a certain point (when the file get large enough of if you have a reasonable number of files)

That’s the neat part, 0 fragmentation as you bucket contain (bucket size / max single file size) any deleted files is replaced by a new one on the same blocks so no fragmentation can occur (at the loss of some empty bits)

I see it more as “Storing the large files on the filesystem and the small files in dedicated databases”. My storage node is large and fragmentation is very hard to contain as small files get mixed with large ones.
But I know that Storj philosophy if mor toward lot’s of small/medium nodes and I’m ok with that. It’s more a though and experience feedback than a feature request :slight_smile:

1 Like

I just created a fresh new node 1 month ago, all default parameters (Windows, NTFS 4k Clusters) and I now have 53GB stored.

This is what the drive clusters looks like:

The fact that Storj create a 4MB empty file, write what it need inside then trim it to the correct side is very bad for data allocation. It should create a file the size of the final data, not an arbitrary number.

This would only work if the entire size is pre-allocated. But yeah, that would be quite nice.

Oof, yeah. I’m pretty sure this used to be smaller (I think something around 256KB). But 4MiB makes no sense as pieces can never be more than 64/29 ~= 2.2MiB. This setting should at most be 2.5MiB if you want to keep a small margin. You can change it in the config.yaml. Smaller might not be ideal either as that might fragment files as they come in. It would be really great if the uplink could just communicate the actual piece size before sending, so we don’t have to use one universal setting to begin with.

2 Likes

setting preallocate to small might also leave you with near useless capacity between the larger blocks.

i’ve been running on the 4MiB preallocate setting with ZFS for like 1½ year now, my pool fragmentation is at 53% currently… which isn’t great but so far as i understand, its fine…

don’t remember how we ended with the 4MiB number back then…
of ZFS being Copy on Write might also make it less of a factor.

also keep in mind HDD’s makes a lot of optimization with how the put down blocks of data and how to best retrieve i’m unsure if the “block map” you @JDA showed really indicates any issues.

it is a cool way to visualize it and might be very useful… but its also takes a lot of training for a doctor to evaluate an xray.

plus its a very new node, it would be interesting to see similar data from much older windows nodes running the same or similar setups for comparisons.

but yeah … maybe 4MiB is a bad setting… i’ve been running it and maybe even promoting it for lack of better options…
any optimization would be great, but at present i duno why i would switch, since it has been working fine thus far…

i’m pretty sure one wants to leave a certain sized gap for other smaller blocks to fit in-between the largest storj blocks, how much… no real clue…

doubt i would have started using it at 4MiB without a semi good reason or argument for why it was a good choice, but can’t remember so its a mute point really… lol

maybe somebody else wiser on this could enlighten us.

1 Like

OK I have a slightly older node (10 months) with 1.45TB of data and after 4h I managed to get a defrag report.
Note that I defragmented this volume a couple of time around 3 months ago

Stats:

Unfragmented:  4 377 037 items
Fragmented:      322 718 items
Gaps:            186 392 gaps

“Big picture”

Zoom:

PS: I can’t do the same on my big node, it’s over 17TB of data and the reports will take weeks to generate.

3 Likes

I decided to check the ap1 folder for my oldest node for fragmentation. This is kind of a worst case scenario on ext4, as it’s a multiuse array with several nodes and other things that is usually around 90% full (and has been over 95% full at times).

Total/best extents                             3925124/3626416
 Average size per extent                        384 KB
 Fragmentation score                            1
 [0-30 no problem: 31-55 a little bit fragmented: 56- needs defrag]
 This directory (redacted) does not need defragmentation.

So doesn’t seem to be an issue on ext4 at least.

As a (kinda former) (a bit drunk at this specific moment) data scientist I’d warn against trusting averages. The average might be decent, but where it actually matters (i.e. the directory entries), fragmentation might be huge. After all, if you have a set of million elements, one of them 1M, all others at 1, the average is still “just” 2.

The worst files were listed and had 16 extents… So I’m not too worried about that.

1 Like

my zfs is also doing fine, its been hosting storj for 1½ year.
has been running around 50-95% full with changes as more storage was added.
did push it to 98% but that was to much for zfs, then i had to stop ingress and delete other stuff to free up space enough that it would work correctly again.

fragmentation is fine… i’m at 50% currently which means 50% of the holes doesn’t have enough space for a full record, or thats how i remember it… the zfs fragmentation number is a bit of a complex size when one digs into the details.

works fine for now, but i do have my eye on it…

fragmentation is also very much a result of limited capacity… just don’t go to close to often…
caches and such can also help.

been wanting to run zfs with sync always, but that is very demanding on the hardware.
sync always with a proper dedicated ram based or fast enterprise ssd slog, is the recommended way to avoid or limit fragmentation on zfs.

basically the idea is that the data is written to the SLOG and then every 5 sec is written to raid / hdd’s in one sweep.

I’m still running Storj + Chia, so I’m always close to 90% and remove Chia plots when it goes over. Though I don’t really know why I still bother with Chia, it practically earns me $0. But yeah… my setup seems to be a worst case scenario because of this… and yet it doesn’t seem to be a problem.

Of course it’s very file system dependent. And my setup is ext4 with a 1TB read/write cache, which probably also helps.

1 Like

That’s for sure fragmentation is extremely related to the filesystem

NTFS is really old… maybe using the much newer ReFS would be a better option for windows based SNO’s.

do keep in mind one cannot boot of a ReFS storage, but it is made for data storage.

while NTFS was more the replacement filesystem from when microsofts customers went from msdos to windows … i really hope NTFS has gotten updated since then lol…

from 2012 so only 10 years old.

lol 29 years old…

My sysadmin friend said he’d never let ReFS near his production systems. This thing breaks constantly.

4 Likes

i guess i should have written might be… lol because i never actually used it myself.

1 Like