Disk fragmentation is inevitable... Do we need to prepare?

This suggests me to use a virtual disk file on the normal disk. This it pretty the same.
You can do this now, then check how much it is better than a normal disk.

1 Like

Yes and no, most files systems are very inefficient at handling millions of small files. That’s what databases are for. But I agree the Bucket have diminishing returns past a certain point (when the file get large enough of if you have a reasonable number of files)

That’s the neat part, 0 fragmentation as you bucket contain (bucket size / max single file size) any deleted files is replaced by a new one on the same blocks so no fragmentation can occur (at the loss of some empty bits)

I see it more as “Storing the large files on the filesystem and the small files in dedicated databases”. My storage node is large and fragmentation is very hard to contain as small files get mixed with large ones.
But I know that Storj philosophy if mor toward lot’s of small/medium nodes and I’m ok with that. It’s more a though and experience feedback than a feature request :slight_smile:

1 Like

I just created a fresh new node 1 month ago, all default parameters (Windows, NTFS 4k Clusters) and I now have 53GB stored.

This is what the drive clusters looks like:

The fact that Storj create a 4MB empty file, write what it need inside then trim it to the correct side is very bad for data allocation. It should create a file the size of the final data, not an arbitrary number.

1 Like

This would only work if the entire size is pre-allocated. But yeah, that would be quite nice.

Oof, yeah. I’m pretty sure this used to be smaller (I think something around 256KB). But 4MiB makes no sense as pieces can never be more than 64/29 ~= 2.2MiB. This setting should at most be 2.5MiB if you want to keep a small margin. You can change it in the config.yaml. Smaller might not be ideal either as that might fragment files as they come in. It would be really great if the uplink could just communicate the actual piece size before sending, so we don’t have to use one universal setting to begin with.

3 Likes

setting preallocate to small might also leave you with near useless capacity between the larger blocks.

i’ve been running on the 4MiB preallocate setting with ZFS for like 1½ year now, my pool fragmentation is at 53% currently… which isn’t great but so far as i understand, its fine…

don’t remember how we ended with the 4MiB number back then…
of ZFS being Copy on Write might also make it less of a factor.

also keep in mind HDD’s makes a lot of optimization with how the put down blocks of data and how to best retrieve i’m unsure if the “block map” you @JDA showed really indicates any issues.

it is a cool way to visualize it and might be very useful… but its also takes a lot of training for a doctor to evaluate an xray.

plus its a very new node, it would be interesting to see similar data from much older windows nodes running the same or similar setups for comparisons.

but yeah … maybe 4MiB is a bad setting… i’ve been running it and maybe even promoting it for lack of better options…
any optimization would be great, but at present i duno why i would switch, since it has been working fine thus far…

i’m pretty sure one wants to leave a certain sized gap for other smaller blocks to fit in-between the largest storj blocks, how much… no real clue…

doubt i would have started using it at 4MiB without a semi good reason or argument for why it was a good choice, but can’t remember so its a mute point really… lol

maybe somebody else wiser on this could enlighten us.

1 Like

OK I have a slightly older node (10 months) with 1.45TB of data and after 4h I managed to get a defrag report.
Note that I defragmented this volume a couple of time around 3 months ago

Stats:

Unfragmented:  4 377 037 items
Fragmented:      322 718 items
Gaps:            186 392 gaps

“Big picture”

Zoom:

PS: I can’t do the same on my big node, it’s over 17TB of data and the reports will take weeks to generate.

3 Likes

I decided to check the ap1 folder for my oldest node for fragmentation. This is kind of a worst case scenario on ext4, as it’s a multiuse array with several nodes and other things that is usually around 90% full (and has been over 95% full at times).

Total/best extents                             3925124/3626416
 Average size per extent                        384 KB
 Fragmentation score                            1
 [0-30 no problem: 31-55 a little bit fragmented: 56- needs defrag]
 This directory (redacted) does not need defragmentation.

So doesn’t seem to be an issue on ext4 at least.

As a (kinda former) (a bit drunk at this specific moment) data scientist I’d warn against trusting averages. The average might be decent, but where it actually matters (i.e. the directory entries), fragmentation might be huge. After all, if you have a set of million elements, one of them 1M, all others at 1, the average is still “just” 2.

1 Like

The worst files were listed and had 16 extents… So I’m not too worried about that.

1 Like

my zfs is also doing fine, its been hosting storj for 1½ year.
has been running around 50-95% full with changes as more storage was added.
did push it to 98% but that was to much for zfs, then i had to stop ingress and delete other stuff to free up space enough that it would work correctly again.

fragmentation is fine… i’m at 50% currently which means 50% of the holes doesn’t have enough space for a full record, or thats how i remember it… the zfs fragmentation number is a bit of a complex size when one digs into the details.

works fine for now, but i do have my eye on it…

fragmentation is also very much a result of limited capacity… just don’t go to close to often…
caches and such can also help.

been wanting to run zfs with sync always, but that is very demanding on the hardware.
sync always with a proper dedicated ram based or fast enterprise ssd slog, is the recommended way to avoid or limit fragmentation on zfs.

basically the idea is that the data is written to the SLOG and then every 5 sec is written to raid / hdd’s in one sweep.

I’m still running Storj + Chia, so I’m always close to 90% and remove Chia plots when it goes over. Though I don’t really know why I still bother with Chia, it practically earns me $0. But yeah… my setup seems to be a worst case scenario because of this… and yet it doesn’t seem to be a problem.

Of course it’s very file system dependent. And my setup is ext4 with a 1TB read/write cache, which probably also helps.

1 Like

That’s for sure fragmentation is extremely related to the filesystem

NTFS is really old… maybe using the much newer ReFS would be a better option for windows based SNO’s.

do keep in mind one cannot boot of a ReFS storage, but it is made for data storage.

while NTFS was more the replacement filesystem from when microsofts customers went from msdos to windows … i really hope NTFS has gotten updated since then lol…

from 2012 so only 10 years old.

lol 29 years old…

My sysadmin friend said he’d never let ReFS near his production systems. This thing breaks constantly.

4 Likes

i guess i should have written might be… lol because i never actually used it myself.

1 Like