Anyone experimented with NTFS Cluster Size on Windows for Storj DATA?

Hi,
Considering the average blob file size is around 2MB, the default NTFS Cluster (4k) seems inappropriate. 64k Cluster will reduce MFT size and fragmentation. But depending on Storj IO size it might cause increase IOPs…

Anyone did some test or have more information’s?

Yes, saw that… but that doen’t answer my question, on a 10TB drive with more that 4m files around 2mb, NTFS 4k is very suboptimal…

well its also quite a large range… 4k to 64k is 10x increase
that would sort of reduce your IO by that same factor.

ZFS runs 128k on 8k actual written on disk, atleast in ****ing proxmox… tho i’m told that writing in anything less than 64k on 4k spec drives, decreases their performance.

most likely due to the modern use of larger files rather than many small

ofc the space wasted can become larger the bigger blocks you write at a time, i’m currently running 256k recordsize in zfs on 8k, 256k is one of the block sizes that fits best into the max storj data file, only wasting a few %

2319872 bytes pr file / 1024 =2265,5k meaning storj picked a weird size
but if you imagine it written out in blocks and then see how filled the last block will be… thats how much waste you get.

the bigger the blocks the more waste can be created and the more bandwidth is used for writing like say 1kb to the drive, if every block must be 64k or 256k

and then there is the previously mentioned io considerations… i wouldn’t stay on 4k blocks… if memory serves 32k is also pretty good, but you should really check… maybe i should make a list…
on at almost every file is so much so that you barely need to check how much space you’ve used, just count the number of files and multiply by that.

the larger the blocks get the less IO advantage you gain, because you will always be required to use a minimum amount pr file, so just going up to maybe 16k could be a great idea… tho then we have to make sure that 16k isn’t terrible offset with the filesize

2265,5kb file / 16k = 141.593

the number is what is interesting and the last half of a whole number… first off each block would be less than 1% of the file, because there are 141 of them, so you would never get a bigger deviation than 1% in added space usage.

the .593 is what is written in the last block… meaning 40.7% of that block is empty
and you would have 141 parts of the file the hardware would have to figure out how to get, which can put a stain on a bad system, but most are pretty smart about it today. like say if it knows you want all those pieces, then it might just get them all at once if they are written sequentially on disk.
but can and will in some cases cause additional IO with smaller blocks, while you would waste space with larger blocks…

if you have a blocksize of 512k, then it writes the whole file in 4,42 blocks… so 4 are full and the 5th or 20% of the total space used is only less than half filled meaning something like 10% of your disk space would be wasted, but you would be down near or beyond the minimum possible IO to move the file… no matter the blocksize

ofc in this case, then if your system wants to change a bit in your database on the drive… it will go read the block into memory (i think and then rewrite the entire block) meaning that 512k vs 4k could in some cases require 128 times the memory bandwidth, it would also allocate 128x the memory for such tasks even if its all empty, and the drive would be writing zeroes for 128x longer.

so it very much becomes a trade off… stick around the recommended defaults for whatever system you are working with, within sort of big database spec, but adjust the block sizes depending on if you need more IO (working with a single disk) or if you want basically minimal IO at other costs

16k looks pretty good less than 1% wasted space… thats pretty neat… 32k is also pretty solid
above 64k i wouldn’t recommend the costs most likely outweigh the benefits… tho zfs runs a default recordsize of 128k but i don’t think thats the regular block sizes, because it has a vol block size which for me is 8k which limits my throughput…
so yeah i would recommend

16k or 32k blocksize

tho i wouldn’t call myself an expert on the topic, i sure would like to get rid of my 8k vol block size on my zfs pool… people say i should go 64k for throughput… but pretty sure ill go with 16k or 32k as to be able to run regular things on the pool also…

1 Like

You forgot about sqlite databases. If you created them on 4k block size, it will use the same record size. If you would move to a different block size, you can have an unexpected behavior with databases, because they will use a previous record size.
I would not touch the default block size if you do not have experience with all components of the system.

2 Likes

I agree for the database… And it’s the main reason I’m on the team of those who think we should be able to relocate the .DB files on another drive. If that where the case I could have the DBs on SSD. :confused:

I just checked. 6 month node, 10TB data, 700 MB Database in more than 4000 fragments… that hurts too. I have pretty fast disk but that can’t be good.

the fact is that most filesystems run on their own blocksizes depending on the size of the system and what its doing, and why is 4000 fragments a bad thing… i mean if it uses the fragments to locate parts of the database…

on top of that a 700mb database that is heavily used should be mostly in memory and only periodically be written to disk for backup purposes, which is most likely why its so fragmented, it would be wasteful to overwrite data that remains the same.

i don’t know a ton about the database subject, but i would bet its very well optimized for exactly what its doing, and people tinkering with it is most likely going to cause worse performance issues than those they are trying to reduce.

if i was to hazard a guess then the biggest change on a database following using different blocksizes would be how much space it would take in allocated space on disk.

maybe the blocksize could also affect its memory usage tho… if the blocks are directly transferred which i believe is often the case.

i’m currently on 8k block size, but that runs like crap on my raidz1 array, because it’s striped across all the drives which maxes out their IO fairly easily with rather low thoughtput.
which is why i am forced to move to 16k or 32k blocksizes, tho for zfs people claim one should run 64k but it just seems very high to be honest.

on top of that zfs will compress the blocks if possible… but at that level stuff gets so advanced i got no clue… lol but certainly using small block size is something that is mostly a single disk thing, ofc there are some database recommendations for setting up such stuff also, but mostly thats when dealing with huge databases… that won’t fit easily in RAM

I have no idea how they manage the database, but looking at the ram consumption it is clearly not fully loaded in memory.

You looking at the Storagenode in Docker ? mine usually starts at like 100MB and then slowly grows in Memory… but its Disk usage is flatlined completely, so i will have to assume that it will read and write directly to host storage, which means it might be the host that picks up on the repeated usage of the database and decides to store the DB in host RAM.

i know stuff like my ZFS ARC will do that, and if then the Storagenode also kept it in RAM, you would end up with double allocation…

i would bet the docker container will not even attempt to store the database in RAM because of that, its considered a Host system task and one of the advantages they gain from writing the data directly on the physical disk and not in a emulated drive in the container.

and on the subject of NTFS cluster size, then look at what microsoft documentation says about mild database and around 2mb file sizes

basically just says… stick to 4k in 99.9% of all cases atleast for most consumers, their issue with it being that one needs to read the entire cluster to memory and then write it back like i think i mentioned earlier. thus slowing your through put… this doesn’t apply to zfs because it runs with active compression.

No, native Windows Service.

might be the same thing… the OS might load it into RAM outside the service… not sure tho… but in anycase there are many levels of caches of differing speeds to help the system mitigate constant demands for read and writes on disk, hell some stuff related to the database is most likely also stored in the cpu’s caches atleast at some points in time.

just like your hdd will use its cache to make reads and writes more sequential if possible / allowed.
how it keeps track of not double allocating is beyond me, but i’m sure the system tries… lol

our computers these days are so damn advanced i doubt anybody really understand it all … best thing we can do is to do tests and see what works and what doesn’t or refer to tests performed by others.

some big projects work with databases that can barely fit inside the ram of modern servers… which range in the terabyte range… even my decade old server can go up to 288GB of RAM
from what i understand ram is where you want your database if its heavily used.
else you can wave goodbye to any kind of good performance.

I know this is an old thread, but I accidentally ended up doing my own measurements…!

I’d got a spare 6TB disk lying around and didn’t think to format it before using it. Turns out because I was previously using it for 4K video files I’d set it up for a 2mb cluster size (on exFAT). This means that every single file stored takes 2mb (or 4mb) of space!

As part of my investigations I’ve found a lot of tiny files (as small as 1k) so I’d very much recommend sticking with 4k cluster size in general.

Stats

  • I’ve got about 1.7 million files in my Storj folder, totalling 600gb of data but taking up 3.73tb of disk space
  • 282,594 (16%) of the files are less than 4k in size
  • 1,016,940 (59%) of the files are between 4k and 64k in size
  • 130,234 (8%) of the files are between 64kb and 1mb in size
  • 262,817 (15%) are greater than 1mb in size
  • According to some very rough maths, I’d only be “wasting” 4gb of disk space if I was working on a 4k cluster size, as opposed to the 3+tb I’m wasting now!
  • A 512byte cluster size (which NTFS supports) would only be wasting 650mb, but I don’t know how much more space the metadata would take for a cluster size that small

Now to work out where to put 600gb of data while I reformat this disk…!

3 Likes

I never played with anything past 64k clusters and I would not recommend 2MB for STORJ :slight_smile: Thank for the stats!

Hello @MatthewSteeples ,
Welcome to the forum!

Please, consider replace it to NTFS (or any other native filesystem for your OS) ASAP, you will have problems sooner or later, this is not counting the inefficient use of space: Search results for 'exfat' - Storj Community Forum (official)

1 Like

Привет! Подскажите, вы знаете как сделать такой анализ в windows?
Я пробовал найти инструменты для этого, не нашел.

translation

Hello! Tell me, do you know how to do such an analysis in Windows?
I tried to find tools for this, did not find.

Something like the following:

$4k = 4 * 1024
$64k = 64 * 1024
$1mb = 1024 * 1024

Get-ChildItem -Recurse | Where-Object {$_.length -le $4k } | Measure-Object -sum Length -Average -Maximum -Minimum
Get-ChildItem -Recurse | Where-Object {$_.length -gt $4k } | Where-Object { $_.length -le $64k } | Measure-Object -sum Length -Average  -Maximum -Minimum
Get-ChildItem -Recurse | Where-Object {$_.length -gt $64k } | Where-Object { $_.length -le $1mb } | Measure-Object -sum Length -Average  -Maximum -Minimum
Get-ChildItem -Recurse | Where-Object {$_.length -gt $1mb } | Measure-Object -sum Length -Average  -Maximum -Minimum
1 Like

You can use 4KB, 64KB, 1MB instead, i.e.

Get-ChildItem -Recurse | Where-Object {$_.length -le 4KB } | Measure-Object -sum Length -Average -Maximum -Minimum
Get-ChildItem -Recurse | Where-Object {$_.length -gt 4KB && $_.length -le 64KB } | Measure-Object -sum Length -Average  -Maximum -Minimum
Get-ChildItem -Recurse | Where-Object {$_.length -gt 64KB && $_.length -le 1MB }| Measure-Object -sum Length -Average  -Maximum -Minimum
Get-ChildItem -Recurse | Where-Object {$_.length -gt 1MB } | Measure-Object -sum Length -Average  -Maximum -Minimum
2 Likes

Спасибо @MatthewSteeples @Alexey. Я разобрался.
У меня заработали команды PS @Alexey, но пришлось обновить PowerShell, не работал оператор &&.

translation

Thanks @MatthewSteeples @Alexey. I figured it out.
PS commands @Alexey worked for me, but I had to update PowerShell, the && operator did not work.

it should also work, if you replace && to -and, just forgot about old PowerShell versions…

на русском

Можно ещё заменить && на оператор -and, я совсем забыл про старые версии PowerShell

1 Like