Pre-allocate .partial files to final size to avoid file system fragmentation

JDA · September 7, 2021, 9:34am

Currently the “.partial” temporary files are written in a way that usually end up in file fragmentation (block > 2MB generate between 2-10 fragments) and on big nodes with millions of files created / deleted it add ups.

On windows for example (sorry the only platform I know well enough) you can pre-alocate a file with 0 on creation. This force the file system to try to allocate the file on a contiguous space and minimise fragmentation

API call used:


if (  INVALID_HANDLE_VALUE != (handle=CreateFile(fileName,GENERIC_WRITE,0,0,CREATE_ALWAYS,FILE_FLAG_SEQUENTIAL_SCAN,NULL) )) {                                               
        //preallocate 2Gb disk file                
        LARGE_INTEGER size;
        size.QuadPart=2048 * 0x10000;
        ::SetFilePointerEx(handle,size,0,FILE_BEGIN);
        ::SetEndOfFile(handle);
        ::SetFilePointer(handle,0,0,FILE_BEGIN);
}

SetFileValidData() can also be used, not sure if better.

With the system tools:
fsutil file createnew filename filesize

Ideally he client shoud be prealocating the file to his final size and not an arbitrary one (4MB default)

SGC · September 7, 2021, 9:37am

@Alexey @littleskunk i think this should be given priority, if true…
i must admit i haven’t been tracking my fragmentation of the data, so can’t really say, but i could easily imagine it as a coding oversight.

since the issues it create will only get worse over time and it will take many months if not years to mitigate the effects on the stored data, without forcing a total rewrite of all stored data.

BrightSilence · September 7, 2021, 10:05am

@JDA if you want people to be able to vote you should move it to this subcategory.

littleskunk · September 7, 2021, 10:49am

There are a few config options available. Can we verify if one of them would change the behavior already?

filestore.write-buffer-size: 3MB
pieces.write-prealloc-size: 3MB

I never checked what these options will do. I just noticed that 2. something MB is the biggest piece on Disk. Lets allocate 3MB everywhere. I have the RAM for it so even if these otions are useless I just set them on my node.

JDA · September 7, 2021, 11:20am

In my config.yaml I have:

filestore.write-buffer-size: 128.0 KiB

Wich seen to be the default value and I can’t find pieces.write-prealloc-size is it new? Is there a way to have the last version of the config.yaml withour reinstalling?

I changed both values to:

filestore.write-buffer-size: 4MiB
pieces.write-prealloc-size: 4MiB

JDA · September 7, 2021, 11:21am

Thank you, I have done that, I’m not a forum expert

SGC · September 7, 2021, 12:04pm

not sure i would increase the write buffer, but certainly the pieces.write-prealloc-size.

wouldn’t the write buffer in case of massive numbers of small files eat a ton of memory if set that high.

another question with the pieces.write-prealloc-size becomes if this will force even a 1kb sized file to take up the 3mb as suggested.
that may lead to a capacity problem, i duno…

the optimal solution would ofc be if the storagenode knew what filesize it will end up having…
and could then allocate the space accordingly

@littleskunk
can a 1kb file grow to a 2.4mb file or whatever they are, or is it static and will never change?

ofc without a detailed understanding of how it actually operates i’m just making a lot of assumptions.

littleskunk · September 7, 2021, 12:20pm

64MB / 29 pieces = 2.2 MB. We want to change the reed solomon settings and also the segment size to make this hole calculation better for storage node operators but that will take time. Nothing that I would expect to see tomorrow.

littleskunk · September 7, 2021, 12:24pm

There is a nice trick. write-prealloc-size = WritePreallocSize in the code. This way you can search for any value in the code and end up here: storj/storagenode/pieces/store.go at 10372afbe423909e9993588a2df710e6e9cf1c6d · storj/storj · GitHub

Another nice trick is to run storage node run --help and find it there.

JDA · September 7, 2021, 12:26pm

If I understand correctly the pieces.write-prealloc-size setting is only during the “upload” once the piece is there, the file is resized to the correct size and moved. (For information, the default is 4MB)

The implication is that it should not overuse space, but on the reduce fragmentation side it’s sub-optimal because, for example:

You ask the OS to reserve let’s say 4x 4MB files then you resize the finished uploaded files to 1MB, 2MB, 3MB and 4MB.
The result on the disk side is then:

1MB DATA, 3MB FREE
2MB DATA, 2MB FREE
3MB DATA, 1MB FREE
4MB DATA

This will lead to further fragmentation
The best practice is to set the prealocated size to the final file size if it’s known.

SGC · September 7, 2021, 4:38pm

ofc holes in the data is much preferable to one file being in multiple different places, because those holes can be filled with other smaller blocks of incoming data.

ofc if it always allocates 3 or 4mb then that might not be possible
also not quite sure how this would work with zfs because zfs compresses data with lz4 so even if something tries to allocate space by say filling the remaining empty file with zeros then that would just get compressed…

you may be able to increase the time of the write caching, which then might put more partial files down as a full file rather than .partial

Doom4535 · September 7, 2021, 6:22pm

Another benefit of this, is that it should help reduce seeking for SMR drives further. Maybe the setting could keep x number of pre allocated fragments and then when a piece matches, use a pre allocated segment (this would mean the system now has to track unallocated fragments though). The number of preallocated fragments could be reduced as the drive fills up till it is zero at max fill.

Pac · September 7, 2021, 8:50pm

The other day I deleted 4TB of data scattered among 40 files only (chia plots of 100GB each), on a 8TB SMR disk (ext4) storing 2TB of Storj files.

That took close to a whole minute to do!
Could it have been because of such a high fragmentation?

I didn’t run any check on the disk yet as the node shows no sign of audit failure, but maybe I should… Not sure what kind of checks can be run while the partition is mounted and being used though, except for badblock.

JDA · September 7, 2021, 8:53pm

It’s impossible to reply to that. But i advise you dont mix different data types on the same partition.
Chia plots should be on a drive with the highest cluster size supported for your file system to minimise file system database size. Storj… well really depend on how much data you have.

Doom4535 · September 8, 2021, 1:50pm

If you’re going to reuse the SMR drives, and they don’t support TRIM (supposably some of the newer ones do), you may want to checkout ‘re-zeroing’/‘refreshing’ them:

Alexey · September 9, 2021, 8:08pm

8 posts were split to a new topic: Pretty sure trim is mainly an ssd thing