Is it possible to move the "storage\temp\blob-*.partial" to a different folder?

JDA · September 7, 2021, 6:34am

Hi,

Is there a supported way to move the “storage\temp\blob-*.partial” files to an specific folder not bellow the “<Storage2.Database-Dir>” ?

I personaly would like to move those to a temporary SSD to avoid excesive fragmentation on the data drive as my storage drive is huge and it is becoming unpractical to have maintenance in it.

SGC · September 7, 2021, 8:26am

it wouldn’t be worth while because when they are complete they would need to be transferred to the hdd, so you will have doubled the iops required for the partial file and the io is still the same for the hdd.

while on the hdd it’s in the filetable so when completed just the name changes, which is about the most minimal io you could get.

you shouldn’t fill storage solutions above 80% to avoid fragmentation, it shouldn’t be to much of a problem then.

i like your idea tho… maybe it’s possible for storj to create a write cache option in the software, sort of an expansion of what we already do with the database or can do.

Alexey · September 7, 2021, 8:34am

This is job of underlaying OS, not the application. The same is true for integrating RAID.
Please use tools, which can do its jobs better.

JDA · September 7, 2021, 8:38am

I know that it will cost double IOPs, but:

I have Write-Intensive SSD space to spare
Small random IOPs (typical of the *.partial files) are bad for magnetic drive. So I rather have small random IOPs on the “cache drive” and then a bunch of huge IOPs to move the data to HDD. It is much more efficient and will prevent most of the fragmentation

I know I am an edge case but wanted to know if a supported way exist before i start testing arlernatives

JDA · September 7, 2021, 8:39am

I agree the cache option is not ideal and you have to take care of too many scenarios (crashes, shutdown etc)

SGC · September 7, 2021, 8:41am

you could do so using zfs setting up a small block special device while using like 1-2mb recordsize

ofc it would handle all small blocks / records below a certain size also for long term storage, been thinking about doing something like that myself

JDA · September 7, 2021, 8:45am

I’m using Windows currently. I dont have Issues with the RAID part, just trying to optimise where the IOPs goes. Moving the DB was the first step. now the next obvious part is the “Temp” folder.

If there is no suported way to do it i’ll try directory junctions (at my own risk obviously)

SGC · September 7, 2021, 8:48am

can’t you add a write cache for your setup rather than jerry rig your own solution, i think it’s a bit like trying to build one’s own database or host a mail server… it is simply to risky, to complicated and to much work.

i would certainly stay away from non verified solutions.

JDA · September 7, 2021, 8:52am

There is 2 issues:

Write-Cache (this one in my case is already in place, at the disk and RAID level)
Avoid fragmentation, mostly caused by creating an empty file and filling it on multiple passes

SGC · September 7, 2021, 8:55am

well that is a question of the software allocation of the storage space required for the fullsize of the file upon the initial creation of the file block.

if the storagenode doesn’t do this, it really should…
i know stuff like torrent clients will do this to minimize the io and fragmentation.

and afaik, it doesn’t come with much overhead since the disk head basically passes over the area anyways.

Alexey · September 7, 2021, 9:03am

When *.partial are finished, they are moved to the storage location. So, it would transfer to SSD and then move to the HDD every time, they would not be cached at all. Just pure double IO and longer time, adding latency to the HDD operations (to slow down the read operations for example) and new point of failure.

If you want to use SSD as a cache - you can take a look on

JDA · September 7, 2021, 9:05am

Haha I didn’t want to point this example but you are correct

From the statistics I gathered on my node (20TB drive, 14TB used by Storj data) most of the blob files > 2MB have between 2 and 10 fragments. It does not look like a lot but with millions of files it add ups and make the file system table much larger than it should be. The more data is deleted/added the worse it gets

The thing is that if you want to do this correctly it is very OS dependent, for Windows for example it’s a very specific call to create a file of a certain size and ask the system for this file to be reserved on a continues space if possible, even if empty. The default API call create an empty virtual file and blocks are reserved every time you flush data on it.

The default API call is faster because the OS doesn’t look for a continues space.

Again, the cache is not an issue. Fragmentation is.

Forcing Storj to move data between the “temp” folder and the destination cost double IOPs I know but it force the OS to create a files on a single flush (without fragments if possible)

SGC · September 7, 2021, 9:14am

i think you need to take this argument to the suggestions, so we could get this changed…
fragmentation by design should be avoided, it’s most likely a simple oversight… storage can get really complicated.

make the suggestion for getting it fixed so we can avoid further fragmentation… i sure would like the reduction in storagenode IO…

been fighting to get my nodes 100% operational for a week lol

JDA · September 7, 2021, 9:16am

Fixing fragmentation will not reduce IOPs. It can reduce overall drive latency though

SGC · September 7, 2021, 9:18am

reducing seek thrashing… will give more sequential reads / writes and thus more iops throughput.
so yeah it might not directly in theory give less iops, but it will increase the number of iops a hdd can perform since it will not be seeking in the middle of file blocks.

JDA · September 7, 2021, 9:20am

I don’t know if @Alexey can tell us if the are willing to work on optimizing file creation to avoid fragmentation?

SGC · September 7, 2021, 9:22am

JDA:

Haha I didn’t want to point this example but you are correct

From the statistics I gathered on my node (20TB drive, 14TB used by Storj data) most of the blob files > 2MB have between 2 and 10 fragments. It does not look like a lot but with millions of files it add ups and make the file system table much larger than it should be. The more data is deleted/added the worse it gets

The thing is that if you want to do this correctly it is very OS dependent, for Windows for example it’s a very specific call to create a file of a certain size and ask the system for this file to be reserved on a continues space if possible, even if empty. The default API call create an empty virtual file and blocks are reserved every time you flush data on it.

The default API call is faster because the OS doesn’t look for a continues space.

Again, the cache is not an issue. Fragmentation is.

there is a suggestion category, where we vote in new features… this certainly sounds like something everyone would want fixed.

we all loath the killer workloads of the storagenodes.

JDA · September 7, 2021, 9:34am

Done:

Toyoo · September 7, 2021, 12:31pm

As far as I understand, you should already be able to reduce fragmentation by using the --filestore.write-buffer-size switch (or the equivalent configuration file option). When I was still using btrfs, it significantly reduced the amount of IO needed.

Technically, this option would use RAM, as opposed to a different storage, for partial data. From my observations the amount of RAM increased only slightly when setting it to 4096kiB. The situation where a node receives a lot of large uploads at the same time is very, very rare, and even then it would just spill into swap, which could as well be set up on your SSD.

JDA · September 7, 2021, 12:33pm

@Toyoo Thank you, the discution will continue on the other thread. This setting will help but will not solve the fragmentation on long term