Big Windows Node file system latency and 100% activity (NTFS vs ReFS)

JDA · September 7, 2020, 8:55am

Hi,

A little bit of background first, my node is running under Windows Server 2019 with the native windows service version. It has been growing since the past year and it is now around 11TB with 7.8 million files with an average size of 1.4MB.

I noticed huge latency on the volume hosting the data’s since the volume reached around 4-5TB and it keep getting worse as the volume increased. Moving the database files to SSD reduced the problems but did not solve it completely.

The file system used for the data drive was NTFS with 4KB clusters
The drive activity was constantly around 80-100%
The average latency was way above 300ms even when the activity was very low

Listing the files on the volume was very slow and I started suspecting defragmentation or files count problems. I tried defragmenting the volume but it was taking ages and did not seem to solve anything.

I then though of two options:

Try to migrate the volume to a NTFS 64KB cluster to try to reduce the MFT file and limit fragmentation. However, I would lose a little bit of space, as the average file size is not very huge.
Try to migrate the volume to ReFS 4KB. ReFS have his pro and cons but it there is something that I does much better than NTFS is dealing with huge numbers of files, NTFS usually keep getting slower and slower the more file you add (talking about millions of files here)

I chose to test ReFS. I migrated my 11TB volume to a new ReFS 4KB partition (took some time for the copy but the node downtime was only 20 min) and it is the night and day:

Average latency under 1ms
Average activity is around 5-10%
Node startup have the data volume at 60% for 3-4 minutes compare to 100% for more than 30 min before
Listing the files on the volume is quite fast now

PS: It is just my experience; I am not saying that ReFS is better than NTFS for every Windows node. But if you are facing huge latency spike on a > 10TB volume on Windows Server it might be the solution for you.

PartMent · August 17, 2021, 5:09am

Tried ReFS with a week.
Storj startup time (listing files) definitely faster than NTFS.
I used this repo to format partition on Windows 10 pro 21H1
But now ReFS doesn’t support size grow or shrink.

Alexey · August 17, 2021, 8:06am

Please, check that number of started uploads is equal finished.

PartMent · August 17, 2021, 12:43pm

Everything in log looks fine.
I see “upload started” and “uploaded”.
No critical error or warning.

JDA · September 7, 2021, 7:02am

As much as I suggest the use of ReFS on a huge node (> 10TB) I strongly advise you against doing it on a client OS. This should only be done on a Server OS.
ReFS is not really supported on Windows 10, you are asking for trouble :x

serger001 · October 12, 2021, 5:16pm

I use refs in windows 2019 and lost my 1tb node after reboot. After windows booted disk with data was in raw format. There are maaany questions in google by keyword: refs raw disk recovery
and no windows solution… it’s sadness

JDA · October 12, 2021, 5:41pm

It’s super strange, last time I heard about ReFS issues was back then in 2012 R2. Did you looked at the EventLog? it should tell you why the volume was set to this state.
Did the computer crashed during the reboot? Or the disk subsystem had an issue?

jammerdan · October 12, 2021, 5:48pm

ReFS is a typical Microsoft product. They started it with lots of enthusiasm but didn’t think it to the end and finally built a useless product.

It happened to me too: It was running like a charm at the beginning and I could not believe the warnings I had read on the internet. Then one day BANG HDD went into RAW mode shortly after I had consolidated Terabytes onto it and retired the old drives.
Where there are ways to recover a raw disk with Ntfs with ReFS there was no easy solution. Microsoft only offered a tiny CLI tool, 3rd party tools like Testdisk did not exist. So I had to follow Microsofts idea: The only solution was to buy a new drive, minimum the size of the one I needed to recover and perform a fullscan and copy all data found over to the new one. Not only did it cost me a new 8 TB drive as I had no spare of such a size, the fullscan also took like a week. At the end it has worked and I got my data back but this was not exactly the experience that I had expected from a freshly built state of the art new file system that was advertised as resilient and especially suitable for large amounts of data.

I happily went back to Ntfs, there will be no Refs for me for the next years until this product is really reliable.

serger001 · October 12, 2021, 5:54pm

The electricity went out. Other (NTFS) disks boot normal refs died. Now format it into ntfs and all right… Except my 8 months data :(((

JDA · October 12, 2021, 6:16pm

I see. ReFs does’t like power outrage at all :x
Using it in a Storage Space configuration with the option IsPowerProtected to false solve this issue, but it require aditional configuration.

That been said, for small volumes (anything bellow 10TB) there is really no benefit to use ReFS anyway.

jammerdan · May 2, 2022, 6:34am

ReFS seems to continue to be a horror story:

I would love to use it but still it seems to be not as reliable as it should be.