A little bit of background first, my node is running under Windows Server 2019 with the native windows service version. It has been growing since the past year and it is now around 11TB with 7.8 million files with an average size of 1.4MB.
I noticed huge latency on the volume hosting the data’s since the volume reached around 4-5TB and it keep getting worse as the volume increased. Moving the database files to SSD reduced the problems but did not solve it completely.
- The file system used for the data drive was NTFS with 4KB clusters
- The drive activity was constantly around 80-100%
- The average latency was way above 300ms even when the activity was very low
Listing the files on the volume was very slow and I started suspecting defragmentation or files count problems. I tried defragmenting the volume but it was taking ages and did not seem to solve anything.
I then though of two options:
- Try to migrate the volume to a NTFS 64KB cluster to try to reduce the MFT file and limit fragmentation. However, I would lose a little bit of space, as the average file size is not very huge.
- Try to migrate the volume to ReFS 4KB. ReFS have his pro and cons but it there is something that I does much better than NTFS is dealing with huge numbers of files, NTFS usually keep getting slower and slower the more file you add (talking about millions of files here)
I chose to test ReFS. I migrated my 11TB volume to a new ReFS 4KB partition (took some time for the copy but the node downtime was only 20 min) and it is the night and day:
- Average latency under 1ms
- Average activity is around 5-10%
- Node startup have the data volume at 60% for 3-4 minutes compare to 100% for more than 30 min before
- Listing the files on the volume is quite fast now
PS: It is just my experience; I am not saying that ReFS is better than NTFS for every Windows node. But if you are facing huge latency spike on a > 10TB volume on Windows Server it might be the solution for you.