Hi all,
My storage node wasn’t performing as it used to, and I started looking into it. I set it up 13 months ago, on 1x 3TB 7200rpm drive, to get into the platform. Today, it was simply not performing; loading the dashboard took 10-15 seconds.
I had 3,6 million files on 2,8 TiB of storage, which shouldn’t be an issue. However… it took 15 seconds to ‘ls’ in a blob folder. It took 3 hours do run a ncdu on the volume… It took almost half a day for the runner to traverse the drive after restart.
It seems that CoW file systems are problematic for the recurring deletion and insertion of ~2Mbyte files and their associated metadata.
I’ve tried ‘btrfs filesystem defrag’ and ‘-o autodefrag’, however the metadata seems to be the problem, and that doesn’t defragment using those. So… I kinda advise against storagenodes on CoW systems - traditional RAID + ext4 or other more traditional filesystems. I haven’t had the chance to see how XFS or ReiserFS (if anyone still uses that) holds up to this type of workload.
I spent 4 days rsync’ing the files to a 4TB 5900rpm drive - on ext4 - and 3 hours rsyncing again while the strage node was down, to catch up on changes. Based on iostat it took approx 140 iops to read from the source drive, and 16 iops to write the same to the destination drive, at around 7,5 Mbyte/sec.
I switched the mount points, and…
Now all the slowness of the dashboard is gone, it takes 0.0x seconds to ls in a folder, and it takes 80 seconds to scan the entire volume. The node starts tremendously faster, and the disk isn’t constantly grinding - and that at a disk that has slower rotational speed. I probably lost a lot of uploads on account of this.
Use the above information as input to your own decisions.
@storjteam - do you have a load simulator that creates files and deletes them like the network would do, in order to test the storagenode filesystem workload?
Kind regards,
Martin