This is a very simplistic view of what a node does. Over the summer I managed to run some experiments trying to find out bottlenecks in node performance. I’ve took a log from a new node, roughly 3 weeks worth of operation, and wrote a Python script that would faithfully reproduce all I/O related to ingress, egrees, database writes and reads, some maintenance operations like a file walker, etc. I wrote some posts with results of these experiments: [1], [2]. One relevant number here is that the whole run took around 25 hours to reproduce the whole run.
I’ve also wrote a simplified path where the node just dumps all ingress into a single file and stores offsets+lengths for each piece in a simple file. I’ve reproduced reading from that file as seek+a single read call, and some basic file deletion in form of punching a hole in the file. Turned out, the same run now took below 3 hours.
So what is the node doing that took those 22 hours? Database writes and reads, maintenance operations like a file walker, some sync() calls to make sure data is there etc. Turns out, the way a storage node is implemented now, the bottleneck is not in sequential writes/reads, but in random IO. This basically means the 400 kBps figure is not very meaningful.
Now, the above experiments were performed on ext4. I’ve made another run on btrfs. This was terrible! I had to disable CoW to reach results in manageable time, and even then, instead of 25 hours, now I was getting 52 hours. Turns out, btrfs deals with random I/O very badly.
ZFS is also a CoW file system. ZFS has the advantage of having features that merge writes, so it would hopefully perform better than btrfs.
Now, is doing 3 weeks worth of I/O in 52 hours terrible performance? Well, the log was sourced from a period with less ingress than now—we probably experience the same I/O in a week or 10 days now. Also, 52 hours is a synthetic number measured in best possible conditions: I dedicated a full machine only for doing this tests, so no background tasks, no other services. The storage stack was also simplified: btrfs wrote directly on /dev/sda, no parity RAIDs, no LVM, no thin provisioning. And, from just a log file I couldn’t replicate traffic peaks when multiple connections compete for disk I/O, making the hard disk’s heads work even harder. In the past we have already noticed that there were situations in which btrfs was just too slow during traffic peaks, not being able to cope with traffic.
I have basically zero experience with ZFS, so I won’t claim caches and such are necessary. But by the results I’ve got, I find such claims made by other people plausible, especially if they want to take advantage of nice ZFS features like parity schemes.