As I wrote before, I wrote a chunk of code that reproduces as faithfully as I could the disk I/O performed by storage nodes as part of usual operation. While at that time I focused mostly on the parameters of ext4 itself, over the last three weeks I’ve made some experiments with profiling this code, looking for improvements to the storage node code itself that would benefit the network. I now believe I found something that might be worth discussing.
Firstly, to establish the test conditions. I took the machine I used for my first, long-forgotten node: a Thinkpad X61 with an internal 1TB CMR drive. The OS is run from a flash USB drive, so I can dedicate all I/O of the internal drive for tests. I’ve also reduced the amount of usable RAM to 1GB. The machine runs a stripped down Debian Buster with kernel 4.19.0 (not fresh enough to benefit from the fast commit patches). This setup should roughly correspond to the official minimal requirements for running a node.
I’ve took the logs of one of my younger nodes, taking advantage of the fact that upload sizes are now logged. I replay all
download started and
delete piece sent to trash events. At the time I started the benchmarks, there were around 1,750,000 of them. I divided them into batches of 10k events and collected various statistics on each batch. To show some specific numbers:
- The first batch contains 9095 upload events, 342 download events and 563 trash events, and is normal for an node in early vetting stage.
- The last batch contains 6642 upload events, 2840 download events and 518 trash events, and is probably the closest in my dataset to a steady state of a storage node that is not yet filled.
I also simulate the hourly chores of bandwidth rollups and reading the orders files, as well as the periodic chore of purging trash. Before each batch I perform
sysctl vm.drop_caches=3 to simulate an occasional machine restart or some external process briefly competing for resources.
The test finishes with about 240GB of data stored on disk. I run all tests 10 times, each time on a freshly prepared file system whose size is only slightly larger—250GB, in attempt to see the effects of a file system filled to the brim on performance. All tests in this article has been performed on a file system created with
mke2fs -t ext4 -m 0 -E lazy_itable_init=0,lazy_journal_init=0 -i 131072 -I 128 -J size=128 -O sparse_super2,^uninit_bg,^resize_inode and mounted with
-o lazytime,data=writeback,delalloc (except for the last set of tests, as explained later). After all batches finish, I also measure the time it takes to
du -s, though I haven’t observed any significant change in this metric over all tests performed here.
First, let’s establish the baseline:
- First batch: mean=499 seconds, sd=35 seconds.
- Last batch: mean=603 seconds, sd=27 seconds.
- Total: mean=90021 seconds, sd=1078 seconds.
In general I’ve observed that the operation slows down with time. I suspect it might be due to directories becoming bigger and free space being in smaller chunks. Haven’t investigated this more though. Note also that a single run takes almost 25 hours, so I need to be conservative in running tests like that.
Then I’ve profiled my code with scalene on the first few batches, to save some time. I’ve previously expected that the bottleneck will be writes to the bandwidth.db sql database, but it turned out that almost 80% of the time is attributed to the fsync corresponding to this line in the storage node code. Curious, I’ve run the tests on a piece of code with this fsync commented out:
- First batch: mean=39.2 seconds, sd=2.8 seconds.
- Last batch: mean=266 seconds, sd=8.8 seconds.
- Total: mean=34981 seconds, sd=875 seconds.
An order of magnitude improvement for the first batch and still 3× as fast for the steady state. I’ve started thinking what does this fsync achieve. Without this fsync, in case of an unclean shutdown, the file might be only partially written to disk, meaning that while we have responded to uplink that we did accept the whole file, we will lose it. Yet, ext4 has a feature where it forces a commit anyway every 5 seconds by default. So with ext4 by omitting this fsync we would only risk few seconds of unwritten data in case an unclean shutdown happens.
I’ve also tried bumping up the default commit rate, setting it with
-o …,commit=99999. The numbers:
- First batch: mean=36.8 seconds, sd=4.1 seconds
- Last batch: mean=242 seconds, sd=34 seconds.
- Total: mean=28760 seconds, sd=1552 seconds.
Still some slight improvement, but not as impressive. Likely because the test machine doesn’t have much RAM, so the data must be commited pretty soon anyway. I’ve also started observing higher variability of latency in the system, as I’ve noticed that as I observed progress, the progress bar I’ve implemented in the code sometimes stalled for few seconds—something I haven’t seen with the default commit value.
Why would getting rid of fsync help? I suspect this is because by delaying direct writes, we allow the OS to delay allocation of blocks on disk, and so we allow the system to seek more performant on-disk layout for data being written—maybe this is less fragmentation, maybe this is just bundling the writes to be performed in an order more following physical tracks and sectors.
These tests and all the reasoning above refers only to ext4 being run on host Linux. I do not have enough knowledge to theorize on what would happen with btrfs, ntfs or other file systems likely to be used with storage nodes, under different operating systems or under storage virtualization. I cannot reject that this fsync might actually be vital in some of these circumstances. Hence I think it would be nice to have a configurable option to skip it—with the storage node operator explicitly opting in when they believe the trade-off is worth making in their case.
We have long established that SMR drives are not fast enough for standard setups and only with careful configuration they can be made to perform. While I do not own an SMR that I can use for running these tests, I find it likely that with this change operating SMR drives will be quite a lot easier.