On the impact of fsync in the storage node upload code path on ext4

As I wrote before, I wrote a chunk of code that reproduces as faithfully as I could the disk I/O performed by storage nodes as part of usual operation. While at that time I focused mostly on the parameters of ext4 itself, over the last three weeks I’ve made some experiments with profiling this code, looking for improvements to the storage node code itself that would benefit the network. I now believe I found something that might be worth discussing.

Firstly, to establish the test conditions. I took the machine I used for my first, long-forgotten node: a Thinkpad X61 with an internal 1TB CMR drive. The OS is run from a flash USB drive, so I can dedicate all I/O of the internal drive for tests. I’ve also reduced the amount of usable RAM to 1GB. The machine runs a stripped down Debian Buster with kernel 4.19.0 (not fresh enough to benefit from the fast commit patches). This setup should roughly correspond to the official minimal requirements for running a node.

I’ve took the logs of one of my younger nodes, taking advantage of the fact that upload sizes are now logged. I replay all uploaded, download started and delete piece sent to trash events. At the time I started the benchmarks, there were around 1,750,000 of them. I divided them into batches of 10k events and collected various statistics on each batch. To show some specific numbers:

  • The first batch contains 9095 upload events, 342 download events and 563 trash events, and is normal for an node in early vetting stage.
  • The last batch contains 6642 upload events, 2840 download events and 518 trash events, and is probably the closest in my dataset to a steady state of a storage node that is not yet filled.

I also simulate the hourly chores of bandwidth rollups and reading the orders files, as well as the periodic chore of purging trash. Before each batch I perform sysctl vm.drop_caches=3 to simulate an occasional machine restart or some external process briefly competing for resources.

The test finishes with about 240GB of data stored on disk. I run all tests 10 times, each time on a freshly prepared file system whose size is only slightly larger—250GB, in attempt to see the effects of a file system filled to the brim on performance. All tests in this article has been performed on a file system created with mke2fs -t ext4 -m 0 -E lazy_itable_init=0,lazy_journal_init=0 -i 131072 -I 128 -J size=128 -O sparse_super2,^uninit_bg,^resize_inode and mounted with -o lazytime,data=writeback,delalloc (except for the last set of tests, as explained later). After all batches finish, I also measure the time it takes to du -s, though I haven’t observed any significant change in this metric over all tests performed here.

First, let’s establish the baseline:

  • First batch: mean=499 seconds, sd=35 seconds.
  • Last batch: mean=603 seconds, sd=27 seconds.
  • Total: mean=90021 seconds, sd=1078 seconds.

In general I’ve observed that the operation slows down with time. I suspect it might be due to directories becoming bigger and free space being in smaller chunks. Haven’t investigated this more though. Note also that a single run takes almost 25 hours, so I need to be conservative in running tests like that.

Then I’ve profiled my code with scalene on the first few batches, to save some time. I’ve previously expected that the bottleneck will be writes to the bandwidth.db sql database, but it turned out that almost 80% of the time is attributed to the fsync corresponding to this line in the storage node code. Curious, I’ve run the tests on a piece of code with this fsync commented out:

  • First batch: mean=39.2 seconds, sd=2.8 seconds.
  • Last batch: mean=266 seconds, sd=8.8 seconds.
  • Total: mean=34981 seconds, sd=875 seconds.

An order of magnitude improvement for the first batch and still 3× as fast for the steady state. I’ve started thinking what does this fsync achieve. Without this fsync, in case of an unclean shutdown, the file might be only partially written to disk, meaning that while we have responded to uplink that we did accept the whole file, we will lose it. Yet, ext4 has a feature where it forces a commit anyway every 5 seconds by default. So with ext4 by omitting this fsync we would only risk few seconds of unwritten data in case an unclean shutdown happens.

I’ve also tried bumping up the default commit rate, setting it with -o …,commit=99999. The numbers:

  • First batch: mean=36.8 seconds, sd=4.1 seconds
  • Last batch: mean=242 seconds, sd=34 seconds.
  • Total: mean=28760 seconds, sd=1552 seconds.

Still some slight improvement, but not as impressive. Likely because the test machine doesn’t have much RAM, so the data must be commited pretty soon anyway. I’ve also started observing higher variability of latency in the system, as I’ve noticed that as I observed progress, the progress bar I’ve implemented in the code sometimes stalled for few seconds—something I haven’t seen with the default commit value.

Why would getting rid of fsync help? I suspect this is because by delaying direct writes, we allow the OS to delay allocation of blocks on disk, and so we allow the system to seek more performant on-disk layout for data being written—maybe this is less fragmentation, maybe this is just bundling the writes to be performed in an order more following physical tracks and sectors.

These tests and all the reasoning above refers only to ext4 being run on host Linux. I do not have enough knowledge to theorize on what would happen with btrfs, ntfs or other file systems likely to be used with storage nodes, under different operating systems or under storage virtualization. I cannot reject that this fsync might actually be vital in some of these circumstances. Hence I think it would be nice to have a configurable option to skip it—with the storage node operator explicitly opting in when they believe the trade-off is worth making in their case.

We have long established that SMR drives are not fast enough for standard setups and only with careful configuration they can be made to perform. While I do not own an SMR that I can use for running these tests, I find it likely that with this change operating SMR drives will be quite a lot easier.

I absolutely love this extensive investigation. The differences are big enough that this warrants a good look from Storj Labs.

I would love to hear a little more about how you set up this test. But either way, you have my vote!

3 Likes

This is amazing work indeed, it’s long been a puzzle why the storagenodes nearly cannot run on some HDD’s, this fsync thing sort of looks like something it’s not suppose to do unless there is an error or something…

atleast from how i would read it… but that should mean it would run rarely, which seems inconsistent with your numbers… unless if it massively slows down the process when it is run…

would be interesting to know how often this fsync command is run…
this is very interesting… apparently fsync flushes all the cache each time its called…
no wonder it slows stuff down… also seems like other software have had poor performance due to improper use of fsync…
https://dzone.com/articles/difference-between-fsync-and

big vote from me, this could be amazing for storagenode performance.
and honestly… my system is basically immune to dataloss on a hardware level, so flushing the cache is pointless…

i guess that is why i get so much better performance when i make zfs lie and just say everything is synced all the time. have run this for days at a time without any issues.

i saw some lectures on zfs optimization where developers was digging into the code to figure out why their usage wasn’t working correctly…

on of the ways they did that was by using something they called flame graphs, giving them an overview of how much system / process time was spent on each task… like say fsync.

was very interesting, in some cases they found issues where 50% or more of the time was just the system waiting for something like fsync to complete / respond.

It could also be due to the disk geometry, as a HDD fills the writes moves in towards the center which due to the fixed rotation of the platter makes less of it move under the head in a fixed amount of time…

i think its about a 50% loss in speed for full HDD, which would make your 250GB of a 1TB disk 25% of that, which seems to fit pretty well.

Keep up the great work.

1 Like

Impressive work. You have my vote.

1 Like

Outstanding @Toyoo, Kudos. You have my vote too.

1 Like

My voice here also. :+1:

1 Like

Some notes:

  • The last batch is slower most likely because it has 2840/342=8.3 times higher number of download events than the 1st batch. With only 1 GB of memory (thus: less than 900 MB of ext4 cached in memory), the download events most likely resemble random HDD head seeks, and those random seeks slow down the last batch compared to the 1st batch.

  • The test (~300 GB in 25 hours and 100% HDD utilization) might not be representative of real-world Storj access patterns (~300 GB in 1 month and 3% HDD utilization). In particular, it is highly questionable whether removing the fsync() call from the source code would affect a real-world long-running Storj node the same way it affected the test.

  • As far as I know, ext4 isn’t performing real-time tracking of the position(s) of HDD’s head(s) and doesn’t take the physical topology of the HDD into account when performing fsync().

  • Some NAS HDD’s feature power loss protection (the HDD attempts to save the HDD’s cache to the HDD’s platters in case of a power failure). I wonder whether such a drive can perform more fsync() operations per second compared to an HDD without power loss protection.

  • “We have long established that SMR drives are not fast enough for standard setups” - SMR drives are OK, except for certain data write patterns, and the fact is that a Storj node is usually writing at most just a few megabytes per second to the HDD.

    • SMR drives support fstrim(). The OS should use this feature properly.
    • An issue with SMR drives is that their behavior (physical topology) isn’t properly documented by manufacturers and thus Linux filesystem implementations don’t know for sure how to avoid pathological SMR data access patterns without resorting an ineffective trial-and-error method.
3 Likes

I agree with most of your points.

Indeed, I can’t say that the way the OS/file system operates without pauses is similar to how it operates in my test. I think though that if the system has any decent I/O load in addition to the storage node—which may happen when reusing free resources on existing hardware, as recommended by Storj—my test should offer reasonable approximation. Besides, at this point I can also infer from my test that this hardware can maybe cope with traffic around 30× the current amount (seems unlikely short-term, but who knows what would happen when Storj becomes more popular?). My experiment shows that we could make this hardware work with ~80× the current traffic with a small software change.

The kernel can influence the order in which writes and reads are performed. This is a big feature here, as the more blocks are pending for write, the bigger chances that there are clusters of nearby blocks that can be written to without excessive seeking. Real-time tracking is not necessary, instead just knowledge which part of the disk need to be visited at some point in future.

I wonder myself! Given that even cheapest consumer external drives are often basically NAS models with minor tweaks to firmware, this would be very useful to know.

We still observe problems like this one or this one, showing that unless the operator moves databases to a different drive or uses some kind of a cache, decent performance is not guaranteed.

Not all of them, my Seagate Backup doesn’t. I wish it did.

3 Likes

filesystems ie EXT4 is afaik completely unaware of the storage hardware, as it is a filesystem nothing else.

your should read the article i linked, its very interesting and quite indepth about the subject.

it also says that some hardware will ignore the fsync requests completely, if its not advantageous, doesn’t state how common that is tho…

fsync forces writing of the entire HDD cache to disk… so no matter when, how or what does it, then that should greatly affect the disk activity.

so far i as understand anyways… very familiar with some of this, but the whole fsync thing is new to me.

my newest Toshiba Enterprise HDD’s have PLP, which is pretty nice i guess… as yet another fallback thing incase my other stuff fails.

i should really look into the whole ftrim… the ssd trim i get, but i don’t get how that can be applied to HDD’s ofc most likely its a different feature with similar effects and thus adopting the name of the SSD trim feature.

1 Like

@Toyoo I guess the entire question/discussion comes down to the following:

Do you want to risk losing data to gain more performance? With the potential impact of it DQ’ing you?

At the end of the day, if you run a single node on a single drive (granted i can only speak for non-SMR), i never had any issues with it keeping up or slowing down too much that the success rate decreased.
If the drive does more than just the storagenode operation, i could certainly see it being a problem.
Making this configurable should be straight forward and i would even encourage to propose this change if you can need it! :pray:
Further you’ll also need to see it from the other side (meaning the satellite and customer). Making that tradeoff can have serious impact on SLA and thus potential change in payout rates, etc. If nodes get marked as unreliable, DQ’ed or alike, there is more repair, churn which drives the cost up.

In my mind having the flag is great, but it definitely should default to keeping fsync in place and be very clearly labelled as risky and not recommended to change to “off”. :slight_smile:

if nothing else a minor investigation into the subject matter seems justified…
i duno how much time storagenodes spend on fsync commands… but it might be interesting to know…
is it 5%, 10% or 90% also verifying that the code is actually running as it’s suppose to… seems maybe worth the effort for all involved.

i fully agree with you that having this kind of stuff as a toggle, is risky at best… and not something i would put into the arsenal of the plebs.

but that being said… it seems like something that should happen on errors… or the command or variable was marked as err… maybe it runs to often… who knows…
sure seems like a way to big performance uplift to simply ignore it, especially when other software have had similar issues which was correctly with “minor” code changes.

non of us knows that is the right way forward here… so more investigation is justified.

I agree, it’s a trade-off which each node operator should consider on their own. Maybe their hardware is fast enough to operate their nodes as they are now, maybe instead they use some SMR drive or just the same hardware also hosts other I/O-intensive software and they’d like the storage node to not impact it too much. I’m just pointing out probably the lowest hanging fruit to those who would like to work on it. I’m not comfortable enough with writing golang to do it myself.

Also note that it is against Node Operator T&C to modify storage node software (point 4.1.3), so unless this modification is allowed by Storj Inc., whoever would like to work on it would be breaking T&C. Which is why I’ve put it to discussion here.

1 Like

At most disabling fsync should be an option, not the default. The proper way to do this (I have not analyzed the code to see if this is how the node works) is this:

  1. Get data from customer
  2. Write it to disk.
  3. Fsync
  4. Report to customer that the upload is complete.

This way avoids the situation where the customer and satellite think the data exists, when it actually doesn’t (leading to failed audits). If power fails before step 4, the customer sees it as failed upload and uploads to some other (slower, but still powered on) node.

I use SSD SLOG, so sync writes for me are not a problem, though with the current upload rate, even without SLOG, my node would be fast enough, after all the current upload rate 4mbps or less.

2 Likes

Or stable enough for this risk to be really small. I haven’t had an unsafe shut down in years. I would also add that even regular HDD maintenance like resyncing raid and doing extended SMART tests can be really slow with nodes running. And unfortunately with these processes running as well as nodes, my NAS becomes really slow. I don’t use it for a lot of other things, but I do feel the impact. So I might try this setting if it were there.

I get that. I feel the same way. But you can always post an issue on the github, linking back here. The developers don’t always read the forums. But they do monitor github issues.

This doesn’t really apply if you create a pull request on github. Storj Labs would then review it and merge it into the code if they agree with the changes. You’re not modifying your own node, but suggesting a modification to the node software to Storj Labs. Yay for open source!

3 Likes

Yeah, but I wouldn’t be able to test the code in some real conditions without setting up a test satellite, which sounds like a lot of work.

1 Like

You could use this: Please join our public test network

1 Like

i wonder if this fsync thing actually explains why some SMR drives run pretty okay while others basically cannot run storagenodes.

if fsync forces write of the entire HDD cache, and SMR drives often have massive sized caches + bad write times…
that would mean the SMR HDD’s unable to run storagenodes should be able to run it, if it has a SLOG, if the SLOG makes ZFS ignore the fsync request.

or simply set the pool on the SMR HDD to run
zfs sync=disabled poolname

anyone got a SMR drive to test this with… that they know doesn’t work for storagenodes.

1 Like

Indeed. So, just to make sure, @littleskunk, is it ok to use the test network for purposes like that?

1 Like

The linked topic even invites people to try and cheat the system on that satellite. It would certainly then be ok to test a new feature.

1 Like

Ah, yes it does! Thank you, I missed it somehow. I only remembered the issue of different autoupdates and thought that network is for testing pre-releases.