On the impact of fsync in the storage node upload code path on ext4

This is amazing work indeed, it’s long been a puzzle why the storagenodes nearly cannot run on some HDD’s, this fsync thing sort of looks like something it’s not suppose to do unless there is an error or something…

atleast from how i would read it… but that should mean it would run rarely, which seems inconsistent with your numbers… unless if it massively slows down the process when it is run…

would be interesting to know how often this fsync command is run…
this is very interesting… apparently fsync flushes all the cache each time its called…
no wonder it slows stuff down… also seems like other software have had poor performance due to improper use of fsync…

big vote from me, this could be amazing for storagenode performance.
and honestly… my system is basically immune to dataloss on a hardware level, so flushing the cache is pointless…

i guess that is why i get so much better performance when i make zfs lie and just say everything is synced all the time. have run this for days at a time without any issues.

i saw some lectures on zfs optimization where developers was digging into the code to figure out why their usage wasn’t working correctly…

on of the ways they did that was by using something they called flame graphs, giving them an overview of how much system / process time was spent on each task… like say fsync.

was very interesting, in some cases they found issues where 50% or more of the time was just the system waiting for something like fsync to complete / respond.

It could also be due to the disk geometry, as a HDD fills the writes moves in towards the center which due to the fixed rotation of the platter makes less of it move under the head in a fixed amount of time…

i think its about a 50% loss in speed for full HDD, which would make your 250GB of a 1TB disk 25% of that, which seems to fit pretty well.

Keep up the great work.

1 Like

Impressive work. You have my vote.

1 Like

Outstanding @Toyoo, Kudos. You have my vote too.

1 Like

My voice here also. :+1:

1 Like

Some notes:

  • The last batch is slower most likely because it has 2840/342=8.3 times higher number of download events than the 1st batch. With only 1 GB of memory (thus: less than 900 MB of ext4 cached in memory), the download events most likely resemble random HDD head seeks, and those random seeks slow down the last batch compared to the 1st batch.

  • The test (~300 GB in 25 hours and 100% HDD utilization) might not be representative of real-world Storj access patterns (~300 GB in 1 month and 3% HDD utilization). In particular, it is highly questionable whether removing the fsync() call from the source code would affect a real-world long-running Storj node the same way it affected the test.

  • As far as I know, ext4 isn’t performing real-time tracking of the position(s) of HDD’s head(s) and doesn’t take the physical topology of the HDD into account when performing fsync().

  • Some NAS HDD’s feature power loss protection (the HDD attempts to save the HDD’s cache to the HDD’s platters in case of a power failure). I wonder whether such a drive can perform more fsync() operations per second compared to an HDD without power loss protection.

  • “We have long established that SMR drives are not fast enough for standard setups” - SMR drives are OK, except for certain data write patterns, and the fact is that a Storj node is usually writing at most just a few megabytes per second to the HDD.

    • SMR drives support fstrim(). The OS should use this feature properly.
    • An issue with SMR drives is that their behavior (physical topology) isn’t properly documented by manufacturers and thus Linux filesystem implementations don’t know for sure how to avoid pathological SMR data access patterns without resorting an ineffective trial-and-error method.
3 Likes

I agree with most of your points.

Indeed, I can’t say that the way the OS/file system operates without pauses is similar to how it operates in my test. I think though that if the system has any decent I/O load in addition to the storage node—which may happen when reusing free resources on existing hardware, as recommended by Storj—my test should offer reasonable approximation. Besides, at this point I can also infer from my test that this hardware can maybe cope with traffic around 30× the current amount (seems unlikely short-term, but who knows what would happen when Storj becomes more popular?). My experiment shows that we could make this hardware work with ~80× the current traffic with a small software change.

The kernel can influence the order in which writes and reads are performed. This is a big feature here, as the more blocks are pending for write, the bigger chances that there are clusters of nearby blocks that can be written to without excessive seeking. Real-time tracking is not necessary, instead just knowledge which part of the disk need to be visited at some point in future.

I wonder myself! Given that even cheapest consumer external drives are often basically NAS models with minor tweaks to firmware, this would be very useful to know.

We still observe problems like this one or this one, showing that unless the operator moves databases to a different drive or uses some kind of a cache, decent performance is not guaranteed.

Not all of them, my Seagate Backup doesn’t. I wish it did.

3 Likes

filesystems ie EXT4 is afaik completely unaware of the storage hardware, as it is a filesystem nothing else.

your should read the article i linked, its very interesting and quite indepth about the subject.

it also says that some hardware will ignore the fsync requests completely, if its not advantageous, doesn’t state how common that is tho…

fsync forces writing of the entire HDD cache to disk… so no matter when, how or what does it, then that should greatly affect the disk activity.

so far i as understand anyways… very familiar with some of this, but the whole fsync thing is new to me.

my newest Toshiba Enterprise HDD’s have PLP, which is pretty nice i guess… as yet another fallback thing incase my other stuff fails.

i should really look into the whole ftrim… the ssd trim i get, but i don’t get how that can be applied to HDD’s ofc most likely its a different feature with similar effects and thus adopting the name of the SSD trim feature.

1 Like

@Toyoo I guess the entire question/discussion comes down to the following:

Do you want to risk losing data to gain more performance? With the potential impact of it DQ’ing you?

At the end of the day, if you run a single node on a single drive (granted i can only speak for non-SMR), i never had any issues with it keeping up or slowing down too much that the success rate decreased.
If the drive does more than just the storagenode operation, i could certainly see it being a problem.
Making this configurable should be straight forward and i would even encourage to propose this change if you can need it! :pray:
Further you’ll also need to see it from the other side (meaning the satellite and customer). Making that tradeoff can have serious impact on SLA and thus potential change in payout rates, etc. If nodes get marked as unreliable, DQ’ed or alike, there is more repair, churn which drives the cost up.

In my mind having the flag is great, but it definitely should default to keeping fsync in place and be very clearly labelled as risky and not recommended to change to “off”. :slight_smile:

if nothing else a minor investigation into the subject matter seems justified…
i duno how much time storagenodes spend on fsync commands… but it might be interesting to know…
is it 5%, 10% or 90% also verifying that the code is actually running as it’s suppose to… seems maybe worth the effort for all involved.

i fully agree with you that having this kind of stuff as a toggle, is risky at best… and not something i would put into the arsenal of the plebs.

but that being said… it seems like something that should happen on errors… or the command or variable was marked as err… maybe it runs to often… who knows…
sure seems like a way to big performance uplift to simply ignore it, especially when other software have had similar issues which was correctly with “minor” code changes.

non of us knows that is the right way forward here… so more investigation is justified.

I agree, it’s a trade-off which each node operator should consider on their own. Maybe their hardware is fast enough to operate their nodes as they are now, maybe instead they use some SMR drive or just the same hardware also hosts other I/O-intensive software and they’d like the storage node to not impact it too much. I’m just pointing out probably the lowest hanging fruit to those who would like to work on it. I’m not comfortable enough with writing golang to do it myself.

Also note that it is against Node Operator T&C to modify storage node software (point 4.1.3), so unless this modification is allowed by Storj Inc., whoever would like to work on it would be breaking T&C. Which is why I’ve put it to discussion here.

1 Like

At most disabling fsync should be an option, not the default. The proper way to do this (I have not analyzed the code to see if this is how the node works) is this:

  1. Get data from customer
  2. Write it to disk.
  3. Fsync
  4. Report to customer that the upload is complete.

This way avoids the situation where the customer and satellite think the data exists, when it actually doesn’t (leading to failed audits). If power fails before step 4, the customer sees it as failed upload and uploads to some other (slower, but still powered on) node.

I use SSD SLOG, so sync writes for me are not a problem, though with the current upload rate, even without SLOG, my node would be fast enough, after all the current upload rate 4mbps or less.

2 Likes

Or stable enough for this risk to be really small. I haven’t had an unsafe shut down in years. I would also add that even regular HDD maintenance like resyncing raid and doing extended SMART tests can be really slow with nodes running. And unfortunately with these processes running as well as nodes, my NAS becomes really slow. I don’t use it for a lot of other things, but I do feel the impact. So I might try this setting if it were there.

I get that. I feel the same way. But you can always post an issue on the github, linking back here. The developers don’t always read the forums. But they do monitor github issues.

This doesn’t really apply if you create a pull request on github. Storj Labs would then review it and merge it into the code if they agree with the changes. You’re not modifying your own node, but suggesting a modification to the node software to Storj Labs. Yay for open source!

2 Likes

Yeah, but I wouldn’t be able to test the code in some real conditions without setting up a test satellite, which sounds like a lot of work.

1 Like

You could use this: Please join our public test network

1 Like

i wonder if this fsync thing actually explains why some SMR drives run pretty okay while others basically cannot run storagenodes.

if fsync forces write of the entire HDD cache, and SMR drives often have massive sized caches + bad write times…
that would mean the SMR HDD’s unable to run storagenodes should be able to run it, if it has a SLOG, if the SLOG makes ZFS ignore the fsync request.

or simply set the pool on the SMR HDD to run
zfs sync=disabled poolname

anyone got a SMR drive to test this with… that they know doesn’t work for storagenodes.

1 Like

Indeed. So, just to make sure, @littleskunk, is it ok to use the test network for purposes like that?

1 Like

The linked topic even invites people to try and cheat the system on that satellite. It would certainly then be ok to test a new feature.

1 Like

Ah, yes it does! Thank you, I missed it somehow. I only remembered the issue of different autoupdates and thought that network is for testing pre-releases.

Had similar tests being performed for btrfs, waiting for analysis… and finally found some time. So, yes, the effect for btrfs is even more dramatic.

For reference, I’ve created the file system with -d single -m single, mounted it with -o nodatacow,max_inline=0,noatime. I found these settings to be the fastest in prior experiments. While they disable a lot of btrfs features, they make the comparison fair to ext4 feature-wise. Given ext4 is believed to be good enough for a storage node, btrfs with these settings should be as well.

Showing just the totals, as all batches exhibit the same properties here:

  • Without the suggested change: 188209 seconds. No sd here, I’ve run this case only once, as the performance was so pathetic I didn’t bother doing more tests. This is more than twice slower than ext4 without the change and so far the most basic reason for not recommending btrfs for storage nodes.
  • With the suggested change: 34480 seconds, sd=195 seconds. Pretty much the same as ext4 with the change! Curiously, variance is also lower than ext4.

I’ll repeat that I don’t know what kind of failure modes would this change introduce on btrfs (nor any other file system except for ext4, which I studied somewhat carefully at some point). btrfs also has the commit mount option, so maybe it would be safe enough? Anyway, this change would make nodes on btrfs quite workable (du is still much slower though).

2 Likes

so far as i know BTRFS is still a bit of a mess, so not really recommended for production.
aside from that running it without Copy on Write kind of defeats one of the major advantages of the filesystem.

sure it won’t be as fast, but it also is unlikely to mess up your entire data store due to an unexpected power outage, which then needs to be mitigated in hardware instead.

if one doesn’t want to roll the dice.