On the impact of fsync in the storage node upload code path on ext4

Toyoo · September 21, 2022, 6:20pm

Had similar tests being performed for btrfs, waiting for analysis… and finally found some time. So, yes, the effect for btrfs is even more dramatic.

For reference, I’ve created the file system with -d single -m single, mounted it with -o nodatacow,max_inline=0,noatime. I found these settings to be the fastest in prior experiments. While they disable a lot of btrfs features, they make the comparison fair to ext4 feature-wise. Given ext4 is believed to be good enough for a storage node, btrfs with these settings should be as well.

Showing just the totals, as all batches exhibit the same properties here:

Without the suggested change: 188209 seconds. No sd here, I’ve run this case only once, as the performance was so pathetic I didn’t bother doing more tests. This is more than twice slower than ext4 without the change and so far the most basic reason for not recommending btrfs for storage nodes.
With the suggested change: 34480 seconds, sd=195 seconds. Pretty much the same as ext4 with the change! Curiously, variance is also lower than ext4.

I’ll repeat that I don’t know what kind of failure modes would this change introduce on btrfs (nor any other file system except for ext4, which I studied somewhat carefully at some point). btrfs also has the commit mount option, so maybe it would be safe enough? Anyway, this change would make nodes on btrfs quite workable (du is still much slower though).

SGC · September 22, 2022, 9:13am

so far as i know BTRFS is still a bit of a mess, so not really recommended for production.
aside from that running it without Copy on Write kind of defeats one of the major advantages of the filesystem.

sure it won’t be as fast, but it also is unlikely to mess up your entire data store due to an unexpected power outage, which then needs to be mitigated in hardware instead.

if one doesn’t want to roll the dice.

aad · May 3, 2023, 5:52pm

Sorry for being a bit late to the party, but I have been thinking about some of this recently…

The ext4 barrier=0 mount option I think will disable the forced synchronization incurred by the call to fsync() without having to change any of the underlying code.

Data validity and redundancy is already enforced by the storj software, so why should the node operator worry about incorrectly storing some amount of data? Of course failing audits is a problem, but presumably there is a balance between data integrity and performance such that the node operator is able to use their drive more efficiently if they are confident about their node’s reliability.

On a related note, I’ve been wondering: if the satellite audits a piece, and the node fails the audit because the data is incorrect, is the piece removed by the node and no longer audited by the satellite?

If the node gives a customer a bad piece, is the satellite made aware of this and similarly removes the corrupted piece?

Toyoo · May 3, 2023, 6:23pm

Happy to see my post is still inspires to new thoughts!

By man ext4:

barrier=0 / barrier=1

This disables / enables the use of write barriers in the jbd code. barrier=0 disables, barrier=1 enables (default). This also requires an IO stack which can support barriers, and if jbd gets an error on a barrier write, it will disable barriers again with a warning. Write barriers enforce proper on-disk ordering of journal commits, making volatile disk write caches safe to use, at some performance penalty. If your disks are battery-backed in one way or another, disabling barriers may safely improve performance.

The barrier option only covers writes of journal entries, which is only one part of the I/O necessary to securely store a file. fsync is much stronger, it guarantees that file metadata (like, inode and directory entries) and file data (actual contents written into the file) have been stored.

Storj Inc. needs to carefully set the parameters for the network-wide redundancy techniques (like the Reed-Solomon parameters) to balance between safety and costs. If node software can guarantee even slightly higher level of safety, then the costs of network-wide redundancy may go down. My personal belief is that the change suggested in this thread should have negligible effect on this trade-off specifically on ext4, but I don’t know about other setups. Besides, it is Storj Inc. that has to make this decision, likely based also on facts that I have no access to.

The amount of audited data is negligible, see e.g. BrightSilence’s post. Though, I don’t know what’s the exact answer to your question, sorry, nor to the next one.

BrightSilence · May 3, 2023, 6:37pm

Good thing your link to my post pinged me, cause I do. @aad Audits don’t result in action on the piece level. They are just used to determine whether a node is reliably storing data as a whole. Since so little data is ever audited, repairing or removing those pieces would be a waste of time as there could be many more missing or corrupt pieces on that node that are never audited. Instead it just looks for a threshold. If your node crosses that threshold, it’s bye bye node. For nodes that have minor loss and never cross that threshold, there is plenty of redundancy on the network to cover for that.