How I Learned to Stop Worrying and Love fsync disabled

IsThisOn · May 14, 2024, 6:17am

There seems to be a lot of confusion when it comes to fsync and the new default.
Let us take a look at what fsync is and why you don’t have to worry about it.

Modern SSDs and operating systems have tons of caches. Read cache, write cache, cache in RAM, cache on SSDs, pseudo-SLC cache you name it. There are even SSDs that use RAM offered by the OS as a write cache.

These caches are all tricks to speed up your system!

Some of them are even volatile. Take your RAM as an example. It is fast as hell, way faster than your SSD, but as soon as your power goes out, all data is gone.

When we talk about fsync, we are talking about writes only.
For writes on your PC, there are two options for how a write can happen, sync or async.
If an application thinks its write is really important (like a DB which always should be ATOMIC) it can ask for writes to be written in sync. That is what fsync is for. Only when the data (and metadata) is completely written to the disk, data is safely saved even in case of a crash, the OS will report that the write is done. To good thing about fsync = enabled is that you can crash and don’t lose data. The negative thing is, it is very slow. A normal consumer 2,5" SSD can easily do async writes at 500MB/s but will sync write at 30MB/s.

For async writes, there are tons of caching mechanisms, and the OS will say “Yeah, data is transferred” even though it is maybe just in RAM at the moment.

There are enterprise SSDs that come with capacitors. The neat thing is, thanks to these capacitors, the SSD can lie to the OS and say “Sure bud, all done” while the data is still in flight. When a crash arrives, the capacitors offer the power for the SSD to write all the cached data down to flash. Currently, enterprise SSDs with capacitors (and Intel Optane) are the only option to get fast sync writes! Everything else will be dead slow!

Now for STORJ data, we know that we can lose 4% of data before we get disqualified.
Let’s assume you have 1TB of clean data already. Now your node has 100Mbit/s ingress (would be nice, wouldn’t it ). Your PC crashed. Your last 5 seconds are gone (technically only true for ZFS, but still a good number). 100mbit is 12,5mb. Multiplied by 5 = 63MB loss.
Your 1TB data = 1048576MB. So now 0.006008148193359% of your data is corrupt.
That is still far away from 4%.

4% would be 41943MB. With a data loss of 63MB you would need 665 crashes to reach that 4%.

Because it comes up a lot, NO DATA loss does not add up! I mean it does, but also does your new and fresh and good data. So you don’t have to worry that your node “adds up errors” over time.

But let’s say you start fresh without that 1TB of clean data. One day has 24h = 86400 seconds. Assuming you lose 5 seconds per crash, 86400 seconds is 17280 of 5 seconds intervals. 4% of these intervals would be 691.2.

So assuming you get a constant ingress, no matter how high that number is, and assuming you lose 5 seconds of in-flight data at every crash,
as long as you stay under 691.2 crashed per day you are fine!
If you are even remotely close to that number, you have other problems than fsync
I would even argue that if you have more than one crash a day, something is extremely wrong with your system!

TLDR: Don’t worry!

Update:

Brightsilent is right, to make it statistically impossible to get disqualified, you should not go over 2% and not 4%. Don’t go over 345.6 crashes a day
ZFS even profits more, because now there is less fragmentation caused by STORJ. ZIL will stay on RAM in Tx groups and only be written to the disk(s) once.

mike · May 14, 2024, 7:10am

Well written, sums it up quite nice

MarcVanWijk · May 14, 2024, 8:01am

Great article!

Why not go one step further and store all uploads in memory, acknowledge the upload and only when we know we won the race, we write to disk?

This way we avoid trashing when the success rates are low. I have seen my node losing 80% of the races, hardly comitting any new pieces and yet 100% IOPS utilization (writing incoming pieces to disk and then deleting them).

Brgds, Marc

Mitsos · May 14, 2024, 8:09am

What you saw was sync writes. While your disk was trying to write the incoming data, a faster disk somewhere out there had already written the data and told the client “all done”.

Mitsos · May 14, 2024, 8:16am

Would have felt really really good to just type “Source?”

The default for EXT4 is also 5 seconds.

I agree with the rest of the reply, sync isn’t something you need to worry about if there is only the occasional abrupt shutdown.

BrightSilence · May 14, 2024, 8:36am

This is nitpicky, but it’s not really 4%. Only less than 2% guarantees no DQ. And more than 4% guarantees DQ. Anything in between is pray to your chosen higher power and hope it survives until you get enough data to get that loss below 2% before DQ.

Does nothing to discount your point though. You’re absolutely right and I think the default setting will be perfectly fine for anyone.

JWvdV · May 14, 2024, 9:34am

Explain? Audit is a random process, so only if you have a very small (young) node the sample could be skewed. So, in order to reach <96% you must have lost at least 4% I would say.
Besides, during the time to reach that point, probably also some of the lost data had already been deleted. Especially since deletes are skewed to younger data (daily backups for example), so you probably can even lose a bit more. Depending on the time frame the audits are being calculated.

JWvdV · May 14, 2024, 9:39am

I hope for everyone?

Besides, I also think it will relieve the problems with SMR and probably also file systems like BTRFS a bit. Because the data is being batched meta data writes are also batched and data probably can be written more in sequence. Also lowering fragmentation. So, I only see us all benefiting from this.
Except… those not respecting ‘use what you have’ having bought some SSDs or those spending a lot of time creating a setup with SSR write cache, which now probably has lost much of it’s added value.

BrightSilence · May 14, 2024, 9:45am

I really have no good way to explain or summarize this whole thread I’m quoting below, where I helped tune the audit scoring system. But the short answer is that that was the intended goal and it was tuned to that. Here is a test I did and Storj eventually decided to use the parameters I used in that test.

Tuning audit scoring

Tuning it even tighter by raising the threshold to 96% and you get exactly the result you wanted. No nodes disqualified at 2% data loss, all nodes disqualified at 4% data loss.

./simrep.py -d 0.02 -r 10000 -x 3000 -l 0.999 -w 1 -q 0.96 -a 1000 -b 0 -g 50
with 2.00% data loss, 0.00% of runs hit dq (after 0.00 rounds on avg)

./simrep.py -d 0.03 -r 10000 -x 3000 -l 0.999 -w 1 -q 0.96 -a 1000 -b 0 -g 50
with 3.00% data loss, 21.73% of runs hit dq (after 6275.94 rounds on avg)

./simrep.py -d 0.04 -r 10000 -x 3000 -l 0.999 -w 1 -q 0.96 -a 1000 -b 0 -g 50
with 4.00% data loss, 99.77% of runs hit dq (after 2906.48 rounds on avg)

JWvdV · May 14, 2024, 9:50am

But this all depends on underlying node size. So if you have infinitive node size and perfect random audit, you will be DQed at 4%. But if your node just started, you can even be DQ-ed with less than 1% data loss. It’s all in the statistics of binomial stochastics.

BrightSilence · May 14, 2024, 10:04am

This is incorrect. Feel free to read the linked topic if you want to know more. But the short version is that only the chance that each audit will fail influences the chance of being disqualified. And this chance is equal to the percentage of lost data. Node size is not a factor.

In fact kind of the opposite is true as bigger nodes get more audits, they get more chances to drop below the threshold.

Mitsos · May 14, 2024, 10:05am

If you are only storing 1 piece and you get an audit for that piece and it failed, that’s a 100% failure. It’s percentages, doesn’t need any complex maths to solve this.

The more data you have, the more data could be damaged before you get DQ.

CutieePie · May 14, 2024, 10:10am

Just checking, is it still true that the change which went in for the repair worker, also impacts the disqualification for a node, in addition to the audit worker ?

It’s my understanding that repair traffic, will also disqualify a node if it fails to return the piece in a 5 minute retry window ?

CP

JWvdV · May 14, 2024, 10:34am

Well it all depends on how the calculation is done. Of the calculation is done in the audits of X days ago then a bigger node might safe you from DQ, by a more stable percentage.

Of it’s about Y last audits, then it doesn’t matter what the node size is.

The quickest DQ I had, was once when it needed only three audits to reach that point. Just because the node was only one month old and had only 50 audits before the problems started.

But perhaps, you could explain in own words, how the audit prices actually works: does it work with time windows, just so many last audits or are all audits taken into account?

snorkel · May 14, 2024, 10:39am

Is this 4% audit failure in the last 30 days? Or is it calculated for the entire life of the node, like the failed data audits are adding up constantly?

IsThisOn · May 14, 2024, 10:55am

You are right, so only 300 crashes a day

For everyone under 300 crashes a day, yes.

JWvdV · May 14, 2024, 11:00am

Correct, GET_REPAIR also gives you a failed audit on not getting back any data: New audit scoring is live

This topic also send to imply last 1000 audits are taken into account. That takes quite a time to reach so before that how’s it calculated?

Actually the math boils down to:
For a sequence of 0s (failures) and 1s (succes). With a mean chance of Y (data loss rate) to get 0, how much is the chance of getting at least 40 0s in a sequence of 1000.

Even with 2% data loss, this means in the next 1000 audits you have a chance of 4/100000 to get a DQ; accumulating over time (although, due to deletions and growth of the node in the mean time, actual data loss rate may decrease; or increase in case of new events).

IsThisOn · May 14, 2024, 11:01am

Can anyone help me remember what the Windows “we cache SSD stuff in RAM” feature is called again? And is there an equivalent in Linux?

ACarneiro · May 14, 2024, 11:09am

I googled it and couldn’t really find much. Are you sure that is “a thing”?

All I could fin were references to Primocache…

IsThisOn · May 14, 2024, 11:11am

No. You failed probability calculations class
If the chance of an occurrence is 2%, and you need that occurrence 40 times in a row, the chance is not 4/100000.

Read the thread @BrightSilence has posted. It explains it well.