How I Learned to Stop Worrying and Love fsync disabled

I am pretty sure I have read that in the context of some cheap QLC drives. Not as software addon like primecache but native. Maybe I am misremembering and it was some pseudo-SLC cache bs.

Ohh, ohh! Am I allowed to say bs here? :smile:

Edit: Yeah, I was mixing it up with pseudo-SLC cache, DRAM cache, maybe also with Direct Storage that does not need data in RAM. Either way, it is not a thing :wink:

Bovine biology seems to be acceptable.

1 Like

No, you need to have 40 in the last 1000 audits (not necessarily sequential).

So, for any moment 1000 audits after the event of data loss this means the chance of being DQ-ed is: 1-binomcdf(0.02,1000,39) (it’s even a bit higher, … because those having 40 audit failures before then have already been filtered out).

See: binomial distribution calculator - Wolfram|Alpha

That’s incorrect too. 40 sequential gets you from a perfect score to DQ. If there are successful ones inbetween it’ll take more than 40. This is because its not just an average of the last 1000 scores, but rather a weighted score adjustment with bias to more recent audits. Just go read the original topic I linked, all the formulas and lots of scripts and sheets are there to play with this stuff yourself. And you don’t have to make wrong assumptions.

2 Likes

I see, quite complicated to create a direct formula :see_no_evil:

This is what a sync-less upload will probably already do now on Linux. The node will just wait for sort-of acknowledgement from the OS that the OS will take care of the data. Like, closing a file does not force a disk write of the kernel buffers related to that file.

As such, we’re in microseconds range (how much does a kernel context switch take?) and there’s no point in optimizing it further.

Note that this 5 seconds is just for journal commits in ext4, and it’s not a hard limit—this is just the time the OS waits for more writes before starting the journal update. On an I/O-starved system the journal commit can still take tens of seconds or more.

(*) shows defaults

data=ordered	(*)	All data are forced directly out to the main file
			system prior to its metadata being committed to the
			journal.
commit=nrsec	(*)	Ext4 can be told to sync all its data and metadata
			every 'nrsec' seconds. The default value is 5 seconds.
			This means that if you lose your power, you will lose
			as much as the latest 5 seconds of work (your
			filesystem will not be damaged though, thanks to the
			journaling).  This default value (or any low value)
			will hurt performance, but it's good for data-safety.
			Setting it to 0 will have the same effect as leaving
			it at the default (5 seconds).
			Setting it to very large values will improve
			performance.

1 Like

You are right, so data as well. Forgot that the default is data=ordered.

Still it may take more than 5 seconds for all these writes to physically hit the drive.

I believe this is called Host Memory Buffer, which got introduced in the NVMe 1.2 specs. Its definitely a thing!

4 Likes

Thank you for that, I’ll have a read :slight_smile:

I was thinking about the acronym HBM, but then thought, naah that comes to my mind because of HBM2 GPUs.

Holy cow, it is HMB!

Debian also supports it and used 5% per default. Is part of the nvme protocol, so SATA drives are not supported.

Writing uploads to disk without fsync is not the same as completely buffering in memory. In the first case, a lost race would cause a file to be written to disk (cache) and then deleted. Even if this happens really quickly there will be IOPS. If the upload is buffered in memory and simply freed up again in case the race was lost, we have zero IOPS.

3 Likes

There is an algorithm how the reputation is calculated. It’s not easy to explain a model in simple words, sorry.

Only to use SSD in a tiered storage:
https://answers.microsoft.com/en-us/windows/forum/all/windows-11-and-storage-spaces-tiered-storage/ebd694ad-28ff-4d1b-89db-a9752a177571