How I Learned to Stop Worrying and Love fsync disabled

IsThisOn · May 14, 2024, 11:13am

I am pretty sure I have read that in the context of some cheap QLC drives. Not as software addon like primecache but native. Maybe I am misremembering and it was some pseudo-SLC cache bs.

Ohh, ohh! Am I allowed to say bs here?

Edit: Yeah, I was mixing it up with pseudo-SLC cache, DRAM cache, maybe also with Direct Storage that does not need data in RAM. Either way, it is not a thing

LrrrAc · May 14, 2024, 3:26pm

Bovine biology seems to be acceptable.

JWvdV · May 14, 2024, 4:08pm

No, you need to have 40 in the last 1000 audits (not necessarily sequential).

So, for any moment 1000 audits after the event of data loss this means the chance of being DQ-ed is: 1-binomcdf(0.02,1000,39) (it’s even a bit higher, … because those having 40 audit failures before then have already been filtered out).

See: binomial distribution calculator - Wolfram|Alpha

BrightSilence · May 14, 2024, 4:30pm

That’s incorrect too. 40 sequential gets you from a perfect score to DQ. If there are successful ones inbetween it’ll take more than 40. This is because its not just an average of the last 1000 scores, but rather a weighted score adjustment with bias to more recent audits. Just go read the original topic I linked, all the formulas and lots of scripts and sheets are there to play with this stuff yourself. And you don’t have to make wrong assumptions.

JWvdV · May 14, 2024, 5:15pm

I see, quite complicated to create a direct formula

Toyoo · May 14, 2024, 10:57pm

This is what a sync-less upload will probably already do now on Linux. The node will just wait for sort-of acknowledgement from the OS that the OS will take care of the data. Like, closing a file does not force a disk write of the kernel buffers related to that file.

As such, we’re in microseconds range (how much does a kernel context switch take?) and there’s no point in optimizing it further.

Note that this 5 seconds is just for journal commits in ext4, and it’s not a hard limit—this is just the time the OS waits for more writes before starting the journal update. On an I/O-starved system the journal commit can still take tens of seconds or more.

Mitsos · May 14, 2024, 11:14pm

(*) shows defaults

data=ordered	(*)	All data are forced directly out to the main file
			system prior to its metadata being committed to the
			journal.

commit=nrsec	(*)	Ext4 can be told to sync all its data and metadata
			every 'nrsec' seconds. The default value is 5 seconds.
			This means that if you lose your power, you will lose
			as much as the latest 5 seconds of work (your
			filesystem will not be damaged though, thanks to the
			journaling).  This default value (or any low value)
			will hurt performance, but it's good for data-safety.
			Setting it to 0 will have the same effect as leaving
			it at the default (5 seconds).
			Setting it to very large values will improve
			performance.

Toyoo · May 15, 2024, 7:46am

You are right, so data as well. Forgot that the default is data=ordered.

Still it may take more than 5 seconds for all these writes to physically hit the drive.

pietjebell · May 15, 2024, 8:20am

I believe this is called Host Memory Buffer, which got introduced in the NVMe 1.2 specs. Its definitely a thing!

ACarneiro · May 15, 2024, 9:02am

Thank you for that, I’ll have a read

IsThisOn · May 15, 2024, 9:23am

I was thinking about the acronym HBM, but then thought, naah that comes to my mind because of HBM2 GPUs.

Holy cow, it is HMB!

Debian also supports it and used 5% per default. Is part of the nvme protocol, so SATA drives are not supported.

MarcVanWijk · May 16, 2024, 2:22pm

Writing uploads to disk without fsync is not the same as completely buffering in memory. In the first case, a lost race would cause a file to be written to disk (cache) and then deleted. Even if this happens really quickly there will be IOPS. If the upload is buffered in memory and simply freed up again in case the race was lost, we have zero IOPS.

Alexey · May 17, 2024, 3:03pm

github.com

storj/design-docs/blob/ed8bfe8d4c66a587f5322237f06208f74e301b7a/20190909-reputation-and-node-selection.md

tags: []
---

# Reputation and Node Selection

## Abstract

Node selection is the process wherein the set of all possible storage nodes is reduced by the satellite for uploading segments.  Node selection applies to new file uploads via an uplink, as well as repair traffic from a satellite.  The node selection processes endeavors to fairly distribute upload traffic among storage nodes.  Node selection takes into consideration how new a node is, the overall performance characteristic of a storage node as characterized by its reputation score, and the IP address of each node.

## Background

The white paper section 4.15 describes a 'preferences' system used in node selection, based on reputation:

> After disqualified storage nodes have been filtered out, remaining statistics collected during audits will be used to establish a preference for better storage nodes during uploads. These statistics include performance characteristics such as throughput and latency, history of reliability and uptime, geographic location, and other desirable qualities. They will be combined into a load-balancing selection process, such that all uploads are sent to qualified nodes, with a higher likelihood of uploads to preferred nodes, but with a non-zero chance for any qualified node.  Initially, we’ll be load balancing with these preferences via a randomized scheme, such as the Power of Two Choices, which selects two options entirely at random and then chooses the more qualified between those two.
>
> On the Storj network, preferential storage node reputation is only used to select where new data will be stored, both during repair and during the upload of new files, unlike disqualifying events.  If a storage node’s preferential reputation decreases, its file pieces will not be moved or repaired to other nodes.

The existing reputation-like system uses uptime and audit responses.  It does not currently consider geographic location, throughput, or latency.  In addition to factors which affect reputation, there are other factors in node selection.  These considerations currently include IP address, advertised available bandwidth, advertised available disk space, software version compatibility, and whether the node appeared to be online in the latest communication with the satellite.

One final factor involved in node selection is node 'vetting.'  During upload

This file has been truncated. show original

There is an algorithm how the reputation is calculated. It’s not easy to explain a model in simple words, sorry.

Alexey · May 17, 2024, 3:09pm

Only to use SSD in a tiered storage:
https://answers.microsoft.com/en-us/windows/forum/all/windows-11-and-storage-spaces-tiered-storage/ebd694ad-28ff-4d1b-89db-a9752a177571

IsThisOn · May 20, 2024, 5:30pm

I think you are once again missing the point.
Or maybe not reading properly.
My question was already answered correctly.