Copying node data is abysmally slow

Pentium100 · December 6, 2022, 6:18pm

I think nfs (not smb) by default not only waits for sync after closing a file, but, for large files syncs after some amount of data written. It is possible to disable this behavior.
Even if TrueNAS lies about having written the data, it probably is not a problem because if it crashes right after I save the document (and get a fake ack) I can just check it afterthe file serverreboots.

Write cache is not exactly the same as async. Normally, the system would write to the drive, then issue a flush command which tells the drive to actually write the data. Some drives lie about having written it, that’s why RAID controllers have the option to disable drive write cache, though I guess some drives could lie about disabling as well.

While the OS knows the difference between data and metadata, the drive does not and a drive that lies about having actually written the data can result in a broken filesystem after a crash.

Since Storj mostly writes new files and does not do a lot of modifications/replacements, by the time the SSD wears out, I’ll have most of the TBW rating as stored data

But mostly it may be because of purism on my part. I think that it is incorrect for the node to report success without having actually written the data. This introduces “acceptable data loss” and if my node fails an audit, I will be left wondering if this is because of a crash and async, therefore normal or if it is an indication of a real problem. I would search the log files, find that the file was uploaded, and then just disappeared.

This is one of the reasons why I would like to be able to do an “internal audit” - to check if my node has all the files it should have (locally, not informing the satellite about the result).

Async is good because it increases performance and most of the time the file can be rewritten if there was a crash.

Pentium100 · December 7, 2022, 5:18am

OK, apparently this was a bit of a broken telephone game with that information - somebody told me about it and it made sense, but now when I was looking for it I found this:
https://queue.acm.org/detail.cfm?id=2367378

Luckily, SATA (serial ATA) has a new definition called NCQ (Native Command Queueing) that has a bit in the write command that tells the drive if it should report completion when media has been written or when cache has been hit. If the driver correctly sets this bit, then the disk will display the correct behavior.

In the real world, many of the drives targeted to the desktop market do not implement the NCQ specification. To ensure reliability, the system must either disable the write cache on the disk or issue a cache-flush request after every metadata update, log update (for journaling file systems), or fsync system call.

So, as I understand now, it used to be that the drive reported “write complete” after actually writing the data to disk. Then write cache came along and drives started reporting “write complete” after writing to the cache. Later, TCQ (for SCSI) solved that problem and NCQ for SATA was supposed to do the same - introduce “cached write” and “uncached write”, but some drives would treat both the same and require a separate “cache flush” request to actually write the data.

Intel SSDSC2KB960G8. I used to use a pair of Kingston desktop drives, but one of them failed (I used them for L2ARC too, a bit of a bad idea) and I had the Intel drives from when I was mining chia, so I just use them now. That’s why I use a 960GB SSD for a 4GB SLOG partition.
Because it is possible to remove SLOG at any time (unlike data drives), if I need those SSD somewhere else, I can just remove them.

As I have mounted those SSDs inside the server, if I wanted to use a non-PLP drive I could always build a mini-UPS for it
Power loss is not my biggest concern though (server with dual power supplies, connected to separate UPSs, which are connected to a much bigger UPS), but something like a kernel panic (inside the VM or in the host) or reboot would result in loss of data in the cache, but a non-PLP SSD would not lose its cache.

Pentium100 · December 7, 2022, 11:38am

mini-UPS - to keep the drive (and only the drive) powered on in case the server loses power. If the OS crashes or reboots, the SSD does not lose power and does not need PLP, it only needs that for power loss. If I can keep the power on the SSD for some time, it should write the data to flash, even if the rest of the server is off.

It’s probably true. I was told about the lying drives some time ago, it made sense (just like some Samsung SSDs say they support queued TRIM, but in practice they fail when being told to do that) and I did not really check on it. Just disable the drive write cache and be done with it (the server usually has way more cache than the drive anyway).

Apparently some SSDs lie about this, there would be no need for PLP if the SSD actually wrote the data to flash when after being issued a cache flush command.

Pentium100 · December 7, 2022, 1:53pm

Most desktop SSDs only need 5V. So, basically just put something like a USB power bank between the server and SSD, just make sure the powerbank outputs stable 5V or a 12V->13.8V DC/DC converter, a small 12V Pb battery and a 12V->5V converter. Hopefully the SSD gets around to writing the data before the battery discharges.

So yeah, the problem is that devices lie about actually writing data. If a devices does not lie about this, it does not need PLP to be safe for ZFS as ZFS waits for the confirmation.

You know what would be cool for SLOG? A battery-backed RAM drive, like the Gigabyte i-RAM, it does not need to be large and DRAM does not wear out.

TechAUmNu · December 7, 2022, 2:47pm

This is basically what the Radian RMS 200 are but they are flash backed by supercaps instead of battery.

Toyoo · December 7, 2022, 4:14pm

Aren’t (decent) SSDs already equipped with a capacitor that would store enough energy to finish the writes? I recall that it was a thing long time ago already, e.g. as in this article.

Pentium100 · December 7, 2022, 7:56pm

Yes, I was talking about cheaper desktop SSDs that do not have that.

Powerbanks - probably not (on the other hand if each SSD has its own battery, it would be unlikely for both of them to fail at the same time as the server power. You can also put a Schottky diode from input to output, so if the powerbank decides to stop, the SSD will just get power from the server directly (basically an ATS).
With better DC/DC converters it would probably be OK (especially with two of them). I have installed a 12V->5V DC/DC converter in one server to be able to connect extra “unofficial” SSDs, it worked, I do not know if they are still using it though. I would probably do this instead of using a powerbank, but this would essentially be the same as a powerbank.

That’s cool. It uses capacitors instead of batteries, those should last longer, but replacing them would probably be done with a soldering iron. But yeah, this is the modern equivalent of the i-RAM and would be great as SLOF.