Copying node data is abysmally slow

The losses accumulate. In any case, this should either be a setting or sync should always be enabled. It looks weird to me that it would be acceptable to lose data to save a bit of latency. Maybe it is needed for the single-drive nodes (especially SMR) I don’t know. I know that I would not want sync to be disabled on my node. If I have a problem with performance at some point maybe I will upgrade to a PCIE SLOG instead of tuning off sync.

I have had one single unexpected shutdown of my TrueNAS in the last 3 year, because my USP went belly up and I don’t use a redundant PSU. So yeah, my 25MB losses would accumulate to 50MB losses after 6 years. New, fresh data to bring up my score also accumulates.

To me it looks weird to have a something enable that makes sense for VMs and DBs but not for data. That is like having sync enabled for my smb share. I just don’t gain anything by enabling sync for a workload that does not need sync.

SMR is out of the question for ZFS.

You won’t run into performance problems. Even an old 500GB 2,5" Laptop drive as SLOG would be sufficient for STORJ :slight_smile:

The only thing you do, is wear out your SLOG device and add latency. Not worth it in my opinion with next to nothing (25MB) to gain.

1 Like

This is not a factor in node selection. Of course if you lose races, chances are that you will lose races again. But it’s not a compounding thing.

OK, so I read that wrong or remember it wrong about some kind of cache of successful nodes.

Depends on the data though. The way Storj works it looks to me more like a distributed database than just a file server.

The difference (at least for me) is that if I am copying a file to a SMB share and the server crashes, after it restarts, I just check if the file is there and if not, I copy it again. The is no equivalent for a node (“hey, satellite, could you check if the pieces uploaded in the last 10 minutes are still there? Just don’t count them as audit failures”).

Also, don’t programs (like ms word) sync the file when saving? I ave always assumed it, but never checked.

Though I would like it if it was possible to check if my node has all the files the satellite thinks it has (just for my own peace of mind, those results would be inadmissible as audits). Kind-of like “scrub” for a node.

Either having sync on affects the performance or it doesn’t. If it doesn’t, there is no reason to turn it off, because it is a (slightly) safer option. If it does affect the performance, then it makes sense for some node operators to disable it, but if it does not affect the performance on my node, then I might as well keep it on.

My 50TB array gets hit pretty hard, by the time I get to 120TB the fancy bits of ZFS will be a necessity.

How many nodes (with different /24 subnets) are on that array?

8 /24 ips for 15 nodes (some of them are vetting on existing ips so i can easily deploy more nodes later)

You don’t even have to check it, the transfer will show as failed on your client, because it did not get ack.

No, because like explained above. These applications can handle async. They will just wait for sync ack or assume it failed. Imagine you copy a file over to an SMB share with Windows. Explorer will wait until it gets the response from TrueNAS. If TrueNAS crashes in the meantime, Explorer will show an error. Word will probably cache save to AppData or something. If the save to TrueNAS fails, it will probably show you the “save as” dialog to select a new target destination. Of course without loosing any data.

I agree, that STORJ can’t handle it. It just will get you a failed audit. But again, who cares if you lose 25mb of STORJ data?

Turning on snyc will always have a negative impact on performance. Even 4 Intel Optane drives in RAID 0 for SLOG will be slower than just async. For everyone that understands how async, sync, SLOG and ZIL workes, it is clear why a sync never ever can perform as good as async.

1 Like

That sounds like a sync write to me (making sure that the data is actually written before doing something else). Async, as I understand it, would be writing and doing something else without waiting for the ack. Maybe I am wrong, I have never analyzed this in extensive detail.

I think we are misunderstanding each other.

The way Storj works now (as I understand it) is that my node gets the piece, writes it to the drive, syncs it to get a confirmation that it is actually written and reports it to the customer and/or satellite.

Turning off sync for Storj would mean that it writes the data, does not wait for anything and reports success. There is no way for the node to report 5 or however many seconds later (after doing async write) that now the data is actually written. I don’t think Storj would change the protocol enough to allow for this late reporting.

The question is - does it affect the performance of the node in a perceptible way? An increase of latency of 0.5ms is still an increase, but would it make me lose many more races?

Me neither to be honest, that is why I like these thought experiments :slight_smile:
To be fair, their is a small chance of loosing data over smb. And smb even can ask for sync.
So Windows will wait until TrueNAS sends ack. But TrueNAS will send that ack, when the data is only in RAM and not on disk. So if the system crashes in exact that timeframe, there could be a data loss. But there are ways to prevent this. With a PSU, TrueNAS will be able to write it to disk without any data loss, if there is just a power or network cut and not a system panic.
While this may sound risky, lot of systems nowadays use some kind of async. Often this is called write cache. QNAP, Synology, Windows, sometimes even SSDs controller, some SSDs even add RAM cache on top, like the Samsung Rapid mode. For Windows it is even enabled by default. This is nothing to worry about in most workloads.

But if you wanna be 100% sure to have consistent data, you need to have ECC, sync, and a power loss protected SSD as SLOG. And even PLP can have different implementations.

Turning off sync for Storj would mean that it writes the data into RAM, does not wait for anything and reports success (aka lies). If there is a crash, there is no way for the node to report the exactly 5 seconds it has lost. There is no way, because even TrueNAS does not know.
So even if STORJ would implement a late reporting, this would not be possible.

So what exactly happens, when I lie to to the node?
Let’s assume we get very high 40Mbit/s ingress. That is 5MB/s.
5x5 is 25MB data loss. So if we have a crash, we loose 25MB. And we lied to STORJ to have this. If we wanna survive the audits, our pool should be at least 1,3GB so we have less than the very safe 2% corrupt data threshold. And we better get another 1,3GB ingress, before the next crash.

My guess is, it depends but probably no. The better question would be, what do you expect to gain with it? You wanna wear out an expensive SSD to prevent a 25MB data loss?

2 Likes

I think nfs (not smb) by default not only waits for sync after closing a file, but, for large files syncs after some amount of data written. It is possible to disable this behavior.
Even if TrueNAS lies about having written the data, it probably is not a problem because if it crashes right after I save the document (and get a fake ack) I can just check it afterthe file serverreboots.

Write cache is not exactly the same as async. Normally, the system would write to the drive, then issue a flush command which tells the drive to actually write the data. Some drives lie about having written it, that’s why RAID controllers have the option to disable drive write cache, though I guess some drives could lie about disabling as well.

While the OS knows the difference between data and metadata, the drive does not and a drive that lies about having actually written the data can result in a broken filesystem after a crash.

Since Storj mostly writes new files and does not do a lot of modifications/replacements, by the time the SSD wears out, I’ll have most of the TBW rating as stored data :slight_smile:

But mostly it may be because of purism on my part. I think that it is incorrect for the node to report success without having actually written the data. This introduces “acceptable data loss” and if my node fails an audit, I will be left wondering if this is because of a crash and async, therefore normal or if it is an indication of a real problem. I would search the log files, find that the file was uploaded, and then just disappeared.

This is one of the reasons why I would like to be able to do an “internal audit” - to check if my node has all the files it should have (locally, not informing the satellite about the result).

Async is good because it increases performance and most of the time the file can be rewritten if there was a crash.

nfs seems to be sync by default. Sync waits until everything is written to disk, before sending ack.

No, but it behaves similar and follow a similar logic. Both have the problem, that we act like data is disk while it is not (yet) on the disk.

I don’t believe this is true.

RAID controller have the option disable drive write cache, because you lose all data, when the controller goes down. That is why some RAID controller have batteries, that way they can hold the data in the cache up to 72 hours. It is like RAM for RAID controllers and you make hold the volatile RAM data by adding a battery. If you get back online during that timeframe, data can be written and will not be lost. PLP for SSD works similar. If Data is in cache while power goes down, the capacitors provide enough power to empty the cache into the drive flash.

And that is why there are no lying drives, because that could seriously put a pool at risk and be a way bigger backlash than the WD SMR debacle :slight_smile: That is btw also the reason, why you apparently risk loosing the pool, if you don’t have PLP on the SLOG.

What drive(s) do you use for SLOG?

This one I totally get!

OK, apparently this was a bit of a broken telephone game with that information - somebody told me about it and it made sense, but now when I was looking for it I found this:
https://queue.acm.org/detail.cfm?id=2367378

Luckily, SATA (serial ATA) has a new definition called NCQ (Native Command Queueing) that has a bit in the write command that tells the drive if it should report completion when media has been written or when cache has been hit. If the driver correctly sets this bit, then the disk will display the correct behavior.

In the real world, many of the drives targeted to the desktop market do not implement the NCQ specification. To ensure reliability, the system must either disable the write cache on the disk or issue a cache-flush request after every metadata update, log update (for journaling file systems), or fsync system call.

So, as I understand now, it used to be that the drive reported “write complete” after actually writing the data to disk. Then write cache came along and drives started reporting “write complete” after writing to the cache. Later, TCQ (for SCSI) solved that problem and NCQ for SATA was supposed to do the same - introduce “cached write” and “uncached write”, but some drives would treat both the same and require a separate “cache flush” request to actually write the data.

Intel SSDSC2KB960G8. I used to use a pair of Kingston desktop drives, but one of them failed (I used them for L2ARC too, a bit of a bad idea) and I had the Intel drives from when I was mining chia, so I just use them now. That’s why I use a 960GB SSD for a 4GB SLOG partition.
Because it is possible to remove SLOG at any time (unlike data drives), if I need those SSD somewhere else, I can just remove them.

As I have mounted those SSDs inside the server, if I wanted to use a non-PLP drive I could always build a mini-UPS for it :slight_smile:
Power loss is not my biggest concern though (server with dual power supplies, connected to separate UPSs, which are connected to a much bigger UPS), but something like a kernel panic (inside the VM or in the host) or reboot would result in loss of data in the cache, but a non-PLP SSD would not lose its cache.

That article is from 2012. According to my sources, there haven’t been any “lying HDDs” since 2014. Which makes sense, otherwise we would have a lot of failed pools (corrupt ZIL) :slight_smile:

SSD on the other hand are only safe to use as SLOG if they have PLP (or don’t use any cache at all. Apparently some old Intel Optane drives wrote directly to flash without controller cache).

Not sure what you mean by that. A UPS will not prevent you from loosing the pool, if the system crashes and SLOG does not have PLP. That is why your Intel SSD DC S4510 is a perfect drive for SLOG but the “Kingston desktop drives” are a bad idea.

It really is a shame that Intel sold their Optane business :frowning: Hopefully Solidigm SK hynix will still offer great drives for reasonable prices.

mini-UPS - to keep the drive (and only the drive) powered on in case the server loses power. If the OS crashes or reboots, the SSD does not lose power and does not need PLP, it only needs that for power loss. If I can keep the power on the SSD for some time, it should write the data to flash, even if the rest of the server is off.

It’s probably true. I was told about the lying drives some time ago, it made sense (just like some Samsung SSDs say they support queued TRIM, but in practice they fail when being told to do that) and I did not really check on it. Just disable the drive write cache and be done with it (the server usually has way more cache than the drive anyway).

Apparently some SSDs lie about this, there would be no need for PLP if the SSD actually wrote the data to flash when after being issued a cache flush command.

1 Like

Haha nice idea. Is that some kind of 12V SATA cable USP?

That was the post I had in the back of my brain for the 2014 claim :slight_smile:

Most desktop SSDs only need 5V. So, basically just put something like a USB power bank between the server and SSD, just make sure the powerbank outputs stable 5V or a 12V->13.8V DC/DC converter, a small 12V Pb battery and a 12V->5V converter. Hopefully the SSD gets around to writing the data before the battery discharges.

So yeah, the problem is that devices lie about actually writing data. If a devices does not lie about this, it does not need PLP to be safe for ZFS as ZFS waits for the confirmation.

You know what would be cool for SLOG? A battery-backed RAM drive, like the Gigabyte i-RAM, it does not need to be large and DRAM does not wear out.

1 Like

This is basically what the Radian RMS 200 are but they are flash backed by supercaps instead of battery.

Haha nice! That is the most MacGuyver thing I heard in a long time :slight_smile: But I would not trust these powerbanks.

Probably even a Dell RAID controller with BBU could be very cool for a SLOG. Unfortunately I have not heard a lot of people try that out yet.

Jesus, this thing is fast :slight_smile: Are the batteries replaceable? Does it report that somehow, when they go bad?

Aren’t (decent) SSDs already equipped with a capacitor that would store enough energy to finish the writes? I recall that it was a thing long time ago already, e.g. as in this article.