Using ZFS Arc cache on non-ECC RAM

anon27637763 · March 3, 2020, 3:44pm

ARC is a very fast cache located in the server’s memory (RAM). The amount of ARC available in a server is usually all of the memory except for 1GB.

This may work, but beware that a system crash may result in some unknown problems with data pieces being partially downloaded or audits failing due to RAM errors…

I have no idea if anything I wrote is accurate… I’m just thinking out loud. Please tell me if I’m wrong.

kevink · March 3, 2020, 3:53pm

This may work, but beware that a system crash may result in some unknown problems with data pieces being partially downloaded or audits failing due to RAM errors…

Well this is kinda always the case. If your system crashes then the current download will of course fail. And if your RAM has errors then your system will sooner or later cause serious problems anyways.
In contrast to arc the L2ARC is a fast cache on your SSD.

However no crash will result in data loss because the arc/l2arc is only a read cache.

anon27637763 · March 3, 2020, 4:25pm

I haven’t been able to locate much information yet on ECC memory vs. non-ECC memory and ZFS Arc… but it would seem logical that ECC memory would be required for reliable ZFS Arc operation within the context of a SN, since an SN is a component of a larger system which is assembling the downloaded data pieces into a file at the far end.

EDIT…

Found this

If multiple bits are corrupted within a single word, the CPU will detect the errors, but will not be able to correct them. When the CPU notices that there are uncorrectable bit errors in memory, it will generate an MCE that will be handled by the operating system. In most cases, this will result in a halt 2 of the system.

This behaviour will lead to a system crash, but it prevents data corruption. It prevents the bad bits from being processed by the operating system and/or applications where it may wreak havoc.

ECC memory is standard on all server hardware sold by all major vendors like HP, Dell, IBM, Supermicro and so on. This is for good reason, because memory errors are the norm, not the exception.

The question is really why not all computers, including desktop and laptops, use ECC memory instead of non-ECC memory. The most important reason seems to be ‘cost’.

It is more expensive to use ECC memory than non-ECC memory. This is not only because ECC memory itself is more expensive. ECC memory requires a motherboard with support for ECC memory, and these motherboards tend to be more expensive as well.

non-ECC Memory is reliable enough that you won’t have an issue most of the time. And when it does go wrong, you just blame Microsoft or Apple3. For desktops, the impact of a memory failure is less of an issue than on servers. But remember, your NAS is your own (home) server. There is some evidence that memory errors are abundant4 on desktop systems.

kevink · March 3, 2020, 4:28pm

Well you can always go over the top but why have ECC while using a single, consumer-grade HDD with an ext4 filesystem? Or on windows even ntfs… (which both can’t correct data like zfs does, data integrity is worse)
There’s nothing less secure about using arc with non-ECC memory. How often does the RAM produce errors? And even if it does, it likely only affects a single piece that gets downloaded, while the original piece is still completely intact on the HDD (assuming the HDD doesn’t get a damaged sector either).

anon27637763 · March 3, 2020, 4:31pm

Let’s say that the ZFS on non-ECC has a stuck bit… and the cached data piece is being audited… the audit will silently (according to the host OS) fail… Or crash the host system.

kevink · March 3, 2020, 4:35pm

Yes but this argument is pointless because it is the same with using ext4 or ntfs. Using zfs with non-ECC RAM is just as good/bad as using ext4/ntfs with non-ECC RAM.
The only advantage we were talking about is the automatic file caching when you use arc/l2arc, which uses RAM/SSD to cache files.

anon27637763 · March 3, 2020, 4:40pm

I don’t know enough about the comparison to have a definitive opinion one way or the other… but every anecdote and recommendation I’ve come across in the last few minutes of my reading has indicated in big bold letters:

“Use Only ECC memory with ZFS”

DDG Search

kevink · March 3, 2020, 4:45pm

Yes because if you care enough about data integrity to set up a zfs raid system then you need ECC memory too otherwise this might screw with your data integrity.
However if you don’t care about data integrity more than you would with ext4 or ntfs, then it makes no difference.

anon27637763 · March 3, 2020, 4:50pm

It does make a difference.

In ZFS, one is relying on the integrity of the RAM. While in ext4 or ntfs, one is only relying on the HDD.

So, using ZFS without ECC RAM is like using HDD drives with 10^13 bit failure … while using ZFS with ECC is like using HDDs with 10^16 bit failure.

Since we are discussing data failure avoidance, using ZFS without ECC RAM seems like a bad idea if one is looking to run a SN reliably for a long time.

kevink · March 3, 2020, 4:52pm

Well if you think so…
I’m not continuing arguing about that. Every filesystem relies on the RAM to not have errors.
ZFS relies on that a bit more than other systems, sure, it’s more complex. But ultimately RAM errors can screw with every filesystem. Not only because about every OS uses RAM caches for asynchronous file access.

anon27637763 · March 3, 2020, 4:54pm

I’m not necessarily arguing… I’m learning… reading as I go. I don’t know much about ZFS… but I’m fascinated.

I share my motto with The Royal Society

Nullius in verba

– take nobody’s word for it

BrightSilence · March 3, 2020, 6:44pm

Pretty much every disk read is read into RAM first before being used by something else. It doesn’t matter whether that data was kept in RAM or read into RAM when it was needed. It passes through it either way. The only difference with ARC is that the data is already in RAM and kept there. I guess in theory because it’s in RAM longer it has a slightly higher chance to get corrupted. But bad RAM is bad for every system.

anon27637763 · March 3, 2020, 7:28pm

After reading a little bit, it seems that ZFS trusts RAM information blindly and explicitly. So, if there is an error in the RAM data image, that error may be propagated to the disk stored data… but not the other way around.

Again, I haven’t read too much yet on this topic, but at first blush it seems that running ZFS with non-ECC memory is a setup for avoidable node disqualification with the only real benefit being a possible mild increase in catching some data pieces.

kevink · March 3, 2020, 8:01pm

Every filesystem trusts the ram blindly, the OS trusts the RAM blindly… nobody has checksums on things in ram. If files are written asychronously, they are stored in RAM until written to disk. With every filesystem this file can get corrupted if the RAM flips a bit.
zfs just has more features that depend on ram and is designed for a server environment with data integrity as top priority. that’s why every article puts more empasis on ECC RAM because you will need it for proper data integrity as it doesn’t make sense to make the filesystem “rock-solid” if it can easily be corrupted by another component that isn’t as safe as it can be.

The way you explain it makes zfs look like a horrible choice for SNOs, only because you are expecting your RAM to be damaged so badly it will destroy the whole filesystem. If your RAM is damaged this badly, it might as well corrupt your ext4 or ntfs filesystem.

I use zfs in my homeserver with non-ECC RAM and I trust it to work well enough, at least better than ext4, btrfs or ntfs. Ultimately if my RAM should ever fail then yes, my zfs filesystem might get destroyed, probably just like an ext4 one would. That’s what backups are for.
But generally, the chance of stored data (backups, archival data, storj data) to get silently corrupted is very low because once written, it doesn’t get loaded to RAM, modified and written back to disk. So unless the whole filesystem gets corrupted, any such data stored is safe (safer than ext4 and ntfs because zfs has checksums and can repair itself if the HDD has a bad sector).
But if you have 200$ additionally for ECC RAM to build a “rock-solid” storj node then by all means…
If not, then it’s ridiculous to make zfs worse than ext4 or ntfs for SNOs because the chances of visible data corruption are about the same. Especially since we are comparing consumer-grade hardware for all components.

Alexey · March 3, 2020, 8:03pm

Sorry, I split the topic with ZFS from the Ideas, it’s deserve own thread

kevink · March 3, 2020, 8:03pm

Thank you alexey, it evolved unexpectedly.

anon27637763 · March 3, 2020, 8:20pm

At first, the target block is read from disk to memory. For read, there are two scenarios, as shown in the left half of Figure 3. On first read of a target block not in the page cache, it is read from the disk and immediately verified against the checksum stored in the block pointer in the parental block. Then the target block is returned to the user. On a subsequent read of a block already in the page cache, the read request gets the cached block from the page cache directly, without verifying the checksum.
…
After some time, the block is updated. The write timeline is illustrated in the right half of Figure 3. All updates are first done in the page cache and then flushed to disk.

If I’m reading the study analysis correctly, they introduced errors in the memory copy of the Read data which ZFS correctly recorded was different than the disk data… and then stored the error-induced data in memory back to disk.

If this is what happened in their study, and how ZFS currently functions, then ZFS with non-ECC memory is much more dangerous than EXT4 or NTFS.

kevink · March 3, 2020, 8:38pm

Well you should read the document a little bit further:

In summary, so far we have studied two extremes:ZFS, a complex filesystem with many techniques tomaintain on-disk data integrity, and ext2, a simplerfilesystem with few mechanisms to provide extra relia-bility. Both are vulnerable to memory corruptions

It shows 2 important points:

It is extremely outdated, ext2 is ages ago. It mentions ext3 somewhere in there but ext4 was released 2008!
other filesystems are vulnerable to memory corruptions as well.

Therefore zfs isn’t much more dangerous than EXT4 or NTFS. In every filesystem you also have asynchronous writes which are most of the write operations. These cache the files in RAM until flushed to disk, basically just like zfs does. You’ll get a corruption there with any filesystem. Just to name one example.

If you read articles and studies, please make sure they are halfways up to date and that you correctly apply to the problem to all filesystems (and not find a problem in zfs and point it out, without comparing it to other filesystems).

kevink · March 3, 2020, 8:46pm

I can only assume you found the study in this thread: https://www.ixsystems.com/community/threads/freenas-without-ecc-ram.73669/
However they answer quite well and did explain that any filesystem has issues with memory corruption.

anon27637763 · March 3, 2020, 8:50pm

No. This is not accurate.

An EXT4 read consists of reading data from the disk into memory, there is no checksum checking of what’s been read into memory. So, an EXT4 read looks something like this:

Disk → Memory → Memory Error

While ZFS looks like this:

Disk → Memory → Memory Error → Disk

Let’s go through an example of a memory error in an audit piece.

Data piece is read into memory.
Data piece is audited.
A bit flips in non-ECC memory.
Data piece in memory with error fails audit.
Data piece with error gets written back to disk since ZFS trusts memory.
Corrupted data piece continues to fail audit.
Node DQ-ed.

Steps 5 and 6 do not happen with EXT4. Therefore, the next audit of the data piece may get loaded into non-corrupt memory and pass.

Therefore…ZFS with non-ECC memory is worse than EXT4.