Data corruption bug in OpenZFS

There seems to be a data corruption bug under investigation by OpenZFS: some copied files are corrupted (chunks replaced by zeros) · Issue #15526 · openzfs/zfs · GitHub. There are a lot of people that can reproduce the problem in diverse environments, including Fedora, Debian, TrueNAS Core, Proxmox, reproduced with versions as old as the 2.1.x series (run zfs --version to check).

zfs_dmu_offset_next_sync=0, zfs_bclone_enabled=0 seem to reduce the frequency of the problem, there might be a performance impact though.

The root cause has probably been found, and patches are being prepared. And, as far as I understand, it seems there should be no impact on storage nodes in regular operation, as they do not perform the system calls that seem to trigger the bug (copy_file_range in a massively concurrent workload). On the other side, if you were migrating a node, tools as simple as cp could trigger it. Software build scripts, torrents, thin-provisioned virtual machines are some use cases that seem to be affected the most.

Stay safe and upgrade your OpenZFS version as soon as an official bug fix shows up!

4 Likes

From skimming the super long thread there, it looks like (feel free to correct me if I am wrong):

  1. The issue starts with 2.1.x, but gets much worse with 2.2.x (great for Debian to not upgrade immediately)
  2. Setting zfs_dmu_offset_next_sync to zero on 2.1.x (it was zero by default up to 2.1.5) avoids the problem
  3. zvols do not seem to be affected.

I recently upgraded the host where the node VM is to Debian 12 which has zfs 2.1.13, but I only use zvols, so it looks like I’m safe. I still set that parameter to zero.

My file server (which stores files on zfs and not just zvols) runs debian 10 and zfs 2.0.3, so it should be safe. Good thing I did not upgrade it.

2 Likes