Node crashing upon realizing data storage isn't perfect

Alexey · August 4, 2024, 10:42am

That’s true only for a single (simple) pools, where you do not have a redundancy.

How it could do so if there is no RAID (mirror/parity), where it would take the correct data?
My tests didn’t show, that ZFS in the single simple pool can survive the bitrot. As an LVM and BTRFS too.

However, the RAID is made it, but only for ZFS and BTRFS. But, BTRFS RAID is a mess…

Mitsos · August 4, 2024, 10:46am

Ah, the million dollar question and the million dollar answer: Do not use ZFS in single disk scenarios.

There are checksums that need to match when data is being read. Single disk means that there isn’t something to verify that the checksum is indeed correct. ZFS does what ZFS is supposed to do: re-reads the corrupted data, calculates a new checksum for it (since it does not have something to compare that checksum to) and re-writes the now-bad checksum. You are basically screwed at this point (pardon my French).

Alexey · August 4, 2024, 10:47am

As far as I understand, most of SNOs uses ZFS exactly in a single disk scenario… Because it’s convenient and you can use SSD caches (any) out of the box without too much troubles. Also use RAM (if you have a lot) more efficiently.

exactly.
And here we are. The ZFS disk is corrupted, hard. And the node cannot tolerate it.
Usually you need to mark these bad blocks with what’s FS can provide and move on. In a worst case - just rename the broken piece to always return “file not found” (to keep occupied this bad block) and move on.

Mitsos · August 4, 2024, 10:52am

We are going offtopic, but since the ZFS crusaders will eventually flock to defend that use case, I have to be very clear: SNOs use what they think works best based on a number-go-up mentality. I’ve warned people plenty of times that if something is working perfectly today, that doesn’t mean that it’s the best tool for the job down the road, or in other simpler words: a 1TB ZFS node cannot be extrapolated to a 20TB node. ZFS’s use case is just a single one: rarely changed data that needs to be available no matter what. Constantly changing data (= all ZFS SNOs will notice 70% fragmentation two years down the line from today) + disabling safety features (=using it in single disk scenarios) means that they’ll eventually regret their decision.

Some will realize it faster than others. Some will defend its use to the end.

Alexey · August 4, 2024, 10:55am

How right you are!

But, it wouldn’t stop newbies to make the same mistake over and over again.

Thus, I wouldn’t change the recommendations from the current ones - NTFS for Windows and ext4 for Linux.
Who want to experiment - go ahead, but take it as an experiment, not more.

Of course, I understand, some setups have a limitation what’s FS you can use by default (BTRFS for Synology, XFS for unRAID and ZFS for TrueNAS), but you need to think ahead.

Roxor · August 4, 2024, 11:01am

OP knows they could just delete the problem files to eliminate the OS-level errors… and the node would run fine (and maybe fail some audits, but it’s statistically unlikely) but they simply don’t want to. Instead they insist on feeding a filesystem that the OS insists is b0rked to the node. It’s like pouring garbage into your gas tank and insisting your car just figure it out

Single-disk ZFS can only do so much. They know they had hardware problems. They know they lost data. They know how to fix the problem (they’re being told the exact files to delete). But… they don’t want to run a few commands to file the filesystem… they want to leave it broken and then cry when applications can’t run properly on it

Mitsos · August 4, 2024, 11:05am

OP needs to find where the corruption came from and since the node is now lost, start over by first fixing the underlying corruption issue, then installing the node on a filesystem recommended for the node’s use case.

Alexey · August 4, 2024, 11:10am

No, they shouldn’t delete them, they need to rename them. Because these files will still be locked to the same bad block(s). If they would delete it, the new piece could be placed there increasing the risk of disqualification. However, in this case I believe, that the garbage collector will remove it eventually… I do not know, how to mark bad blocks with ZFS.

Storgeez · August 4, 2024, 11:34am

Of course there was an issue, I said there that I lost 0.5% of data in the first post. When one has time once a week on average to fix stuff, you can see how it can easily turn into an ordeal. Corruption happened because drive was failing, it has since been fixed. ZFS couldn’t repair a lot of it because the ZFS checksum isn’t ECC, its just error detection checksum so nothing gets fixed. I hadn’t known that before deploying this system.

It can, if there is corruption, ZFS will throw I/O errors or something similar, it will absolutely refuse to read that data, actually it is impossible to get to this data, it is basically destroyed because they never made a way to access it, there is supposed to be a switch that lets you access this but it doesn’t work it seems. Furthermore, node is badly written, if it encounters an I/O error it exits after a while. So one bad file = catastrophic node failure, not very sensible.

Yeah, that is where it’s mostly likely to manifest, however same goes with other pools once you lost a drive or two.

Exactly, ZFS is completely useless there, I wasn’t aware of that before.

It doesn’t if it did that there would be no problem. What it does is reads the data, sees bad checksum and it refuses to serve the data, it does nothing with it. Not only does it do that, it also disallows you to change this data altogether, you need to first delete this, and better hope it’s not a directory, because it doesn’t always work from what I’ve seen.

ZFS is meant to be used with changing data, it is perfect for it, fragmentation is not an issue with ZFS, data is always fragmented because its a copy-on-write and it’s not an issue.

OP knows they could just delete the problem files to eliminate the OS-level errors… and the node would run fine (and maybe fail some audits, but it’s statistically unlikely) but they simply don’t want to
You are wrong, I never said I don’t want to, there is no way to. There is no facility to do so, ZFS isn’t designed for data corruption. I tried to delete the files from the 'zpool status -v" output, it deleted most of the data, not all however, and then on top of that it refused to delete some folders because they’re weren’t empty. And they weren’t empty because it didn’t allow me to delete all the insides. Again, ZFS isn’t designed well enough to handle corrption in all cases. I don’t insist on feeding bad data to the node, I’m trying to do whatever. Nothing works, not fixable. First the ZFS isn’t designed to handle corruption, then it isn’t designed to delete broken data. And then the node throws hands in the air and refuses to work because it encounters an error when accessing a file.

All bad designs. But generally you have hard time explaining these things to the developers because their answer will be (as is yours) “well, it’s not expected you will do that, of course it doesn’t work”. This is just unrobust design.

Good thing about ZFS is no bad blocks, ZFS is copy-on-write, each time you write something, it goes to some free location, never in the same place, even if you make small changes to a file.

pangolin · August 4, 2024, 12:42pm

What fragmentation? As far as I know a ZFS block can not be fragmented, so ZFS should just fail to write new data if no unfragmented free space exists.

Storgeez · August 4, 2024, 1:05pm

They’re referring to the fact that copy-on-write always causes fragmentation. Suppose there is a 1GB file written on a fresh pool. This might get allocated in a contiguous (on disk) 1GB space. Now you write other data. However, now you edit a part of the 1GB file, data is read, modified, then written in a new LBA on the disk, hence the name - copy-on-write. Now you have an almost 1GB file in 1 piece, and small pieces in another place. Over time this becomes significant framentation. However it is not a problem because normally blocks are 128kB, so this is already 32 4kB sectors per ZFS block, meaning file contiguous reads will not translate to disk 4k random reads, they will translate to 128k random reads, which is heaps better than 4k performance on HDDs. Data is technically still fragmented to all hell, but you don’t care because it doesn’t eat into you performance much.

Smaller files get smaller blocks, hence worse performance but you still have the same sekes on regular file systems so no difference there. If you have smaller max block sizes then there might be a performance impact, larger ones are even better.

pangolin · August 4, 2024, 1:25pm

What does it mean for the storj use case? I have set block size larger than max possible piece size and pieces are never changed. Would ZFS fragment new files anyway or would it just say disk is full?

Alexey · August 4, 2024, 1:28pm

Hm, this is new for me. I expected that it just return an usual OS error, something like “the file is not readable”, but do not block any IO operations.

I suppose no one FS is designed for that…

yeah, but the pieces are never changing…