Improve storage node dir verification to prevent audit failures

Hey, lovely storage nodes fellows. I am creating this topic for 2 reasons:

  1. I want to show you a bit more of what a storage node is doing internally. Hopefully, that knowledge will allow us to discuss how we can improve it.
  2. Collect information about different failures that you might have seen in production. I hope together we can write a few unit tests to reproduce and later fix these failures. Don’t be scared. I have an example test prepared and you will see this is not as much magic as you might think.

Let’s start with the first one. Let me give you a few details about the dir verification. I will drop a few links to the source code. I am not expecting that everyone is able to read the source code so please take these links as optional.
During setup, the storage node writes its nodeID into a dir verification file. Every 1 minute the storage node will try to read that file and every 5 minutes try to write that file (source code link). This dir verification should make sure the file system is still available for read and write operations. It also catches a few user errors like mixing up storage node identities. If the dir verification detects a failure it will shut down the storage node to make sure it better count as offline instead of failing audits (source code link).

Now that we hopefully all know how the dir verification works it is time to talk about audit failures that manage to bypass the dir verification. I encountered one issue myself. One of my drives had a bad sector. For a few minutes, everything was running fine but at some point, the storage node would like to read data from that bad sector. From that moment the drive was not returning any data because it was trying to read from the bad sector over and over again. Unfortunately, the dir verification has no timeout so it would just wait forever, and meanwhile, audits will timeout until I would get disqualified. Luckily my operating system was configured to send me an email alert so I was able to fix it in time.

Long story short the dir verification needs to fail if reading the file takes too long. I promised you a unit test to reproduce and later fix this issue. The test is really simple (source code link). It simulates a slow drive that will take 10 seconds to return data. We expect the dir verification to notice it and shut down the storage node earlier. All we have to do is measure the time the unit test needs. If it takes 10 seconds or more that means dir verification failed. If the unit test finishes in less than 10 seconds that would be a success. The test is currently failing. The next step would be to add a fix so that this test will pass.

Have you seen other audit failures that we might be able to reproduce in a similar way? Does someone want to add the missing timeout so that the test above will pass? We appreciate any contribution.

7 Likes

If I am not mistaken all we need to do is adding a timeout context here: source code link

The audit job has such a timeout that we could just copy: source code link

Does anyone want to accept this challenge? It should be just 3 lines of code.

In a final version please use more than just 10 seconds. I have an eSata enclosure that sometimes reconnects drives under heavy IOload and that process makes them unavailable for 10-15 seconds. I wouldn’t want my nodes to shut down for a relatively short and temporary error.
(However, I’m not completely sure how zfs handles this case, it’s possible the dir verification file stays in the ARC and won’t get read from the HDD itself every again after the initial node start, even when writing the file it could be kind of cached in a SLOG, obviously defeating the purpose of the write attempt… but that’s how it is with caches I guess…)

2 Likes

I think that the node should shut down if it gets multiple “file not found” errors in a row during audits. This normally should never happen, so it’s an indication of a problem, like the USB cable falling out.

1 Like

If drive is disconnected, the dir verification will shutdown the node.
The error would be something like “i/o error”

It seems to me that if the drive was auto-mounted and is disconnected, the system would just unmount it and then, when the node tries to write the test file, a new file on the OS drive would be created.

Still, multiple “file not found” or “io” errors in a row should never happen and show a serious problem, better to shut the node down, alert the operator and let him try to fix the problem.

I could be wrong, but I think @littleskunk misstated that it writes the same file. It writes a different file to test writeability. So in that scenario the read test would still fail as it can’t find the file with the node id.

1 Like

This one is for you: 2 lines of code

What: Add test steps to make sure VerifyDirWriteable doesn’t create the file for VerifyDirReadable. Even if the write test gets executed first the read test should still throw an error.

Why: There was a question in the community. What happens if the storage node drive gets unmounted and the storage node continues with an empty directory instead. Wouldn’t the VerifyDirWriteable check create the file and the VerifyDirReadable pass after that? The answer is no. They use different files. This unit test will double-check that.

3 Likes

That’s awesome, but for what it’s worth, the CheckWritability function writes a temporary file that is subsequently deleted.

Can’t hurt to have a test though.

1 Like

Yeah- in my experience with server farms, even well-provisioned machines with SSDs can sometimes experience surprising delays of 60s or more under the right load conditions. It doesn’t happen often, of course; maybe once per node per year, but that’s still enough that it will cause problems for some people.

I’d suggest a timeout somewhere between 90-300s.

5 Likes

Other than added complexity, is there any reason that this timeout value couldn’t be added as a config file option? The default could be whatever is deemed reasonable for most people, but then others with more complex setups could adjust as they see fit. (I say this being unable to write any of the code that would be involved.)

4 Likes

No, no reason. We would definitely make it a config option; my concern is only about what to use as the default (because few people are likely to change it).

1 Like

Do you have statistics on audit response times? If so you could use those stats to determine what a reasonable response time would be. My guess is you would have the bulk respond within 10, maybe 30 seconds. Then almost nothing until the timeout of 5 minutes. Based on that you could probably find a reasonable timeframe for a default.

1 Like