Hey, lovely storage nodes fellows. I am creating this topic for 2 reasons:
- I want to show you a bit more of what a storage node is doing internally. Hopefully, that knowledge will allow us to discuss how we can improve it.
- Collect information about different failures that you might have seen in production. I hope together we can write a few unit tests to reproduce and later fix these failures. Don’t be scared. I have an example test prepared and you will see this is not as much magic as you might think.
Let’s start with the first one. Let me give you a few details about the dir verification. I will drop a few links to the source code. I am not expecting that everyone is able to read the source code so please take these links as optional.
During setup, the storage node writes its nodeID into a dir verification file. Every 1 minute the storage node will try to read that file and every 5 minutes try to write that file (source code link). This dir verification should make sure the file system is still available for read and write operations. It also catches a few user errors like mixing up storage node identities. If the dir verification detects a failure it will shut down the storage node to make sure it better count as offline instead of failing audits (source code link).
Now that we hopefully all know how the dir verification works it is time to talk about audit failures that manage to bypass the dir verification. I encountered one issue myself. One of my drives had a bad sector. For a few minutes, everything was running fine but at some point, the storage node would like to read data from that bad sector. From that moment the drive was not returning any data because it was trying to read from the bad sector over and over again. Unfortunately, the dir verification has no timeout so it would just wait forever, and meanwhile, audits will timeout until I would get disqualified. Luckily my operating system was configured to send me an email alert so I was able to fix it in time.
Long story short the dir verification needs to fail if reading the file takes too long. I promised you a unit test to reproduce and later fix this issue. The test is really simple (source code link). It simulates a slow drive that will take 10 seconds to return data. We expect the dir verification to notice it and shut down the storage node earlier. All we have to do is measure the time the unit test needs. If it takes 10 seconds or more that means dir verification failed. If the unit test finishes in less than 10 seconds that would be a success. The test is currently failing. The next step would be to add a fix so that this test will pass.
Have you seen other audit failures that we might be able to reproduce in a similar way? Does someone want to add the missing timeout so that the test above will pass? We appreciate any contribution.