Prevent broken mount points from DQing nodes

BrightSilence · June 2, 2020, 11:18am

For logs written to a file

tail -f /volume1/storj/v3/data/node.log | awk '/(ERROR|canceled|failed).*GET_AUDIT/ {system ("docker stop -t 300 storagenode")}'

For logs in docker

docker logs -f --tail 20 storagenode | awk '/(ERROR|canceled|failed).*GET_AUDIT/ {system ("docker stop -t 300 storagenode")}'

Please be aware I haven’t tested the docker version. But it should work.

This stops your node if it encounters a single audit failure, that might be overly aggressive as it would also kill your nodes for recoverable audit failures. This suggestion is just for educational purposes. Don’t use this. (Unless you know what you’re doing)

SGC · June 2, 2020, 11:54am

i suppose one could just set it so that if it gets two failed audits in the 1min per log append then it would shut it down and trigger an alert.
and if need be increase it to two or 5 minutes per append to give it more overview…

ofc that comes at slower reaction time.

thanks ill give it a try sometime in the near future when i got time to implement it.

kevink · June 2, 2020, 12:39pm

not bad but I opened this suggestion so we won’t need solutions like that in the future

BrightSilence · June 2, 2020, 12:47pm

Oh absolutely. It’s not a solution and it can never be as good as the node software doing it itself. And besides only a small fraction would only read something like this on a forum. Your suggestion is still very valid.

SGC · June 2, 2020, 1:50pm

well one simple option might be to extend the already growing docker run command… xD
that might simply be to update the storj documentation website, when we have a workable command… which buys storj time to bake in a solution.

Alexey · June 2, 2020, 10:29pm

6 posts were split to a new topic: Find a way to stop the node on audit errors

Alexey · July 6, 2020, 8:47pm

6 posts were split to a new topic: Make the node crash with a fatal error is the storage path becomes unavailable

moby · July 6, 2020, 2:52pm

We currently have work in progress to add a new preflight check to storagenodes that will prevent them from starting if they have not selected the correct storage directory. We discussed a variety of ways to implement it, but it seems like the most practical thing to implement for now is the following:

On running for the first time, place a special file in the storage directory to indicate that pieces should be stored there
When the node is restarted in the future, it will check the storage directory for this new file. If the file does not exist, it will assume the mount point for the storage directory was incorrectly configured and fail to start.

Pentium100 · July 6, 2020, 3:07pm

Also make the node check for the existence of that file every minute or so. If it disappears - crash.

kevink · July 6, 2020, 3:15pm

how would it know that it isnt a new node?

moby · July 6, 2020, 3:30pm

We haven’t determined the best way to do it yet, but we’ve discussed a couple options:

If there are 0 pieces stored when the node is starting up, add the file
Run a special command when a new mount point is used that will add the file. After the first time, a different command should be used

I guess the operator could also manually insert the file themselves, but that wouldn’t be ideal.

moby · July 6, 2020, 3:31pm

Thanks for the suggestion. I added this point to the ticket.

kevink · July 6, 2020, 3:36pm

That looks rather complicated. Would it be an idea to store that file with the identity files? Because if those are unavailable, the node wouldn’t start anyway but if they are available, so will be the file, even if the storage location is unavailable.
For Pentium100’s idea that would then need a 2nd file to be checked continually during uptime.

This option doesn’t sound like it will work because if the mountpoint is wrong, there will be 0 pieces stored and the node would just created the file again. So I’m not sure how this would be different to the current situation.

moby · July 6, 2020, 3:43pm

Since the node already won’t start correctly if the identity directory is incorrectly configured, we are not storing the new file there. We are storing it in the directory where pieces are stored.

We should be able to determine the number of pieces stored without looking in the directory - I think there is some relevant information in the storagenode database. That said, we still need a way to create a file for the first time even if a node is storing more than 0 pieces. For example, when this update goes out, we want existing storagenodes to create the file even if they have more than 0 pieces. But after that we want them to fail to start of the file does not exist. So there are some details that still need to be figured out.

kevink · July 6, 2020, 3:46pm

but if the storage location is unavailable, there won’t be a database.

I’m sorry but I don’t understand the logic behind this. My whole point was to store the file there because without a correctly configured identity directory, the node won’t start.
Storing the file in the directory of the pieces will make it unavailable every time the pieces are unavailable without the node knowing if there should be a file or not because that information would need to be stored somewhere else (somewhere it is accessible even if the storage directory isn’t).
That’s why my initial suggestion was to ask the satellite if the node was supposed to be empty or not but it could of course be solved with a special file, if it is accessible when the storage directory is not.

moby · July 6, 2020, 7:37pm

Those are good points, and I will make sure we address them in whatever implementation we decide on in the end. Thank you for pointing out some of the flaws I didn’t notice. We want to avoid contacting the satellite if possible, but if it’s necessary to implement this check properly, we might have to.

BrightSilence · July 6, 2020, 8:43pm

So, wouldn’t the simplest solution on startup not simply be removing this part from the entrypoint of the docker container?

if [[ ! -f "config/config.yaml" ]]; then
        ./storagenode setup --config-dir config --identity-dir identity
fi

That would mean that one time only a setup command needs to be run manually during first setup of the node. After that the node would simply fail because of the missing config on startup.

Then within the node software itself you can check the existence of the config file from time to time.

Though, I’m not entirely sure how it works on windows. A different implementation may be needed there.

Storgeez · July 29, 2020, 9:16pm

I thought the simplest way is to not start the node if the config.yaml file isn’t present. Am I missing something here? And when creating a new node, there should be a switch to make the SN create all the start files, it shouldn’t be automatic since this is only ever done once.

Also you should make permission denied errors from the filesystem trigger this too. I had a problem recently where some permission errors started appearing by themselves for some reason which would trigger disqualification I presume.

kevink · July 30, 2020, 4:55am

It would probably be the simplest solution for development and still very easy for the users. I agree that it would be best if the storagenode simply didn’t start if the files (e.g. config.yaml) are not available. Running another command for setting up a storagenode for the first time which generates the config.yaml should be easy enough for everyone.

BrightSilence · July 30, 2020, 10:43am

https://review.dev.storj.io/c/storj/storj/+/2272