While similar to the suggestion posted here, the implementation of this one should require no changes to satellite code and focuses on taking the node offline instead of failing audits, when a storage location becomes unavailable. After discussing with @kevink we decided this is different enough to post as a separate suggestion.
We seem to have 2 issues that we want to prevent.
- The node starts over at 0 if the mount point isn’t available at start.
- The node keeps running if the storage location somehow becomes unavailable during runtime.
Let’s start with point 1. This is only really an issue on Linux if the data location is on the root of an HDD. There are several ways to prevent this, like using a subdirectory on the used HDD or placing your identity on the node’s HDD as well. If either of those is the case, the node simply won’t start and the problem is solved. I suggest changing installation instructions to include storing the identity on the same HDD as the storage location as well as a note to use a subfolder for data and not just store it in the root of an HDD. Additionally for the docker implementation, the entrypoint could be altered to not automatically run setup if the config file is missing. That way the node would stop with an error about the missing config file.
Moving on to point 2. I think this can be fixed by simply making the node check whether the storage path is available from time to time or when any transfer fails with a file not found or similar error. If that is the case, it should crash with a FATAL error of “storage location not available”.
Edit: This check should not only make sure that the location can be read from and written to, but should also double check that it is a valid node storage path. Ideally this could be done by storing some file there with the node id in it so the node can check that the data pointed to belongs to the node that’s accessing it. This could avoid mistakes of pointing to the wrong storage location on mutli node systems as well.
The combination of these 2 things would ensure the node also doesn’t automatically restart as it won’t start if the storage location isn’t available either. As for notifications, work is already being done to email SNOs when their node goes offline and in the mean time, the node being offline can be detected with uptime robot, which is already used by a large part of the community. It wouldn’t mention the specific problem, but a quick look at the node logs would instantly show what the issue is.
The upside of this approach is simplicity. It fixes a very specific problem of the storage location not being available and doesn’t touch the way normal audits work. It doesn’t provide SNOs a way to avoid having to respond to audits either. And doesn’t require changes on the satellite end. Hopefully this relatively simple solution will be considered a quick win by the Storj team.
I noticed @moby responded to the other suggestion related to the same thing. So just pinging you to this one as well. Though something similar may now already be in the works.