Prevent broken mount points from DQing nodes

kevink · June 1, 2020, 1:14pm

We have read it multiple times that mount points are breaking, were done wrong after moving a node or replacing the harddrive. This often results in the nodes being DQed for missing pieces.

This is my proposal to solve this problem:

When the node starts, it checks in with the satellite. Now without any files, the node doesn’t know if it is new or pretty old. So let the satellite answer the check-in with some information about the node (e.g. used space on the satellite or 20 random piece-ids or just node age and if data is being stored on the node).
The storagenode then realizes that it doesn’t have any of those files/no used space for storj and stays offline because the most likely case would be a wrong mounting point. The satellite could send the SNO an email and put the node into suspension mode.

This way the SNO gets a warning about his misconfiguration and can fix easily fix it. If he doesn’t fix it in time, the suspension becomes a DQ just like it was proposed in the downtime DQ mechanism.

That would solve a lot of unneeded DQs and frustration among SNOs as well as making the risk and fear of moving a storagenode a lot lower.

This way the SNO can correct his problem in suspension mode but we’re not even touching the area of recovering from missing files or getting new files that might need to be merged or similar.

nerdatwork · June 1, 2020, 1:28pm

I would recommend this to be Misconfiguration Mode since the node is configured incorrectly. Suspension mode is different and already implemented. Email should specify this specifically.

Your node (id goes here) seems to be configured incorrectly and unable to get online. Please click here to follow a checklist in getting your node back online.

kevink · June 1, 2020, 1:57pm

The point is to not implement another mode (for simplicity of implementation). Suspension mode is for unknown errors. Misconfigurated mountpoints could be one of those.

But I agree that it should be stated in the email if possible.
However, even if not, it will be printed in the logs and it can be written on the dashboard too so the SNO would notice immediately if the node is misconfigured, even without a correct reason in the email.

techcenter · June 1, 2020, 3:23pm

I encountered a similar situation and a simple solution for me was to put the identity files on the drive with the storagenode data. This way if the mount point is wrong , the identity files are missing and the node won’t start.

direktorn · June 1, 2020, 5:45pm

Ths is just stupid. The node should not even start if the mountpoint is incorrect.
But a SNO should know what they are doing, thats the first misstake here and 99.99999% of all threads is about that.

I’m sure the local DB knows, or at least should know what satellites it has talked to and just confirm that these directories exists in the root of the path configured. If not it wont start. This is just a simple check and should be enough for miss-configuration.

kevink · June 1, 2020, 5:53pm

How would it know if there is no local db because of a misconfigured mountpoint? It could be a new node without any data yet. The storagenode container doesn’t know.

direktorn · June 1, 2020, 7:19pm

Doh! I still for some reason believe the data and the application will not live in the same path, but here this true and itself is just really bad bad design.

kevink · June 1, 2020, 7:31pm

welcome to docker… application and data is always separate and not on the same path. I don’t think it’s possible to have the docker container on the data path of the storagenode. and it certainly isn’t typical for a windows setup either.

direktorn · June 1, 2020, 7:59pm

You understand we are talking about a database here right? This has nothing to do with Windows, it’s just stupid

BrightSilence · June 1, 2020, 8:58pm

The data and application don’t live in the same path. The data is in the storage location and the application is either in the docker container or in the program files folder. What are you getting at? I’m not following.

direktorn · June 1, 2020, 9:12pm

So sqllite in your narrow world has nothing to do w application? Interesting

BrightSilence · June 1, 2020, 9:33pm

I was asking a question… Please change your tone.

Databases that contain application state and configuration usually don’t live in the application location. As that should only contain parts of the application that don’t change. If you disagree, that’s fine, but please be respectful in your response.

SGC · June 2, 2020, 5:27am

i moved my identity files onto the same mount as the storagenode folders… that way it won’t have an identity… or shouldn’t i suppose it might not use them after the first launch… i suppose i should test that…

think i got the idea from @Pentium100 but not sure

if anyone knows a reason why that shouldn’t keep my node from booting without proper connection to my mount, feel free to elaborate on why i am wrong lol

kevink · June 2, 2020, 5:28am

It will always use them. There’s no information stored inside the docker container. So your workaround should be alright.

BrightSilence · June 2, 2020, 6:46am

That will work for issues on startup of the node. I did the same thing for the same reason, but issues could also arise after startup. I still think some form of suspension where the node could only get out of it if it successfully responds to the audits it failed before could save the node from a lot of setup errors.

SGC · June 2, 2020, 6:53am

why would you want to fix it remotely and not locally… seems pretty trivial to keep track of if the mount is working good when doing it locally.

BrightSilence · June 2, 2020, 6:56am

That’s fair, but then the node should take itself offline. If it doesn’t the satellite needs to suspend it so it doesn’t select it for uploads anymore. Either is fine by me.

SGC · June 2, 2020, 7:25am

i’m not saying that a satellite based consideration wouldn’t be a good thing… just that taking care of the issue and stopping the node would be very easily implemented locally, where as the satellite would most likely require additional code to account for stuff like this…

on the other hand, having it setup from the satellite would ofc mean it would be the same for everyone and no configuration fails to take into account…

but then again the satellites would most likely be the utmost of streamlined programming…

ill just stop now lol

kevink · June 2, 2020, 8:03am

My suggestion wasn’t a satellite based solution. The main solution was in the storagenode taking itself offline after realizing the data is missing. This needs some cooperation from the satellite to work but isn’t satellite based.
The addition of putting the node in suspension mode and sending an email is basically for convenience to the SNO.

SGC · June 2, 2020, 8:52am

i’m pondering writing a little script to run with my logging, since i append my logs out of docker every minute i could just make it take a look at it and see if there is something bad going on…

going to look into that… not sure if i can come up with a good way of easily making sense of the logs tho… but ill think about it for a bit… i’m sure there is a good easy way to do that.