v1.63.1 is rolled out and it contains a very useful change to monitor you storagenodes:
The public port which is shared accepts simple HTTP calls (in addition to the storj protocol).
When you call it from any HTTP client, it will return with HTTP 200 status code:
if not suspended
online score is > 0.9
for all the satellites.
In case of any error it will return with HTTP StatusServiceUnavailable.
If you would like to be alerted quickly in case of any error, you can set up an alert with any uptime monitoring servers (you can find a list here: Free for developers)
Use the same public domain/ip which is set for the storagenodes (usually by the ADDRESS environment variable in the docker container)
Again: this port is not only about the availability of the storagenode, but the state, reported back by the satellites. In case of problems (like suspension) the HTTP call will return error.
The behavior of the endpoint can be changed with the following configuration (=environment variables):
STORJ_HEALTHCHECK_DETAILS → set it to true, to get detailed response (includes satellite IDs and online scores). If false (default) only a generic true/false response is returned, the reason might not so clear.
STORJ_HEALTHCHECK_ENABLED → this is true, by default, but you can use it to turn off this new function if you don’t like it.
In case this feature is useful for you, we can further improve the condition when it reports failure.
I’ve posted this elsewhere as well, but nodes which have worked with the now deprecated stefan-benten satellite show the status as not healthy. All my new nodes look fine, all the ones which worked with the stefan-benten sat, report not healthy.
That is very useful. I had been checking the dashboard from docker and checking online and free disk space. I then passed the information to MQTT so that Home Assistant can run automations based on the MQTT status information.
Just reading the code. Based on my understanding you can just move the satellites.db to a safe place and restart the storagenodes. Didn’t test it, but based on the code the db should be re-created with the default satellites (if you help the QA satellites, it should be configured again to be added…)
I could try that of course, but this can’t be the solution for all old node operators. You can’t expect everyone to manually mess with the databases (and probably don’t want to encourage that to begin with).
We also still get an error in the logs every time the dashboard is refreshed saying this satellite isn’t trusted. It seems shutting down satellites just isn’t correctly supported by the node software at this point. Why not add a field in the satellites.db to mark certain satellites as shut down, so you can actually handle these scenarios?
I also wonder how this deals with satellites that were exited gracefully… will those be ignored for this health status?
I agree, this is just a workaround. But decentralization requires independent handling of satellites. Storj Labs shouldn’t blacklist any satellites. It is the responsibility of the storagenode operators to add / remove additional satellites.
But we can definitely implement better tooling to make it easy (=possible without db magic)
The easiest approach might be a new config / environment variable (requires restart). CLI command is also possible but requires more endpoints I guess (will check the related code, we may require some cache invalidation…).
I’d argue the satellite operator should send a shut down signal to all nodes before finally shutting down, so this is automated. Just like there is an exit procedure for storagenodes, there should be one for satellites as well. This is going to be hard for the stefan-benten satellite, since it no longer exists. So perhaps a one time intervention from Storj Labs is warranted for that one. In a software update.
That said, manual tools will still be needed when community satellites become a thing. Especially since someone could easily shut down a satellite without calling the appropriate exit procedure, so there needs to be a way out for node operators. But that could be a more longer term thing to solve.