Easy monitoring for Storagenodes

v1.63.1 is rolled out and it contains a very useful change to monitor you storagenodes:

The public port which is shared accepts simple HTTP calls (in addition to the storj protocol).

When you call it from any HTTP client, it will return with HTTP 200 status code:

  • if not suspended
  • not disqualified
  • online score is > 0.9

for all the satellites.

In case of any error it will return with HTTP StatusServiceUnavailable.

If you would like to be alerted quickly in case of any error, you can set up an alert with any uptime monitoring servers (you can find a list here: Free for developers)

Use the same public domain/ip which is set for the storagenodes (usually by the ADDRESS environment variable in the docker container)

Again: this port is not only about the availability of the storagenode, but the state, reported back by the satellites. In case of problems (like suspension) the HTTP call will return error.

The behavior of the endpoint can be changed with the following configuration (=environment variables):

STORJ_HEALTHCHECK_DETAILS → set it to true, to get detailed response (includes satellite IDs and online scores). If false (default) only a generic true/false response is returned, the reason might not so clear.

STORJ_HEALTHCHECK_ENABLED → this is true, by default, but you can use it to turn off this new function if you don’t like it.

In case this feature is useful for you, we can further improve the condition when it reports failure.

9 Likes

I’ve posted this elsewhere as well, but nodes which have worked with the now deprecated stefan-benten satellite show the status as not healthy. All my new nodes look fine, all the ones which worked with the stefan-benten sat, report not healthy.

1 Like

How do I do that?
In the browser, e.g. Chrome, in the address, enter:
http://195.52.14.158:28967
???

2 Likes

Does this mean that it doesn’t return an error before it gets disqualified in case of audit failures?

1 Like

i get
{
“Statuses”: null,
“Help”: “To access Storagenode services, please use DRPC protocol!”,
"AllHealthy": true
}

when i hit the url on my browser

Thanks, this is very useful. Perhaps it would also be possible to include if the node is full or not?

Here is a very simple script that tests a bunch of nodes:

#!/usr/bin/perl

## node list
@nodes = qw {
1.2.3.4
5.6.7.8
9.10.11.12
};

foreach $node (@nodes) {
    $test = `wget http://$node:28967 -O - 2>/dev/null`;
    if ($test =~ /"AllHealthy": true/) {
        print "$node passed\n";
    } else{
        print "$node failed\n";
    }
}
1 Like

That is very useful. I had been checking the dashboard from docker and checking online and free disk space. I then passed the information to MQTT so that Home Assistant can run automations based on the MQTT status information.

#!/bin/bash

touch ./statusfile

HOST=$(hostname)

timeout 60 docker exec -t storagenode0001 /app/dashboard.sh >  ./statusfile
STATUS=$(grep "ONLINE" ./statusfile)
echo $STATUS
if [[ "$STATUS" == *"Status ONLINE"* ]]; then
        echo "Storj Online"
#
# $HOST is used as userid
#
mosquitto_pub -h 192.168.99.999 -p 1883 -u "$HOST" -P "Password123" -t "storj/$HOST/status" -m "
Online" -p 1883
else
        echo "Storj offline"
mosquitto_pub -h 192.168.1.33 -p 1883 -u "$HOST" -P "Password123" -t "storj/$HOST/status" -m "
Offline" -p 1883
docker restart storagenode0001 -t 300
fi
#
Obtain free disk space on machine
#
DISKSTRING=$(grep "Disk" ./statusfile | head -1)
DISKFREE=$(echo $DISKSTRING | sed 's@^[^0-9]*\([0-9]\+\).*@\1@')
mosquitto_pub -h 192.168.99.999 -p 1883 -u "$HOST" -P "Password123" -t "storj/$HOST/diskfree" -m $DISKFREE -p 1883

rm ./statusfile
1 Like

Yes, it does, but it can be easily added. I am not sure what is the good threshold with the new audit system.

Most of the cases disqualification is started with suspension, that’s why I started with it, but I can add other conditions if you have good suggestion for the threshold…

Absolutely, it’s a good idea.

Just reading the code. Based on my understanding you can just move the satellites.db to a safe place and restart the storagenodes. Didn’t test it, but based on the code the db should be re-created with the default satellites (if you help the QA satellites, it should be configured again to be added…)

Even if that worked as a workaround, my take is that nodes should take care of that automatically, ideally :slight_smile:

I could try that of course, but this can’t be the solution for all old node operators. You can’t expect everyone to manually mess with the databases (and probably don’t want to encourage that to begin with).

We also still get an error in the logs every time the dashboard is refreshed saying this satellite isn’t trusted. It seems shutting down satellites just isn’t correctly supported by the node software at this point. Why not add a field in the satellites.db to mark certain satellites as shut down, so you can actually handle these scenarios?

I also wonder how this deals with satellites that were exited gracefully… will those be ignored for this health status?

3 Likes

Considering the new audit system disqualifies nodes when the score drops below 96%, I’d say it should start alerting when audit goes below 99.5%, or something similar?

1 Like

I agree, this is just a workaround. But decentralization requires independent handling of satellites. Storj Labs shouldn’t blacklist any satellites. It is the responsibility of the storagenode operators to add / remove additional satellites.

But we can definitely implement better tooling to make it easy (=possible without db magic)

The easiest approach might be a new config / environment variable (requires restart). CLI command is also possible but requires more endpoints I guess (will check the related code, we may require some cache invalidation…).

I’d argue the satellite operator should send a shut down signal to all nodes before finally shutting down, so this is automated. Just like there is an exit procedure for storagenodes, there should be one for satellites as well. This is going to be hard for the stefan-benten satellite, since it no longer exists. So perhaps a one time intervention from Storj Labs is warranted for that one. In a software update.

That said, manual tools will still be needed when community satellites become a thing. Especially since someone could easily shut down a satellite without calling the appropriate exit procedure, so there needs to be a way out for node operators. But that could be a more longer term thing to solve.

3 Likes