Easy monitoring for Storagenodes

elek · September 21, 2022, 7:16am

v1.63.1 is rolled out and it contains a very useful change to monitor you storagenodes:

The public port which is shared accepts simple HTTP calls (in addition to the storj protocol).

When you call it from any HTTP client, it will return with HTTP 200 status code:

if not suspended
not disqualified
online score is > 0.9

for all the satellites.

In case of any error it will return with HTTP StatusServiceUnavailable.

If you would like to be alerted quickly in case of any error, you can set up an alert with any uptime monitoring servers (you can find a list here: Free for developers)

Use the same public domain/ip which is set for the storagenodes (usually by the ADDRESS environment variable in the docker container)

Again: this port is not only about the availability of the storagenode, but the state, reported back by the satellites. In case of problems (like suspension) the HTTP call will return error.

The behavior of the endpoint can be changed with the following configuration (=environment variables):

STORJ_HEALTHCHECK_DETAILS → set it to true, to get detailed response (includes satellite IDs and online scores). If false (default) only a generic true/false response is returned, the reason might not so clear.

STORJ_HEALTHCHECK_ENABLED → this is true, by default, but you can use it to turn off this new function if you don’t like it.

In case this feature is useful for you, we can further improve the condition when it reports failure.

BrightSilence · September 21, 2022, 11:57am

I’ve posted this elsewhere as well, but nodes which have worked with the now deprecated stefan-benten satellite show the status as not healthy. All my new nodes look fine, all the ones which worked with the stefan-benten sat, report not healthy.

peem · September 21, 2022, 12:11pm

How do I do that?
In the browser, e.g. Chrome, in the address, enter:
http://195.52.14.158:28967
???

Pac · September 21, 2022, 8:44pm

Does this mean that it doesn’t return an error before it gets disqualified in case of audit failures?

olympios · September 21, 2022, 9:05pm

i get
{
“Statuses”: null,
“Help”: “To access Storagenode services, please use DRPC protocol!”,
"AllHealthy": true
}

when i hit the url on my browser

peter_linder · September 25, 2022, 5:23pm

Thanks, this is very useful. Perhaps it would also be possible to include if the node is full or not?

Here is a very simple script that tests a bunch of nodes:

#!/usr/bin/perl

## node list
@nodes = qw {
1.2.3.4
5.6.7.8
9.10.11.12
};

foreach $node (@nodes) {
    $test = `wget http://$node:28967 -O - 2>/dev/null`;
    if ($test =~ /"AllHealthy": true/) {
        print "$node passed\n";
    } else{
        print "$node failed\n";
    }
}

mr-biz · September 25, 2022, 6:00pm

That is very useful. I had been checking the dashboard from docker and checking online and free disk space. I then passed the information to MQTT so that Home Assistant can run automations based on the MQTT status information.

#!/bin/bash

touch ./statusfile

HOST=$(hostname)

timeout 60 docker exec -t storagenode0001 /app/dashboard.sh >  ./statusfile
STATUS=$(grep "ONLINE" ./statusfile)
echo $STATUS
if [[ "$STATUS" == *"Status ONLINE"* ]]; then
        echo "Storj Online"
#
# $HOST is used as userid
#
mosquitto_pub -h 192.168.99.999 -p 1883 -u "$HOST" -P "Password123" -t "storj/$HOST/status" -m "
Online" -p 1883
else
        echo "Storj offline"
mosquitto_pub -h 192.168.1.33 -p 1883 -u "$HOST" -P "Password123" -t "storj/$HOST/status" -m "
Offline" -p 1883
docker restart storagenode0001 -t 300
fi
#
Obtain free disk space on machine
#
DISKSTRING=$(grep "Disk" ./statusfile | head -1)
DISKFREE=$(echo $DISKSTRING | sed 's@^[^0-9]*\([0-9]\+\).*@\1@')
mosquitto_pub -h 192.168.99.999 -p 1883 -u "$HOST" -P "Password123" -t "storj/$HOST/diskfree" -m $DISKFREE -p 1883

rm ./statusfile

elek · September 26, 2022, 8:47am

Yes, it does, but it can be easily added. I am not sure what is the good threshold with the new audit system.

Most of the cases disqualification is started with suspension, that’s why I started with it, but I can add other conditions if you have good suggestion for the threshold…

elek · September 26, 2022, 8:48am

Absolutely, it’s a good idea.

elek · September 26, 2022, 9:08am

Just reading the code. Based on my understanding you can just move the satellites.db to a safe place and restart the storagenodes. Didn’t test it, but based on the code the db should be re-created with the default satellites (if you help the QA satellites, it should be configured again to be added…)

Pac · September 26, 2022, 9:23am

Even if that worked as a workaround, my take is that nodes should take care of that automatically, ideally

BrightSilence · September 26, 2022, 9:25am

I could try that of course, but this can’t be the solution for all old node operators. You can’t expect everyone to manually mess with the databases (and probably don’t want to encourage that to begin with).

We also still get an error in the logs every time the dashboard is refreshed saying this satellite isn’t trusted. It seems shutting down satellites just isn’t correctly supported by the node software at this point. Why not add a field in the satellites.db to mark certain satellites as shut down, so you can actually handle these scenarios?

I also wonder how this deals with satellites that were exited gracefully… will those be ignored for this health status?

Pac · September 26, 2022, 9:26am

Considering the new audit system disqualifies nodes when the score drops below 96%, I’d say it should start alerting when audit goes below 99.5%, or something similar?

elek · September 26, 2022, 1:53pm

I agree, this is just a workaround. But decentralization requires independent handling of satellites. Storj Labs shouldn’t blacklist any satellites. It is the responsibility of the storagenode operators to add / remove additional satellites.

But we can definitely implement better tooling to make it easy (=possible without db magic)

The easiest approach might be a new config / environment variable (requires restart). CLI command is also possible but requires more endpoints I guess (will check the related code, we may require some cache invalidation…).

BrightSilence · September 26, 2022, 2:04pm

I’d argue the satellite operator should send a shut down signal to all nodes before finally shutting down, so this is automated. Just like there is an exit procedure for storagenodes, there should be one for satellites as well. This is going to be hard for the stefan-benten satellite, since it no longer exists. So perhaps a one time intervention from Storj Labs is warranted for that one. In a software update.

That said, manual tools will still be needed when community satellites become a thing. Especially since someone could easily shut down a satellite without calling the appropriate exit procedure, so there needs to be a way out for node operators. But that could be a more longer term thing to solve.

aurelb87 · July 25, 2024, 4:32pm

Hello @elek,

I am using Home Assistant. I am looking for a solution to monitore my storagenode because, sometimes (and I don’t know why) my node goes offline. I am already using a ping integration but it is not enough.

Could you please tell me more about HTTP calls or other solution ? (I am already using a MQTT solution so it could be a reliable solution for me.)

I am running my node on docker container.

Many thanks for your help.

Alexey · July 26, 2024, 7:18am

Hello @aurelb87,
Welcome back!

Please search for Unrecoverable errors in your logs.

docker logs storagenode 2>&1 | grep Unrecoverable | tail

This is easy - use the URL http://your.external.address:28967

aurelb87 · July 26, 2024, 7:52am

Hello @Alexey,

sudo docker logs storagenode 2>&1 | grep Unrecoverable | tail

This command doesn’t work for me : I have no error but nothing happen.

This is easy - use the URL http://your.external.address:28967

I also try to use the URL, but didn’t work too. I opened the port on my UFW firewall but nothing too…

Alexey · July 26, 2024, 7:58am

You can search for the other errors, which could indicate the offline issue:

docker logs storagenode 2>&1 | grep "ping satellite failed" | grep -v "rate" | tail

You need to use you external address and port but connecting outside your network (for example - use the smartphone and the mobile internet).

aurelb87 · July 26, 2024, 8:10am

I will check the log when the problem will occur.

Regarding the URL, this is OK, is it working (I am using another port…).
What is the well interpretation of this:

“Statuses”: null,
“Help”: “To access Storagenode services, please use DRPC protocol!”,
“AllHealthy”: true

Can I consider that if I have “AllHealthy”: true, it means that my node is OK?
Thanks.