Howto: storage node health check > discord + email alerting script

Bivvo · December 15, 2021, 10:55pm

Hi there,

based on my ideas and findings from one of my previous posts, I have published my “health check script (linux + macOS)” on GitHub.

Please feel free to support, subscribe or comment. I’ll release updates soon, in order to make it more self-descriptive and common to use for everyday usage.

Target group: docker and linux / macOS users.

about this script

this linux shell script checks, if a storj node runs into errors and alerts the operator by discord push messages as well as emails. requires at least one storj node running with docker on linux.

features

multinode support
optionally discord (as quick notifications) and/or mail (with error details) alerts
alerts, in case:
- audit, suspension and/or online scores are below a threshold (storj node disqualification risk)
- audit timeouts are recognized (pending audits; discqualification risk)
- audit time lags: download started vs. download finished is larger than 3 mins (storj node disqualification risk)
- a threshold of repair gets/puts and downloads/uploads are reached (storj node disqualification risk)
- there was no get/put at all in the last hour (storj node disqualification risk)
- any other fatal error occurs, incl. issues with docker stability
- storj node version is outdated
- the node is offline (docker container not started)
reports:
- disk usage
- success rates audits, downloads, uploads, repair up-/downloads
- estimated payouts for today and current month
- todays upload and download statistics
optimized for crontab and command line usage
supports redirected logs to a file
only requires curl, jq, bc and (optionally) swaks to run

optimzed / tested for

debian bullseye
macos monterey (jq + swaks installed with brew)

dependencies

storj node node up and running, within a
docker container
curl (http requests)
jq 1.6 (JSON parsing)
bc (arbitrary precision calculator)
swaks (mail sending, smtp)
discord.sh (discord pushes)

setting up storj system health

optional: setup a webhook in the desired discord text channel
optional: grab your smtp email authentication data
download (or clone) a copy of discord.sh *
download (or clone) a copy of storj-system-health.sh and storj-system-health.credo **
optional: setup discord and mail variables in storj-system-health.credo
Go nuts

* wget https://raw.githubusercontent.com/ChaoticWeg/discord.sh/master/discord.sh

** wget https://raw.githubusercontent.com/dusselmann/storj-system-health.sh/main/storj-system-health.sh && wget https://raw.githubusercontent.com/dusselmann/storj-system-health.sh/main/storj-system-health.credo

setting up variables in *.credo

you will need to modify these variables in *.credo for your specific node and smtp mail server configuration. the *.credo file must not include comments and blank lines, the following description is just for your explanation:

## discord settings
DISCORDON=true.         # enables (true) or disables (false) discord pushes
DISCORDURL=https://discord.com/api/webhooks/...
                        # your discord webhook url

## mail settings
MAILON=true             # enables (true) or disables (false) email messages
MAILFROM=""             # your "from:" mail address
MAILTO=""               # your "to:" mail address
MAILSERVER=""           # your smtp server address
MAILUSER=""             # your user name from smtp server
MAILPASS=""             # your password from smtp server

## alerting settings
SATPINGFREQ=3600        # in case satellite scores are below threshold, 
                        # value in seconds, when next alert will be sent earliest
                        
## storj node docker names and urls
NODES=storagenode       # storage node names, multiple: separated with comma, 
                        # e.g. storagenode,storagenode-a,storagenode-b
NODEURLS=localhost:14002
                        # storage node dashboard urls, multiple: separated with comma, 
                        # e.g. localhost:14002,192.168.171.5:14002

## node data mount points
MOUNTPOINTS=/mnt/node   # your storage node mount point, multiple: separated with comma
                        # e.g. /mnt/node,/mnt/node-a,/mnt/node-b
                        # enter 'source' from the docker run command here

## specify redirected logs per node
NODELOGPATHS=/          # put your relative path + log file name here,
                        # in case you've redirected your docker logs with
                        # e.g. config.yaml: 'log.output: "/app/config/node.log"'
                        #  /                       -> for non-redirected logs
                        #  /node.log               -> for single node redirect
                        #  /,/                     -> for 2 node with non-redirected logs
                        #  /node1.log,/node2.log   -> for 2 nodes with redirects
                        #  /node.log,/             -> only 1st is redirected
                        #  /mnt/hdd1/node.log      -> full path possible, too

## log selection specifica - in alignment with cronjob settings
LOGMIN=60               # latest log horizon to have a detailled view on, in minutes
                        # -> change this, if your cronjob runs more often than 60m
LOGMAX=720              # larger log horizon for overall statistics, in minutes

make sure, your script is executable by running the following command. add ‘sudo’ at the beginning, if admin privileges are required.

chmod u+x storj-system-health.sh  # or:
sudo chmod u+x storj-system-health.sh

chmod u+x discord.sh  # or:
sudo chmod u+x discord.sh

usage

you can run the script in debug mode to force a push message to your discord channel (if enabled) although no error was found - or without the debug flag to run it in silent mode via crontab (see automation chapter).

./storj-system-health.sh -d   # for a regular discord push message or:
./storj-system-health.sh      # for silent mode

optionally you can pass another path to *.credo, in case it has another name or source:

./storj-system-health.sh -c /home/pi/anothername.credo

in order to use the estimated payout information, which looks like so:

message:  [sn1] : hdd 38.62% > OK 0.25$ / 11.77$

… you should set your crontab to be run around 23:55 UTC. You need to adjust the timing, if you have a couple of nodes and/or huge log files to be analysed: the script needs to be finished before the next full hour, ideally latest 23:59:59 UTC.

it also supports a help command for further details:

./storj-system-health.sh -h

automation with crontab

to let the health check run automatically, here’s a crontab example for linux, which runs the script each hour.

15,35,55  * * * *   pi      /home/pi/storj-system-health.sh -d  > /dev/null

for macos please be aware of the following specifics:

use crontab -e and crontab -l, although it is depricated (for now it works)
you do not have to use the user name, it’s to be executed with the current user
use full paths to your script and credo file
find out your standard path with echo §PATH and set it in crontab

SHELL=/bin/sh
PATH="/opt/homebrew/opt/sqlite/bin:/opt/homebrew/bin:/opt/homebrew/sbin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin"
# UNIX:
58 *    * * *   pi      cd /home/pi/scripts/ && ./storj-system-health.sh -ev
10 7,19 * * *   pi      cd /home/pi/scripts/ && ./storj-system-health.sh -ed
# MACOS
# 58    *     *  *  *  /Users/me/storj-system-health.sh -ev >> /Users/me/Desktop/checks.txt 2>&1
# 10    7,19  *  *  *  /Users/me/storj-system-health.sh -ed -c /Users/me/my.credo >> /Users/me/Desktop/checks.txt 2>&1

contributing

issues and pull requests are welcome. for major changes, please open an issue first to discuss what you would like to change.

if you want to contact me directly, feel free to do so via discord: Discord

license

GPL-3.0

Bivvo · December 29, 2021, 8:14pm

With versions v1.3.3. + v1.3.4 the script becomes better scalable for everyday usage:

There was a huge performance improvement recently and I managed to switch of disturbing, not to say spamming email alerts, in case an error was found.

On top of that, the script now warns in case of several thresholds missed, like repair up- / downloads (risk of getting disqualified), normal data up- / downloads and if there was no upload / download activity within the last hour.

Here’s a more or less complete list of the current features implemented:

emails containg an excerpt of the relevant error log message
if the debug mode is used, disk usage of the mounted data storage disk mount point included
alerts in case a threshold of repair gets/puts and downloads/uploads is reached
alerts if there was no get/put at all during the last hour
alerts in case the node is offline (docker container not started)
optimized for crontab and command line usage (to be continued)

Bivvo · December 30, 2021, 8:38pm

Version v1.4.0 adds:

multinode support
mail and/or discord on/off flags

“disk usage per mount point” (in case of multinodes) needs to be extended, but does not harm the main functionality of the script (alerting in case of errors/issues).

Alexey · January 4, 2022, 7:33am

I changed the topic to a wiki, now you can update it any time.

Bivvo · January 4, 2022, 1:23pm

Versions v1.5.0 till v1.5.2 add:

added satellite audit+suspension+online score threshold checks and alerts
added disk usage for multinodes
put the configuration part into a separate config file named *.credo > completely outsourced variables from the script functions
added script arguments for debugging, verbose mode, individual config file name/path and a more self-describing help message / script usage

Full Changelog: comparing v1.4.0…v1.5.2

Bivvo · January 12, 2022, 10:53am

Version v1.5.3 changes:

macOS / docker support: added + tested
jq version check to minimum 1.6 added / solved issue #14
multinodes: node names added to push and email alerts / solved issue #17
success, failure, cancellation ratios into command line outputs in verbose mode added

Bivvo · January 16, 2022, 9:37pm

Latest changes up to version 1.5.5 :

added settings file (automatic creation), to store last ping of satellite notification, in order to just ping once a day; solving #16: limit satellite threshold warnings to once a day
added environment variables path option, e.g. for crontab usage
added settings path option, if settings file is situated somewhere else
optimized error filter to ignore “satellite service pings” and “emptying trash warnings”
optimized verbose output for debugging
checks, if discord.sh exists and is executable
fixed: in case the script is called with an absolute path from another folder, discord.sh is (still) expected to be in the same folder, but will be correctly executed
shortened threshold messages a bit, to make it better readable

Full changelog: comparing v1.5.3 with v1.5.5

Bivvo · January 29, 2022, 12:24pm

Version v1.6.0:

added storj node version check
added support for redirected logs
young nodes: in case there is no repair download startet yet, no warning will be thrown

chain7 · March 11, 2022, 11:44pm

Is there windows version for this one?

Bivvo · March 12, 2022, 6:19am

Unfortunately no. Not sure if it works, when you run it within a Linux VM and

share your dashboard to your local network
export your logs to a file, where you have access from the VM

These two things should be the only settings to be made to let it work.

Bivvo · March 30, 2022, 9:38am

A lot of things have changed since v1.6.0 > latest version is v1.6.8:

ignoring “timeout: no recent network activity” error, linked to quick library issues (see here), but added it to the “timeout” hints, so that you are aware of it
implemented #30 : added space information (net, without trash and gross, including trash) from storage node data and added a warning, if space is getting overused
minor formatting changes to discord push message
added 10 day delay support for new storj version, see #26
fixed logical issues with notifications, where boolean values were not handled properly
added configurable timeframes for log selection in minutes
small tweaks in selection of error and severe error entries
minor fixes for multinode usage
minor general improvements

Bivvo · April 1, 2022, 9:07am

latest improvement releases v1.6.9-v1.6.11:

proper error handling and alerting of the QUIC issue mentioned here: storj/storj#4688 and here
added logmin and logmax configration entries
added option “-l” (L) to specify larger or shorter log timeframes to be selected and analyzed; e.g.: ./checks.sh -l 30 -v
minor: optimized i/o timeout notes in verbose mode
minor tweaks in readme

Bivvo · April 3, 2022, 9:19pm

v1.7.0-v1.7.1 - bugfix releases

removed the grouping of audit and repair errors; this will also properly handle the current QUIC issues of storj for repair traffic
added alerts for repair failures; meaning they will be reported separately from audit issues
limited the selection of repair alerts to LOGMIN instead of LOGMAX

example output from the verbose mode:

$ ./checks.sh -l 600 -vd
2685416 (process ID) old priority 0, new priority 19
===
 *** timestamp [03.04.2022 23:12]
 *** discord debug mode on
 *** config file loaded
 *** settings: logs from the last 600 minutes will be selected
 *** settings: satellite pings will be sent: false
===
running the script for node "sn1" (/mnt/WD1001) ..
 *** node is running        : 1
 *** disk usage             : 20.79% (incl. trash: 21.49%)
 *** satellite scores url   : localhost:1234/api/sno/satellites (OK)
 *** storj node api url     : localhost: 1234/api/sno (OK)
 *** storj version current  : installed 1.50.4
 *** storj version latest   : github 1.50.4 [2022-03-18]
 *** docker log 1440m selected : #242332
 *** docker log 600m selected : #103129
 *** info count             : #98101
 *** audit error count      : #0
 *** repair failures count  : #0
 *** fatal error count      : #0
 *** severe count           : #0
 *** other error count      : #0
 *** i/o timouts count      : #0
 *** audits                 : warn: 0.00%, crit: 0.00%, s: 100%
 *** downloads              : c: 0.76%, f: 0.66%, s: 99%
 *** uploads                : c: 0.08%, f: 0.03%, s: 100%
 *** repair downloads       : c: 0.00%, f: 0.03%, s: 100%
 *** repair uploads         : c: 0.10%, f: 0.00%, s: 100%
 *** 600 m activity : up: 12890 / down: 40962 > OK
 *** i/o timouts ignored    : false
===
 message:  [sn1] : hdd 21.49% > OK 
===
 *** discord summary push sent.
 *** discord success rates push sent.

Bivvo · July 24, 2022, 3:07pm

until version v1.8.0 :

added support for estimated payout information for UNIX + MacOS; see README for details

message:  [sn1] : hdd 38.62% > OK 0.25$ / 11.77$

# 38.62% = hdd space usage incl. trash
# 0.25$ = today's payout until now
# 11.77$ = this month's payout until now

fixed an issue with awk and date selection from the external stored log file
added support for external stored log files on another device than the storagenode itself
added: fire just one error in case “other” and “severe” are equal
added: pending audits: run script again automatically and fire alert with the second run
minor optimisation in error selection for “rpc client” issues
fixed the external log file selection and filtering for MacOS and cleaned the code in combination with linux handling
added several files to be in line with GitHub community guidelines

Bivvo · July 28, 2022, 3:36pm

until v1.9.2 - feature and improvement releases:

most important new features:

added a check of the LOGMIN selected part of the log, if there are time lags larger than 3 mins between a started and finished download of GET_AUDIT; the script will alert you, in case there are time lags (which will quickly disqualify your node, if you miss to act!). the selection is done by satellite, so a count per satellite will be sent by email to have some more details for further analysis. referencing related topic in the forum.
when the script runs between 23:50-23:59 UTC, a push msg will be sent as a daily summary, in case discord is configured
optimised command line output and added extra param “-q” in order to distinguish verbose “-v” outputs from “quality checks”
success rates are only shown, when scores are below 98%; can be send detailed with new param “-o”

Other things optimised:

removed PUT_REPAIR from monitoring, as it should have no effect on the suspension and/or audit score(s)
removed an if-clause, which prevented the estimated payouts calculation is done at any time, not only, when param “-e” was provided
fixed if statement to enable push message EOD, which lead to not sending the push at all
fixed a typo in the mail templates
uploaded newer example screenshots
optimised OK case push message format
fixed date issues with leading zeros

cc thank you @Alexey

Bivvo · March 19, 2023, 9:49am

until v1.10.0:

fixing noiseconn issue, falsifying success rate stats (Connection reset by peer errors - #6 by Bivvo)
fix issue #42 and properly show warning in case a newer version of the storj node software is available
added “node rate limited by id” to timout counts
added “tcp connector failed” to i/o timeout count, which in my case, where temporary hickups during reconnects of the internet connection with the provider
fixed handling on payout estimations for new nodes added to the settings
fixed a minor logical error, which caused a command line warning in verbose mode
fixed an issue with the satellite scores notification
added option -E to skip log file analysis and quickly select / push current earnings

Bivvo · March 22, 2023, 10:49am

v1.10.1 improvement release:

changed KPI handling to “new normal”: downloads failed are “OK” till 80%; alert will be sent when below (before it was 90%) Connection reset by peer errors - #2 by jtolio

Full Changelog: Comparing v1.10.0...v1.10.1 · bjoerrrn/storj-system-health.sh · GitHub

Bivvo · April 7, 2023, 9:03am

v1.10.2 improvement release:

changed KPI handling to “new normal”: downloads failed are “OK” till 60%; alert will be sent when below (before it was 80%) Connection reset by peer errors - #2 by jtolio

Full Changelog: Comparing v1.10.1...v1.10.2 · bjoerrrn/storj-system-health.sh · GitHub

agente · May 9, 2023, 11:05am

First of all thanks for this script. I’m trying to use it and I have few question (btw link of discordapp on github isn’t working).
I have more than one node and seems very slow in my case. I have a raid5 configuration and one node use 20/25 minutes to complete the scan. I paste results:

20194 (process ID) old priority 0, new priority 19

*** timestamp [09.05.2023 10:35]
*** config file loaded: ./storj-system-health.credo
*** settings file path: .storj-system-health

running the script for node “storagenode201” (/node201) …
*** node is running : 1
*** disk usage : 52.80% (incl. trash: 53.40%)
*** satellite scores url : localhost:14001/api/sno/satellites (OK)
*** settings: satellite pings will be sent: false
*** storj node api url : localhost:14001/api/sno (OK)
*** storj version current : installed 1.78.2
*** storj version latest : github 1.76.2 [2023-04-03]
*** docker log 720m selected : #268010
*** docker log 60m selected : #19289
*** info count : #19279
*** audit error count : #0
*** repair failures count : #0
*** fatal error count : #0
*** severe count : #0
*** other error count : #0
*** i/o timouts count : #0
*** audits : w: 0.00%, c: 0.00%, s: 100%
*** downloads : c: 2.39%, f: 0.04%, s: 98%
*** uploads : c: 0.09%, f: 1.50%, s: 98%
*** repair downloads : c: 0.00%, f: 0.00%, s: 100%
*** repair uploads : c: 0.02%, f: 0.79%, s: 99%
*** 60 m activity : up: 7484 / down: 8047 > OK
*** i/o timouts ignored : false
*** settings: added key ‘storagenode201_payTimestamp’, because it was not found.
*** settings: added key ‘storagenode201_payValue’, because it was not found.

seems stuck on 720m logs. My nodes runs in docker and I dont redirect my log (I leave “/” in nodelogpaths).

Is it ok? Can I run this check faster for multiple nodes checks?

Bivvo · May 9, 2023, 6:52pm

happy to see it’s used thank you for your feedback. let me try to help:

strange - can you DM me what is shown (an error message?)?

the script gets the lowest priority when it is run:

in case your overall load of the system is very high, it just will need some time to finish.

if you want, you can comment the following line in the script in order to let it run with normal speed by doing the following. but be aware, it might have an impact on the performance of your system, as i have the feeling, it is already overloaded (?):

# renice 19 $$

does the script stop after these 2 lines? no additional outputs?

*** settings: added key ‘storagenode201_payTimestamp’, because it was not found.
*** settings: added key ‘storagenode201_payValue’, because it was not found.

if so, can you please run the script in debug mode and share the result?

./storj-system-health.sh -vq