Howto: storage node health check > discord + email alerting script

Hi there,

based on my ideas and findings from one of my previous posts, I have published my “health check script (linux + macOS)” on GitHub. :partying_face:

Please feel free to support, subscribe or comment. I’ll release updates soon, in order to make it more self-descriptive and common to use for everyday usage.

:bulb: Target group: docker and linux / macOS users.

Changelog

v1.5.5 - feature release

  • added settings file (automatic creation), to store last ping of satellite notification, in order to just ping once a day; solving #16: limit satellite threshold warnings to once a day
  • added environment variables path option, e.g. for crontab usage
  • added settings path option, if settings file is situated somewhere else
  • extended error filter to ignore satellite service pings and emptying trash errors
  • optimized verbose output for debugging

v1.5.4 - improvement release

  • checks, if discord.sh exists and is executable
  • fixed: in case the script is called with an absolute path, discord.sh is still expected to be in the same folder, but will be correctly executed
  • shortened threshold messages a bit

v1.5.3 - improvement release

  • verified and assured macOS support
  • added jq version check to minimum 1.6 / solved #14
  • added node names in the case of multinodes to push and email alerts / solved #17
  • added success, failure, cancellation ratios into command line outputs in verbose mode

v1.5.1-v1.5.2 - improvement releases

  • added satellite audit+suspension+online score threshold checks and alerts
  • verbose option added: enable console output while execution

v1.5.0 - feature release

  • put the configuration part into a separate config file named *.credo > completely outsourced variables from the script functions
  • added disk usage for multinodes
  • added script arguments for debugging, individual config file name/path and a more self-describing help message / script usage

v1.4.0 - feature release

  • added multinode support (running on the same machine)
  • added mail + discord on/off flags to enable or disable discord pushes and mail alerts
  • general script restructures: you only need to setup your individual settings at the beginning of the script

v1.3.4 - bugfix release

  • fixed conditional statements for push string concatenation and “i/o timeout” ignore logic

v1.3.3 - improvement release

  • improved performance: just 2 docker log selections instead of 20+
  • improved mail notifications: limited error msg selection to last 1h
  • … both covering issue #11: Prevent mail alerts from spamming the mail account
  • advice to let the script run just once per hour to be in line with the error msg selection and notification
  • added echo outputs for command line usage as a feedback on what the script is working on
  • improved discord alert formatting in case repair or get/put stats are out of threshold
  • added “(io)” hint in “no errors (io)” message to let the user know, that i/o timeouts were ignored (but sent by mail in debug mode anyway)

v1.3.2 - improvement release

  • added curl, swaks, jq library check
  • added first version of a help text
  • several code optimisations

v1.3.1 - bugfix release

  • fixed an issue with the new “ignore io timeouts, if no other error occurs” feature.

v1.3.0 - added several alerts + bug fixes + optimisations:

  • fix audit log details selection from (missing) LOG variable
  • ignore i/o timeouts (satellite service pings + single satellite connects), if audit success rate is 100% and there are no other errors
  • performance: only select audit error details in case there are audit issues
  • added renice to let the script run in low performance to not block the system
  • recognizes “no download/upload activity” within the last hour
  • general optimization of the script’s styling
  • solved #7: Add alert: on low uploads or downloads
  • solved #8: Alert on a) low thresholds for success rates + b) storage node status

v1.2.0 - added audit critical + recoverable stats + log extract

v1.1.0 - added debug modes + echo outputs for command line usage

  • added mail/discord debug mode
  • created constants section to manage your individual params
  • optimised echo output for shell command line usage

v1.0.1 - v1.0.3 - several bug fixes

  • exclude ‘pieces error: filestore error: context canceled’ from the alerts
  • fixes an issue with false positive (fatal) error alerts + logs excerpt not sent by email in case of an alert
  • docker logs tail param did not work and caused the email body not to have the error messages included; removed it
3 Likes

about this shell script

this linux shell script checks, if a storj node ([:storage node] from the storj project) runs into errors and alerts the operator by discord push messages as well as emails. requires at least one storj node running with docker on linux.

features

  • multinode support :earth_africa:
  • optionally discord (as quick notifications) and/or mail (with error details) alerts :inbox_tray: :bell:
  • alerts when audit, suspension and/or online scores are below a threshold (storj node discqualification risk) :warning:
  • alerts in case a threshold of repair gets/puts and downloads/uploads are reached (storj node discqualification risk) :warning:
  • alerts if there was no get/put at all in the last hour (storj node discqualification risk) :warning:
  • alerts in case the node is offline (docker container not started) :warning:
  • optimized for crontab and command line usage :computer:
  • only requires curl, jq and swaks to run :fire:

optimzed / tested for

  • debian bullseye :penguin:
  • macos monterey :apple: (jq + swaks installed with brew)

dependencies

setting up storj system health

  1. optional: setup a webhook in the desired discord text channel
  2. optional: grab your smtp email authentication data
  3. download (or clone) a copy of discord.sh
  4. download (or clone) a copy of storj-system-health.sh and storj-system-health.credo
  5. optional: setup discord and mail variables in storj-system-health.credo
  6. Go nuts.

setting up variables in *.credo

you will need to modify these variables in *.credo for your specific node and smtp mail server configuration. the *.credo file must not include comments and blank lines, the following description is just for your explanation:

## discord settings
DISCORDON=true.         # enables (true) or disables (false) discord pushes
DISCORDURL=https://discord.com/api/webhooks/...
                        # your discord webhook url

## mail settings
MAILON=true             # enables (true) or disables (false) email messages
MAILFROM=""             # your "from:" mail address
MAILTO=""               # your "to:" mail address
MAILSERVER=""           # your smtp server address
MAILUSER=""             # your user name from smtp server
MAILPASS=""             # your password from smtp server

## node data mount points
MOUNTPOINTS=/mnt/node   # your storage node mount point, multiple: separated with comma
                        # e.g. /mnt/node,/mnt/node-a,/mnt/node-b

## storj node docker names
NODES=storagenode       # storage node names, multiple: separated with comma, 
                        # e.g. storagenode,storagenode-a,storagenode-b
NODEURLS=localhost:14002
                        # storage node dashboard urls, multiple: separated with comma, 
                        # e.g. localhost:14002,192.168.171.5:14002

make sure, your script is executable by running the following command. add ‘sudo’ at the beginning, if admin privileges are required.

chmod u+x storj-system-health.sh  # or:
sudo chmod u+x storj-system-health.sh

chmod u+x discord.sh  # or:
sudo chmod u+x discord.sh

usage

you can run the script in debug mode to force a push message to your discord channel (if enabled) although no error was found - or without the debug flag to run it in silent mode via crontab (see automation chapter).

./storj-system-health.sh -d   # for a regular discord push message or:
./storj-system-health.sh      # for silent mode

optionally you can pass another path to *.credo, in case it has another name or source:

./storj-system-health.sh -c /home/pi/anothername.credo

it also supports a help command for further details:

./storj-system-health.sh -h

automation with crontab

to let the health check run automatically, here’s a crontab example for linux, which runs the script each hour.

0  *    * * *   pi      /home/pi/storj-checks.sh

for macos please be aware of the following specifics:

  • use crontab -e and crontab -l, although it is depricated (for now it works)
  • you do not have to use the user name, it’s run with the current user
  • use full paths to your script and credo file
  • find out your standard path with echo §PATH and set it in crontab
SHELL=/bin/sh
PATH="/opt/homebrew/bin:/opt/homebrew/sbin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin"
40    *  *  *  * /Users/me/storj-checks.sh -d -c /Users/me/my.credo >> /Users/me/err.txt 2>&1

contributing

pull requests are welcome. for major changes, please open an issue first to discuss what you would like to change.

license

GPL-3.0

3 Likes

With versions v1.3.3. + v1.3.4 the script becomes better scalable for everyday usage:

There was a huge performance improvement recently and I managed to switch of disturbing, not to say spamming email alerts, in case an error was found.

On top of that, the script now warns in case of several thresholds missed, like repair up- / downloads (risk of getting disqualified), normal data up- / downloads and if there was no upload / download activity within the last hour.

Here’s a more or less complete list of the current features implemented:

  • emails containg an excerpt of the relevant error log message
  • if the debug mode is used, disk usage of the mounted data storage disk mount point included
  • alerts in case a threshold of repair gets/puts and downloads/uploads is reached
  • alerts if there was no get/put at all during the last hour
  • alerts in case the node is offline (docker container not started)
  • optimized for crontab and command line usage (to be continued)
2 Likes

Version v1.4.0 adds:

  • multinode support
  • mail and/or discord on/off flags

“disk usage per mount point” (in case of multinodes) needs to be extended, but does not harm the main functionality of the script (alerting in case of errors/issues).

I changed the topic to a wiki, now you can update it any time.

1 Like

Versions v1.5.0 till v1.5.2 add:

  • added satellite audit+suspension+online score threshold checks and alerts
  • added disk usage for multinodes
  • put the configuration part into a separate config file named *.credo > completely outsourced variables from the script functions
  • added script arguments for debugging, verbose mode, individual config file name/path and a more self-describing help message / script usage

Full Changelog: comparing v1.4.0…v1.5.2

Version v1.5.3 changes:

  • macOS / docker support: added + tested
  • jq version check to minimum 1.6 added / solved issue #14
  • multinodes: node names added to push and email alerts / solved issue #17
  • success, failure, cancellation ratios into command line outputs in verbose mode added
1 Like

Latest changes up to version 1.5.5 :

  • added settings file (automatic creation), to store last ping of satellite notification, in order to just ping once a day; solving #16: limit satellite threshold warnings to once a day
  • added environment variables path option, e.g. for crontab usage
  • added settings path option, if settings file is situated somewhere else
  • optimized error filter to ignore “satellite service pings” and “emptying trash warnings”
  • optimized verbose output for debugging
  • checks, if discord.sh exists and is executable
  • fixed: in case the script is called with an absolute path from another folder, discord.sh is (still) expected to be in the same folder, but will be correctly executed
  • shortened threshold messages a bit, to make it better readable

Full changelog: comparing v1.5.3 with v1.5.5