Hi there,
based on my ideas and findings from one of my previous posts, I have published my “health check script (linux + macOS)” on GitHub.
Please feel free to support, subscribe or comment. I’ll release updates soon, in order to make it more self-descriptive and common to use for everyday usage.
Target group: docker and linux / macOS users.
Changelog
until v1.9.2 - feature and improvement releases
- added a check of the LOGMIN selected part of the log, if there are time lags larger than 3 mins between a started and finished download of GET_AUDIT; the script will alert you, in case there are time lags. the selection is done by satellite, so a count per satellite will be sent by email to have some more details for further analysis. referencing related topic in the forum.
- when the script runs between 23:50-23:59 UTC, a push msg will be sent as a daily summary, in case discord is configured
- optimised command line output and added extra param “-q” in order to distinguish verbose “-v” outputs from “quality checks”
- removed PUT_REPAIR from monitoring, as it should have no effect on the suspension and/or audit score(s)
- removed an if-clause, which prevented the estimated payouts calculation is done at any time, not only, when param “-e” was provided
- fixed if statement to enable push message EOD, which lead to not sending the push at all
- fixed a typo in the mail templates
- uploaded newer example screenshots
- optimised OK case push message format
- success rates are only shown, when scores are below 98%; can be send detailed with new param “-o”
- fixed date issues with leading zeros
until v1.8.0 - feature and bugfix releases
- added support for estimated payout information for UNIX + MacOS; see README for details
- fixed an issue with awk and date selection from the external stored log file
- added support for external stored log files on another device than the storagenode itself
- added: fire just one error in case “other” and “severe” are equal
- added: pending audits: run script again automatically and fire alert with the second run
- minor optimisation in error selection for “rpc client” issues
- fixed the external log file selection and filtering for MacOS and cleaned the code in combination with linux handling
- added several files to be in line with GitHub community guidelines
v1.7.0-v1.7.1 - bugfix releases
- removed the grouping of audit and repair errors; this will also properly handle the current QUIC issues of STORJ for repair traffic
- added alerts for repair failures; meaning they will be reported separately from audit issues
- limited the selection of repair alerts to LOGMIN instead of LOGMAX
v1.6.9-v1.6.11 - improvement releases
- proper error handling and alerting of the QUIC issue mentioned here: storj/storj#4688
- added logmin and logmax configration entries
- added option “-l” (L) to specify larger or shorter log timeframes to be selected and analyzed; e.g.:
./checks.sh -l 30 -v
- minor: optimized i/o timeout notes in verbose mode
- minor tweaks in readme
v1.6.1-v1.6.8 - improvement releases
- ignoring “timeout: no recent network activity” error, linked to quick library issues (see here), but added it to the “timeout” hints, so that you are aware of it
- implemented #30 : added space information (net, without trash and gross, including trash) from storage node data and added a warning, if space is getting overused
- minor formatting changes to discord push message
- added 10 day delay support for new storj version, see #26
- fixed logical issues with notifications, where boolean values were not handled properly
- added configurable timeframes for log selection in minutes
- small tweaks in selection of error and severe error entries
- minor fixes for multinode usage
- minor general improvements
v1.6.0 - feature release
- added storj node version check
- added support for redirected logs
- young nodes: in case there is no repair download startet yet, no warning will be thrown
v1.5.6-1.5.7 - several improvements
- fixed an issue with satellite’s score handling
- added a “severe” category, e.g. alerting for docker issues or read failures
- changed satping frequency from static 24h to standard 1h, configurable via SATPINGFREQ in .credo settings file. the initial value will be created automatically, you do not have to update your .credo file manually!
- added #15 : success rates “info stats” pushed to discord, when flag -d was provided
- added pending audits warning, solving #4 issue
- fine-tuned command-line verbose output
- several minor optimisations
v1.5.5 - feature release
- added settings file (automatic creation), to store last ping of satellite notification, in order to just ping once a day; solving #16: limit satellite threshold warnings to once a day
- added environment variables path option, e.g. for crontab usage
- added settings path option, if settings file is situated somewhere else
- extended error filter to ignore satellite service pings and emptying trash errors
- optimized verbose output for debugging
v1.5.4 - improvement release
- checks, if discord.sh exists and is executable
- fixed: in case the script is called with an absolute path, discord.sh is still expected to be in the same folder, but will be correctly executed
- shortened threshold messages a bit
v1.5.3 - improvement release
- verified and assured macOS support
- added jq version check to minimum 1.6 / solved #14
- added node names in the case of multinodes to push and email alerts / solved #17
- added success, failure, cancellation ratios into command line outputs in verbose mode
v1.5.1-v1.5.2 - improvement releases
- added satellite audit+suspension+online score threshold checks and alerts
- verbose option added: enable console output while execution
v1.5.0 - feature release
- put the configuration part into a separate config file named *.credo > completely outsourced variables from the script functions
- added disk usage for multinodes
- added script arguments for debugging, individual config file name/path and a more self-describing help message / script usage
v1.4.0 - feature release
- added multinode support (running on the same machine)
- added mail + discord on/off flags to enable or disable discord pushes and mail alerts
- general script restructures: you only need to setup your individual settings at the beginning of the script
v1.3.4 - bugfix release
- fixed conditional statements for push string concatenation and “i/o timeout” ignore logic
v1.3.3 - improvement release
- improved performance: just 2 docker log selections instead of 20+
- improved mail notifications: limited error msg selection to last 1h
- … both covering issue #11: Prevent mail alerts from spamming the mail account
- advice to let the script run just once per hour to be in line with the error msg selection and notification
- added echo outputs for command line usage as a feedback on what the script is working on
- improved discord alert formatting in case repair or get/put stats are out of threshold
- added “(io)” hint in “no errors (io)” message to let the user know, that i/o timeouts were ignored (but sent by mail in debug mode anyway)
v1.3.2 - improvement release
- added curl, swaks, jq library check
- added first version of a help text
- several code optimisations
v1.3.1 - bugfix release
- fixed an issue with the new “ignore io timeouts, if no other error occurs” feature.
v1.3.0 - added several alerts + bug fixes + optimisations:
- fix audit log details selection from (missing) LOG variable
- ignore i/o timeouts (satellite service pings + single satellite connects), if audit success rate is 100% and there are no other errors
- performance: only select audit error details in case there are audit issues
- added renice to let the script run in low performance to not block the system
- recognizes “no download/upload activity” within the last hour
- general optimization of the script’s styling
- solved #7: Add alert: on low uploads or downloads
- solved #8: Alert on a) low thresholds for success rates + b) storage node status
v1.2.0 - added audit critical + recoverable stats + log extract
- solution for issue #1: Add critical + recoverable audit error details, added statistical info into discord push message as well as log extract into mail body
v1.1.0 - added debug modes + echo outputs for command line usage
- added mail/discord debug mode
- created constants section to manage your individual params
- optimised echo output for shell command line usage
v1.0.1 - v1.0.3 - several bug fixes
- exclude ‘pieces error: filestore error: context canceled’ from the alerts
- fixes an issue with false positive (fatal) error alerts + logs excerpt not sent by email in case of an alert
- docker logs tail param did not work and caused the email body not to have the error messages included; removed it
about this script
this linux shell script checks, if a storj node runs into errors and alerts the operator by discord push messages as well as emails. requires at least one storj node running with docker on linux.
features
- multinode support
- optionally discord (as quick notifications) and/or mail (with error details) alerts
- alerts, in case:
- audit, suspension and/or online scores are below a threshold (storj node disqualification risk)
- audit timeouts are recognized (pending audits; discqualification risk)
- audit time lags: download started vs. download finished is larger than 3 mins (storj node disqualification risk)
- a threshold of repair gets/puts and downloads/uploads are reached (storj node disqualification risk)
- there was no get/put at all in the last hour (storj node disqualification risk)
- any other fatal error occurs, incl. issues with docker stability
- storj node version is outdated
- the node is offline (docker container not started)
- reports:
- disk usage
- success rates audits, downloads, uploads, repair up-/downloads
- estimated payouts for today and current month
- todays upload and download statistics
- optimized for crontab and command line usage
- supports redirected logs to a file
- only requires curl, jq, bc and (optionally) swaks to run
optimzed / tested for
dependencies
- storj node node up and running, within a
- docker container
- curl (http requests)
-
jq 1.6
(JSON parsing)
- bc (arbitrary precision calculator)
- swaks (mail sending, smtp)
- discord.sh (discord pushes)
setting up storj system health
- optional: setup a webhook in the desired discord text channel
- optional: grab your smtp email authentication data
- download (or clone) a copy of
discord.sh
* - download (or clone) a copy of
storj-system-health.sh
andstorj-system-health.credo
** - optional: setup discord and mail variables in
storj-system-health.credo
- Go nuts
* wget https://raw.githubusercontent.com/ChaoticWeg/discord.sh/master/discord.sh
** wget https://raw.githubusercontent.com/dusselmann/storj-system-health.sh/main/storj-system-health.sh && wget https://raw.githubusercontent.com/dusselmann/storj-system-health.sh/main/storj-system-health.credo
setting up variables in *.credo
you will need to modify these variables in *.credo
for your specific node and smtp mail server configuration. the *.credo
file must not include comments and blank lines, the following description is just for your explanation:
## discord settings
DISCORDON=true. # enables (true) or disables (false) discord pushes
DISCORDURL=https://discord.com/api/webhooks/...
# your discord webhook url
## mail settings
MAILON=true # enables (true) or disables (false) email messages
MAILFROM="" # your "from:" mail address
MAILTO="" # your "to:" mail address
MAILSERVER="" # your smtp server address
MAILUSER="" # your user name from smtp server
MAILPASS="" # your password from smtp server
## alerting settings
SATPINGFREQ=3600 # in case satellite scores are below threshold,
# value in seconds, when next alert will be sent earliest
## storj node docker names and urls
NODES=storagenode # storage node names, multiple: separated with comma,
# e.g. storagenode,storagenode-a,storagenode-b
NODEURLS=localhost:14002
# storage node dashboard urls, multiple: separated with comma,
# e.g. localhost:14002,192.168.171.5:14002
## node data mount points
MOUNTPOINTS=/mnt/node # your storage node mount point, multiple: separated with comma
# e.g. /mnt/node,/mnt/node-a,/mnt/node-b
# enter 'source' from the docker run command here
## specify redirected logs per node
NODELOGPATHS=/ # put your relative path + log file name here,
# in case you've redirected your docker logs with
# e.g. config.yaml: 'log.output: "/app/config/node.log"'
# / -> for non-redirected logs
# /node.log -> for single node redirect
# /,/ -> for 2 node with non-redirected logs
# /node1.log,/node2.log -> for 2 nodes with redirects
# /node.log,/ -> only 1st is redirected
# /mnt/hdd1/node.log -> full path possible, too
## log selection specifica - in alignment with cronjob settings
LOGMIN=60 # latest log horizon to have a detailled view on, in minutes
# -> change this, if your cronjob runs more often than 60m
LOGMAX=720 # larger log horizon for overall statistics, in minutes
make sure, your script is executable by running the following command. add ‘sudo’ at the beginning, if admin privileges are required.
chmod u+x storj-system-health.sh # or:
sudo chmod u+x storj-system-health.sh
chmod u+x discord.sh # or:
sudo chmod u+x discord.sh
usage
you can run the script in debug mode to force a push message to your discord channel (if enabled) although no error was found - or without the debug flag to run it in silent mode via crontab (see automation chapter).
./storj-system-health.sh -d # for a regular discord push message or:
./storj-system-health.sh # for silent mode
optionally you can pass another path to *.credo
, in case it has another name or source:
./storj-system-health.sh -c /home/pi/anothername.credo
in order to use the estimated payout information, which looks like so:
message: [sn1] : hdd 38.62% > OK 0.25$ / 11.77$
… you should set your crontab to be run around 23:55 UTC. You need to adjust the timing, if you have a couple of nodes and/or huge log files to be analysed: the script needs to be finished before the next full hour, ideally latest 23:59:59 UTC.
it also supports a help command for further details:
./storj-system-health.sh -h
automation with crontab
to let the health check run automatically, here’s a crontab example for linux, which runs the script each hour.
15,35,55 * * * * pi /home/pi/storj-checks.sh -d > /dev/null
for macos please be aware of the following specifics:
- use
crontab -e
andcrontab -l
, although it is depricated (for now it works) - you do not have to use the user name, it’s to be executed with the current user
- use full paths to your script and credo file
- find out your standard path with
echo §PATH
and set it in crontab
SHELL=/bin/sh
PATH="/opt/homebrew/opt/sqlite/bin:/opt/homebrew/bin:/opt/homebrew/sbin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin"
# UNIX:
58 * * * * pi cd /home/pi/scripts/ && ./checks.sh -ev
10 7,19 * * * pi cd /home/pi/scripts/ && ./checks.sh -ed
# MACOS
# 58 * * * * /Users/me/checks.sh -ev >> /Users/me/Desktop/checks.txt 2>&1
# 10 7,19 * * * /Users/me/checks.sh -ed -c /Users/me/my.credo >> /Users/me/Desktop/checks.txt 2>&1
contributing
issues and pull requests are welcome. for major changes, please open an issue first to discuss what you would like to change.
if you want to contact me directly, feel free to do so via discord: Discord