Howto: storage node health check > discord + email alerting script

agente · May 9, 2023, 10:30pm

About speed: My system is not overloaded with 20% hdd busy and less than 10% cpu.
I tried to comment renice but I have the same speed (about 25min per node).
Seems stuck 15min from “storj versions: current larger” to “docker log 720m selected : #235101”

This is the result of debug mode:
./storj-system-health.sh -vq

*** timestamp [09.05.2023 21:30]
*** config file loaded: ./storj-system-health.credo
*** settings file path: .storj-system-health

running the script for node “storagenode201” (/node201) …
*** node is running : 1
*** disk usage : 52.80% (incl. trash: 53.48%)
*** satellite scores url : localhost:14001/api/sno/satellites (OK)
… satellite scores:
… satping difference: 39300 (1683667812 - 1683628512) / freq: 3600
*** settings: satellite pings will be sent: false
*** storj node api url : localhost:14001/api/sno (OK)
*** storj version current : installed 1.78.2
*** storj version latest : github 1.76.2 [2023-04-03]
… storj versions unequal
… storj versions: current larger
*** docker log 720m selected : #235101
*** docker log 60m selected : #18322
*** info count : #18313
*** audit error count : #0
*** repair failures count : #0
*** fatal error count : #0
*** severe count : #0
*** other error count : #0
*** i/o timouts count : #0
*** audits : w: 0.00%, c: 0.00%, s: 100%
*** downloads : c: 2.16%, f: 0.02%, s: 98%
*** uploads : c: 0.04%, f: 0.27%, s: 100%
*** repair downloads : c: 0.00%, f: 0.00%, s: 100%
*** repair uploads : c: 0.00%, f: 0.08%, s: 100%
*** 60 m activity : up: 9772 / down: 5495 > OK
*** i/o timouts ignored : false
… audit time lags selection:
… settings read: (declare -A settings=([storagenode218_payTimestamp]=“1683586027” [storagenode219_payTimestamp]=“1683586938” [storagenode218_payValue]=“0” [storagenode201_payValue]=“0” [satping]=“1683628512” [storagenode219_payValue]=“0” [storagenode201_payTimestamp]=“1683629746” )).
… settings : tmp_todayDay=9
… settings : tmp_todayHour=21
… settings : tmp_todayMinutes=45
… settings : storagenode201_payTimestamp found.
… settings : storagenode201_payValue found.
… settings : storagenode201_payTimestamp=1683629746
… settings : storagenode201_payValue=0
… settings : tmp_payTimestamp=1683629746
… settings : tmp_payDateDay=9
… settings : tmp_payDateHour=10
… settings : tmp_payDateMinutes=55
… settings : tmp_egressBandwidthPayout=121.62
… settings : tmp_egressRepairAuditPayout=21.29
… settings : tmp_diskSpacePayout=207.15
… settings : tmp_currentMonthExpectations=910
… settings : tmp_estimatedPayoutTotal calculated: 350.06
… settings : tmp_payDiff=350.06
… push message sending: sendpush: false, discordon: true, hour: 21, minutes: 45, details: false
*** no discord success rates to be sent.

Bivvo · May 10, 2023, 7:59pm

Exactly on the point where the logs are read.

I have 70k entries within 360 mins. And it takes 1 min for 2 nodes to run.

So somehow, your logger selection looks really slow. Hmm

Bivvo · August 9, 2023, 6:17pm

v1.10.3-5 improvement & bugfix releases:

fixed “command not found” issue, when there are pending audits
minor tweaks in the README file, linked to the crontab examples mentioned there
minor tweaks to onlineScore and download/upload warnings

Unique · August 25, 2023, 10:47am

That script is impressive. I hope it’s not too slow running or load intensive but am keen to get into it. Looks like I could use it on my Debian servers. Cheers.

Bivvo · August 25, 2023, 3:42pm

Thank you! It helped me a lot during the last months and quickly alerts in case of issues. That allows to react very fast.

Some features can be skipped in order to speed it up. Just check -h and the readme on GitHub.

If you have any question or issue, please simply pm me.

Bivvo · March 12, 2024, 10:18pm

v1.10.6 to v1.10.7 improvement releases:

fixed disk overused warning not shown
skipped no-upload-warning, in case node disk is full or overused
fixed wording in verbose message : mixed up download / uploads count
minor other push message optimisations

Bivvo · March 18, 2024, 4:20pm

v1.10.8 bugfix release:

fixed warning messages in case of full hdd disks (the script was permanently warning, if disk is full and there is no upload of new files due to that)

Eleos · March 19, 2024, 4:39am

How’s the resource usage on this script? Say if my node runs on a rpi 4 8GB or even a pi 3b+ 2GB? Will it slow the node’s capabilities in any way?

Bivvo · March 19, 2024, 5:13am

It is low prioritized when started automatically and on my RPi 4B with 2 full nodes at 10 TB it’s running once per hour for less than a minute. Sure, it requires resources, but I did not recognize an effect on the nodes so far.

The other way round: I am so thankful to have it, I get alerted very fast about any issue and can react before it’s too late (esp. with regard to suspension and disqualification).

Eleos · March 20, 2024, 3:17am

How long on average does it actually take? I just ran the forced Discord push via the debug command and all that’s been outputted so far is the following:
372605 (process ID) old priority 0, new priority 19

When I changed the settings files, I set my node folder to /mnt/drive1 (inside drive1 is the data folder for the node). I also just left the log path as / as I don’t know where the log file would even be.

Bivvo · March 22, 2024, 6:00pm

Really busy currently. Will come back to you.

Eleos · March 22, 2024, 9:45pm

No worries, I had the command formatted wrong.

No rush but I am wondering what the detailed letters represent when using the -o option, In this image here. Also, what are the rep up and rep down values?

Bivvo · March 23, 2024, 7:56pm

that means “repair uploads” + “repair downloads”

valid point - i’ve added this explaination on the readme.md on github.

explanation:

(repair) downloads / (repair) uploads:
c = cancellation rate
f = failure rate
s = success rate

audits : 
r = recoverable audit rate
c = critical audit fail rate
s = audit success rate

you can share your anonymised docker run command, so that we can check together.

have you used the verbosed (-v) option? if not, you’ll see no output on the command line.

Eleos · March 24, 2024, 8:07pm

I got it running well this time, just had to change my cron command. Takes about 18 minutes for it to run.

Here’s the run command for finding the log files:

sudo docker run -d --restart unless-stopped --stop-timeout 300
-p 28967:28967/tcp
-p 28967:28967/udp
-p 14002:14002
-e WALLET=“0x—”
-e EMAIL="email@example.com"
-e ADDRESS=“domain.ddns.net:28967”
-e STORAGE=“14TB”
–user $(id -u):$(id -g)
–mount type=bind,source=“/mnt/drive1/identity”,destination=/app/identity
–mount type=bind,source=“/mnt/drive1/data”,destination=/app/config
–name storagenode storjlabs/storagenode:latest --operator.wallet-features=zksync-era,zksync

Bivvo · March 26, 2024, 7:39pm

So you should see your logs with the following command, right?

docker logs storagenode --since 60m

If so, you should have the following setting in the credo file (standard setting), right?

NODELOGPATHS=/ 
LOGMIN=60 
LOGMAX=720

I have a different setting in my case for LOGMAX:

LOGMIN=60
LOGMAX=360

Meaning, if you select less log files, the script will be much faster. Also, if you redirect your logs to a local log file, the selection will be faster than the docker logs command.

In that case, consider to let your script run more often via crontab. For your information, my personal setting here is:

30 *    * * *   pi      cd /home/pi/scripts/ && ./checks.sh 
59 23   * * *   pi      cd /home/pi/scripts/ && ./checks.sh -Ed

Interesting. Can you share some information, how full your 14 TB disk is?

In my case:

HDD1 at 10 TB full completely, no upload traffic anymore → script is running 0:38 minutes.
HDD2 at 9.9 TB almost full, normal upload traffic → script is running 2:10 minutes.

Meaning, I really expect, that outsourcing your log files will increase the script’s speed.
On top of that, please try to limit the selected log amount by changing LOGMAX to 360.

Bivvo · August 18, 2024, 7:34pm

new version 1.11.3 released, consolidated change log since 1.10.8:

fixed an issue with selection of GET_REPAIR errors: connection timeouts do NOT lead to suspension, so they are excluded
fixed an issue with WARN or INFO messages selected, where ERROR or FATAL are part of the piece ID of a random file
added an explanation on the success rate abbreviations
excluded ‘manager closed: read tcp’ messages, as they don’t seem to have an effect on the nodes KPIs
optimiized get repair push warnings
excluded write tcp errors, as they regularly occur, when the internet connection has been reconnected - without any effect on the scores
further tuned the last improvement of skipping connection timeout warnings
added a verbose message to make transparent, that in the case of “pending audits, running the script in 5m automatically again”
a couple of fixes and reductions of unnecessary warnings, esp. “audits pending”, which are not an issue, as long as audits are not failing
disabled the auto-repeat of the script in case of pending audits, which could lead to an infinite running script
fixed: exclusion of “connection timed out” warnings did not work
fixed: wrong get-repair error message selection sent by mail
fixed GET_REPAIR count in discord push message
added some more “error” messages to ignore, because they do not have an impact on node scores (e.g., due to hick-ups of the internet connection)

Full Changelog: Comparing v1.10.8...v1.11.3 · bjoerrrn/storj-system-health.sh · GitHub

Bivvo · September 19, 2024, 6:30am

new version 1.12.1 released , consolidated change log since 1.11.3:

Add detailed mail config options (incl. TLS support) by @molnarti in #49
fix swaks attachment deprecation warnings by @molnarti in #50

Full Changelog: Comparing v1.11.3…v1.12.1 · bjoerrrn/storj-system-health.sh · GitHub

daschmidt · November 11, 2024, 11:20am

have someone the script running on truenas scale

Alexey · November 12, 2024, 7:42am

Hello @daschmidt,
Welcome back!

You can run it as a cronjob using crontab from the system shell.

You may also use the UptimeRobot service instead:

Or setup your own: monitoring

daschmidt · November 13, 2024, 7:38am

Thanks for your answer, I’m looking forward to being back again.

My question is not generally about monitoring but specifically for this script.

Unfortunately, truenas does not have the “BC” package, i.e. the script cannot be used.