Howto: storage node health check > discord + email alerting script

About speed: My system is not overloaded with 20% hdd busy and less than 10% cpu.
I tried to comment renice but I have the same speed (about 25min per node).
Seems stuck 15min from “storj versions: current larger” to “docker log 720m selected : #235101

This is the result of debug mode:
./storj-system-health.sh -vq

*** timestamp [09.05.2023 21:30]
*** config file loaded: ./storj-system-health.credo
*** settings file path: .storj-system-health

running the script for node “storagenode201” (/node201) …
*** node is running : 1
*** disk usage : 52.80% (incl. trash: 53.48%)
*** satellite scores url : localhost:14001/api/sno/satellites (OK)
… satellite scores:
… satping difference: 39300 (1683667812 - 1683628512) / freq: 3600
*** settings: satellite pings will be sent: false
*** storj node api url : localhost:14001/api/sno (OK)
*** storj version current : installed 1.78.2
*** storj version latest : github 1.76.2 [2023-04-03]
… storj versions unequal
… storj versions: current larger
*** docker log 720m selected : #235101
*** docker log 60m selected : #18322
*** info count : #18313
*** audit error count : #0
*** repair failures count : #0
*** fatal error count : #0
*** severe count : #0
*** other error count : #0
*** i/o timouts count : #0
*** audits : w: 0.00%, c: 0.00%, s: 100%
*** downloads : c: 2.16%, f: 0.02%, s: 98%
*** uploads : c: 0.04%, f: 0.27%, s: 100%
*** repair downloads : c: 0.00%, f: 0.00%, s: 100%
*** repair uploads : c: 0.00%, f: 0.08%, s: 100%
*** 60 m activity : up: 9772 / down: 5495 > OK
*** i/o timouts ignored : false
… audit time lags selection:
… settings read: (declare -A settings=([storagenode218_payTimestamp]=“1683586027” [storagenode219_payTimestamp]=“1683586938” [storagenode218_payValue]=“0” [storagenode201_payValue]=“0” [satping]=“1683628512” [storagenode219_payValue]=“0” [storagenode201_payTimestamp]=“1683629746” )).
… settings : tmp_todayDay=9
… settings : tmp_todayHour=21
… settings : tmp_todayMinutes=45
… settings : storagenode201_payTimestamp found.
… settings : storagenode201_payValue found.
… settings : storagenode201_payTimestamp=1683629746
… settings : storagenode201_payValue=0
… settings : tmp_payTimestamp=1683629746
… settings : tmp_payDateDay=9
… settings : tmp_payDateHour=10
… settings : tmp_payDateMinutes=55
… settings : tmp_egressBandwidthPayout=121.62
… settings : tmp_egressRepairAuditPayout=21.29
… settings : tmp_diskSpacePayout=207.15
… settings : tmp_currentMonthExpectations=910
… settings : tmp_estimatedPayoutTotal calculated: 350.06
… settings : tmp_payDiff=350.06
… push message sending: sendpush: false, discordon: true, hour: 21, minutes: 45, details: false
*** no discord success rates to be sent.

Exactly on the point where the logs are read.

I have 70k entries within 360 mins. And it takes 1 min for 2 nodes to run.

So somehow, your logger selection looks really slow. Hmm

v1.10.3-5 improvement & bugfix releases:

  • fixed “command not found” issue, when there are pending audits
  • minor tweaks in the README file, linked to the crontab examples mentioned there
  • minor tweaks to onlineScore and download/upload warnings

That script is impressive. I hope it’s not too slow running or load intensive but am keen to get into it. Looks like I could use it on my Debian servers. Cheers.

1 Like

Thank you! It helped me a lot during the last months and quickly alerts in case of issues. That allows to react very fast.

Some features can be skipped in order to speed it up. Just check -h and the readme on GitHub.

If you have any question or issue, please simply pm me.

1 Like

v1.10.6 to v1.10.7 improvement releases:

  • fixed disk overused warning not shown
  • skipped no-upload-warning, in case node disk is full or overused
  • fixed wording in verbose message : mixed up download / uploads count
  • minor other push message optimisations
1 Like

v1.10.8 bugfix release:

  • fixed warning messages in case of full hdd disks (the script was permanently warning, if disk is full and there is no upload of new files due to that)

How’s the resource usage on this script? Say if my node runs on a rpi 4 8GB or even a pi 3b+ 2GB? Will it slow the node’s capabilities in any way?

It is low prioritized when started automatically and on my RPi 4B with 2 full nodes at 10 TB it’s running once per hour for less than a minute. Sure, it requires resources, but I did not recognize an effect on the nodes so far.

The other way round: I am so thankful to have it, I get alerted very fast about any issue and can react before it’s too late (esp. with regard to suspension and disqualification).

How long on average does it actually take? I just ran the forced Discord push via the debug command and all that’s been outputted so far is the following:
372605 (process ID) old priority 0, new priority 19

When I changed the settings files, I set my node folder to /mnt/drive1 (inside drive1 is the data folder for the node). I also just left the log path as / as I don’t know where the log file would even be.

Really busy currently. Will come back to you.

No worries, I had the command formatted wrong.

No rush but I am wondering what the detailed letters represent when using the -o option, In this image here. Also, what are the rep up and rep down values?

that means “repair uploads” + “repair downloads”

valid point - i’ve added this explaination on the readme.md on github.

explanation:

(repair) downloads / (repair) uploads:
c = cancellation rate
f = failure rate
s = success rate

audits : 
r = recoverable audit rate
c = critical audit fail rate
s = audit success rate

you can share your anonymised docker run command, so that we can check together.

have you used the verbosed (-v) option? if not, you’ll see no output on the command line.

I got it running well this time, just had to change my cron command. Takes about 18 minutes for it to run.

Here’s the run command for finding the log files:

sudo docker run -d --restart unless-stopped --stop-timeout 300
-p 28967:28967/tcp
-p 28967:28967/udp
-p 14002:14002
-e WALLET=“0x—”
-e EMAIL="email@example.com"
-e ADDRESS=“domain.ddns.net:28967
-e STORAGE=“14TB”
–user $(id -u):$(id -g)
–mount type=bind,source=“/mnt/drive1/identity”,destination=/app/identity
–mount type=bind,source=“/mnt/drive1/data”,destination=/app/config
–name storagenode storjlabs/storagenode:latest --operator.wallet-features=zksync-era,zksync

So you should see your logs with the following command, right?

docker logs storagenode --since 60m

If so, you should have the following setting in the credo file (standard setting), right?

NODELOGPATHS=/ 
LOGMIN=60 
LOGMAX=720 

I have a different setting in my case for LOGMAX:

LOGMIN=60
LOGMAX=360

Meaning, if you select less log files, the script will be much faster. Also, if you redirect your logs to a local log file, the selection will be faster than the docker logs command.

In that case, consider to let your script run more often via crontab. For your information, my personal setting here is:

30 *    * * *   pi      cd /home/pi/scripts/ && ./checks.sh 
59 23   * * *   pi      cd /home/pi/scripts/ && ./checks.sh -Ed

Interesting. Can you share some information, how full your 14 TB disk is?

In my case:

HDD1 at 10 TB full completely, no upload traffic anymore → script is running 0:38 minutes.
HDD2 at 9.9 TB almost full, normal upload traffic → script is running 2:10 minutes.

Meaning, I really expect, that outsourcing your log files will increase the script’s speed.
On top of that, please try to limit the selected log amount by changing LOGMAX to 360.