Logs: how to send relevant log messages to a Discord web hook

i’ll post my first draft of it here, hopefully sooner than later, have had something like this in mind for a long time but never really wasn’t sure of where or how to pipe the data.

i plan to make it very straight forward to setup, atleast the parts i’ve been working on, and i already performance tested and optimized the solution, even if the last couple of bits doesn’t work exactly yet.
and thus far it’s pretty sleek, but don’t want to to be to advanced either because it will be running all the time.

i’ll throw myself at it later today.

Very cool - I am not that much an expert of bash coding, but will support in fine tuning and handling wherever I can. Thx in advance.

oh i’m not great at it either lol, just sort of stumbling ahead one byte at a time lol…
and maybe i’m a bit picky about how to make it, because i want to consider the performance impact.

my script seems to be failing again and not for the first time… not sure my solution is even viable, since logs aren’t very useful if they don’t work reliably…
i will most likely be switching from my custom logging script to this.

1 Like

I have taken this bash script:

… and reused it in the success rates script from this post:

E.g. this looks like this:

echo_downloads=$(printf '%.0f\n' $(echo -e "$dl_success $dl_failed $dl_canceled" | awk '{print ( $1 / ( $1 + $2 + $3 )) * 100 }'))
if [[ $echo_downloads -lt 98 ]] || [[ $DEB -eq 1 ]]
	./discord.sh --webhook-url=$url --text "$([[ $DEB -eq 1 ]] && echo 'INFO' || echo 'WARNING') downloads: $dl_ratio"

This will ping me via discord, in case the download rate drops below 98% and periodically in debug mode just for my (daily) information. Having the success rate script prepared, it is called via crontab with and without parameters.

I’ll extend that with other scripts like “scanning” the logs for FATAL or ERROR messages and/or SMART tests of the HDD(s) and push them on a regular base as well.

i was told one shouldn’t create ones own logging scripts… i guess its a bit like the issue with creating ones own mail server, its just a bad idea because even tho it’s supposedly a fairly simple task, it does end up becoming much more advanced and labor intensive as it progresses.

my failure i think was in me trying to process the logs while they was being exported, to avoid them being read multiple times, want to cut down on iops where i can.
it seemed to work at first, but for some reason the script that would work fine for one container might not want to execute correctly for others, maybe due to some inherent limitations of my server, or other software issues.

i hope the grafana / prometheus / loki solution can process the logs live with out issue, but i’m sure they can do that just fine.

my attempts at processing my logs with custom scripts sure wasn’t a good idea and now about 1 year in i can just throw it all out lol

I’ve understood it is not ready for everyday-usage - at least I’ve not understood how to implement it. So far, I can live with my custom pings. :wink:

1 Like

Is there anything I should take care of additionally?

Currently I have a regular “view” on the following log counts:

tmp_disk_usage=$(df -H | grep storage) # for disk usage
tmp_fatal_errors=$(docker logs storagenode 2>&1 | grep FATAL -c)
tmp_rest_of_errors=$(docker logs storagenode 2>&1 | grep ERROR | grep -v -e "collector" -e "piecestore" -c)

And just to let you know how it looks like (almost satisfied):

one of the issues with
docker logs storagenode 2>&1
is that it will read the entire storagenode log file since last update, so this is an ever increasing resource demand, ofc it doesn’t matter to much if it doesn’t do it to often, but then that defeats much of the point of live tracking it for alerts.

one could set a max size for the storagenode log file in docker or use a daily log file and run the script on that…

this parameter will set the max megabyte size for docker logfiles based on megabytes, usually on older nodes its like 20-45 MB a day i think…

so each time your script runs now at the day 18 since the last docker storagenode release push your docker log file will be something along the size of 400MB to 2GB which you seem to be processing twice… and then however many times you decide to do that per day.

#docker log size max parameter set to 1MB
--log-opt max-size=1m \

You can use a --since option

docker logs --help

Usage:  docker logs [OPTIONS] CONTAINER

Fetch the logs of a container

      --details        Show extra details provided to logs
  -f, --follow         Follow log output
      --since string   Show logs since timestamp (e.g.
                       2013-01-02T13:23:37Z) or relative (e.g. 42m for 42
  -n, --tail string    Number of lines to show from the end of the logs
                       (default "all")
  -t, --timestamps     Show timestamps
      --until string   Show logs before a timestamp (e.g.
                       2013-01-02T13:23:37Z) or relative (e.g. 42m for 42
1 Like

I’ve done both already: limited the docker log selection with “since”:

LOG="docker logs --since "$(date -d "$date -1 day" +"%Y-%m-%dT%H:%M")"  $DOCKER_NODE_NAME"

… and limited the logs as well:

docker run -d --restart unless-stopped --stop-timeout 300 \
    --log-opt max-size=100m \
    --log-opt max-file=5 \
1 Like

i didn’t have much luck with the --log-opt max-file= multiple files thing, but didn’t really give it much of a chance as i didn’t really need it for anything… maybe i just did something wrong.

seems to work for me, I guess:

-rw-r----- 1 root root  53011692 Nov 27 22:47 abc-json.log
-rw-r----- 1 root root 100000841 Nov 26 23:13 abc-json.log.1
-rw-r----- 1 root root 100000057 Nov 24 20:07 abc-json.log.2

I intend to optimise logging either by logging to the RAM or logging to another, external HDD - in order to reduce write operations on the RPi’s SD. Will keep you posted.

weird how your logs seems larger than mine… i don’t think my zfs compression is counted when just using ls -l
because only zfs can see that its compressed…
1 is 16tb and 2 is like 4tb so the size seems almost irrelevant to the log sizes…
and 3 is even less…
these are all 24 hour logs
i guess these scripts might also be failing and your logs are the correct size and mine are lacking data, haven’t gotten around to replacing my custom scripts

-rw-r--r-- 1 100000 100000  27269607 Nov 27 22:54 2021-11-27-sn0001.log
-rw-r--r-- 1 100000 100000  30906674 Nov 27 22:54 2021-11-27-sn0002.log
-rw-r--r-- 1 100000 100000  24092844 Nov 27 22:54 2021-11-27-sn0003.log

30906674 eg is around 30 mb, right?

yep, seems to be about that for each 24 hour period for the last week atleast
but can’t exclude that its not my scripts failing, but i haven’t have problems with these particular ones, but i haven’t fully checked them… just seemed weird

maybe it is just the zfs compression

NAME               PROPERTY              VALUE                  SOURCE
bitpool/storjlogs  type                  filesystem             -
bitpool/storjlogs  creation              Thu Jun  3 15:46 2021  -
bitpool/storjlogs  used                  3.72G                  -
bitpool/storjlogs  available             5.88T                  -
bitpool/storjlogs  referenced            3.72G                  -
bitpool/storjlogs  compressratio         4.67x                  -
bitpool/storjlogs  mounted               yes                    -

seems like a no the zfs compression is much higher than just 2x
which leads me back to … weird.

scanned through my logs and i can’t see any gaps over a 24 hour period

Why does anyone need so many log file data in general? That’s weird to me. :wink:

1 Like

Currently I am using these 3 calls to count the audit, error and fatal cases, and alerting via Discord. That’s really working amazingly, I’ll share what I’ve done later some day.

tmp_fatal_errors=$(docker logs storagenode --since "24h" 2>&1 | grep FATAL -c)
tmp_audits_failed=$(docker logs storagenode --since "24h" 2>&1 | grep "GET_AUDIT" | grep "failed" -c)
tmp_rest_of_errors=$(docker logs storagenode --since "24h" 2>&1 | grep ERROR | grep -v -e "collector" -e "piecestore" -c)

Technical question meanwhile: I want to group and count the subsequent causes of the error messages in order to include it in the notification.

I want to immediately see, what’s going wrong and how fast I need to respond. E.g. “ERROR: satellite ping timeout (#1). Is there an easy way to do so? (Linux bash + docker; # = count)

I would recommend to update to grep -E "GET_AUDIT|GET_REPAIR", because they both affects audit score.

1 Like

published my shell script on GitHub, details in a separate post:

cc @SGC

1 Like