[Tech Preview] Email alerts with Grafana and Prometheus

I would like to setup email alerts for my storage nodes. It turns out the storage node already has a Prometheus endpoint. It is running on the debug endpoint /metrics. If you haven’t set the debug endpoint please visit Guide to debug my storage node, uplink, s3 gateway, satellite

I have already a Prometheus instance running and Grafana inclusive email alerts. Here are my configs:

docker-compose.yaml

version: "3.3"
services:
  prometheus:
    restart: unless-stopped
    user: 993:991
    ports:
      - 9090:9090/tcp
    image: prom/prometheus
    volumes:
      - /mnt/ssd/eth/prometheus:/prometheus
    command:
      - --storage.tsdb.retention.time=31d
      - --config.file=/prometheus/prometheus.yaml
  grafana:
    restart: unless-stopped
    user: 993:991
    ports:
      - 3000:3000/tcp
    image: grafana/grafana
    volumes:
      - /mnt/ssd/eth/grafana:/var/lib/grafana
    command:
      - -config=/var/lib/grafana/grafana.ini
  prometheus-exporter:
    restart: unless-stopped
    user: 993:991
    ports:
      - 9100:9100/tcp
    image: quay.io/prometheus/node-exporter
    volumes:
      - /:/host:ro,rslave
    command:
      - --path.rootfs=/host

prometheus.yaml

global:
  scrape_interval: 15s # By default, scrape targets every 15 seconds.
  scrape_timeout: 10s

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
   - job_name: 'prometheus-exporter'
     static_configs:
     - targets: ['prometheus-exporter:9100']
   - job_name: 'storagenodetest'
     metrics_path: /metrics
     static_configs:
     - targets: ['localhost:12019']
   - job_name: 'storagenode1'
     metrics_path: /metrics
     static_configs:
     - targets: ['localhost:13019']
   - job_name: 'storagenode2'
     metrics_path: /metrics
     static_configs:
     - targets: ['localhost:13029']

grafana.ini

[smtp]
enabled = true
host = smtp.gmail.com:587
user = <my gmail address>
password = <create new google app passwords>
;cert_file =
;key_file =
skip_verify = true
from_address = <my gmail address>
from_name = Grafana
# EHLO identity in SMTP dialog (defaults to instance_name)
;ehlo_identity = dashboard.example.com
[server]
root_url = http://localhost:3000

Up next we need a Grafana dashboard with email alerts. Let’s start with a first MVP. I am running multiple storage nodes and I want the email alert to tell me which node I need to fix. The trigger for the email alert doesn’t matter for this first MVP.

In a second step I would like to brainstrom with all of you which email alerts we need. For each email alert we also have to specify which data we need.
More than 5% audit failures (GET_AUDIT failed vs sucess)
More than 2 pending audits (GET_AUDIT started vs success + failed)
Audit score lower than 1
Suspension score lower than 1
Node process not running
Please write down which email alerts you want to see

The third step will be to expose the data we need for these email alerts. To be honest I will not have the time to make all the code changes myself. Instead my intension is to demonstrate how to do it and hopefully it is easy enough so that you can continue to keep the ball rolling. Ideally we have a living Grafana dashboard at the end that gets improved by the community over time.

6 Likes

I have configured zabbix to send me SMS for these reasons, maybe these would also be useful for you:

  • audit score lower than 1
  • suspension score lower than 1
  • node process is not running
  • node port is not accessible
  • new version is available
  • dmesg contains the words “blocked for more”

Also, it sends me notifications if cpu iowait % is too high, disk space is running out and similar (kind-of default Linux template for zabbix).

this is interesting, I’ll have to find a way to include this in my system.

4 Likes

Great ideas. I have added them to the list.

These might be a bit harder. I haven’t added them to the list yet. I think a good way to check open ports would be to watch the hourly checkin result.

I have the first MVP ready. It is not perfect but it sends me an email if my storage node is down or in a crash loop.

{
  "annotations": {
    "list": [
      {
        "builtIn": 1,
        "datasource": "-- Grafana --",
        "enable": true,
        "hide": true,
        "iconColor": "rgba(0, 211, 255, 1)",
        "name": "Annotations & Alerts",
        "target": {
          "limit": 100,
          "matchAny": false,
          "tags": [],
          "type": "dashboard"
        },
        "type": "dashboard"
      }
    ]
  },
  "editable": true,
  "fiscalYearStartMonth": 0,
  "gnetId": null,
  "graphTooltip": 0,
  "id": 11,
  "links": [],
  "liveNow": false,
  "panels": [
    {
      "alert": {
        "alertRuleTags": {},
        "conditions": [
          {
            "evaluator": {
              "params": [
                0.9
              ],
              "type": "lt"
            },
            "operator": {
              "type": "and"
            },
            "query": {
              "params": [
                "Storagenode1",
                "1m",
                "now"
              ]
            },
            "reducer": {
              "params": [],
              "type": "avg"
            },
            "type": "query"
          },
          {
            "evaluator": {
              "params": [
                0.9
              ],
              "type": "lt"
            },
            "operator": {
              "type": "or"
            },
            "query": {
              "params": [
                "Storagenode2",
                "1m",
                "now"
              ]
            },
            "reducer": {
              "params": [],
              "type": "avg"
            },
            "type": "query"
          },
          {
            "evaluator": {
              "params": [
                0.9
              ],
              "type": "lt"
            },
            "operator": {
              "type": "or"
            },
            "query": {
              "params": [
                "Storagenodetest",
                "1m",
                "now"
              ]
            },
            "reducer": {
              "params": [],
              "type": "avg"
            },
            "type": "query"
          }
        ],
        "executionErrorState": "alerting",
        "for": "0m",
        "frequency": "1m",
        "handler": 1,
        "message": "${job} Offline",
        "name": "${job} Offline",
        "noDataState": "alerting",
        "notifications": [
          {
            "uid": "vddsvqi7k"
          }
        ]
      },
      "datasource": null,
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "custom": {
            "axisLabel": "",
            "axisPlacement": "auto",
            "barAlignment": 0,
            "drawStyle": "line",
            "fillOpacity": 0,
            "gradientMode": "none",
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "viz": false
            },
            "lineInterpolation": "linear",
            "lineWidth": 1,
            "pointSize": 5,
            "scaleDistribution": {
              "type": "linear"
            },
            "showPoints": "auto",
            "spanNulls": false,
            "stacking": {
              "group": "A",
              "mode": "none"
            },
            "thresholdsStyle": {
              "mode": "off"
            }
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": null
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          }
        },
        "overrides": []
      },
      "gridPos": {
        "h": 8,
        "w": 12,
        "x": 0,
        "y": 0
      },
      "id": 4,
      "options": {
        "legend": {
          "calcs": [],
          "displayMode": "list",
          "placement": "bottom"
        },
        "tooltip": {
          "mode": "single"
        }
      },
      "targets": [
        {
          "exemplar": true,
          "expr": "up{job=\"storagenode1\"}",
          "instant": false,
          "interval": "",
          "legendFormat": "Storagenode1",
          "refId": "Storagenode1"
        },
        {
          "exemplar": true,
          "expr": "up{job=\"storagenode2\"}",
          "hide": false,
          "interval": "",
          "legendFormat": "Storagenode2",
          "refId": "Storagenode2"
        },
        {
          "exemplar": true,
          "expr": "up{job=\"storagenodetest\"}",
          "hide": false,
          "interval": "",
          "legendFormat": "storagenodetest",
          "refId": "Storagenodetest"
        }
      ],
      "thresholds": [
        {
          "colorMode": "critical",
          "op": "lt",
          "value": 0.9,
          "visible": true
        }
      ],
      "title": "Up",
      "type": "timeseries"
    }
  ],
  "refresh": "",
  "schemaVersion": 32,
  "style": "dark",
  "tags": [],
  "templating": {
    "list": []
  },
  "time": {
    "from": "now-5m",
    "to": "now"
  },
  "timepicker": {},
  "timezone": "",
  "title": "Storagenodes",
  "uid": "AA6sH8c7z",
  "version": 9
}
2 Likes

Today I would like to show you that it is very easy to expose new values for new alerts. Here is my PR: storagenode/piecestore: upload and download metrics for Grafana alerts by littleskunk · Pull Request #4280 · storj/storj · GitHub

This PR is not perfect. It is mixing up audit failures with all other download failures. It would be possible to get DQed for failing audits while still having a high customer download success rate. I hope someone else can follow up and improve the metric. Group by limit.Action would be awesome.

2 Likes

A new day and another good news. I have managed to set up an email alert for a low download success rate. It is defined as download success / download started. This alert will catch audit failures and also audit timeouts. I had an issue once that almost got my node disqualified. My node was opening a channel for an audit request (download started) but would never finish the audit in time. I am confident that the alert would catch that.

I managed to keep the dashboard and the alerts generic. You can throw any number of storage nodes at it and Grafana will send an email and tell you which of the storage nodes has an issue. However, if you don’t resolve the issue Grafana might drop that storage node after 48 hours and stop watching it.

And last but not least @clement has improved my PR. This will allow me to break down my download success on audit, repair and customer downloads. Today I want to improve my alert so it fires if any of these has a low success rate.

2 Likes

Final dashboard for this MVP:

It contains 7-8 alerts.

  • Storagenode not running
  • Storagenode unable to checkin including pingback error. This should cover any kind of communication issue in both directions like identity not signed, wrong external address or port forwarding not working. It would also fire if one of the satellite is unreachable.
  • Storagenode is unable to submit orders back to any satellite. Technically the storage node will retry every hour. An alternative alert rule would be to wait a few hours and see if the retry works before sending an email alert.
  • Low audit success rate (<95%). This and all following alert will also catch timeouts and fire an alert early while the operator can still stop the storage node and avoid a disqualification.
  • Low repair success rate (<95%). This can get you disqualified as well.
  • Low customer download success rate (<90%). No disqualification risk but I want to keep the success rate high. From time to time I test different tools on my system and one of them might reduce my download success rate which would result in a reduced payout. If I get email alerts I can find out early what is causing the lower success rate. Keep in mind this threshold will not work for everyone because of long tail cancelation.
  • Low upload success rate (<90%). Same deal. No disqualification risk.
  • Optional 8th alert (currently not implemented but possible with the same data). No upload or download activity for quite some time. I had that alert for a moment but I noticed that my testnet node would fire that alert a bit too frequently. In testnet uploads and downloads are not happening as frequently. If you want that alert upload and download started should do the trick. By combining the upload alert with available space you can make sure a full node doesn’t fire the alert. Keep in mind if the storage node has more than 500MB space available the next checkin might be an hour away.

The best for all of these alerts. The number of storage nodes doesn’t matter. The grafana dashboard will apply these email alerts on any number of storage nodes. As soon as one of the given storage nodes has a low audit success rate it will fire an email inclusive the name of the storage nodes.

4 Likes

Great deal!

Would you mind connecting Discord alternatively to an e-mail notification? (config Option) @littleskunk

Doing so, this is the right expertise in terms of content and another option to notify the SNO (much faster).

Personally I do not have to reinvent the wheel, too:

Feel free to replace the email alert with what ever you like. Grafana can push the alerts to a lot of applications.

1 Like

I have uploaded the JSON file for my dashboard here: Grafana Dashboard for Storagenodes · GitHub

The latest version is now watching delete queue, GC execution time and GC deleted pieces. I wanted to setup email alerts for these metrics as well but I am not sure about the threasholds yet.