Grafana Metrics

littleskunk · January 18, 2021, 11:32am

Grafana is awesome. It would allow the storage node community to setup a grafana dashboard inclusive email alerts. The email alerts could get fired within 5 minutes after a problem was detected instead of hours later when the satellite might notice it.

Currently, there is already a grafana dashboard but it requires running a log scraper process. It would be nice if the storage node could send these metrics directly. The storage node already contains a metric endpoint. So we just need to make sure the data that the log scraper collects is also available on the storage node metrics endpoint. Maybe we can add a bit of documentation around it so that the community is able to extend the metrics endpoint every time they wish to add something to the grafana dashboard.

kevink · January 18, 2021, 11:36am

Oh well I would certainly appreciate an endpoint for all the information I currently scrape from the logfiles.
Even better would be a native support for a prometheus endpoint Then we wouldn’t need to develop any exporters ourselves

I played around a bit with grafana email notifications and it works good, even though the docker installation doesn’t support sending pictures of the graph that triggers the notification… Have to play around with it a little more to see what else I can use for alerts, seems like the Stat widgets don’t support sending an alert, only the graphs do?

littleskunk · January 18, 2021, 11:46am

I am on the beginner level and have only little knowledge how a native integration would work. I just see how ETH2 is handling this: prysm-docker-compose/config/prometheus.yaml at add38e2e7c3f6d33f3b72ca219a0676925f665d4 · stefa2k/prysm-docker-compose · GitHub

As far as I understand this should also be possible with all of our binaries. We have that endpoint as well. We are currently missing the information you would need for your dashboard.

kevink · January 18, 2021, 11:58am

That won’t work with your binaries because they don’t expose a prometheus compatible output. That’s why we need the exporter from @greener that transforms the API output into a prometheus format.

I don’t have any idea about how to make a prometheus endpoint in go and the exporter uses python so…

Odmin · January 18, 2021, 12:06pm

I really this idea! I used Grafana + telegraf for collecting metrics via API and it really was pain to parse it in telegraf. I will more than happy (really) if storagenode can send it directly to influxdb or via graphite protocol (influxdb have graphite listener too).

Also, if you need any help with this process please engage me to this process. I will more than happy to help with it.

Odmin · January 18, 2021, 12:37pm

@littleskunk here is a good examples of how it can be sent directly on other systems:

TrueNAS

ProxMox

Odmin · January 18, 2021, 1:06pm

It also supports send alerts to Telegram, I use it and it much better than email for critical alerts (but email alerts I using too)

kevink · January 18, 2021, 1:52pm

@littleskunk there is a go librariy for prometheus:

With an example implementation: client_golang/main.go at b7799362e0ac323f658fb8d52c2d6df001cf272c · prometheus/client_golang · GitHub

I know this probably won’t be considered anytime soon since it requires engineering efforts but maybe some day in the future?
For now we have a good exporter in python to transform the API output into a prometheus endpoint that can be easily extended (sometimes the API output is just a bit weird and unexpected )

greener · January 18, 2021, 3:56pm

There are multiple community solutions that help visualise storagenode state in Grafana:

There’s one that @KernelPanick implemented that I think relies on logs and telegraf but I don’t know much about Visual Dashboard - Grafana Mon: 24hr Docker log > Telegraf > InfluxDB
The one @kevink shared recently that relies on logs and grok-exporter Log-Exporter for Prometheus with Grafana Dashboard
The one I shared here https://github.com/anclrii/Storj-Exporter that relies on storagenode API and not logs Prometheus Storj-Exporter
There might be more

As @kevink rightly pointed out current storj binaries don’t expose anything that could be used with grafana. Some metrics are available in the node API and this exporter translates these metrics to prometheus compatible format which is pretty specific.

To get the idea of the format difference you can try spin up the exporter yourself with docker run -d --link=storagenode --name=storj-exporter -p 9651:9651 anclrii/storj-exporter:latest and then curl -s storj-exporter:9651/metrics will give you the output that storj binaries would need to expose.

If this is implemented, Prometheus would be able to scrape storagenodes directly rather then through my exporter and this would be a much better solution. Metrics could be updated along with the rest of the app, better performance etc. Thought even if storj binaries exposed prometheus compatible metrics, one would still need a prometheus server between node and grafana to scrape and store historic metrics, and also a grafana instance for dashboards.

I think it would be good if storj binaries exposed total/successful uploads/downloads and also generic error count might be useful. Currently we need to parse logs to get this metrics and it’s hard. It would be good to add these to the node API for a start so that I can translate them to prometheus and potentially exposing all metrics in prometheus compatible format would be awesome but would need more time to implement I guess.

KernelPanick · January 21, 2021, 7:24am

I’d release a new version of the board with this info. Early on logs were all we had, but def would be excited to standardize it better.

kirbah · March 3, 2021, 7:26pm

telegraf json input should let you easily query required fields from json from the REST api.
You will just need bunch of http GET to load different parts and store in separate metrics.
Not everything is probably available in API but this is another trouble.

Odmin · March 5, 2021, 6:56pm

Nice joke men!
I just encourage you to parse all data from storj API with telegraf json input and you will feel all my pain

kirbah · March 6, 2021, 10:08am

What part is difficult to do?
I have not parsed 100% of data but it is enough to track ingress/egress stats and disk usage per satellite.

Odmin · March 6, 2021, 12:38pm

So, try to parse all, and scale it to all a few nodes, because parsing just a few parameters is nothing for me and easy to do.

kirbah · March 6, 2021, 5:35pm

Here is it. You will also need to copy paste many times for the audit per each satellite. If you use many nodes there will bunch of copy pasts per each and as outcome it is easier to write script in any script language to parse all and store in the text line that is acceptable by the telegraf.

[[inputs.http]]
	urls = ["http://192.168.55.77:14002/api/sno/"]
	method = "GET"
	data_format = "json"
	tagexclude = ["url"]

	name_override = "storj_disk_total"
	fieldpass = ["diskSpace*", "version", "upToDate"]
	json_string_fields = [ "version", "upToDate" ]

[[inputs.http]]
	urls = ["http://192.168.55.77:14002/api/sno/"]
	method = "GET"
	data_format = "json"
	tag_keys = [ "url" ]

	name_override = "storj_disk_satellites"
	json_string_fields = [ "url" ]
	json_query = "satellites"

[[inputs.http]]
	urls = ["http://192.168.55.77:14002/api/sno/satellites"]
	method = "GET"
	data_format = "json"
	tagexclude = ["url"]
	tag_keys = [ "satelliteName" ]

	name_override = "storj_audit_satellites"
	json_string_fields = [ "satelliteName" ]
	json_query = "audits"

[[inputs.http]]
	urls = ["http://192.168.55.77:14002/api/sno/satellites"]
	method = "GET"
	data_format = "json"
	tagexclude = ["url"]

	name_override = "storj_bandwidth"
	fieldpass = ["*Summary"]

[[inputs.http]]
	urls = ["http://192.168.55.77:14002/api/sno/satellites"]
	method = "GET"
	data_format = "json"
	tagexclude = ["url"]

	name_override = "storj_bandwidth_daily"
	json_query = "bandwidthDaily.@reverse|0"


[[inputs.http]]
    interval = "1h"
    urls = ["http://192.168.55.77:14002/api/sno/satellite/12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"]
    method = "GET"
    data_format = "json"
    tagexclude = ["url"]

    fieldpass = ["auditHistory_score"]
    name_override = "storj_audit_history"

    [inputs.http.tags]
        satellite = "us-central-1.tardigrade.io"

[[inputs.http]]
    interval = "1h"
    urls = ["http://192.168.55.77:14002/api/sno/satellite/12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"]
    method = "GET"
    data_format = "json"
    tagexclude = ["url"]

    json_time_key = "windowStart"
    json_time_format = "2006-01-02T03:04:05Z"

    name_override = "storj_audit_details"
    json_query = "auditHistory.windows"

    [inputs.http.tags]
        satellite = "us-central-1.tardigrade.io"

joesmoe · March 6, 2021, 7:08pm

Sorry if I’m not the most advanced in these logging systems. But why couldn’t we just use a docker gelf driver to export the logs to Grafana or logstash or whatever - to avoid having to run anything on the storage node or change the code.

kirbah · March 6, 2021, 7:19pm

This is not logs export to Grafana. There is great rest api that return a lot of ready to use numbers of sno status. It is just required to grab that data from api and send it to telegraf that persit it in influxdb. The Grafana just has dashboards to retrieve numbers from db and draw the charts.

For logs parsing there are other options like logstash with kibana or splunk.

BTW. Splunk perfectly parse the logs. It is only required to switch log format to json and setup input. It’s really easy to setup but I am not sure how many resources will be consumed by splunk in case of heavy traffic on the sno.

kevink · March 6, 2021, 7:34pm

I now use loki and promtail for log storing and analyzation. Works great. You could even use a loki docker plugin then you don’t need to have logfiles lying around.
This additionally to the storj-exporter that converts the API into a prometheus format and you have all the information you need to make great grafana dashboards.

Odmin · March 7, 2021, 10:11am

Man, you seriously recommend this as a solution?

I voted here because I didn’t like this solution, I would like to see something like it just a specific influx instance and store all metrics here. Then I will use need metrics for me on Grafana.

kirbah · March 7, 2021, 2:41pm

I see. you would like to get ready to use solution with one click setup. You might wait it for ages and I can’t help with this.
Personally I am going to build very simple python script with list of hosts as an input and it will grab all data that I need from rest api on each host and send it to Grafana.