Grafana Metrics

Grafana is awesome. It would allow the storage node community to setup a grafana dashboard inclusive email alerts. The email alerts could get fired within 5 minutes after a problem was detected instead of hours later when the satellite might notice it.

Currently, there is already a grafana dashboard but it requires running a log scraper process. It would be nice if the storage node could send these metrics directly. The storage node already contains a metric endpoint. So we just need to make sure the data that the log scraper collects is also available on the storage node metrics endpoint. Maybe we can add a bit of documentation around it so that the community is able to extend the metrics endpoint every time they wish to add something to the grafana dashboard.

Oh well I would certainly appreciate an endpoint for all the information I currently scrape from the logfiles.
Even better would be a native support for a prometheus endpoint :smiley: Then we wouldn’t need to develop any exporters ourselves :wink:

I played around a bit with grafana email notifications and it works good, even though the docker installation doesn’t support sending pictures of the graph that triggers the notification… Have to play around with it a little more to see what else I can use for alerts, seems like the Stat widgets don’t support sending an alert, only the graphs do?

1 Like

I am on the beginner level and have only little knowledge how a native integration would work. I just see how ETH2 is handling this: https://github.com/stefa2k/prysm-docker-compose/blob/add38e2e7c3f6d33f3b72ca219a0676925f665d4/config/prometheus.yaml#L18-L26

As far as I understand this should also be possible with all of our binaries. We have that endpoint as well. We are currently missing the information you would need for your dashboard.

That won’t work with your binaries because they don’t expose a prometheus compatible output. That’s why we need the exporter from @greener that transforms the API output into a prometheus format.

I don’t have any idea about how to make a prometheus endpoint in go and the exporter uses python so… :man_shrugging:

I really :heartpulse: this idea! I used Grafana + telegraf for collecting metrics via API and it really was pain to parse it in telegraf. I will more than happy (really) if storagenode can send it directly to influxdb or via graphite protocol (influxdb have graphite listener too).

Also, if you need any help with this process please engage me to this process. I will more than happy to help with it.

@littleskunk here is a good examples of how it can be sent directly on other systems:

TrueNAS

ProxMox

It also supports send alerts to Telegram, I use it and it much better than email for critical alerts (but email alerts I using too)

@littleskunk there is a go librariy for prometheus:

With an example implementation: https://github.com/prometheus/client_golang/blob/b7799362e0ac323f658fb8d52c2d6df001cf272c/examples/random/main.go

I know this probably won’t be considered anytime soon since it requires engineering efforts but maybe some day in the future?
For now we have a good exporter in python to transform the API output into a prometheus endpoint that can be easily extended (sometimes the API output is just a bit weird and unexpected :smiley: )

1 Like

There are multiple community solutions that help visualise storagenode state in Grafana:

As @kevink rightly pointed out current storj binaries don’t expose anything that could be used with grafana. Some metrics are available in the node API and this exporter translates these metrics to prometheus compatible format which is pretty specific.

To get the idea of the format difference you can try spin up the exporter yourself with docker run -d --link=storagenode --name=storj-exporter -p 9651:9651 anclrii/storj-exporter:latest and then curl -s storj-exporter:9651/metrics will give you the output that storj binaries would need to expose.

If this is implemented, Prometheus would be able to scrape storagenodes directly rather then through my exporter and this would be a much better solution. Metrics could be updated along with the rest of the app, better performance etc. Thought even if storj binaries exposed prometheus compatible metrics, one would still need a prometheus server between node and grafana to scrape and store historic metrics, and also a grafana instance for dashboards.

I think it would be good if storj binaries exposed total/successful uploads/downloads and also generic error count might be useful. Currently we need to parse logs to get this metrics and it’s hard. It would be good to add these to the node API for a start so that I can translate them to prometheus and potentially exposing all metrics in prometheus compatible format would be awesome but would need more time to implement I guess.

4 Likes

I’d release a new version of the board with this info. Early on logs were all we had, but def would be excited to standardize it better.

2 Likes

telegraf json input should let you easily query required fields from json from the REST api.
You will just need bunch of http GET to load different parts and store in separate metrics.
Not everything is probably available in API but this is another trouble.

Nice joke men! :grinning: :+1:
I just encourage you to parse all data from storj API with telegraf json input and you will feel all my pain :grinning:

What part is difficult to do?
I have not parsed 100% of data but it is enough to track ingress/egress stats and disk usage per satellite.

So, try to parse all, and scale it to all a few nodes, because parsing just a few parameters is nothing for me and easy to do.

Here is it. You will also need to copy paste many times for the audit per each satellite. If you use many nodes there will bunch of copy pasts per each and as outcome it is easier to write script in any script language to parse all and store in the text line that is acceptable by the telegraf.

[[inputs.http]]
	urls = ["http://192.168.55.77:14002/api/sno/"]
	method = "GET"
	data_format = "json"
	tagexclude = ["url"]

	name_override = "storj_disk_total"
	fieldpass = ["diskSpace*", "version", "upToDate"]
	json_string_fields = [ "version", "upToDate" ]

[[inputs.http]]
	urls = ["http://192.168.55.77:14002/api/sno/"]
	method = "GET"
	data_format = "json"
	tag_keys = [ "url" ]

	name_override = "storj_disk_satellites"
	json_string_fields = [ "url" ]
	json_query = "satellites"

[[inputs.http]]
	urls = ["http://192.168.55.77:14002/api/sno/satellites"]
	method = "GET"
	data_format = "json"
	tagexclude = ["url"]
	tag_keys = [ "satelliteName" ]

	name_override = "storj_audit_satellites"
	json_string_fields = [ "satelliteName" ]
	json_query = "audits"

[[inputs.http]]
	urls = ["http://192.168.55.77:14002/api/sno/satellites"]
	method = "GET"
	data_format = "json"
	tagexclude = ["url"]

	name_override = "storj_bandwidth"
	fieldpass = ["*Summary"]

[[inputs.http]]
	urls = ["http://192.168.55.77:14002/api/sno/satellites"]
	method = "GET"
	data_format = "json"
	tagexclude = ["url"]

	name_override = "storj_bandwidth_daily"
	json_query = "bandwidthDaily.@reverse|0"


[[inputs.http]]
    interval = "1h"
    urls = ["http://192.168.55.77:14002/api/sno/satellite/12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"]
    method = "GET"
    data_format = "json"
    tagexclude = ["url"]

    fieldpass = ["auditHistory_score"]
    name_override = "storj_audit_history"

    [inputs.http.tags]
        satellite = "us-central-1.tardigrade.io"

[[inputs.http]]
    interval = "1h"
    urls = ["http://192.168.55.77:14002/api/sno/satellite/12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"]
    method = "GET"
    data_format = "json"
    tagexclude = ["url"]

    json_time_key = "windowStart"
    json_time_format = "2006-01-02T03:04:05Z"

    name_override = "storj_audit_details"
    json_query = "auditHistory.windows"

    [inputs.http.tags]
        satellite = "us-central-1.tardigrade.io"
1 Like

Sorry if I’m not the most advanced in these logging systems. But why couldn’t we just use a docker gelf driver to export the logs to Grafana or logstash or whatever - to avoid having to run anything on the storage node or change the code.

This is not logs export to Grafana. There is great rest api that return a lot of ready to use numbers of sno status. It is just required to grab that data from api and send it to telegraf that persit it in influxdb. The Grafana just has dashboards to retrieve numbers from db and draw the charts.

For logs parsing there are other options like logstash with kibana or splunk.

BTW. Splunk perfectly parse the logs. It is only required to switch log format to json and setup input. It’s really easy to setup but I am not sure how many resources will be consumed by splunk in case of heavy traffic on the sno.

1 Like

I now use loki and promtail for log storing and analyzation. Works great. You could even use a loki docker plugin then you don’t need to have logfiles lying around.
This additionally to the storj-exporter that converts the API into a prometheus format and you have all the information you need to make great grafana dashboards.

1 Like

Man, you seriously recommend this as a solution?

I voted here because I didn’t like this solution, I would like to see something like it just a specific influx instance and store all metrics here. Then I will use need metrics for me on Grafana.

I see. you would like to get ready to use solution with one click setup. You might wait it for ages and I can’t help with this.
Personally I am going to build very simple python script with list of hosts as an input and it will grab all data that I need from rest api on each host and send it to Grafana.