Grafana Metrics

I am on the beginner level and have only little knowledge how a native integration would work. I just see how ETH2 is handling this: https://github.com/stefa2k/prysm-docker-compose/blob/add38e2e7c3f6d33f3b72ca219a0676925f665d4/config/prometheus.yaml#L18-L26

As far as I understand this should also be possible with all of our binaries. We have that endpoint as well. We are currently missing the information you would need for your dashboard.

That won’t work with your binaries because they don’t expose a prometheus compatible output. That’s why we need the exporter from @greener that transforms the API output into a prometheus format.

I don’t have any idea about how to make a prometheus endpoint in go and the exporter uses python so… :man_shrugging:

I really :heartpulse: this idea! I used Grafana + telegraf for collecting metrics via API and it really was pain to parse it in telegraf. I will more than happy (really) if storagenode can send it directly to influxdb or via graphite protocol (influxdb have graphite listener too).

Also, if you need any help with this process please engage me to this process. I will more than happy to help with it.

@littleskunk here is a good examples of how it can be sent directly on other systems:

TrueNAS

ProxMox

It also supports send alerts to Telegram, I use it and it much better than email for critical alerts (but email alerts I using too)

@littleskunk there is a go librariy for prometheus:

With an example implementation: https://github.com/prometheus/client_golang/blob/b7799362e0ac323f658fb8d52c2d6df001cf272c/examples/random/main.go

I know this probably won’t be considered anytime soon since it requires engineering efforts but maybe some day in the future?
For now we have a good exporter in python to transform the API output into a prometheus endpoint that can be easily extended (sometimes the API output is just a bit weird and unexpected :smiley: )

1 Like

There are multiple community solutions that help visualise storagenode state in Grafana:

As @kevink rightly pointed out current storj binaries don’t expose anything that could be used with grafana. Some metrics are available in the node API and this exporter translates these metrics to prometheus compatible format which is pretty specific.

To get the idea of the format difference you can try spin up the exporter yourself with docker run -d --link=storagenode --name=storj-exporter -p 9651:9651 anclrii/storj-exporter:latest and then curl -s storj-exporter:9651/metrics will give you the output that storj binaries would need to expose.

If this is implemented, Prometheus would be able to scrape storagenodes directly rather then through my exporter and this would be a much better solution. Metrics could be updated along with the rest of the app, better performance etc. Thought even if storj binaries exposed prometheus compatible metrics, one would still need a prometheus server between node and grafana to scrape and store historic metrics, and also a grafana instance for dashboards.

I think it would be good if storj binaries exposed total/successful uploads/downloads and also generic error count might be useful. Currently we need to parse logs to get this metrics and it’s hard. It would be good to add these to the node API for a start so that I can translate them to prometheus and potentially exposing all metrics in prometheus compatible format would be awesome but would need more time to implement I guess.

4 Likes

I’d release a new version of the board with this info. Early on logs were all we had, but def would be excited to standardize it better.

2 Likes

telegraf json input should let you easily query required fields from json from the REST api.
You will just need bunch of http GET to load different parts and store in separate metrics.
Not everything is probably available in API but this is another trouble.

Nice joke men! :grinning: :+1:
I just encourage you to parse all data from storj API with telegraf json input and you will feel all my pain :grinning:

What part is difficult to do?
I have not parsed 100% of data but it is enough to track ingress/egress stats and disk usage per satellite.

So, try to parse all, and scale it to all a few nodes, because parsing just a few parameters is nothing for me and easy to do.

Here is it. You will also need to copy paste many times for the audit per each satellite. If you use many nodes there will bunch of copy pasts per each and as outcome it is easier to write script in any script language to parse all and store in the text line that is acceptable by the telegraf.

[[inputs.http]]
	urls = ["http://192.168.55.77:14002/api/sno/"]
	method = "GET"
	data_format = "json"
	tagexclude = ["url"]

	name_override = "storj_disk_total"
	fieldpass = ["diskSpace*", "version", "upToDate"]
	json_string_fields = [ "version", "upToDate" ]

[[inputs.http]]
	urls = ["http://192.168.55.77:14002/api/sno/"]
	method = "GET"
	data_format = "json"
	tag_keys = [ "url" ]

	name_override = "storj_disk_satellites"
	json_string_fields = [ "url" ]
	json_query = "satellites"

[[inputs.http]]
	urls = ["http://192.168.55.77:14002/api/sno/satellites"]
	method = "GET"
	data_format = "json"
	tagexclude = ["url"]
	tag_keys = [ "satelliteName" ]

	name_override = "storj_audit_satellites"
	json_string_fields = [ "satelliteName" ]
	json_query = "audits"

[[inputs.http]]
	urls = ["http://192.168.55.77:14002/api/sno/satellites"]
	method = "GET"
	data_format = "json"
	tagexclude = ["url"]

	name_override = "storj_bandwidth"
	fieldpass = ["*Summary"]

[[inputs.http]]
	urls = ["http://192.168.55.77:14002/api/sno/satellites"]
	method = "GET"
	data_format = "json"
	tagexclude = ["url"]

	name_override = "storj_bandwidth_daily"
	json_query = "bandwidthDaily.@reverse|0"


[[inputs.http]]
    interval = "1h"
    urls = ["http://192.168.55.77:14002/api/sno/satellite/12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"]
    method = "GET"
    data_format = "json"
    tagexclude = ["url"]

    fieldpass = ["auditHistory_score"]
    name_override = "storj_audit_history"

    [inputs.http.tags]
        satellite = "us-central-1.tardigrade.io"

[[inputs.http]]
    interval = "1h"
    urls = ["http://192.168.55.77:14002/api/sno/satellite/12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"]
    method = "GET"
    data_format = "json"
    tagexclude = ["url"]

    json_time_key = "windowStart"
    json_time_format = "2006-01-02T03:04:05Z"

    name_override = "storj_audit_details"
    json_query = "auditHistory.windows"

    [inputs.http.tags]
        satellite = "us-central-1.tardigrade.io"
1 Like

Sorry if I’m not the most advanced in these logging systems. But why couldn’t we just use a docker gelf driver to export the logs to Grafana or logstash or whatever - to avoid having to run anything on the storage node or change the code.

This is not logs export to Grafana. There is great rest api that return a lot of ready to use numbers of sno status. It is just required to grab that data from api and send it to telegraf that persit it in influxdb. The Grafana just has dashboards to retrieve numbers from db and draw the charts.

For logs parsing there are other options like logstash with kibana or splunk.

BTW. Splunk perfectly parse the logs. It is only required to switch log format to json and setup input. It’s really easy to setup but I am not sure how many resources will be consumed by splunk in case of heavy traffic on the sno.

1 Like

I now use loki and promtail for log storing and analyzation. Works great. You could even use a loki docker plugin then you don’t need to have logfiles lying around.
This additionally to the storj-exporter that converts the API into a prometheus format and you have all the information you need to make great grafana dashboards.

1 Like

Man, you seriously recommend this as a solution?

I voted here because I didn’t like this solution, I would like to see something like it just a specific influx instance and store all metrics here. Then I will use need metrics for me on Grafana.

I see. you would like to get ready to use solution with one click setup. You might wait it for ages and I can’t help with this.
Personally I am going to build very simple python script with list of hosts as an input and it will grab all data that I need from rest api on each host and send it to Grafana.

You misunderstood, let’s me explain:

This is a voting thread I saw:

And I realize that this feature is much better than my telegraf parsing solution that I already have and wotted - YES!
This is an awesome feature if any SNO can set up its own dashboard (or use existed shared). This feature offloads the DEV team with creating the multinode dashboard, the community can easily create it.
I think you now understand the reason for this thread and if you would like to stay with your solution - go on, but if you would like to simplify things and did it in a more effective way - please vote - YES.

1 Like

Well. I got it. Thanks for clarification.
And I totally agree that luck of API documentation is terrible. I see bunch of weird values especially in the payment API.
Probably it will be nice to have one more feature request for the api documentation. This is a bit independent feature.

1 Like