Prometheus Storj-Exporter

I’ve created a PR for the dashboards that should cope with the 1.21.1 API changes as best it can as well as the earlier API (you will have to ensure you’re using the exporter matching the storage node API version - at present dev branch for 1.21.1, master for older). From the API update post it’s unclear whether successCount will make a return at some point so I’ve left that alone. Once the API settles down we may need to remove any references to the removed storj_sat_uptime.
https://github.com/anclrii/Storj-Exporter-dashboard/pull/15. It will require reviewing as I’m new to PromQL.

On a side note I just tried to update my other storagenode to 1.21.1 and it’s picked up 1.21.2 whatever that is! At least I can’t see any more API changes with it.

2 Likes

My storj-exporters seem to have blown up this morning:

File "/home/smg/git/Storj-Exporter/storj-exporter.py", line 109, in add_payout_metrics data = self.node_data.get('payout', {}).get('currentMonth', None)

Seems like my storagenodes are returning None as the payout.
Adding an if statement keep the exporter running, but with missing payout data.

--- a/storj-exporter.py
+++ b/storj-exporter.py
@@ -105,13 +105,14 @@ class StorjCollector(object):
   def add_payout_metrics(self):
     if 'payout' in self.storj_collectors:
       self.get_node_payout_data()
-      metric_name     = 'storj_payout_currentMonth'
-      data            = self.node_data.get('payout', {}).get('currentMonth', None)
-      documentation   = 'Storj estimated payouts for current month'
-      keys            = ['egressBandwidth', 'egressBandwidthPayout', 'egressRepairAudit', 'egressRepairAuditPayout', 'diskSpace', 'diskSpacePayout', 'heldRate', 'payout', 'held']
-      labels          = ['type']
-      metric_family   = GaugeMetricFamily
-      yield from self.dict_to_metric(data, metric_name, documentation, metric_family, keys, labels)
+      if self.node_data.get('payout', {}):
+        metric_name     = 'storj_payout_currentMonth'
+        data            = self.node_data.get('payout', {}).get('currentMonth', None)
+        documentation   = 'Storj estimated payouts for current month'
+        keys            = ['egressBandwidth', 'egressBandwidthPayout', 'egressRepairAudit', 'egressRepairAuditPayout', 'diskSpace', 'diskSpacePayout', 'heldRate', 'payout', 'held']
+        labels          = ['type']
+        metric_family   = GaugeMetricFamily
+        yield from self.dict_to_metric(data, metric_name, documentation, metric_family, keys, labels)

   def add_sat_metrics(self):

Looks like the payout data can go awol - I wonder if there’s some satellite maintenance going on. We should probably make this change anyway to keep gathering the I/O data etc whilst the payout data is missing.

Raised https://github.com/anclrii/Storj-Exporter/issues/57 to add the check. I guess it returns None at start of the month when there are no payouts yet.

Just spotted this in my storagenode logs:

2021-02-01T02:08:32.055-0800 ERROR console:endpoint failed to encode json response {"error": "payouts console web error: json: unsupported value: +Inf", "errorVerbose": "payouts console web error: json: unsupported value: +Inf\n\tstorj.io/storj/storagenode/console/consoleapi.(*StorageNode).EstimatedPayout:132\n\tnet/http.HandlerFunc.ServeHTTP:2042\n\tgithub.com/gorilla/mux.(*Router).ServeHTTP:210\n\tnet/http.serverHandler.ServeHTTP:2843\n\tnet/http.(*conn).serve:1925"}

So it looks like a storagenode bug to me, but the exporter should deal with it.

I’ve also just realised my BSD jails running my storagenodes have inherited PST from my TrueNAS server (set to work around a yet unfixed reporting issue on it). My issues started at 8am UTC which was midnight PST. I’m not going to change them to local UK time though yet as it should work regardless. The payments seem to have rolled over at midnight UTC correctly though and were up to $0.07 with $0.02 held back before it exploded.

I changed my Jails to use GMT and restarted my storagenodes earlier and the payouts API call started working correctly again shortly after midnight.
Had a quick look at the storj code golang code and it’s full of a mishmash of calls that get UTC time and localtime. I’m pretty sure they should be using UTC throughout (the satellite code seemed a bit better in this respect).
I think I’m going to permanently set my jails to UTC to avoid any unexpected behaviour in future. I don’t know if this is something that is already done in docker images.

Just to confirm, all info about uptime %, vetting/audit counts, audit success%, and suspension% have been removed from the API and that’s why they’re showing as undefined in grafana?

Use the updated dashboard. The uptime and audit score have a changed metric name and audit/vetting count has been removed (but will likely be added back one day)

You also need to use the dev build of the storj-exporter until @greener confirms the changes in the API and merges the dev branch into master.

1 Like

yep, I knew about the updated dashboard from the github page, although didn’t think to look at docker hub for a more recent storj-exporter package. Tried the :dev package, but that appears to only be released for amd64 architecture. So just went ahead and pulled the most recent release which appears to be :1.0.9-beta.3, which is working. I’ll just keep an eye on when it is officially published/finalized as the :latest tag.

2 Likes

That’s dev branch merged and new latest multi-arch images pushed. Watchtower will pickup the update or docker run -v /var/run/docker.sock:/var/run/docker.sock storjlabs/watchtower --cleanup --run-once storj-exporter should also work.

Once updated, you will also need to update the dashboard to latest version to make audit/uptime values work with the recent storagenode update.

@waistcoat, @kevink thanks for you help sorting this out!

3 Likes

having issues with new images on my raspberry pis (single node on each machine). I ran
sudo docker run -d --link=storagenode --name=storj-exporter -p 9651:9651 anclrii/storj-exporter:latest
and
sudo docker run -d --link=storagenode --name=storj-exporter -p 9651:9651 -e STORJ_HOST_ADDRESS=storagenode anclrii/storj-exporter:latest
and
sudo docker run -d --link=storagenode --name=storj-exporter -p 9651:9651 -e STORJ_HOST_ADDRESS="storagenode" anclrii/storj-exporter:latest

and even tried all three of those with anclrii/storj-exporter:1.0.9 at the end, and still each one creates a container, which immediately exits itself. Also, just for clarification, between starting each container I did stop and rm the previous one…

Any ideas?

Edit, just tried the :1.0.9-beta.3 tag again and that seems to work, so I guess that’s what I’ll sit tight with that until I hear that the :latest tag will work on the RPi.

Did you docker pull anclrii/storj-exporter:latest before docker run...? Try to rerun pulling new image first, or use the watchtower command above.

Reading again and 1.0.9 should be same as 1.0.9-beta.3 and latest… weird. Can you try without the -d and share the output.

The vettings column is disappeared by new nodes, how to check them now?

have to wait until they get added again:

2 Likes

Vetting info (total audits) has been removed in last version, I’m not sure why, see A couple changes to SNO API for more details.

I did not pull the new image with that docker pull command previously, I typically just run the docker start command and let it take care of downloading the new image in the process. So here’s the output of trying all of those different scenarios:

Still no dice. But as you can see, the :1.0.9-beta.3 tag still sticks…

It worked like a charm :slight_smile:

Did you manage to get it to work on latest?

Can you also share docker image inspect anclrii/storj-exporter:latest | egrep "Created|Architecture" please. Looking at the error it’s complaining about permissions which is weird.It could be a broken build for ARM where base image has a bug or similar. Unfortunately I don’t have anywhere to test this manually and don’t have any meaningful tests in the build. I’ll see what I can do but this might take a while.

In the meantime if anyone else is running on PI can you please confirm if you have a similar issue with latest image.

here’s the output you asked for:

For comparison I ran the same query on the :1.0.9-beta3 image that is working on the Rpis, but unfortunately that seems to be the same results…