Missing debug information?

jammerdan · May 2, 2024, 8:39am

When retrieving debug information from the debug port there is the option to retrieve the data in json-format.
However when I compare it with the non-json format, there is some data that seems not present.

For example when I retrieve the retain service [8908487690175027216,2710472465353091244] storj.io/storj/storagenode/retain.(*Service).Run() (elapsed: 43h11m17.622264964s, orphaned) it gives me the elapsed time while with json {"id":8908487690175027216,"parent_id":7441024798193623745,"func":{"package":"storj.io/storj/storagenode/retain","name":"(*Service).Run"},"trace":{"id":2710472465353091244},"start":1714482721649738423,"orphaned":true,"args":[],"annotations":[]} there is no such thing.

1st question: Why is that?
2nd question: How do I get the very same and complete information as json output?

elek · May 3, 2024, 7:48am

Can you please share the URLs what did you use?

It seems to be the goroutine stack fragment, the orphaned part is strange, anyway.

BTW, are you looking for the last execution time? I think the elapsed time is different, that’s the goroutine uptime…

Egon · May 3, 2024, 7:50am

The elapsed time is calculated from start time. i.e. current time - start. The start time is represented as nanoseconds since unix epoch.

jammerdan · May 3, 2024, 8:05am

It is just curl localhost:port/mon/ps/json.

The idea was to find the processes currently running together with the duration.

That means I would have to calculate it by myself? Couldn’t that be a json field value?

elek · May 4, 2024, 6:27am

This is not what you are looking for, I assume.

Retain service is running all the time, started with the storagenode process. It listens on a queue, and when it receives data, it will start the real retain work.

It’s slightly better to check the monkit metrics, there are lot of for the garbage_collection process including the start time and duration. Unfortunately it’s harder to see if it’s actually running…

Egon · May 4, 2024, 2:30pm

That means I would have to calculate it by myself? Couldn’t that be a json field value?

Yes. Generally the programmatic outputs usually try to avoid duplicating data to reduce the overhead of fetching the debug information. Similarly there isn’t a fully qualified method name or the method name in addition.

jammerdan · May 5, 2024, 4:36am

The way I understand this, it does not sound convincing, sorry.

When I do curl localhost:port/mon/ps for example I get this:
[1173031424553271801,8799459025918700292] storj.io/storj/storagenode/blobstore/filestore.(*Dir).iterateStorageFormatVersions() (elapsed: 12m27.981508219s)
I interpret it as whatever it is what is running there, it is currently running and has been running for 8m3.686644014s.

Now when I do curl localhost:port/mon/ps/dot
I get f1173031424553271801 [label="storj.io/storj/storagenode/blobstore/filestore.(*Dir).iterateStorageFormatVersions()\lelapsed: 12m50.776322027s\l"]; which is the same running process.

And only when I do curl localhost:port/mon/ps/json I get {"id":1173031424553271801,"parent_id":8196538704928372536,"func":{"package":"storj.io/storj/storagenode/blobstore/filestore","name":"(*Dir).iterateStorageFormatVersions"},"trace":{"id":8799459025918700292},"start":1714882527299034584,"orphaned":false,"args":[],"annotations":[]},

So the first two methods show the elapsed time while the json method does not, it shows the start time instead, which the other methods do not show.
This is not consistent. All methods should present the same set of information data and only differ in format. At least this is what I think how it should be.
I don’t know why you chose to show the start time in json format but not the elapsed time.

Egon · May 6, 2024, 12:08pm

I don’t know why you chose to show the start time in json format but not the elapsed time.

Honestly, I don’t think there was a huge amount of thought other than “this is what’s needed usually from those endpoints, hence they expose that information”.

i.e. ps is consumed directly by developers hence, reading elapsed is more useful. Similarly with dot it ends up being piped directly to graphviz and so no possibility for processing.

When getting json, you are probably going to be using some data processing thing i.e. either some data consumption pipeline that can add that information via computation. There might have been some need for correlating output across multiple programs, but I don’t have any information on that.

All methods should present the same set of information data and only differ in format.

The main reason for different endpoints is to display information for different needs. Hence the output will also differ, sometimes just the formatting and other times the details or the amount of information.

I’m not against including the elapsed time per se., but it would be nice to understand where the need comes from. It could even use some sort of query flag to avoid including, unless requested – e.g. ps/json?details=elapsed or something.

jammerdan · May 7, 2024, 5:19am

Look, I am not yet really familiar with the stuff that comes from the debug port.
It just looks like it could be a solution to improve my monitory of the nodes and that there might be a lot of interesting and useful information in there.
Sadly information that I would like to see in API or the dashboard, does not get provided. So I have to see if I can get this information from somewhere else.

But I cannot tell already 100% what it is what I am looking for.

In general there are time when you see the load spikes for example but you do not really know what the node is doing because the information is missing.

For example I had this

where it was interesting to see that filewalker is running, might have restarted and for how long it is running.
Number of current uploads and downloads might be interesting also. I am sure there is other stuff.

I have no idea yet how to get this into my monitoring. But it reads json the best and it would be most convenient to have all available data ready to pull instead of calculating them on my own for every function that might be interesting.

Egon · June 4, 2024, 11:00am

Some upcoming release will have go.mod: bump monkit · storj/storj@3ea6167 · GitHub included, which contains the change to include elapsed time for unfinished spans.