Storjnode metrics documentation (proc_stat)

Hello everyone,

I’m building Grafana dashboards using Prometheus metrics from my Storj nodes. Is there any documentation (or a reference list) that explains what metrics are available and what they mean?

Specifically, I’m trying to display and alert on node connectivity and QUIC status (e.g., alert if a node disconnects). I found several metrics under proc_stat, but I’m not sure how to interpret them.

Examples (subset):

proc_stat{field=“Pid”, instance=“X.X.X.X:9501”, job=“NODEXX”, scope=“github_com_spacemonkeygo_monkit_v3_environment”}
proc_stat{field=“State”, instance=“X.X.X.X:9501”, job=“NODEXX”, scope=“github_com_spacemonkeygo_monkit_v3_environment”}
proc_stat{field=“Ppid”, instance=“X.X.X.X:9501”, job=“NODEXX”, scope=“github_com_spacemonkeygo_monkit_v3_environment”}
proc_stat{field=“Pgrp”, instance=“X.X.X.X:9501”, job=“NODEXX”, scope=“github_com_spacemonkeygo_monkit_v3_environment”}
proc_stat{field=“Session”, instance=“X.X.X.X:9501”, job=“NODEXX”, scope=“github_com_spacemonkeygo_monkit_v3_environment”}
proc_stat{field=“TtyNr”, instance=“X.X.X.X:9501”, job=“NODEXX”, scope=“github_com_spacemonkeygo_monkit_v3_environment”}
proc_stat{field=“Tpgid”, instance=“X.X.X.X:9501”, job=“NODEXX”, scope=“github_com_spacemonkeygo_monkit_v3_environment”}
proc_stat{field=“Flags”, instance=“X.X.X.X:9501”, job=“NODEXX”, scope=“github_com_spacemonkeygo_monkit_v3_environment”}
proc_stat{field=“Minflt”, instance=“X.X.X.X:9501”, job=“NODEXX”, scope=“github_com_spacemonkeygo_monkit_v3_environment”}
proc_stat{field=“Cminflt”, instance=“X.X.X.X:9501”, job=“NODEXX”, scope=“github_com_spacemonkeygo_monkit_v3_environment”}
proc_stat{field=“Majflt”, instance=“X.X.X.X:9501”, job=“NODEXX”, scope=“github_com_spacemonkeygo_monkit_v3_environment”}
proc_stat{field=“Cmajflt”, instance=“X.X.X.X:9501”, job=“NODEXX”, scope=“github_com_spacemonkeygo_monkit_v3_environment”}
proc_stat{field=“Utime”, instance=“X.X.X.X:9501”, job=“NODEXX”, scope=“github_com_spacemonkeygo_monkit_v3_environment”}
proc_stat{field=“Stime”, instance=“X.X.X.X:9501”, job=“NODEXX”, scope=“github_com_spacemonkeygo_monkit_v3_environment”}
proc_stat{field=“Cutime”, instance=“X.X.X.X:9501”, job=“NODEXX”, scope=“github_com_spacemonkeygo_monkit_v3_environment”}
proc_stat{field=“Cstime”, instance=“X.X.X.X:9501”, job=“NODEXX”, scope=“github_com_spacemonkeygo_monkit_v3_environment”}
proc_stat{field=“Priority”, instance=“X.X.X.X:9501”, job=“NODEXX”, scope=“github_com_spacemonkeygo_monkit_v3_environment”}
proc_stat{field=“Nice”, instance=“X.X.X.X:9501”, job=“NODEXX”, scope=“github_com_spacemonkeygo_monkit_v3_environment”}
proc_stat{field=“NumThreads”, instance=“X.X.X.X:9501”, job=“NODEXX”, scope=“github_com_spacemonkeygo_monkit_v3_environment”}
proc_stat{field=“Itrealvalue”, instance=“X.X.X.X:9501”, job=“NODEXX”, scope=“github_com_spacemonkeygo_monkit_v3_environment”}

Thanks in advance!

I would just parse the source code and extract api definitions from there.

2 Likes

You may also use this link as an example:

1 Like

I tried to figure it out by reviewing Monkit, but it doesn’t provide useful explanations since the metrics are developer-defined and Monkit abstracts them. I’ll check the source code instead and try to map the metrics. Thanks!

@Alexey,

I looked for that one too…

I checked a bunch of Storj Grafana dashboards (some are really good), but I still haven’t found a metric that clearly shows TCP/UDP connection status.

Again, I’m replying to my own question in case it helps anyone else who runs into the same thing (I may have too much free time right now :grinning_face_with_smiling_eyes:).

The proc_stat metrics are related to Linux process CPU/memory stats—or, in ChatGPT words: proc_stat_* = raw Linux process stats for the storagenode process.”

If anyone’s curious, here’s the (slightly lengthy) ChatGPT explanation for these metrics:

proc_stat_pid: process ID.
proc_stat_ppid: parent process ID.
proc_stat_pgrp: process group ID.
proc_stat_session: session ID.
proc_stat_tty_nr: controlling terminal device number.
proc_stat_tpgid: foreground process group of the terminal.
proc_stat_flags: kernel task flags bitmask.
proc_stat_minflt: minor page faults for this process.
proc_stat_cminflt: minor page faults by waited-for children.
proc_stat_majflt: major page faults for this process.
proc_stat_cmajflt: major page faults by waited-for children.
proc_stat_utime: CPU time in user mode, in clock ticks.
proc_stat_stime: CPU time in kernel mode, in clock ticks.
proc_stat_cutime: children’s user CPU time, in clock ticks.
proc_stat_cstime: children’s kernel CPU time, in clock ticks.
proc_stat_priority: kernel scheduling priority value.
proc_stat_nice: nice level.
proc_stat_num_threads: number of threads in the storagenode process.
proc_stat_itrealvalue: obsolete field, usually zero on modern Linux.
proc_stat_starttime: time the process started after boot, in clock ticks since boot.
proc_stat_vsize: virtual memory size in bytes.
proc_stat_rss: resident set size in pages, not bytes.
proc_stat_rsslim: soft RSS limit in bytes.
proc_stat_startcode / endcode: text segment address range.
proc_stat_startstack: start address of the main stack.
proc_stat_kstkesp / kstkeip: old kernel stack instruction fields; mostly legacy/low-value today.
proc_stat_signal: bitmap of pending signals.
proc_stat_blocked: bitmap of blocked signals.
proc_stat_sigignore: bitmap of ignored signals.
proc_stat_sigcatch: bitmap of caught signals.
proc_stat_wchan: wait channel / kernel wait location.
proc_stat_nswap / cnswap: obsolete placeholders on modern kernels.
proc_stat_exit_signal: signal sent to parent on exit.
proc_stat_processor: CPU number last executed on.
proc_stat_rt_priority: realtime priority.
proc_stat_policy: scheduler policy.
proc_stat_delay_acct_blkio_ticks: time spent waiting on block I/O, in clock ticks.
proc_stat_guest_time: guest VM CPU time in clock ticks.
proc_stat_cguest_time: children’s guest VM CPU time in clock ticks.

A few of these are easy to misread:

rss is in memory pages, not bytes. To convert, multiply by system page size, typically 4096 bytes. The Linux docs call out that RSS-related proc values may be approximate.
utime, stime, and delay_acct_blkio_ticks are in clock ticks, not seconds. Convert with sysconf(_SC_CLK_TCK); on many Linux systems that is 100 ticks per second.
processor is not CPU usage. It is the CPU core the task was scheduled on last.
flags, signal, blocked, sigignore, and sigcatch are bitmasks, so they are rarely useful directly in Grafana except as debugging clues.

For Storj operations, the ones that are actually useful are usually these:

proc_stat_num_threads: sudden growth can hint at a stuck or overloaded process.
proc_stat_utime and proc_stat_stime: derive CPU usage rate with rate().
proc_stat_rss: track real memory footprint.
proc_stat_vsize: track address-space growth, useful for leak suspicion.
proc_stat_majflt: major page faults can hint at memory pressure or disk-backed page-ins.
proc_stat_delay_acct_blkio_ticks: can hint at I/O wait pressure on busy storage.
proc_stat_processor: mostly diagnostic, not capacity planning.

1 Like