Better solution for monitoring than UptimeRobot?

cy2k · June 30, 2021, 2:36pm

So I’ve had an issue a few times, where UptimeRobot thinks my nodes are still up… I get no Down notifications from the service, but I find out later on that my nodes were actually unresponsive. Normally I caught it quickly because I was home, and when I was using the computer that the nodes are running on, I noticed it was unresponsive.

But the other day, I was away from home for a few days, and I started getting emails saying my Nodes were disqualified! Again, NO notification from UptimeRobot at all.

Aaaaaand, now all 3 of my nodes are fully disqualified and I’ve lost a bunch of money that was being held.

So I’m rebuilding my nodes now, starting over, and I’d like to try to setup some kind of monitoring that will alert me to this scenario before I’m disqualified again… Anyone have any tips or solutions? Thanks

striker43 · June 30, 2021, 4:56pm

Oh I am sorry to hear that!

I don’t know a currently available option for better downtime monitoring, but I am right now developing an App for Node monitoring. Downtime notifications will be one feature. Unfortunately it will take some more time until it can be released and is not yet finished.

Alexey · June 30, 2021, 5:09pm

Sounds like DQ for failed audits after timeouts. Your nodes was so unresponsive and unable to provide pieces for audit.
I would suggest to fix this problem first. Maybe it’s hardware related (disks or controller or weak power supply, maybe thermal grease is dried and CPU hangs).
It could be software too.

cy2k · June 30, 2021, 5:57pm

Thanks, for the tips, no doubt it’s some kind of hardware or OS issue.

While I continue too troubleshoot that, what can I look for in the docker logs that could help me indicate when there is a problem?

@Alexey - I’ve seen you mention this command in other threads, would this be what I would use?

docker logs storagenode 2>&1 | grep "GET_AUDIT" | grep "failed"

KernelPanick · June 30, 2021, 6:06pm

I use a uptimerobot, but also a grafana dashboard to monitor many metrics. If they are out of spec it will alert me via Discord / Telegram for any issues immediately.

cy2k · June 30, 2021, 6:22pm

that looks like a pretty neat solution… I’m running my nodes in Docker on a Mac, though. Do you think your solution would work on that kind of setup?

SGC · June 30, 2021, 6:30pm

you should use a hardware watchdog most servers will have something like that by default.
it will basically just reboot / power cycle the system if it doesn’t respond to various parameters such as kernel panic and what not… but one can also use scripts to interact with it, so that it would like say created / delete / touch data on storage to verify that the storage is functional and then if this isn’t responding / completing then the hardware watchdog will trigger.

have been tinkering with my watchdog, but can’t say its saved me… if anything i think it managed to ruin my OS once lol so yeah be careful how you use the power cycle feature.

but my OS ssd didn’t have CoW or PLP at the time so kind of my own fault.

Alexey · June 30, 2021, 6:57pm

You can use this one:

docker logs storagenode 2>&1 | grep -E "GET_AUDIT|GET_REPAIR" | grep "failed"

However it will show results from the last docker run
You can redirect logs to the file to do not lose them when the container is removed: How do I redirect my logs to a file? - Node Operator

These errors are direct indicator that there is something wrong. But if your node is hangs - it will not log anything, so you need a better indicator like external monitoring with Prometheus/Telegraf+InfluxDB + Grafana mentioned by @KernelPanick

cy2k · June 30, 2021, 7:48pm

I tried the log redirection to file… and it caused the containers to continually restart.

What kind of path can I use? I tried “/Volumes//xxxx/storagenode.log” but I’m thinking that it couldn’t see that.

Alexey · June 30, 2021, 8:13pm

You should use exactly the same path from the guide,

log.output: "/app/config/node.log"

This is path inside the container, the log will be stored in the data location.

cy2k · June 30, 2021, 8:35pm

AH gotcha, didn’t realize that at first. Understood, it’s working. Thanks.

pietro · July 2, 2021, 10:08am

I have a little VM in the cloud, I’m paying less than €3/m, which continuously check if the Storj port is opened. If the port is not opened, it sends me a message on my mobile/PC through Keybase.

Nothing complicated, just a script running on the Internet. It also monitors my Internet connection.

cy2k · July 2, 2021, 5:12pm

@pietro - gotcha. cool idea.

UptimeRobot seems to be working fine in terms of monitoring the port from the internet.

It’s local system issues that I need to figure out how to monitor better.

As a start, I’m funneling my logs to a file like @Alexey recommended above. And I’m going to keep my eye on that. Then I’ll see if maybe I can setup some kind of alerting locally based on those keywords.

BrightSilence · July 2, 2021, 8:27pm

That is exactly what uptime robot does for free and without having to go through the trouble of setting all that up. If port monitoring is all you want just go with uptime robot.

SGC · July 3, 2021, 9:46am

i still hold that i don’t worry to much about uptime tracking, it’s nice and all but it’s highly unlikely that downtime is going to kill my node, if i check it like every few days.

really it’s the kernel panic, super high storage latencies or data corruption, that i worry most about to avoid DQ.

one might even count storagenode software version, but again this isn’t very likely without noticing the complete lack of ingress or the version out of date notification in the dashboard, but still that is imo more likely to kill a node than actual downtime.

as the time needed to be DQ for DT is … i duno… a lot … weeks to months range.

sadly there still doesn’t exist a script to guard against such issues, but a hardware watch dog does the job in most cases, even a software watch dog might work, but not sure if that actually can cut down to the mobo in extreme cases.

pietro · July 7, 2021, 7:35am

Yes but I’m not using the VM only for Storj monitoring: it’s a VM I bought time ago for personal stuff and decided to use it for Storj service monitoring as well.

Skyblockpro1 · July 9, 2021, 7:57am

Hi two suggestions.
The free and open source is zabbix. I started using it just a week ago and it can notify you when storj service is down

The other one which is easier to setup is PRTG it can also notify you but I haven’t used it in long time

Let me know if you need help setting one of these up

Alexey · July 9, 2021, 8:05am

But the service is not down. It’s just not responds on high level requests like “give me a piece for audit”, but however responds to low level requests (network level) like “is the port open?”

Skyblockpro1 · July 9, 2021, 8:25am

I just found it is possible to do that in zabbix it seems. I will test it out and let you know if it works

Skyblockpro1 · July 9, 2021, 9:20am

Good news I just got it running on Zabbix. it Checks 2 things at the same time if the 28967 port is opened and if the system is running storj node

you can simply add it in Items
Give it a standardized name

than in key

net.tcp.listen[28967]

and if everything is ok it will return a value of 1
and if something is wrong it will return a value of 0