So I’ve had an issue a few times, where UptimeRobot thinks my nodes are still up… I get no Down notifications from the service, but I find out later on that my nodes were actually unresponsive. Normally I caught it quickly because I was home, and when I was using the computer that the nodes are running on, I noticed it was unresponsive.
But the other day, I was away from home for a few days, and I started getting emails saying my Nodes were disqualified! Again, NO notification from UptimeRobot at all.
Aaaaaand, now all 3 of my nodes are fully disqualified and I’ve lost a bunch of money that was being held.
So I’m rebuilding my nodes now, starting over, and I’d like to try to setup some kind of monitoring that will alert me to this scenario before I’m disqualified again… Anyone have any tips or solutions? Thanks
I don’t know a currently available option for better downtime monitoring, but I am right now developing an App for Node monitoring. Downtime notifications will be one feature. Unfortunately it will take some more time until it can be released and is not yet finished.
Sounds like DQ for failed audits after timeouts. Your nodes was so unresponsive and unable to provide pieces for audit.
I would suggest to fix this problem first. Maybe it’s hardware related (disks or controller or weak power supply, maybe thermal grease is dried and CPU hangs).
It could be software too.
I use a uptimerobot, but also a grafana dashboard to monitor many metrics. If they are out of spec it will alert me via Discord / Telegram for any issues immediately.
you should use a hardware watchdog most servers will have something like that by default.
it will basically just reboot / power cycle the system if it doesn’t respond to various parameters such as kernel panic and what not… but one can also use scripts to interact with it, so that it would like say created / delete / touch data on storage to verify that the storage is functional and then if this isn’t responding / completing then the hardware watchdog will trigger.
have been tinkering with my watchdog, but can’t say its saved me… if anything i think it managed to ruin my OS once lol so yeah be careful how you use the power cycle feature.
but my OS ssd didn’t have CoW or PLP at the time so kind of my own fault.
These errors are direct indicator that there is something wrong. But if your node is hangs - it will not log anything, so you need a better indicator like external monitoring with Prometheus/Telegraf+InfluxDB + Grafana mentioned by @KernelPanick
I have a little VM in the cloud, I’m paying less than €3/m, which continuously check if the Storj port is opened. If the port is not opened, it sends me a message on my mobile/PC through Keybase.
Nothing complicated, just a script running on the Internet. It also monitors my Internet connection.
UptimeRobot seems to be working fine in terms of monitoring the port from the internet.
It’s local system issues that I need to figure out how to monitor better.
As a start, I’m funneling my logs to a file like @Alexey recommended above. And I’m going to keep my eye on that. Then I’ll see if maybe I can setup some kind of alerting locally based on those keywords.
That is exactly what uptime robot does for free and without having to go through the trouble of setting all that up. If port monitoring is all you want just go with uptime robot.
i still hold that i don’t worry to much about uptime tracking, it’s nice and all but it’s highly unlikely that downtime is going to kill my node, if i check it like every few days.
really it’s the kernel panic, super high storage latencies or data corruption, that i worry most about to avoid DQ.
one might even count storagenode software version, but again this isn’t very likely without noticing the complete lack of ingress or the version out of date notification in the dashboard, but still that is imo more likely to kill a node than actual downtime.
as the time needed to be DQ for DT is … i duno… a lot … weeks to months range.
sadly there still doesn’t exist a script to guard against such issues, but a hardware watch dog does the job in most cases, even a software watch dog might work, but not sure if that actually can cut down to the mobo in extreme cases.
Yes but I’m not using the VM only for Storj monitoring: it’s a VM I bought time ago for personal stuff and decided to use it for Storj service monitoring as well.
But the service is not down. It’s just not responds on high level requests like “give me a piece for audit”, but however responds to low level requests (network level) like “is the port open?”