My node is running 24/7 but the uptime is decreasing slowly!

NS1 · November 24, 2020, 3:19pm

Since I fixed the unsent order issue, my uptime has decreased. But on November 13th, I noticed my uptime is still decreasing slowly. So I started taking screenshot to confirm this.

Despite my node is running 24/7, in 11 days my uptime on:

satellite saltlake.tardigrade has decreased from 98.91% to 98.86%
satellite us-central-1.tardigrade has decreased from 99.09% to 99.04%
satellite europe-west-1.tardigrade has decreased from 98.3% to 98.21%

Why my uptime is decreasing slowly and how to fix it?
Will this uptime issue affect my November payout?

Please help, Thank you.

dragonhogan · November 24, 2020, 3:38pm

Any idea how reliable your internet connectivity is?

The only thing I can think of is if there are “blips” in your internet connectivity? I know I’ve seen where if I reset my router or if my ISP has some short downtime, my nodes will just sit there idle until the connectivity comes back and then just keep going once re-established…during that “offline” period if a satellite tries to audit your node and it’s unreachable that could cause a decrease in “online” score.

At least that’s my understanding of it…

kalloritis · November 24, 2020, 3:51pm

@littleskunk could some of these be larger repercussions of the k8s migrations or am I over associating coincidental events?

baker · November 24, 2020, 4:05pm

I think if that were the case, this would be observed in more than one person’s node. Both my nodes are at 100%. Also, this appears to have been happening to OP for over 1.5 weeks now.

It could be as dragonhogan suggested, minor hits to your internet connectivity that are long enough for the node to miss some pings from the satellite but short enough for your node to not reset the uptime value as seen on the dashboard. Not sure how to do it efficiently, but you could search your logs for gaps where there is no activity. I would expect order sending would throw errors if it happened during this downtime, or unresolvable host errors could be in there too.

kalloritis · November 24, 2020, 4:12pm

I’d agree with you, if it were not as well something I was part of:

Being an SA/NA, there isn’t much I don’t dabble in monitoring at home too and I’ve not seen any merit to any downtime (nothing in the logging agg, gateway monitoring, or VPN tunnels).

nerdatwork · November 24, 2020, 4:13pm

Any amount of downtime would result in decrease of online score. This isn’t reflected in real time hence you might observe a gradual decrease.

SGC · November 24, 2020, 4:22pm

so i was looking at my uptime this morning and was a bit surprised that it wasn’t at 100%
but i know exactly what happened… a couple of days ago i wanted to use a port in a network switch which the server was using… and due to me being a pro and not having labeled my patch cables
i had to figure out what went where… so i pulled up a docker log for a storage node and pulled a random cable…

ended up with maybe 30 sec of downtime for the storagenode before i found the cable i was looking for, that gave me this… which seemed a bit over the top imo

so did a bit of math on it, calculating that if it counts over 72 hours (duno that it does)
then i would have been registered as being offline from 25-75 seconds

so all in all i can say, the online score is extremely sensitive, most likely running over 72hour periods according to the number i got anyways.
and so even brief disconnects will be registered…

was helping out one guy recently which ended up figuring out that it was his ddns that took up to 10 minutes to update and because his isp would renew his ip address every day, that would give him a bit of downtime on the online score…

http://forum.storj.io/t/online-status-fluctuates-since-a-few-days

but should be fine if it doesn’t drop to low… and ofc it takes like 288 hours of downtime to get suspended atm… so the online score is a good indicator of small issues, but not really anything to worry about…

also will be interesting to see how long it takes for my online score to recover… because i know i was not offline for more than 30sec maybe 60

anyways good luck, hope some of you can find this ramble useful for identifying your problems

nerdatwork · November 24, 2020, 4:59pm

In case this helps

github.com

storj/storj/blob/257c8682d392cd078c820abac47fa464c94e981f/docs/blueprints/storage-node-downtime-tracking-with-audits.md

# Storage Node Downtime Tracking With Audits

## Abstract

This document describes a means of tracking storage node downtime with audits and using this information to suspend and disqualify.

## Background

The previous implementation of uptime reputation consisted of a ratio of online audits to offline audits. We encountered a problem where some nodes' reputations would quickly become destroyed over a relatively short period of downtime due to the frequency of auditing any particular node being directly correlated with the number of pieces it holds. To solve this problem we need a system that takes into account not only how many offline audits occur, but _when_ they occur as well.

## Design

The solution proposed here is to use a series of sliding windows to indicate a general timeframe in which offline audits occur. Each window keeps two separate tallies indicating how many offline audits and total audits a particular node received within its timeframe. Once a window is complete, it is scored by calculating the percentage of total audits for which it was offline. We can average these scores over a trailing period of time, called the _tracking period_, to determine an overall "offline score" to be used for suspension and disqualification. By granting each individual window the same weight in the calculation of the overall average, the effect of any particularly unlucky period can be minimized while still allowing us to take the failures into account over a longer period.

Storage node downtime can have a range of causes. For those storage node operators who may have fallen victim to a temporary issue, we want to give them a chance to diagnose and fix it before disqualifying them for good. For this reason, we are introducing suspension as a component of disqualification.

Once a node's offline score has risen above an _offline threshold_, it is _suspended_ and enters a period of review. A suspended node will not receive any new pieces, but can continue to receive download and audit requests for the pieces it currently holds. However, its pieces are considered to be unhealthy. We repair a segment if it contains too many unhealthy pieces, at which point we may transfer the repaired pieces from a suspended node to a more reliable node. If at any point during the review period we find that a node's score has fallen below the offline threshold, it is unsuspended, or _reinstated_, but it remains _under review_. This prevents nodes from alternating between suspension and reinstatement without consequence.

The review period consists of one _grace period_ and one _tracking period_. The _grace period_ is given to fix whatever issue is causing the downtime. After the grace period has expired, any offline audits will fall within the scope of the tracking period, and thus will be used in the node's final evaluation. If at the end of the review period, the node is still suspended, it is disqualified. Otherwise, the node is no longer _under review_.

This file has been truncated. show original

SGC · November 24, 2020, 5:51pm

a right i forgot about this window thing… which is kinda funny because i was one of the people initially (to my knowledge) suggesting why not just use audits for uptime tracking.

so this window thing would mean it tracked me as offline for an entire window, whatever that exactly means … and the reason it’s different for each satellite is because of the differences in the amounts of data / audits i got hit by…

still 30-60 sec but i guess if one or a few failed audit is enough to make me fail on a window…
ofc then each satellite should be about the same, but i suppose it could be the stored data that makes them uneven… my numbers aren’t really high enough to draw any conclusions aside from it only took 30-60 sec to discover that i was down… which is pretty impressive…

the BP is a big read to understand online score tho…
don’t get me wrong, i find it as interesting as the next guy… okay lets me fair, most regular people won’t find it interesting at all… they just want to know how online score works, and i’m a bit on that team also…

LinuxNet · November 24, 2020, 8:32pm

the online score is extremely sensitive

I can confirm. Yesterday I had my VPS off for only 1 minute and immediately lost 0.3% on the Asia East satellite.

Such short disconnections should not be a problem for the whole month. At least that’s what I’ve observed so far.

SGC · November 24, 2020, 8:54pm

it’s not really a problem, the online score is an indicator, i guess it’s good that it’s highly accurate so it can indicate even very small network issues before they develop into major issues for a SNO.

also keep in mind that it’s still kinda new, so some fine tuning may eventually be done before it takes on it’s final form… but yeah i was a bit surprised just how sensitive it was… which is good, in the past storj had trouble tracking nodes online time… now it’s basically built into the network architecture and well it knows in short order when one goes offline.

and from what i understood in reading some of the blueprint, being offline strikes the entire window, which i duno how long that is… but could be an hour or maybe 4…

just means you don’t want to have an intermittent connection to the network, as it could ruin the viability for a node, but also that if its like say 4 hour window… it doesn’t matter if you aren’t present for 1 single audit or for however many you will get over 1 or 4 hours as it’s all in the same window… meaning you can without issue shutdown and repair / maintain your system without your score being crippled… (to my understanding… but ill certainly test that at one point)

however if one has an internet connection that works 98% of lets say every hour… then your node will get suspended and i guess the network will the move the data else where before DQ the node eventually if the problem isn’t fixed… which seems fair, i mean sure the network has redundancy… but is it’s unstable… and if say 50% of the nodes are that way… .you could end up with brief minutes where data might simply not be accessible or repairs will be triggered without there really being a need for it… which requires more space to be used i would suppose…

thus decreasing the overall efficiency of the network, just because somebody / many has an unstable internet connection… this way atleast they will get a chance to fix it and the network can very accurately determine if it’s beneficial to keep or DQ a node…

i know it doesn’t do that yet… but it will eventually… since it’s the most efficient solution.

Pac · November 24, 2020, 11:11pm

@SGC & @LinuxNet: You’ve got to be damn unlucky to be sent an audit during such a short offline window of 30~60 seconds…
That seems really unlikely! My nodes got offline during 1 hour because of my ISP lately, and only 2 sats (out of 5) noticed it:

Rather don’t you think you’ve been offline more than just a few seconds at another moment you’re not aware of… ?

SGC · November 24, 2020, 11:16pm

i got nearly 15tb on one node… and get checked 12 days back i get just short of a thousand audits on a slow day and the peak was nearly 2k during a 24hour period… 60minute x 24 hours gives 1440 minutes in a day.

and so getting an audit every minute is just about the avg for me…
some of the sats didn’t notice i was missing for 30 seconds either 3 of them did tho, sorry 4 did

Alexey · November 25, 2020, 12:34am

The online score is dropping faster if your node is new. It should have 30 days audits to match the expectations (60% after 288 hours offline).

emm · November 25, 2020, 2:39am

I’m in a similar boat as well, the storage node updater failed to restart the node after an update.
I do not know how long it was down for before i found it and manualy restarted the node.
since then it has been online 222 hours now and it is still continuing to drop far more on one satelite than any of the others my internet is via 100Mbit fibre and has had 1297 hours of trouble free up time since i started monitoring it.
so at a bit of a loss

LinuxNet · November 25, 2020, 6:21am

You should monitor your node with Uptimerobot, Freshping or similar. Then you know exactly when and how long your node was offline.

SGC · November 25, 2020, 7:45am

well your log will tell you when you are offline… or in case the server stalls all out the only reason one needs uptime robot or such is because the signal needs to get online or into a similar distributed network so it can reach the owner and complain, ofc uptime robot will just see if the port is open and something is responding on the other side.

yeah i also skipped the uptime robot setup… it seemed rather steep to pay them 5$ a month for an sms service, i cannot really use an online based alert, thats kinda want i want to avoid… i mean for the price of 6$ i can take an old mobil and hook up to the server with a mobile subscription and have that send me sms’s if the server has problems…
only difference is i can make it much more elaborate, and fully custom able and working as a separate system which can still get data from the server for communicating directly issues that may arise…

so yeah not really a fan of uptime robot, i think they are using the paid service to pay for the free service, because the paid service is way to over priced when one can basically copy their setup for the same monthly cost…

i mean with that setup i could essentially run my own uptime robot and ping online servers and send sms’s and the biggest cost is the subscription for the mobile phone, ofc in that case i would most likely just use an online sms service instead… but the old mobile is a pretty nice solution compared to uptime robot.

also there being exactly 1$ in difference kinda makes me think that uptime robot looked at the prices of cheapest mobile subscriptions in general and just subtracted a $.

which again seems to indicate that their business model isn’t to be good to their paying customers, but to squeeze them to pay for their free advertisement offering a service that basically costs them the same thing as their paid service, since the sms service utilization is basically free if one really uses it on a large scale then there is no reason to take 5$ and even with 15 tb stored my server still uses more power than it been earning… at best it’s even, so i really don’t want to add more expenses for i can only classify as a poor service unless if one uses the free service which only works online, meaning it’s only relevant if my internet or server is down, which in both cases is painfully obvious.

maybe i will use uptime robot one day… but i really don’t want to … that would just help pay for all the other people that uses it for free, which seems like most people from when people talk about it.

pretty neat business model they got tho…

LinuxNet · November 25, 2020, 8:18am

But why pay? Uptimerobot has a free tariff. Clearly only every 5 minutes. If that doesn’t suit you can switch to Freshping or Hetrixtools. The latter has another great advantage. Namely, that you can configure it so that you get a notification for every little thing. e.g. I can enter the application “storagenode” there and receive a notification immediately when this service is down. This is how it works, for example. with InfluxDB and any other application. Hetrixtools is also the most reliable as it runs via an agent installed on the server. I just set up 2 accounts there. One for all the agents on my servers and one for the ping and port checks.

And if you are still not convinced, you can book a VPS for 1 euro (e.g. here in Germany at Ionos) or at Frantech for $ 2 and install Zabbix there.

However, 10GB of storage space might not be enough for Zabbix, depending on how many devices you want to monitor.

Take a look at Hetrixtools. <- advertising link

SGC · November 25, 2020, 8:38am

because i only want the sms service, so i can get notifications anywhere basically, the whole reason for uptime robot is so it’s able to hunt me down and scream at me even if i’m like at the bottom of a well.
xD

internet service tho great, is still pretty limited in some areas, even here there will be deadzones or atleast low reception areas and i want it to work 24/7 without taking any significant power on the notifying device.

this way i can go for many days without charging and the square kilometer area covered will be like 10-100 times bigger.
isn’t that kinda the whole point behind it… i mean its kinda the whole point behind it, that it’s able to contact me, if it doesn’t have sms service then i’m stuck with 3G, 4G and whatever else the mobile data networks are called… which are poor when moving over certain speeds, which ofc is solved by having special transmitters on trains and such, ofc the newer mobile hardware i’m sure is better at dealing with Doppler shift, duno if thats really still an issue tho… but the range is because of the whole bandwidth thing… the further you want to send a strong complicated signal the louder it has to be…

so low bandwidth simple data vs high speed data the lower bandwidth technology would almost always have a range advantage… much like modern wifi has less range or penetration power compared to older bandwidths… simply due to the whole the higher the frequency the more particles will interact with the environment as n.tesla states it.
thus making them bounce, refract and dissipate into heat / kinetic forces, so slower bandwidth / lower frequencies will always be more easily transmissible.

so yeah i want sms for range and power conservation… imo sms is the only logical way to go, everything else has like 1-4% the range

kevink · November 25, 2020, 8:39am

but why? If you got no internet to receive an email, how are you going to fix your node? Or would you crawl out of your well to fix your node in an area with internet?