Blueprint: Downtime Disqualification

BrightSilence · March 31, 2020, 8:25pm

I’m going to have to emphatically disagree here. We’ve been paid at worst for exactly the service we’ve provided in the same way that we would be in a more mature production situation. At best we’ve been paid 5x as much. The traffic has been artificially increased by extensive testing and payouts have been boosted by rather massive surge payouts. If you’re node hasn’t been profitable over the last few months you need to seriously reconsider how you are running it.
In my case, I’ve made enough money to buy an additional 12TB HDD, making the space I supply to the network essentially free with massive room to spare. And with the payout that remains I can almost buy another one.
In contrast Storj has been paying for their own test traffic several times over with surge payouts without receiving any customer payments until now. It’s not even close to the same thing.

Pentium100 · April 1, 2020, 5:28am

Or, consider longer periods before kicking the node. While my setup is pretty reliable, I still do not have everything redundant, for example, I do not have a generator and during quarantine, I may not even be able to rent one (if the power company warned me in advance). It means that I may possibly have an outage of a day or more (server or switch fails while I am away on vacation etc), but that should only happen once every two years or more (99.3% uptime over a year means 2d 13h downtime total).

serverninjas · April 1, 2020, 6:13am

I have a good example going on right now. Its not a storage node, but a different server in a DC. The DC is having issues in one area. There are several racks that have been offline for going on over ten hours now. Things are going to happen to cause long outages on single servers. AC unit goes, air handler has issues, pdu has issues, bad memory, bad mother board, bad power supply, ect. Take that into a consumer setting where most ISP are a minimum of three days for a service call unless you are paying for a business line in a residential area. Even with that SLA are still several hours up to a couple of days. Two days may sound like a long time but it quickly adds up over time as things happen. You hope for the best but you also have to be reasonable with expectations of underlying infrastructure.

andrew2.hart · April 1, 2020, 8:06am

I think downtime disqualification is a bad idea. (I recently had 10 hours down due to watchtower update failure - and NOT disqualified)
Could “bad” storagenodes be moved to a cheaper low availability tier instead?

Krey · April 1, 2020, 10:03am

Multihomed host - host that have multiple public IP addresses and can can be reached by all of them. It about my original message.

Multihomed IP is something about rent of autonomous system (AS)
And I doubt that there will be at least one farmer with AS because it very expensive cost for IPv4 AS and where is no providers with this service enabled for home users. So “homed” wrong part of word

Krey · April 1, 2020, 11:20am

No one of home users can avoid 5 hours downtime for years. That why a talked to you about expecting massive GE on old nodes.

Recently i was DQed due to config error. I forget run daemon-reload after change storage path. And due to error on storjlabs side i receive a warning message from my script too late, right after DQ on europe satellite.

We all humans and make mistakes. Not all of it can be fixed for 5 hours. This SLA not compatible with home users and plain payouts.

BrightSilence · April 1, 2020, 11:35am

To be fair, that’s what the whole suspension system is about. Allowing you to recover from such errors.

A similar system is being worked on for audits, which would have helped in your case.

thepaul · April 1, 2020, 3:58pm

Many people can and do make services available from home internet accounts with four or five nines of uptime over a period of years. I’ve done it myself in multiple instances. The thing that is hard to do is guarantee low uptime, when you’re reliant on your ISP and on hardware not dying. As @BrightSilence said, that’s the point of this exercise: we are instituting a system to make it so the penalty is limited when problems do arise.

I don’t know what the final exact requirements will be. What I do know is that we will of course not have requirements that “no one” can meet. If our requirements are not reasonable for at least thousands of node operators, then our service will fail, and we don’t want that.

kevink · April 1, 2020, 4:02pm

*guarantee high uptime low is not a problem

thepaul · April 1, 2020, 4:03pm

lol good catch

Krey · April 1, 2020, 5:04pm

Every update i read on chat about watchtower or Windows updaters leave some nodes down. Every update i see peaks in repair traffic. In march i receive 1.3TB Ingress repair. I disagree with you points.

Aurum420 · April 1, 2020, 9:17pm

I can’t believe you’re even considering this when you have not provided a manager that will keep one apprised of how their nodes are doing. We’re just flying blind most of time. I have to remote into each node to see your little uninformative Linux dashboard. I have no other reason to do that daily or even weekly for most of my computers. This combined with the limited number of shards one can transit with a single IP address and you’ve got a recipe for diminishing interest in STORJ.

Alexey · April 1, 2020, 9:20pm

Please, try this one: https://documentation.storj.io/resources/faq/check-my-node
Also, there is plenty of other monitoring tools: https://forum.storj.io/tag/monitoring

cameron · April 3, 2020, 9:22pm

As some have stated, suspension mode should already allow nodes to survive downtime, provided we set logical parameters and their downtime was reasonably temporary. This doesn’t preclude other potential measures to protect SNOs, like multiple IPs or planned downtime, however those are going to require further discussions and designs.
Some have asked what particular parameters we might use. I think that is open for discussion. What would you like them to be? Personally, I was thinking of a tracking period of 30 days with a 7 day grace period. The amount of allowed downtime we would probably set fairly high to begin with, maybe half the tracking period. What the ultimate long term amount of allowed downtime would be, I’m not sure. This is all presuming we even decide to implement this particular design as well.
Also, I like BrightSilence’s idea here Blueprint: Downtime Disqualification
This way if a node has fixed whatever issue it was having it doesn’t have to wait to come back, but given that we will evaluate it at the end of the “under-review” period it incentivizes continued uptime.

cameron · April 3, 2020, 9:25pm

The backoff interval starts at 1 second. After each failure, the interval is multiplied by 2, until it reaches a maximum of the contact chore’s own interval, which is set to 1 hour by default

Pentium100 · April 3, 2020, 9:41pm

Why was it set like that instead of retrying the connection every minute or so?

Toyoo · April 3, 2020, 10:08pm

The backoff interval starts at 1 second. After each failure, the interval is multiplied by 2, until it reaches a maximum of the contact chore’s own interval, which is set to 1 hour by default

No randomization? Then I fear the exponential back-off is counterproductive.

cameron · April 6, 2020, 1:57pm

Presumably to avoid likely fruitless contact attempts. However since it’s in the SNO’s best interest to successfully connect ASAP, I can see a valid argument for changing that

Pentium100 · April 6, 2020, 3:02pm

Yeah, but it’s not like the attempt costs anything. If a browser did that, the user would just click refresh instead of waiting for 15 minutes for another attempt. I think that once or twice a minute would not result in a high cpu usage or high network load.

I sometimes see a similar problem with DHCP - some devices (I think that also includes Windows) just “give up” after a number of failed DHCP discoveries, but now when the problem is fixed, the user has to reboot his device or it to start working again. Some other devices work better in this regard and keep trying to get the IP once every few seconds indefinitely.

cameron · April 6, 2020, 4:22pm

@Pentium100 I hear you. I’ve proposed making a change, waiting for any counter-arguments
@Toyoo How so?