Just had 7 hours of downtime. 4 hours yesterday

Pentium100 · September 11, 2020, 3:09pm

Sometimes the RAID controller can allow you to switch to HBA mode or, in some cases, set each drive up as a single-drive RAID0 “array” to make the drives visible to the OS. I do don’t know how well the second option would work when connected to a different controller though.

When I buy a server buy a HBA for the drives (if there is none built in) I usually use LSI SAS1068 for drives <=2TB and SAS2008 for big drives, preferably flashing then with IT firmware.

deathlessdd · September 11, 2020, 3:20pm

Thanks thats probably what I need for a backup. But can you just plug and play if you have a config saved from the old hardware raid to load on the new.
Of all the hardware in my server the hardware raid card is what I worry most for. This is why I wouldnt wanna run a node on it.

node1 · September 11, 2020, 3:40pm

We are SNO’s, not the ambulance.
It is possible to set up monitoring system, that wakes you up with the alarm during the night, so you could go and work on problem instead of sleeping and going to the main work in the morning. The question is how much Storj should pay SNO’s for such a reaction speed as well do we really need that?

Most of us sleep longer then 5 hours as as well most of us have a main job in the morning. So if the problem starts late in the night, just after you went to sleep, the earliest time when you can start fixing is next day after main job. This means 8 hours of sleep + 8 hours of work = 16 hours downtime with no possibility to fix it. Of course some small things you can do in the morning befor main job, some things you can do over the SSH, but some case needs you to be physically be there and spend few hours. So in my opinion even 24h. of down time (before DQ) is not enough time for some situations.

But i somehow believe, that Storj team already know this very well, because it was discussed many times here before. As well we all know, that sometimes problems can be on ISP or etc. So i hope, that there will be a clever and realistic solution for downtime calculation, DQ and etc.

node1 · September 11, 2020, 3:50pm

…remembered a joke

"There are two ways to fix all problems that are on the planet earth:

Fantastic - we will manage everything by our selves.
Realistic - alians will come and sort it out for us."

With the 5 hours of down time in some cases we have same two options

buchette · September 11, 2020, 4:17pm

I see 2 kinds of setups gap :

1/enterprise grade hardware that cost a fortune with failover stuff. 100% guaranted ISP In this scenario you can’t afford being disqualified.

2/low end stuff with a recycled PC, or raspPi, inexpensive USB drives used until they fail and lose the hosted node. Running over a personal ISP.

I decided to go 2/ but with reliability improvements (a laptop with a decent battery that powers the usb powered disks incase of power outage). And i’m ready to lose my low investment in case of a fail.

Hopefully, my internet FTTH is extremely stable and so is my electricity provider.

5H a month is very short in case of an issue, maybe this could be let’s say 10 hours for the 1st downtime in a period , then 5hours etc. And if no downtime for a certain amount of month, reset it to 10hours…etc.

hoarder · September 11, 2020, 5:04pm

There are still ways around that. For example you could get a pair of HBAs and then connect directly to drives and build a mirrored array from drives connected to different HBAs, or use them to connect to external enclosure with redundant controllers that does raid on its own.

Pentium100 · September 11, 2020, 8:59pm

This is how I have set up my main storage server. There are three HBAs, connected to 8 drives each. ZFS is set up in a way that each vdev has two drives from each HBA and is raidz2. One failed HBA would result in a degraded array, but would still work and I could just move the drives to different slots (well, until I fill up the server).

I have an external storage device (Dell PowerVault don’t remember the model), but I am not sure if it supports big drives. Tough I may have to test that. It has two power supplies and two controllers, using dual-port hard drives (or an interposer card for SATA drives).

It is also possible to have a multipath-compatible backplane and connect it to two HBAs, all accessing the same drives.

kalloritis · September 12, 2020, 4:03am

Multipath SAS is not fun to set up unless you really, really, know what the hell you’re doing and can set up the driver/software to support multipath.

Linux, OotB, does not “zeroconfig” multipath. I don’t remember the best, but I don’t think even NetApp does without extensive setup and cabling.

fmoledina · September 12, 2020, 4:35am

This likely will support >2TB drives if connected to an external HBA. I just sold an MD1000 that I had with a mix of drive sizes up to 12TB without issue.

jxtn · September 12, 2020, 7:45am

When it comes to downtime, the real problem here for me as a regular SNO is not the hardware, it’s the ISP. Due to the fact that in most countries, residential subscribers get the slowest support from the ISP when something breaks. where I’m at, they usually fix the internet connection about 2 to 3 days, there are instances that it even takes 5 days and I don’t have any control over that.

Since Storj targets the home users to become an SNO, slow support by ISP’s should be considered.

twl · September 12, 2020, 7:55am

I also don’t see the problem in ordering backup parts or in ISP downtimes, but purely in the fact that I sleep more than 5h every night.

naxbc · September 12, 2020, 8:07am

Uptimerobot helps a bit on that, as if something happens you get notified within 5 minutes.

MattJE96011 · September 16, 2020, 3:17am

I made a suggestion in another post a while back about possibly having a separate staking account for SNO’s to essentially have the option to put up collateral promising to repair their node / nodes in the event of prolonged downtime. For the sake of the network, nodes need to be online as much as possible so downtime can’t be to lenient, however there should be the option to have added leniency at a reasonable cost especially for SNO’s seriously dedicated to maintaining their reputations. For any old hobbyist with a node or 2 there not going to loose much if they loose their nodes, but for people with real money invested, there needs to be a way of allowing additional downtime for those inevitable circumstances, otherwise it will discourage people once they get burned a time or 2. Think of it like an insurance policy. The more you stake, the longer downtime your allowed, and those costs could be based on overall node reputation. Higher rep nodes get lower insurance cost or something like that. If your node doesnt come back online in the alloted time you can buy more time or the staked account as well as the escro account go to paying the repair costs for the lost node. Personally I have not yet run into any major issues causing significant downtime, but that doesn’t mean it wont happen so this is something I would fully take advantage of given the option.

Pentium100 · September 16, 2020, 5:27am

That sounds interesting.

I agree with that, however, if Storj specifically targets home users and not datacenters to run the nodes, they cannot expect the reliability of a datacenter. While I can make my servers run relatively reliably, unlike a datacenter, I do not have that many protections against rare events that cause long downtimes.

Normally, if a client wanted the service to be that reliable (5 hours max downtime, no allowance for planned downtime), I would recommend running the service in a cluster in a datacenter that that multiple ISPs (preferably that the ISPs not have active equipment in the same building), multiple power lines and a generator. I would probably recommend having a spare server in another datacenter as well.

In my particular case, the server rebooted (instead of freezing this time) and showed an uncorrectable ECC error in memory. I have replaced that RAM stick and did not have a problem since. Maybe that was the real problem instead of the backplane…

BrightSilence · September 16, 2020, 9:02am

While I appreciate the discussion taking place here, I am a little surprised that it’s happening in complete isolation and while ignoring the details of the new uptime measurement system.

Currently the threshold is set to allow for up to 288 hours of downtime per month. And while I’m sure that threshold will be made more strict in the future, it seems pretty clear from all recent communications that we’re never going to allowing only a mere 5 hours of down time. The possible infrequency of audits doesn’t really even allow this system to be precise enough to determine 5 hours of downtime accurately. As a result Storj will have to allow for more down time and incorporate a wide margin of error into those rules as well. I estimate that we would be talking in a maximum allowed down time in days rather than hours when this system is finally tuned.
Additionally too much down time won’t immediately lead to disqualification, but rather just suspension. It then offers a long grace period of a week to correct any problems found. After that week it monitors your node for 30 days to see if the issues have been resolved. I would say this gives all nodes plenty of time to resolve issues and recover from this suspension without permanently losing the node.

That kind of makes this discussion moot.

Pentium100 · September 16, 2020, 9:53am

Yeah, the new rules look ok, as long as they do not make them too strict.

twl · September 16, 2020, 10:01am

I found this really relieving to find out after all those month of uncertainty about how they will implement the dreaded “5h downtime limit”

MattJE96011 · September 16, 2020, 1:54pm

Well that changes things then. Been quite busy lately so I hadn’t had a chance to read up on the most recent updates yet. I just saw this topic and blindly started responding. Thanks for the info!

BrightSilence · September 16, 2020, 2:26pm

Yeah, my surprise was mainly aimed at @Pentium100, who I assumed was aware of this new system. Though I could have been wrong about that.

Pentium100 · September 16, 2020, 3:13pm

I saw something about uptime in that changelog, then saw “this will be changed”, I did not, at the time, read it in detail and though I will share about my incident and that Storj should not expect datacenter-like reliability from a home system (even though my system is almost like a datacenter, the key word here is “almost”).

Because really, my thought at the time was “hey, if the uptime limit was enforced my node would be dead now”.