Downtime more than 5 hours , afters years of working

StoreMe2 · September 18, 2020, 10:46am

Hi i am since V2 (3,5 years) in storj as an Node Operater. This night during sleep my Server got freezed and in had to restart today. Don’t know the exact offline time but it was sure more than 5 hours.

So when storj is really making some disqualification with that i can say that i am out forever to this project.

It was fun to be invovled in storj and I did my best with USV, reliable server Hadware … in the 3,5 years I had 2-3 x (more than 5 hours) offline time through ISP Error and i got not disquailification because it was not activated.

It’s not possible for average Person who is sleeping and working to make the setup so realiable that never ever an freezin of the PC or ISP offline is making trouble for me as an SNO.

In the long run every SNO has some issue with ISP or Hardware error and can’t control this at all.

After all years i got a bit dissappointed after working 3,5 years on this project with a good server. and nearly 100 % reliability until now

So when they are making disqualification on my server I am gone forever and never looking back cause then there is no sense in that what i have done in all the years and there is no way to do as they want in the long run.

Don’t how this effects the Network but in the end there are only real professional setups in central server farms. The average joe (even with USV, serverhardware …) is kicked out of this project in the long run.

Pentium100 · September 18, 2020, 10:49am

I had a similar incident recently.

However, it looks like Storj will be more lenient with the new uptime monitoring system.

StoreMe2 · September 18, 2020, 11:06am

Yes. I read your post some time ago and was exactly thinking : OK that could be me too and now I got the same error.

So my Server got in 3,5 years all in all round 15 hours downtime.

That is in hours relativ: ( 15 / (365 x24x3,5)) x 100 = 0,000489237 %

So I got an uptime of 99,999510763 % but that is not good enough for storj cause my downtime wasn’t over time all the years. The offline time was in only 2-3 days (1 day 2019 and 2 days in 2020) and so after 1277 days working 100 % i would get f… up for this.

Time will tell how storj is handling the SNO but i am not ready to put more into this project when they kick off me.

I think an uptime of 99,999510763 % in 3,5 years is realy more than enough for an average joe.
And there are so much chances to fail without control. Hardware and Software error can suddenly occur and you can do nothing.

I think right now that i get no disqualification but i don’t know if i want to run my Server any more when i know that only 1-3 days can destroy the realiable of a good server even when doing fine with 99,999510763 % uptime in 3,5 years.

Pentium100 · September 18, 2020, 11:28am

Yeah, if Storj kept the 5 hours requirement then a cluster would be the only reasonable answer.

However, it looks like they have figured out that it is unreasonable to expect such high reliability from a server someone is running at his home a single power line, single internet line and no employees who could repair the server 24/7.

baker · September 18, 2020, 1:06pm

Hi StoreMe2,

I think you might be getting ahead of yourself. Currently your node will not be disqualified for downtime. And when disqualification is enabled for downtime in the future, they will (as of the latest information) allow up to 288 hours per month as a starting point, although the current expectation is that after initial testing the allowable downtime will be somewhere around 36 hours per month. You can read about it here:

So please keep your node online! I too have been around since the v2 days. I think this is a great project otherwise I wouldn’t have kept up with it for so long. It seems like you think so too, so don’t throw in the towel yet! Bottom line is that your node should be just fine if the only problem was being offline.

andrew2.hart · September 18, 2020, 3:53pm

I understand that you are upset that you had some downtime. (how does it go? denial, anger, acceptance?)
3 years is a good uptime even by datacentre standards. Well done!

Rene · September 18, 2020, 5:22pm

What is your problem any way? I don’t get it.

donald.m.motsinger · September 18, 2020, 5:32pm

If it would happen to me, I would spin up a new storagenode the next day again. What else can you do with all the empty storage space? It was said from the beginning that you get disqualified after 5 hours downtime and you thought it’s still worth doing.

And as others already pointed out, disqualification for downtime is disabled right now and the 5 hours will be increased.

BrightSilence · September 18, 2020, 5:33pm

There is no problem from what I can tell. But I can understand the fear of losing your investment on a single issue that couldnt be prevented. There is no reason to fear though, the topics already linked to explain how the new uptime system will both allow for more down time and even if you still don’t manage to provide that you get plenty of time to correct the mistake and recover from it.

andrew2.hart · September 18, 2020, 7:40pm

I think you don’t understand until you get that gut wrenching, sickening feeling that you have been down for hours and didn’t know.

It takes time to get back, trust me.

ok im a sad geek

edit : here was my moment Watchtower killed my storagenode

Pentium100 · September 18, 2020, 7:42pm

Yeah, then there is the feeling of “why bother, it’s going to fail in a year or two again anyway” if the cause was something that was not reasonably preventable.

BrightSilence · September 19, 2020, 4:50am

Oh I’ve been there. This was back when uptime was actually still used to suspend nodes. I just arrived at work when I received the uptime robot notification. System had become completely non responsive so I couldn’t fix the issue remotely. Over 7 hours of down time. I now have a smart plug on the system so I could use that as a last resort to revive the system remotely. Of course these days I work at home to avoid the human malware, so I don’t need it anymore.

Floxit · September 19, 2020, 11:48pm

A solution should be to enable notification on your phone or tablet to wake you up when it happens. You can configure that with several apps on uptime robot, even if, in some cases, the relay would fail. I didn’t made it myself, but I guess this is the way to go in the future. Also, it could wake you up for a 5 minutes drop recoverable by your machine or connection, and wasting your sleep, so yeah…

SGC · September 20, 2020, 9:30am

i was tinkering with my new SLOG / L2ARC PCIE SSD and ofc stalled and crashed the server multiple times mainly due to some 4kn to 512B incompatibility over zfs.

this made the pool at times basically inaccessible and stalled the entire kernel… oddly tho the storagenode seemed to run fine, but that aside… i have a BMC hardware watchdog that i was expecting to reboot my server in case of issues, but didn’t seem to catch the issue…

so i guess i will need to take a look at my watch dog again… but yeah i’m also prone to the same issues even with a fairly HA setup… even tho my uptime is nothing like 99.9%
have done a few 3 -5 week runs lately, but my setup is also just about 7 months…
and i’m still in the process of getting everything running as it’s suppose to…

i might try to decrease my watchdog time or maybe have a script that checks the access to the pool and in case of it being dropped will not respond to the watchdog and then it will do a hard reset to try and restore the system to a working state.

anyways, even tho my hardware watchdog hasn’t helped me yet, i would still suggest that those with the hardware to do so, utilize features such as this to help their HA setups.

it’s always the things that we don’t think about that will come back to bite us…
my most recent fail was me doing a reboot, but after having removed an ssd from the onboard sata controller, which made my bios cycle to a wrong boot drive and then the boot process was ofc stalled… luckily i caught it after a couple of hours… will also need to get that fixed so it will not be a problem in the future.

jocelyn · September 21, 2020, 9:43pm

hi @StoreMe2 I was wondering how youre doing on the SNO side. Any updates with your situation?

madbitz · September 22, 2020, 8:46pm

Don’t worry, i had a power failure and forgot to turn on last power state. Also, had another where my kids turned off the pc and i didn’t notice for about 8 hours at least. But i am still showing 100% not sure why or how. Used to get notifications when the pc was down by email, but for the last 2 or 3 months or so, this stopped letting me know. Where or how can i turn this back on as it was handy.

Alexey · September 22, 2020, 9:42pm

https://documentation.storj.io/resources/faq/check-my-node

StoreMe2 · September 22, 2020, 11:01pm

Hi,

actually my Server got stuck again. Today i restarted deleted with docker rm storagenode and made a new restart from scratch. Watchtower deletet too and restartet from scratch.

My Server does some strange behaviour. Nothing happens. It looks like freezing but there is no error. I can see the freezin cause lack of the red HDD LED. During normal Serverworking the red HDD LED is always on and showing work but then suddenly the the red LED does nothing. Then I know that the HDD is not working and then I know storj is not working.

Will watch this and try my best but i think in the long run i am out of this game. Shit happens

If i cant find the error and get disqulifizied i am sadly done

Pentium100 · September 23, 2020, 6:16am

Try disabling C states in BIOS. It helps sometimes, especially if you use virtualization.

SGC · September 23, 2020, 6:28am

usually when i troubleshoot i ask myself… what changed… because stuff in a unchanging state rarely has issues, if it worked to begin with…

this is also why i rarely change to many things at one time, because then it’s much more difficult to locate where a problem was.

if you don’t know whats happening on the server when it stalls i would shutdown the storagenode, then troubleshoot the server until it seems stable again… you can have days off downtime without even getting a suspension… so better to just take some downtime, solve the problem…

instead of having the storagenode do something stupid because the server is half working and tries to run.

do you have your OS on a mirror? if not, does all your cables work?
are you sure the OS drive is okay, i like the mirror drive setup for the OS it’s very secure and stable…

after this long of a run time your connections maybe affected… i would basically pull everything out that is easy to pull out and reattach it… simply to see if something rattled loose… ofc that also depends on how big of a project that is

anyways good luck and do just accept that you will need a bit of downtime instead of killing the node trying to stay online while troubleshooting a buggy system

ofc you can also try to verify that the stall doesn’t put the storagenode into a selfdestructive state and if it doesn’t then you could still keep the storagenode up while troubleshooting…
but certainly requires a bit more finesse and would have an inherently higher risk for causing some kind of damage to the node…

but thats often my modus operandi verify that it doesn’t run off the rails when the problem occurs, most often by keeping a live node log running. and if it’s all fine then i don’t worry to much about it…