Maximum downtime running only during daytime

zethsqx · November 22, 2020, 1:30pm

Hey all,

What is the maximum downtime I can afford before getting disqualified?? My dell server idrac is faulty and I might need to fly in new motherboard to replace or find a new server for the nodes … fan going crazy

If I run only 13 hours a day and shut it down for 11 hours, how soon before I get disqualified…??

SGC · November 22, 2020, 3:51pm

99.5% uptime requirement…
duno if you can get DQ from downtime yet… but if not, then it’s on the horizon.
https://documentation.storj.io/before-you-begin/prerequisites

Pac · November 22, 2020, 3:56pm

I don’t think downtime DQ is enabled yet.
Besides, my understanding is that when it is, it will allow you to have many hours long downtimes as long as they occur rarely. And the node would get suspended for 7 days in case downtime score is too low to give the SNO one last chance to get things right before disqualification.

However, if you keep having regular downtimes (roughly 50% of the time from what you said), the sure thing is that your online score would probably drop and stay around 50%, and my bet is that when downtime DQ is enabled you would get immediately suspended at best, or DQed at worse…

In short: If you have no other way, do that until you fix the issue, but better to avoid it if you can.

SGC · November 22, 2020, 4:44pm

my problem isn’t that he cannot start a node, the problem is that storagenodes makes better money over longer periods… so in a year it will be nice and settled in and then BOOM DQ because he maybe didn’t notice the setup changed…
so unless if he intends to let it run continuous then storj isn’t really for his use case.

also i have seen people having uptime score drop from rather short downtimes, so i would expect the downtime DQ to be around the corner…

so really it’s near 24/7 or don’t even try, simply a waste of effort

Pac · November 22, 2020, 5:22pm

For the long run, obviously.

But because @zethsqx said:

I assumed it was a temporary solution.

SGC · November 22, 2020, 5:58pm

i totally missed that… lol not sure how i managed that…
yeah the time allotment before suspension for dt is currently 12 days so 288hours
my bad

zethsqx · November 22, 2020, 11:06pm

haha thanks all…i guess i need consider graceful exit of the nodes as well.
started 4 nodes in april, so it might be more worthwhile to do that.

twl · November 23, 2020, 7:12am

Why would you do that?

Exchanging a motherboard takes 2h max, so there’s 2h of downtime…am I not getting something here?

zethsqx · November 23, 2020, 8:03am

haha its a old dell server + no critical workload running on it except storj
so the parts is expensive + shipping to my country.
might not be practical to invest just to maintain storj workload.

SGC · November 23, 2020, 8:31am

can’t you just turn it off, its remote access… doesn’t really seem like a critical part, some stuff may be a bit more unpractical, but thats how it is with old broken stuff.

also you might not be stuck with one motherboard, ofc it’s best to get one that fits the chase so going with the same model is simplere… but even of the same exact layout there will be many different models of different prices can different availability.

so yeah certainly make sure you make a god choice, also wattage usage can be pretty high on old servers

zethsqx · November 24, 2020, 12:24am

yeah…i tried fixing till the whole server now refuse to boot up because idrac is undetected…damn…

anyway, i have 4 nodes.
previously, i’ve backed up the credentials for all 4 nodes into storj1.tar, storj2.tar, storj3.tar, storj4.tar
the backup tar file contains
/root/.local/share/storj/identity/storagenode/

ca.key
identity.key
ca.1234567890.cert
ca.cert
identity.1234567890.cert
identity.cert

assuming i would be moving all 4 disk to another server. now i have a problem.

how do i identify which disk is for which storj node?
is there a cacert file or identity file inside the folder storing the data?

Alexey · November 24, 2020, 12:34am

No, if you didn’t place it to the disk with data (the recommended setup).

zethsqx · November 24, 2020, 1:37am

sorry dont quite understand.
so theres no way to identify the disk to the correct node just by looking at the data files?

p.s. nvm just thought of an solution. wouldnt i be able to just check the date the node is started?

kevink · November 24, 2020, 6:15am

That should work. The earnings.py script should provide you with the information. So if the nodes were started with a few months of difference, you should be able to match the identitiy to the data.

Guess you didn’t have the docker run scripts saved and the fstab/mount points of the hdds.

PS: I recommend putting the identity on the same HDD as the data so that won’t be a problem in the future.

SGC · November 24, 2020, 9:33am

isn’t the DRAC module separate, thus pulling it out and putting in a new one might work.

it’s actually a good question, i’m not sure how to correlate the identity files to the storagenode… seems like something that should be stamped on a storagenode when it boots for the first time with a identity, then that identity should graft itself so it’s clearly to see which identity belows to each node… and actually it shouldn’t allow it to run with a wrong identity.

but for now… great question… move the identity folder into the storj dir it belongs to…
and it has to go with the correct storagenode else the storagenode will end up DQ because the satellite cannot find the files if it thinks it’s another node and the identity is what is used to identify each node

zethsqx · November 24, 2020, 9:54am

thanks all for your input. will migrate soon

@sgc about the drac… mine is dell r730. they baked the drac into the mb…and the drac controls the fan speed…what a lousy design…haha…

SGC · November 24, 2020, 10:21am

damn… how did you manage to break it anyways… isn’t there something you can do like reset or rewrite it or update its firmware to maybe fix the issue… also remember stuff like this could in theory have some sort of requirement for things that are operational… like say if it controls your fans.

then if you used 3 pin rather than 4 pin / wire fans, then you might not get the rpm or whatever back from the fan’s and then the drac might refuse to boot…

generally it’s pretty rare stuff baked in on the mobo would die… not saying it’s impossible, i’m sure there was a good reason for being able to replace the drac in the past…which is also common with other brands i believe, duno if that was because they broke or because people might want different versions or custom setups.

but there would have been a reason and it’s far fom the first time i hear about these types of modules having trouble…
but if mine wasn’t working i would be very mindful of what i had changed… like my BMC or whatever it’s called doesn’t work on linux because i need to install java in or give it access to java or something… i duno it’s kinda weird, been trying to make it work, and it does… but i just cannot remote the computer truly… however if i install windows then it works fine and i can hardware remote the server during boot, even if it is a very old system and has some general weirdness about it…
like when i reboot the system, it will not display when counting ram…which is fine… until one disables quick boot and counting ram takes like 10-15 minutes… and then sometimes it will seemingly lock up during reboot and forget to the send the monitor over the remote access

it works, it’s annoying and difficult to work with and things takes twice as long or not 3 times, but it can work in an emergency, when i got windows installed…

so yeah long story short, be sure you didn’t change something that pissed off the drac
because it’s a server and you might not reboot it often, then it could have been quite a while ago since you made the actual change until you notice the effect of it not being able to boot.

and it doesn’t take long to read up on basics requirements of the drac