How to preform maintenance (with downtime)

GrumbleShark · September 19, 2019, 2:42pm

I’ve managed to get a node online and running for a couple of days now. Things are going well. I’d like to make some changes however - namely to resize the drive that I’m using (from 1.5TB to 1.0TB) and move it to a new Raid 5 Array.

Is there a way for me to have about an hour worth of downtime, or will this destroy my current reputation? Any guides or recommendations for how to go about node maintenance?

I’m not really asking for technical help in how to make these changes, but instead what amount (if any) of downtime is acceptable for a node.

BrightSilence · September 19, 2019, 2:48pm

Currently downtime is not used to pause or disqualify a node. The current formula for reputation recovers quite fast from 1 or 2 hours of down time. So my advise would be to keep it as short as you can, but not worry too much about it.
The minimum uptime requirement in the agreement is 99.3%. But as I mentioned this is not currently enforced.

GrumbleShark · September 19, 2019, 3:13pm

How can you tell what your reputation is?

BrightSilence · September 19, 2019, 3:14pm

GrumbleShark · September 19, 2019, 3:22pm

Wow, that’s surprisingly difficult. On first blush I’m not able to run things (command not found jq). I’ll have to revisit this later

Thank you for your help BrightSilence

BrightSilence · September 19, 2019, 3:25pm

It is a little difficult for now, but they have been working on a dashboard that will use this API to show all this info in a web interface. So if this is a bit too complicated, I think the dashboard will be available very soon, so it might not be worth diving into.

GrumbleShark · September 19, 2019, 3:28pm

Yeah, I"ll just wait for that.
Thanks again for the help.

BrightSilence · September 19, 2019, 3:52pm

Good thing you waited!

GrumbleShark · September 19, 2019, 5:44pm

It’s working!
That’s perfect. Thanks again!

cdhowie · September 19, 2019, 5:48pm

As a side note, I’d definitely suggest using LVM for your volumes (assuming Linux here). If using LVM, the only part of this process that would require downtime is shrinking the volume because that can’t be done while it’s mounted. The rest of it (including moving the volume into RAID5) can be done entirely online while the node is fully available.

Resize filesystem (offline).
Create RAID5 device size >= the size of the volume.
Create LVM PV on the RAID5 device.
vgextend the volume group holding the data LV with the new PV.
pvmove the existing volume onto the new PV.

This can be done even if the volume is currently on a disk that will be in the RAID5; you can operate the array degraded until the move is complete.

GrumbleShark · September 19, 2019, 5:55pm

I’m running this node on a vmware host (I have two). I’m still learning, and made the mistake of creating the vmware disk as independent. I’ve just recreated it as a dependent disk, so I should be able to vmotion my entire storj node from one host to another, as well as clone the node.

I don’t know much about LVM volumes, so I’ll have to read up on that. I wasn’t aware that I could resize linux disks on the fly, so I’ll have to check that out. Thanks for the tip cdhowie

cdhowie · September 19, 2019, 5:58pm

No problem. Note that I’m not talking about resizing the disk, but the filesystem. resize2fs can resize an ext filesystem. If growing the filesystem, this can be done online (while it’s mounted), but if shrinking then the filesystem must not be mounted.

Since you’re using VMWare, you can probably hot-add/remove disks to the VM. In this case, using LVM gives you a ton of flexibility that you can’t get with raw partitions. The most useful is the ability to move volumes between physical devices while the volume remains mounted (this is what pvmove does).

Since any block device can be a PV, this also allows you to move logical volumes into a RAID device by layering LVM on top of md-raid, or into an encrypted device by layering a PV on top of a crypt container.

Or you can do it all: PV on crypt on md-raid (which is what I do).

GrumbleShark · September 19, 2019, 6:08pm

What do others do with their node as it concerns Raid? Right now I’ve got 3 500GB drives that are in a Raid 0 config, originally thinking that if the array died it wouldn’t be a big deal. Now I’m thinking that I should be Raid 5, so I’ve been working towards that.

If I loose all the data for my node, how catastrophic is that for my reputation? If I get back online within an hour or two, would I be able to save my node / reputation, or would I pretty much get the boot?

Should I give up the extra 500GB in order to gain redundancy?

cdhowie · September 19, 2019, 6:26pm

If you lose data then your node is going to be effectively kicked out. Restarting your node with no data wouldn’t work; you’d need a new invite and a new identity, and start over from scratch.

Note that backups won’t really help because the data would have to be 100% up to date all of the time. Doing a realtime sync to an off-site system could work, but it would probably take you many hours to sync it back to recover from an array failure, and that downtime will likely disqualify you in production, anyway.

Backups aren’t realistic. The system is designed to be self-healing anyway, so we’re not talking about losing data that the system can’t rebuild.

IMO yes. It’s worth not having to start your node over from scratch every time a disk fails.

Currently I’m using RAID1 (with two disks). If I add disks later I will be using RAID10.

RAID5 is basically guaranteed to fail during rebuild if the array is 12TB+ so as you approach 12TB the odds of a rebuild succeeding go way down. Use RAID6 or RAID10 instead… but with three disks, RAID6 is basically RAID1, and RAID10 wouldn’t even work.

RAID5 is probably fine for now but I would suggest switching to RAID10 as soon as is feasible.

GrumbleShark · September 19, 2019, 6:38pm

Thank you - very clear and understandable!