Taking node offline to rebuild a disk array?

Suppose the following:
A node has been running with 100% availability for 24 months.
In the 25th month, one of the disks fails and the SNO takes the node down for 12 hours to rebuild the disk array (rebuild from parity).
After being offline for 12 hours, the SNO restarts the docker container using the same identity, same storage-dir, same everything.

  1. Would this be a good strategy to handle a failed disk?

  2. Would the node be DQ’d for being offline for 12h (exceeding the SLA 5h of downtime per month), even after having 2 years of 100% availability?

  3. If a small number of files (example: 5 out of ALL the storage/ directory files) could not be recovered from parity, could the node still resume with its original identity having most of the files?

I’m trying to figure out the best way to handle a future disk failure and still preserve the node.

I read about the strategy of running “one node per disk”, and discarding the node if a disk fails, but this seems wasteful if 99.9% of the files can be recovered using a disk array with parity.

“Parity is a waste of space” you say? In my setup, I already have parity protection for my personal files, so I’ve already “lost” the space. I can protect the storj directory “for free”, but the rebuild time will take ~12 hours.

Thank you

Hello @diskhead,
Welcome to the forum!

You could do that. The disqualification for downtime is currently disabled. However, we implemented the online tracking system, and it’s working like described here:

As soon as we collect enough stat - the disqualification for downtime will be enabled.

FYI - with today’s disks the rebuild could take days and could end with died array with 98.86% probability:

Depending how much data is lost. Your audit score will be affected anyway. If audit score lower than 60% - node will be disqualified.
If your node is lucky, it could avoid disqualification:

this is will be 1/8 of array in case of 8 nodes vs array with 8 disks. Or the whole array, if rebuild would fail in case of array.
The full discussion regarding RAID vs No RAID you can read there:

3 Likes

There is no need to take the node down while rebuilding.

Just set your shared capacity to 500GB in config.yaml so the node won’t accept new pieces during the time of the rebuild.

3 Likes