Hello,
I am away on holidays and one of the disks connected to a LSI HBA card went offline. How long do I have until node is completely banned?
Data on this HDD is intact, it is a cabling issue and it has happened before, the node will be able to restart without issue once I am home and restart the computer with a new cable.
If node is banned, is there any appeal process, can I restore or prolong the suspension period?
// AuditHistoryConfig is a configuration struct defining time periods and thresholds for penalizing nodes for being offline.
// It is used for downtime suspension and disqualification.
type AuditHistoryConfig struct {
WindowSize time.Duration `help:"The length of time spanning a single audit window" releaseDefault:"12h" devDefault:"5m" testDefault:"10m"`
TrackingPeriod time.Duration `help:"The length of time to track audit windows for node suspension and disqualification" releaseDefault:"720h" devDefault:"1h"`
GracePeriod time.Duration `help:"The length of time to give suspended SNOs to diagnose and fix issues causing downtime. Afterwards, they will have one tracking period to reach the minimum online score before disqualification" releaseDefault:"168h" devDefault:"1h"`
OfflineThreshold float64 `help:"The point below which a node is punished for offline audits. Determined by calculating the ratio of online/total audits within each window and finding the average across windows within the tracking period." default:"0.6"`
OfflineDQEnabled bool `help:"whether nodes will be disqualified if they have low online score after a review period" releaseDefault:"false" devDefault:"true"`
OfflineSuspensionEnabled bool `help:"whether nodes will be suspended if they have low online score" releaseDefault:"true" devDefault:"true"`
}
Since OfflineDQEnabled defaults to false you only need to come online again within 30 days.
Thanks for your replies. I’m back from holidays, and unfortunately, one of the HDDs is dead and not spinning, so that node is obviously disqualified. It turns out it wasn’t a cabling issue as I initially suspected.
Meanwhile, the other five disks in my home server are working fine. I won’t be replacing the dead drive, as my nodes aren’t full yet.
The bigger problem isn’t the 30day suspension, which is already quite significant. The real issue is that when you’re down, you lose a lot of data accumulated over time