Maintenance mode feature

Alexey · December 11, 2020, 12:24am

We have implemented an online score accounting with audits:

storj/storj/blob/46e26fa47d02efa0327a995ca6879b7fc02b486d/docs/blueprints/storage-node-downtime-tracking-with-audits.md

# Storage Node Downtime Tracking With Audits

## Abstract

This document describes a means of tracking storage node downtime with audits and using this information to suspend and disqualify.

## Background

The previous implementation of uptime reputation consisted of a ratio of online audits to offline audits. We encountered a problem where some nodes' reputations would quickly become destroyed over a relatively short period of downtime due to the frequency of auditing any particular node being directly correlated with the number of pieces it holds. To solve this problem we need a system that takes into account not only how many offline audits occur, but _when_ they occur as well.

## Design

The solution proposed here is to use a series of sliding windows to indicate a general timeframe in which offline audits occur. Each window keeps two separate tallies indicating how many offline audits and total audits a particular node received within its timeframe. Once a window is complete, it is scored by calculating the percentage of total audits for which it was offline. We can average these scores over a trailing period of time, called the _tracking period_, to determine an overall "offline score" to be used for suspension and disqualification. By granting each individual window the same weight in the calculation of the overall average, the effect of any particularly unlucky period can be minimized while still allowing us to take the failures into account over a longer period.

Storage node downtime can have a range of causes. For those storage node operators who may have fallen victim to a temporary issue, we want to give them a chance to diagnose and fix it before disqualifying them for good. For this reason, we are introducing suspension as a component of disqualification.

Once a node's offline score has risen above an _offline threshold_, it is _suspended_ and enters a period of review. A suspended node will not receive any new pieces, but can continue to receive download and audit requests for the pieces it currently holds. However, its pieces are considered to be unhealthy. We repair a segment if it contains too many unhealthy pieces, at which point we may transfer the repaired pieces from a suspended node to a more reliable node. If at any point during the review period we find that a node's score has fallen below the offline threshold, it is unsuspended, or _reinstated_, but it remains _under review_. This prevents nodes from alternating between suspension and reinstatement without consequence.

The review period consists of one _grace period_ and one _tracking period_. The _grace period_ is given to fix whatever issue is causing the downtime. After the grace period has expired, any offline audits will fall within the scope of the tracking period, and thus will be used in the node's final evaluation. If at the end of the review period, the node is still suspended, it is disqualified. Otherwise, the node is no longer _under review_.

This file has been truncated. show original

The current version allows you to be offline up to 288 hours, then your node will be suspended. After that you would have a week of grace period to fix the issue. When you fixed the problem, the node will be under review for the next 30 days. And only if your node would managed to be suspended again it will be disqualified.
There is plenty of time to do a maintenance before the node would be disqualified.

Each failed audit will push your node closer to disqualification, if too many in row - it will be disqualified pretty quick.
The network itself has an ability to recover missed files, but they never come back to your node (because it’s proved to be unreliable).
If your node lost not too much, it can survive, but audit score may never recover.