Changelog v1.12.3

littleskunk · September 9, 2020, 6:09pm

I have an update for you. The storage node dashboard API contains a new uptime score. By default it is 1 aka 100%. It is not visible on the storage node dashboard.

The score gets updated via audits. The currenty default settings are:

# The length of time to give suspended SNOs to diagnose and fix issues causing downtime. Afterwards, they will have one tracking period to reach the minimum online score before disqualification
# overlay.audit-history.grace-period: 168h0m0s

# The point below which a node is punished for offline audits. Determined by calculating the ratio of online/total audits within each window and finding the average across windows within the tracking period.
# overlay.audit-history.offline-threshold: 0.6

# The length of time to track audit windows for node suspension and disqualification
# overlay.audit-history.tracking-period: 720h0m0s

# The length of time spanning a single audit window
# overlay.audit-history.window-size: 12h0m0s

Let’s say a storage node gets 12 audits in 12 hours. If it failes one that would translate into 1 hour offline time. If the storage node gets only 2 audits in 12 hours and is failing one that would translate into 6 hours offline time. The system is getting inaccurate the less audits a storage node is receiving. To compensate that we are starting with a fairly high tollerance. 0.6 means a storage node can be offline up to 288 hours in a month (best case with many audits and high accuracy). Worst case 36 times one minute offline time and bad luck that the satellite was sending only 1 audit every 12 hours and hits 36 times exactly that one minute.

What happens if a storage node is not getting 1 audit every 12 hours? If the audit history is not filled with at least one data point every 12 hours for the full 30 day periode the score will stay at 1 and ignore all downtimes. I would call it a bug and not a feature. The idea is that unvetted nodes are not getting suspended. We want to fill them with enough data first to be able to make a fair judgment. That idea is great but it is currently implemented to early in the process. The score should be updated but suspension shouldn’t kick it. An unvetted nodes should see the impact and correct its behavior. Currenlty I would expect to see a score of 1 for most of the nodes even if they had downtime. It will take 30 days before the first nodes might see the real score.

These values are the current config values and we are going to update them over time. My expectation is that we will suspend nodes earlier at the end and don’t let them stay offline for up to 288 hours. On the other side I don’t think the 12 hour window is going to change much which means with 60 datapoints the satellite has to make a decision. With 2 deployments in a month we need to tollerate at least 2 data points. This translate to at least 36 hours downtime in a month. Likely a bit more. That is my personal expectation and ofc I could be wrong with this. So let’s end my statement with all these values can be changed. I will try to keep you all updated.