Downtime Tracking
The satellite is now tracking the downtime of storage nodes. The satellite will also add and remove downtime suspension mode. Downtime suspension mode has no effect right now. A downtime suspended node will still get selected for uploads. We want to see how it works before we enable the penalties.
Periodic Filesystem Check
Last release we added a check if the filesystem is available (drive disconnected?). With this release we also check if the filesystem is writeable.
Payment Dashboard Surge
The storage node payment dashboard is now showing the surge percentage for previous payouts. The payout table is currently a bit confusing. The surge line should be printed out after gross total before held amount. We will fix the order with a following release.
Note
This release contains a linux autoupdater. We are using it for internal tests. Please don’t use it yet. We have to fix a few bugs first.
For Customers
Delete Bucket Performance uplink rb --force sj://mybucket should now take only a few seconds to delete all files in a bucket.
I have an update for you. The storage node dashboard API contains a new uptime score. By default it is 1 aka 100%. It is not visible on the storage node dashboard.
The score gets updated via audits. The currenty default settings are:
# The length of time to give suspended SNOs to diagnose and fix issues causing downtime. Afterwards, they will have one tracking period to reach the minimum online score before disqualification
# overlay.audit-history.grace-period: 168h0m0s
# The point below which a node is punished for offline audits. Determined by calculating the ratio of online/total audits within each window and finding the average across windows within the tracking period.
# overlay.audit-history.offline-threshold: 0.6
# The length of time to track audit windows for node suspension and disqualification
# overlay.audit-history.tracking-period: 720h0m0s
# The length of time spanning a single audit window
# overlay.audit-history.window-size: 12h0m0s
Let’s say a storage node gets 12 audits in 12 hours. If it failes one that would translate into 1 hour offline time. If the storage node gets only 2 audits in 12 hours and is failing one that would translate into 6 hours offline time. The system is getting inaccurate the less audits a storage node is receiving. To compensate that we are starting with a fairly high tollerance. 0.6 means a storage node can be offline up to 288 hours in a month (best case with many audits and high accuracy). Worst case 36 times one minute offline time and bad luck that the satellite was sending only 1 audit every 12 hours and hits 36 times exactly that one minute.
What happens if a storage node is not getting 1 audit every 12 hours? If the audit history is not filled with at least one data point every 12 hours for the full 30 day periode the score will stay at 1 and ignore all downtimes. I would call it a bug and not a feature. The idea is that unvetted nodes are not getting suspended. We want to fill them with enough data first to be able to make a fair judgment. That idea is great but it is currently implemented to early in the process. The score should be updated but suspension shouldn’t kick it. An unvetted nodes should see the impact and correct its behavior. Currenlty I would expect to see a score of 1 for most of the nodes even if they had downtime. It will take 30 days before the first nodes might see the real score.
These values are the current config values and we are going to update them over time. My expectation is that we will suspend nodes earlier at the end and don’t let them stay offline for up to 288 hours. On the other side I don’t think the 12 hour window is going to change much which means with 60 datapoints the satellite has to make a decision. With 2 deployments in a month we need to tollerate at least 2 data points. This translate to at least 36 hours downtime in a month. Likely a bit more. That is my personal expectation and ofc I could be wrong with this. So let’s end my statement with all these values can be changed. I will try to keep you all updated.
Yes but not with the current content. I would expect some kind of information about downtime. In the last 30 days the satellite has detected that you have been offline 2 times here … and here… . I don’t expect that this is going to happen any time soon but long term it would be nice to make that visible.
The other information in that array are currently meaningless.
I still call it a bug. For 30 days you get no feedback at all. After 30 days you might or might not see your real score. In 30 days we might see that the score is not working at all. For that reason I would not trust the current score.
The way I understood it was that it mostly applied to unvetted nodes or nodes with only a small amount of data.
I’ll just have it as a graph for now.
I wish there was a “last contact” timestamp - I could generate an alert if that get older than a few minutes etc.
It only means a small node is not getting suspended right away. Even a node that is holding only one single piece will get “unlucky” at some point that the satellite was auditing that one piece at least once every 12 hours. That might not happen in the first month but it will happen at some point. I would expect that “a lot of small nodes” will just get slowly suspended over time and they will be unable to recover and just get DQed.
Beside that the repair worker will not wait for your node getting suspended. If you are offline for a few hours the repair job will simply move the data. This can make you a lucky node that never gets suspended but you will constantly lose data. I don’t think that is a good tradeoff.