Changelog v1.12.3

littleskunk · September 9, 2020, 12:52pm

For Storage Nodes

Downtime Tracking
The satellite is now tracking the downtime of storage nodes. The satellite will also add and remove downtime suspension mode. Downtime suspension mode has no effect right now. A downtime suspended node will still get selected for uploads. We want to see how it works before we enable the penalties.

Periodic Filesystem Check
Last release we added a check if the filesystem is available (drive disconnected?). With this release we also check if the filesystem is writeable.

Payment Dashboard Surge
The storage node payment dashboard is now showing the surge percentage for previous payouts. The payout table is currently a bit confusing. The surge line should be printed out after gross total before held amount. We will fix the order with a following release.

Note
This release contains a linux autoupdater. We are using it for internal tests. Please don’t use it yet. We have to fix a few bugs first.

For Customers

Delete Bucket Performance
uplink rb --force sj://mybucket should now take only a few seconds to delete all files in a bucket.

striker43 · September 9, 2020, 1:04pm

What happened to the plan to also show the deleted data in the SNO Dashboard graph?

sorry2xs · September 9, 2020, 1:08pm

and another eleven days before my nodes update

anon27637763 · September 9, 2020, 1:08pm

I guess it’s time to make sure I rebuild whatever I need to before this goes live with penalties.

littleskunk · September 9, 2020, 1:21pm

I don’t know. I haven’t noticed anything like that in the commit history or in the backlog.

greener · September 9, 2020, 5:21pm

Is there a changelog for storagenode api please?

littleskunk · September 9, 2020, 5:42pm

How about this: https://github.com/storj/storj/commits/master/storagenode/console/consoleapi

littleskunk · September 9, 2020, 6:09pm

I have an update for you. The storage node dashboard API contains a new uptime score. By default it is 1 aka 100%. It is not visible on the storage node dashboard.

The score gets updated via audits. The currenty default settings are:

# The length of time to give suspended SNOs to diagnose and fix issues causing downtime. Afterwards, they will have one tracking period to reach the minimum online score before disqualification
# overlay.audit-history.grace-period: 168h0m0s

# The point below which a node is punished for offline audits. Determined by calculating the ratio of online/total audits within each window and finding the average across windows within the tracking period.
# overlay.audit-history.offline-threshold: 0.6

# The length of time to track audit windows for node suspension and disqualification
# overlay.audit-history.tracking-period: 720h0m0s

# The length of time spanning a single audit window
# overlay.audit-history.window-size: 12h0m0s

Let’s say a storage node gets 12 audits in 12 hours. If it failes one that would translate into 1 hour offline time. If the storage node gets only 2 audits in 12 hours and is failing one that would translate into 6 hours offline time. The system is getting inaccurate the less audits a storage node is receiving. To compensate that we are starting with a fairly high tollerance. 0.6 means a storage node can be offline up to 288 hours in a month (best case with many audits and high accuracy). Worst case 36 times one minute offline time and bad luck that the satellite was sending only 1 audit every 12 hours and hits 36 times exactly that one minute.

What happens if a storage node is not getting 1 audit every 12 hours? If the audit history is not filled with at least one data point every 12 hours for the full 30 day periode the score will stay at 1 and ignore all downtimes. I would call it a bug and not a feature. The idea is that unvetted nodes are not getting suspended. We want to fill them with enough data first to be able to make a fair judgment. That idea is great but it is currently implemented to early in the process. The score should be updated but suspension shouldn’t kick it. An unvetted nodes should see the impact and correct its behavior. Currenlty I would expect to see a score of 1 for most of the nodes even if they had downtime. It will take 30 days before the first nodes might see the real score.

These values are the current config values and we are going to update them over time. My expectation is that we will suspend nodes earlier at the end and don’t let them stay offline for up to 288 hours. On the other side I don’t think the 12 hour window is going to change much which means with 60 datapoints the satellite has to make a decision. With 2 deployments in a month we need to tollerate at least 2 data points. This translate to at least 36 hours downtime in a month. Likely a bit more. That is my personal expectation and ofc I could be wrong with this. So let’s end my statement with all these values can be changed. I will try to keep you all updated.

BrightSilence · September 9, 2020, 6:46pm

Thanks for the extensive update. I can imagine this needs some fine tuning. Looking forward to future findings on this!

Girard · September 10, 2020, 10:34am

I am on windows GUI. I am right that the storj node will be restarted?

If yes, can I do the update manually and add on the same time some other updates, so that I can reboot the machine at the same time?

LinuxNet · September 10, 2020, 10:59am

Downtime Tracking

Will it stay at 5 hours/month then?

littleskunk · September 10, 2020, 12:29pm

Just read this comment: Changelog v1.12.3 - #8 by littleskunk

littleskunk · September 10, 2020, 2:02pm

Note
This release contains a linux autoupdater. We are using it for internal tests. Please don’t use it yet. We have to fix a few bugs first.

Pentium100 · September 15, 2020, 10:16am

I see this in the API response:

  "audit": {
    "totalCount": 68645,
    "successCount": 68641,
    "alpha": 20,
    "beta": 0,
    "unknownAlpha": 20,
    "unknownBeta": 0,
    "score": 1,
    "unknownScore": 1
  },
  "uptime": {
    "totalCount": 72237,
    "successCount": 72236,
    "alpha": 0,
    "beta": 0,
    "unknownAlpha": 0,
    "unknownBeta": 0,
    "score": 0,
    "unknownScore": 0
  },
  "onlineScore": 1,

Is “onlineScore” the new score I should monitor? Or is it the “uptime” array? Or is it something else?

littleskunk · September 15, 2020, 10:18am

You should monitor the onlineScore. Please remember the current bug:

Pentium100 · September 15, 2020, 10:36am

Thanks.
But you are using the same value, right? Also, my node gets audits

By the way, will the “uptime” array be used in the future or not? If so, what does “unknown uptime” mean?

littleskunk · September 15, 2020, 10:43am

Yes but not with the current content. I would expect some kind of information about downtime. In the last 30 days the satellite has detected that you have been offline 2 times here … and here… . I don’t expect that this is going to happen any time soon but long term it would be nice to make that visible.

The other information in that array are currently meaningless.

I still call it a bug. For 30 days you get no feedback at all. After 30 days you might or might not see your real score. In 30 days we might see that the score is not working at all. For that reason I would not trust the current score.

Pentium100 · September 15, 2020, 10:54am

The way I understood it was that it mostly applied to unvetted nodes or nodes with only a small amount of data.
I’ll just have it as a graph for now.

I wish there was a “last contact” timestamp - I could generate an alert if that get older than a few minutes etc.

Toyoo · September 15, 2020, 6:09pm

Does this mean that having a lot of small nodes in a single block C leads to none of them ever disqualifying?

littleskunk · September 15, 2020, 7:00pm

It only means a small node is not getting suspended right away. Even a node that is holding only one single piece will get “unlucky” at some point that the satellite was auditing that one piece at least once every 12 hours. That might not happen in the first month but it will happen at some point. I would expect that “a lot of small nodes” will just get slowly suspended over time and they will be unable to recover and just get DQed.

Beside that the repair worker will not wait for your node getting suspended. If you are offline for a few hours the repair job will simply move the data. This can make you a lucky node that never gets suspended but you will constantly lose data. I don’t think that is a good tradeoff.