Design Draft: Storage Node Graceful Exit

brandon · July 12, 2019, 6:42pm

Hey, everyone! We’re excited to be moving our architecture design doc conversations to the forum! As a Product Manager at Storj Labs, I’m particularly excited to open up our design process to gain community engagement and input. On behalf of everyone, I can say that we’re looking forward to the valuable insights this increased feedback loop between our internal engineering team and our community will bring!

As I mentioned in our townhall here is one of the design drafts we have been working on recently. I would love to hear some feedback on what we have so far!

github.com

storj/storj/blob/3a01106ab4b52f74e5f5c420365fd4a5c46e08d8/docs/design/storagenode-graceful-exit.md

# Storage Node Graceful Exit Design Document


## Overview

When a Storage Node wants to leave the network but does not want to lose their escrow we need to have a mechanism for them to exit the network “gracefully”.

This process including the Storage nodes transferring their pieces to other nodes so that the satellite does not have to repair those pieces because of a node exiting abruptly. The process of a Storage Node exiting gracefully including that node requesting a list of Storage Nodes to send their pieces to and updating the satellite with what nodes are now storing those pieces. Graceful Exit for Storage Nodes is beneficially for both the Storage Node and satellite because the storage node receives their escrow and the satellite saves money from not having to repair files.


## Goals

- Give Storage Nodes a mechanism to leave the network while receiving their escrows, and reduce repair caused by node churn on satellites. 


## Non Goals

- Sending the storage nodes escrows.
	- The sending of tokens will happen through our normal token payment process.

This file has been truncated. show original

Alexey · August 4, 2019, 10:45am

We should make a Graceful exit feature available before the production.
Otherwise it will costs money to Storj Labs.
For example. SN started and somehow got 2TB of data for a month. Let’s assume that 25% of it downloaded back by the customer. The payout would accounted as 2*($1.5+$20*0.25)=$13. The 75% of it should be held, so held amount would be $9.75
If the SNO doesn’t like how it’s going and want to exit, they could request a Graceful exit (GE). Since it’s not implemented, it just shutdown their node (because the held amount still not so large).
We should repair that amount of data. And we should pay for repair. So, we should pay $10*2 = $20 to repair 2TB of data, but held amount only $9.75
And each next month makes this situation worse. The held amount is not enough to repair the all data if SN abruptly exit the network.
So the Graceful exit is a must have feature and not after the 15 months as described in the Storage Sharing Terms and Conditions but much earlier.

anon68609175 · August 4, 2019, 11:32am

Absolutely agree with Alexey. For example, I, as an operator, are interested in creating fast and high-quality nodes, but with a short life cycle in the range from 2 to 6 months.
For example, now I have the opportunity to launch the third node on a separate channel and location, but the estimated life of the node is 2-3 months. However, the current rules require a node to work from 15 months or a loss of money.

KernelPanick · August 4, 2019, 1:35pm

One solution could be to pay SNOs more. Which may be temporarily fixed with the 4-5x payouts. Have you run the numbers in that scenario?

littleskunk · August 4, 2019, 1:56pm

The math is a bit more complecated. Lets take a simple example. A network with 80 storage nodes. All the storage nodes are holding 2TB. Now 45 storage nodes are going offline. Repair gets trigger. Lets take a look what will happen.

The repair job has 35 pieces per file available. It has to download only 30 pieces to reconstruct the file. In our example it has to repair all files. In total the repair job will download 30 times 2TB. The price for that is 10$ per TB. -> 600$

The repair job is not finished. After reconstructing the file it can reconstruct the lost pieces and upload them to new storage nodes. To get back to 80 pieces the repair job has to upload 45 pieces. Again for all files in our example. A total of 45 * 2TB = 90TB. The repair job doesn’t need to pay the storage nodes for that but the repair job is running on a server and upload traffic has a price. Depending on the location the satellite has to pay something between 1.19€ per TB (Hetzner) and 90$ per TB (Google). -> 107$ - 8100$

Oh man that are some big numbers. How can one storage nodes cover the costs?
Wait a moment. We have lost 45 storage nodes. We can devide the cost by 45. The first storage node will not trigger any repairs but we can say at some point it will.
The damage a 2TB storage nodes can do is something between 16$ and 193$ depending on the satellite.

(Note: Sooner or later we want to seperate the satellite services. At that point the repair job can run on a separate server with cheap upload cost)

Hold on. I am holding 2TB but the held back amount is less than the calculated values. Does that mean the satellite is loosing money?
Yes and No. Hopefully the customer was able to download the file a few hundert times. For every 1$ the customer pays to the satellite 0.4$ are for the storage nodes and 0.6$ to cover the costs including repair. Best for the network are customer with many downloads and storage nodes that stay with us long term.

What does Graceful Exist change?
With gracefull exit we can copy the 45 * 2TB directly from one storage node to another. We are saving a lot of bandwidth and with that we are saving costs. We will download the 2TB from the storage nodes but without having to pay for it. Instead the incentive for the storage node is the held back amount.

Do you have any questions?

ethan · August 16, 2019, 8:11pm

Hello everyone! We have added some implementation details to the graceful exit document.
You can find the updates in this PR: https://github.com/storj/storj/pull/2734

Please let us know what you think.

anon68609175 · August 18, 2019, 2:22pm

I have a good question. If Gracefully exit can be done 15 months after the start of the node, then how will this date be determined? Each satellite has its own node registration date. And if satellite A sees that the node is 15 months old, then satellite B can only see 1 month. How will the exit happen?

ethan · August 20, 2019, 2:09pm

Hey @anon68609175, thats a great question! Graceful exits can be performed at any time and they are on a per satellite basis.

Konard · December 13, 2019, 5:18pm

I would be nice not only have an ability to graceful exit, but also an ability to reconnect the node later. For example I plan to operate the node 6 months, then get 6 month vacation and then reconnect. I still considering to keep the node up all year around, but the option to gracefully exit and enter back later would be very nice to have.