Graceful Exit Revamp

brandon · August 15, 2023, 7:50pm

Hey Storj Node Operators,

We have seen a few posts about the issues surrounding graceful exit, and we wanted to inform you we have been working on a solution to resolve them.

Currently, graceful exit is a complicated subsystem that keeps a queue of all pieces expected to be on a node and asks the node to transfer those pieces to other nodes one by one. The complexity of the system has, unfortunately, led to numerous problems and unexpected behaviors.

We have decided to remove this entire subsystem and restructure graceful exit. This new graceful exit will be more simple to maintain going forward. The new graceful exit will work as follows:

Nodes will signal their intent to exit gracefully, the same as before.
The satellite will not send any new pieces to gracefully exiting nodes
Pieces on gracefully exiting nodes will be considered by the repair subsystem as “retrievable but unhealthy.” The pieces will be retrieved as part of normal repair for those objects that the graceful exit would cause a need to repair. This also considers all pending graceful exits happening.
After one month, if the node´s online score is at or above the current online threshold set by the satellite and the node is not suspended, contained, or disqualified, the node exit will be considered successful, and the held amount for the node will be released with the regular node payment process. Otherwise, the graceful exit will be considered failed, and the held amount will not be released.
The repair worker will continue to fetch pieces from the node as long as it stays online, but the exiting node will not receive new pieces nor be relied on for data durability.
Nodes that triggered graceful exit prior to the date of this code change will be treated as having started graceful exit at the time that they initiated the graceful exit process, and once the time period has elapsed, they will be marked as done. Once the node software is updated, the nodes currently executing graceful exit will cease direct piece transfers.
- Any nodes that have been in graceful exit for more than the specified time frame and otherwise meet the new requirements for a successful graceful exit will be assumed to have successfully completed graceful exit as of the time of the node software update.

The code for this new graceful exit functionality is complete and is undergoing our code review and QA process. We expect it to be deployed in the next few weeks. The PR is available at:

https://review.dev.storj.io/c/storj/storj/+/10823

In the future, we may adjust the time the nodes must stay online after triggering the exit depending on the amount of data the node is storing or the amount of repair traffic the satellite will need to conduct. We will build types of enhancements as we see how the new functionality works in the live network. The highest priority was to alleviate the issues current node operators have raised. While this is a fundamental shift in the approach to nodes exiting the network, on balance, this is the simplest and most scalable approach.

We believe this iteration of graceful exit will be a better experience for our Storage Node Operators and alleviate most of the issues/bugs the old system had due to its complexity.

Thank you,
The Storj Labs Team

Vadim · August 15, 2023, 7:58pm

so repair worker will move pieces to new nodes?

daki82 · August 15, 2023, 8:04pm

Cool, this will go well with the proposed changes to the held amount and the possible longer payout periods.

I am appy to be part of the network !

Ottetal · August 15, 2023, 8:22pm

What a neat solution, while it’s not relevant for me at the time being, I look forward to seeing it in action

snorkel · August 16, 2023, 12:56am

Nice! This solution was hiding in plain saight all along. So simple and effective.
But… I wonder if this could creat problems for the network if a big chunk of nodes would GE in the same time, because now the pieces are not just requested and transfered from one node, but 80 nodes, for each node with GE.
I believe 1 month is a bit too much; 2 weeks would be better; but if this is adjusted automaticaly by the network, and it is ended sooner if the network repairs all the pieces sooner, then it’s alright.

Alexey · August 16, 2023, 3:27am

Because of:

Since all pieces will be marked as readable, but unhealthy, the repair worker would repair them to other nodes (if the segment requires it, i.e. the number of healthy pieces is below the specified threshold). We believes that month should be enough time, but 2 weeks is too short.
However,

penfold · August 16, 2023, 3:34am

Can you define a month more precisely here? Is it based on the calendar month? 30 days? 31?

SGC · August 16, 2023, 8:04am

Good to see that it wasn’t a wasted effort from my side.
sending my 9GB of affected storagenode GE logs to Storj.

Had been wondering if anything was going to come of that.

BrightSilence · August 16, 2023, 9:30am

I appreciate the solution for SNOs here, but it seems this does come at a cost. Since effectively, graceful exit doesn’t save on repair costs at all anymore. With this implementation all graceful exit does is protect against sudden loss of availability of a large amount of pieces. That said, I don’t think GE was used in the current form in most cases, so that might not be a big loss. But it does seem like a shame that nodes can no longer rehome their pieces peer to peer without incurring direct costs for Storj Labs.

Ruskiem · August 16, 2023, 9:32am

First of all its bad priorities, because why making exit easier?
I’m completely unaffected by this, because i don’t plan to exit.
And i would like to see efforts that help existing nodes, not the one that decided they don’t care any more. (its theirs problem if they want to exit and its unpleasant, too bad! Shouldn’t been exiting!)

You have limited manpower now, after cuts, why helping those who don’t care about network anymore vs those who do care?
What’s matters is nodes who decides to STAY, rather focus efforts to help them stay!

Besides if node ever decide to leave the network, its because a serious reason!
so he want to just leave!
not keeping a node online + another month after he exists!
So this no better than currently. In many situations its just rouge!
because if i want to quit, its because im moving, suddenly, or i need to sell the equipemnt NOW, or other emergency, need to take computer off like in days max! Not a month!
So You want me to give all parts, which in repair process will take less time now i assume?
And then You STILL expect me to stay 1 month in the network after that to unlock my funds?!
Wheres the change for better for the node?

jammerdan · August 16, 2023, 9:42am

That was exactly my thought. Thanks for confirming this.
So what will that mean for held amounts? With that new approach with GE even costing repair money why should the held amount be returned?
Is there really no more direct transfer between nodes taking place? Basically the only way to save money?

Interesting view. I have heard users have been shutting down nodes because they needed the space now. Couldn’t be bothered with waiting for user deletions or even GE.
Being able to offload data to reclaim space would be one of the useful priorities. Of course this has been mentioned in the past already:

BrightSilence · August 16, 2023, 9:51am

People don’t use exit currently, which can put a high immediate strain on repair workers. Giving them an incentive to use it by making it a viable option and also reducing strain and time required on the SNO hardware helps spread out that load on repair workers and reduces risks related to exit. It’s not just to help the exiting nodes, but also to protect the network.

Held amount never covered the cost to begin with. I guess it’s just an incentive to allow Storj more time to do the repair. Not a cost covering measure.

@brandon I guess I have some followup questions:

It sounds like graceful exit will not trigger cleanup of the data on the node. It would be helpful to do that after the exit is done. I expect many questions from people exiting a subset of satellites otherwise.
How would this impact the prospect of partial exit? It seems using this system, triggering a hypothetical partial exit would be more complicated as there is no data removed from the nodes during the process. I think this is still a feature many SNOs would some day like to have.

That said, I’m a big fan of the reduction in code complexity by using existing systems instead of a completely separate system. Even if it comes at some costs.

jammerdan · August 16, 2023, 9:58am

That leads to the question, do we need held amount at all or are there other viable solutions to reach the intended goal?

BrightSilence · August 16, 2023, 2:36pm

Going by this change I can only infer that getting more time to trigger required repair is enough for them to keep that incentive around. I guess also fostering long active nodes by releasing half of the held amount after 15 months is still an incentive they want to keep around. But the goal is more node stability than covering costs now. Maybe that has always been the bigger goal.

brandon · August 16, 2023, 3:56pm

The repair worker will repair segments that fall under the repair threshold because of the node exiting.

brandon · August 16, 2023, 3:58pm

we really appreciate it @SGC

brandon · August 16, 2023, 4:01pm

we calculated how much this would cost the current satellites to repair the data from exiting nodes and compared it to the complexity of the previous graceful exit implementation and it was not worth the cost.

in terms of prioritizing cost vs durability the durability is always number 1 so just making sure the satellite can account for exiting nodes and repair segments appropriately is a win for us!

brandon · August 16, 2023, 4:03pm

Those are good points, once we get this new version of graceful exit out we will iterate and make it better.

Vadim · August 16, 2023, 4:12pm

@brandon
May be we can run repair worker on SNO servers? data is encrypted on client side, we don’t need decrypt it on our side to repair, so Bandwidth will be much cheaper for storj, as I remember Hetzner bandwidth is 20 euro per TB, SNO is only 6$ per TB. So it worth it i think, as Storj getting more and more data it become more and more relevant.

brandon · August 16, 2023, 6:32pm

yea! its an idea we have talked about at Storj; it may make sense as the network grows for sure and it would also give the SNOs another way to earn.