Graceful Exit Guide

littleskunk · January 13, 2020, 4:02pm

Give me a moment to spin up my test node.

littleskunk · January 13, 2020, 4:35pm

docker exec -it storagenode /app/storagenode exit-satellite --config-dir /app/config --identity-dir /app/identity

docker exec -it storagenode /app/storagenode exit-status --config-dir /app/config --identity-dir /app/identity

ayaseen · January 13, 2020, 4:40pm

Strange, I give it a try and there is no warning about how many months you can isue this process.
I though this only working on at leats six months.
Anyway I did not proceeed

deathlessdd · January 13, 2020, 4:41pm

Thanks @littleskunk good to know.

nerdatwork · January 13, 2020, 4:42pm

Can you post a screenshot on what it showed when you executed above command which didn’t show warning?

ayaseen · January 13, 2020, 5:03pm

akshaybengani · January 13, 2020, 5:19pm

Thanks for sharing this it is working

akshaybengani · January 13, 2020, 5:23pm

How much time it will take for completing the Graceful Exit Process complete, I am running my node from 3 months

Vadim · January 13, 2020, 5:32pm

You need minimu 6 month runing node to get GE

akshaybengani · January 13, 2020, 5:35pm

what if I have 2 days only any fastforwading way

nerdatwork · January 13, 2020, 5:42pm

Its a way to prevent any abuse of the system.

nerdatwork · January 13, 2020, 5:43pm

You will get the warning once you initiate Graceful Exit. The logs should show you that information not before.

baker · January 13, 2020, 6:41pm

I think akshaybengani was saying they only have 2 days to complete the graceful exit. However, Vadim is correct, your node has to have been online a minimum of 6 months for the graceful exit to be available for a node. If it is less than 6 months old graceful exit is not an option.

This will depend on how much data you have to upload. The absolute minimum time would be the amount of data you are storing divided by your uplink speed. But that is only if it saturates your uplink bandwidth. In reality it will take longer.

Alexey · January 18, 2020, 9:15am

Should the node ignore the bandwidth cap in case of GE even for audits and normal up/down traffic as well?

littleskunk · January 18, 2020, 11:21am

That is a question for @brandon

At the moment storage nodes would get disqualified when they hit the traffic limit.

Mad_Max · January 24, 2020, 3:10pm

Only bad nodes?
Yeah, you want to have a penalty to “punish” bad nodes which trying to cheat with data.
But in same time your encourage SNOs to NOT to use any storage redundancy (like RAID 5 or more robust levels) as it already covered on network scope. And i think it a right way. Two levels of redundancy is a overkill.

But in a such was as it works now even tiny data corruption on node like from some bad sectors on HDD can cause DQ and full escrow lost on normal nodes even after migration to new HDD.

If you insist on such very strict rules for gracefull exits (only 100% or nothing). Then you need to create a procedure that would allow the SNOs to independently, in case they suspect some data corruption on their node, initiate a full (not selective as satellite audits) local integrity check of the data stored on the node.
I hope the checksums or hashes of each piece of data are stored in the local database in piece itself and such checks can be done locally without network load?

And in case of detection of missing (file not found) or corrupted (file found, but the it checksum does not match to value stored in db) data initiate repair of the corresponding pieces. Of course, with the cost of traffic necessary for repair at the expense of SNO deducted from the next payment or from escrow.
In the current way most normal SNOs will be loosing escrows as in the long term (and you want nodes running for few years at leat) it is nearly impossible to ensure full 100% integrity of data stored without redundancy and backups. And you will be punishing normal SNOs mostly, not only “bad”.

littleskunk · January 24, 2020, 3:54pm

You need much more than just a few bad pieces to get DQed.

99.9% of the storage nodes are running just fine. Why do we need to implement something that is only usefull to the remaining 0.1%? I think we can spend our resource on better features. Anyway if you want to open a PR I see no reason why we shouldn’t merge it. Feel free to do it.

Game theory. DQ is a penalty and we will not implement a way around it. It simply doesn’t work! As a cheater I will abuse any loop hole you might want to add. I can drop pieces, run a verification and just pay for the repair traffic? Sure I will abuse that. Only as long as there is a DQ penalty in the game and no way around it I will follow the rules.

Mad_Max · January 24, 2020, 3:58pm

You can check “blob” folder in datastorage folder. Node create separate folder for each sattelite there to store date. So you can check if any data is still stored on disk for satellite that DQed you.

Mad_Max · January 24, 2020, 5:19pm

Really?
Yes, right now normally running node can still work OK with few bad pieces, because satellite checks(audits) are small and random picked. So if only few pieces of a total few hundreds thousands stored are bad it a very very low probability that the satellite audit will hit one of the damaged pieces and throw DQ. So you can keep running with minor data loss/corruption without DQ for some time.
But during graceful exit each and every piece is checked, is’nt it? (at least it how I understood your statement that only 100% healthy nodes can complete GE, and in other way there is a hight risk of uploading corrupted data to other nodes during GE if data not checked for integrity before upload).

So even single bad piece can trigger DQ and loss of all escrow during GE, while it can be undetected by regular audits.

It because only few month have passed from current network launch (after last data wipe). And because no real nodes goes though GE yet.
It is not a problem right now, yes. But it will be in future. So yes, it not a priority now (until going full production stage), but thinking ahead.

Yes, game theory, of course!
But form GT point - how you suppose to abuse such option to get benefit of it as a cheater if you need to pay for repair of all lost/corrupted data then you trigger such check on your side?
Yes, you can for example delete some of stored data and later restore it back by initiating repair.
But what is benefit of doing so intentionally?
You can not cheat egress traffic this way - vice versa: this way you can only lose some of the egress traffic and corresponding earning if uplink or satellite want some data from the deleted part and node does not has it and thus can not send/generate egress.

You still can try to cheat for some storage payments (TB**month) but how much?
Currently its 1.5 $ for TBM. Let say you are VERY lucky cheater and managed to delete 1 TB of data and rune a node with 1 TB missing for a whole year without being caught by audit and after that year you trigger repair process and get deleted data back on your node before GE.
What you win? 1.5 * 12 = 18 USD. Just 18 bucks for such cheat.
What you lose? You will need to pay for repair traffic. How much? For each missing piece satellite will need to download 29 (for current erasure coding parameters) pieces from other nodes, reconstruct a data segment, and apply erasure coding again to be able to replace missing piece. So it will be 29 TB of egress repair traffic from other nodes.
And you will be billed for 29*10 = 290 USD for such repair at current pricing.

So in final result you cheat and win 18 USD, and loose 290 USD. Total result is 272 USD loss. Very bad game played cheater!
Probable it will exceed node full escrow and such node can be DQed for loosing all escrow.
In any case there is no point in a such cheating as you always lose more than you get and just punish yourself by such cheats.

Not mention what you bear a high risk of DQ and losing all escrow all time if you miss some data on the node. If any significant portion like few % or more of data missed/corrupted - you will be caught by regular audit and DQed very quickly.

deathlessdd · January 24, 2020, 5:32pm

Im pretty sure the whole point of this is for people who decided to run 100+ nodes on datacenters arent gonna beable to keep up with maintenance on all of them so it ends up being in storj favor if there nodes fail.