Graceful Exit (log inspection)

BrightSilence · May 31, 2020, 7:27pm

No worries, sorry for derailing the conversation.

Let me try to fix that.

I don’t think there is currently anything implemented to exclude these in node selection or better even, include only IP versions that the transferring party supports (either uplink or GE node). But perhaps I overlooked something when I looked into the node selection. This implementation has recently also significantly changed with the implementation of node cache and I haven’t looked at it too closely yet since. From what I could tell the conditions for node selection have remained the same though.

I know for sure full nodes are explicitly excluded in node selection. Seeing that the amount is relatively low, I think this is caused by a delay between the node being full and the satellite being updated with this info. This update doesn’t happen instantly. They’ve already added a buffer so that selection doesn’t take place if there is less than 100mb available, but I don’t think that is enough.

This one will hopefully be fixed soon by moving the used serials to RAM, but honestly I’m surprised by how low this number is considering how much it’s been talked about on the forums. Very interesting.

I have no idea how this could have happened as gracefully exited, disqualified and suspended nodes are explicitly excluded from node selection. Very weird indeed.

I responded mostly to the issues related to node selection, as I’m not entirely sure much can be done about the others and I’ve previously studied the code for node selection. My source for the info is the code itself. Conditions are listed here and contradict some of your findings, which I find quite curious. Not sure what is going on here.

github.com

storj/storj/blob/801a3ab90d64efc9598568ad45ba8deb7f312ebd/satellite/satellitedb/overlaycache.go#L43


      
          	db *satelliteDB
          }
          
          // SelectAllStorageNodesUpload returns all nodes that qualify to store data, organized as reputable nodes and new nodes
          func (cache *overlaycache) SelectAllStorageNodesUpload(ctx context.Context, selectionCfg overlay.NodeSelectionConfig) (reputable, new []*overlay.SelectedNode, err error) {
          	defer mon.Task()(&ctx)(&err)
          
          	query := `
          		SELECT id, address, last_net, last_ip_port, (total_audit_count < $1 OR total_uptime_count < $2) as isnew
          			FROM nodes
          			WHERE disqualified IS NULL
          			AND suspended IS NULL
          			AND exit_initiated_at IS NULL
          			AND type = $3
          			AND free_disk >= $4
          			AND last_contact_success > $5
          	`
          	args := []interface{}{
          		// $1, $2
          		selectionCfg.AuditCount, selectionCfg.UptimeCount,
          		// $3

Hope this made up for the slight detour earlier in the topic.

Krey · May 31, 2020, 8:14pm

Some SNO already sent this kind of logs to support. I dont see this logs. Where is SNO with node DQed 4 times with one node. Support reset DQ flag on this node 3 times, so problem not at His side.

BrightSilence · May 31, 2020, 8:20pm

As far as I know they haven’t reset disqualifications in a very long time. They only did that when there was a known bug that could cause it, but that was fixed many months ago. So if it’s from that time, then I can imagine. But there is little to discuss if you can’t link to a post with more details or share your own experience.

Krey · May 31, 2020, 9:15pm

You can’t know anything if you are not involved with the strojlabs. You are regular, aren’t you

You will not wait for my own experience. I’m not going to make an GE to give my escro with such probability of DQ what I see in chat rooms and forums

KernelPanick · May 31, 2020, 10:34pm

I think one of the key description words to use here may be ‘optimization’.

100GB/day (~9Mb/s) seems pretty slow. decreasing the errors seen here may have a pretty good impact on Increasing that rate.

littleskunk · May 31, 2020, 10:49pm

On the satellite side you have 0 failures (on the other storage nodes with the same wallet address as well)

GE worked just fine in your case.

I will skip that for the moment and respond later. Let me answer the easy questions first

We are currently in contact with 1 storage node and try to fix graceful exit for them. It looks like the storage node has a corrupted piece but is not handline it as expected. The idea is that the storage node transfers the piece to a new storage node, the target node sends back a signed piece hash, the original storage node notices the mismatch and will tell the satellite that the transfer failed. A storage node can get disqualified if it tries to call a transfer a success but the signed hash has a mismatch. It looks like something like that has happened on these 3 storage nodes. The developer team is on it and will fix it.

Nevertheless, 185 storage nodes finished graceful exit without any problems. Only 23 failed graceful exit and most of them because of a high error rate or because they have gone offline. From what I can see 22 storage nodes failed graceful exit for a good reason and the one that doesn’t is in contact with us.

What we should keep in mind is that these 22 storage nodes will clam that the storage node was perfectly fine even if they have initiated GE after deleting data. When money is on the table there is a good reason to lie.

github.com/storj/storj

nodeselection: Increase minimum free space to 500MB

storj:master ← storj:jh/nodeselection

opened 03:34PM - 22 May 20 UTC

littleskunk

+3 -3

What: Increase the minimum free space to 500MB Why: Because the node selectio…n cache will need 3 minutes to find out that a storage node is close to full and stop selecting the storage node. We should minimize the number of error message the storage node would return in these 3 minutes. Please describe the tests: - Test 1: - Test 2: Please describe the performance impact: ## Code Review Checklist (to be filled out by reviewer) - [x] NEW: Are there any Satellite database migrations? Are they forwards _and_ backwards compatible? - [x] Does the PR describe what changes are being made? - [x] Does the PR describe why the changes are being made? - [x] Does the code follow [our style guide](https://github.com/storj/docs/blob/master/code/Style.md)? - [x] Does the code follow [our testing guide](https://github.com/storj/docs/blob/master/code/Testing.md)? - [x] Is the PR appropriately sized? (If it could be broken into smaller PRs it should be) - [x] Does the new code have enough tests? (*every* PR should have tests or justification otherwise. Bug-fix PRs especially) - [x] Does the new code have enough documentation that answers "how do I use it?" and "what does it do?"? (both source documentation and [higher level](https://github.com/storj/docs), diagrams?) - [x] Does any documentation need updating? - [x] Do the database access patterns make sense?

This only means suspension mode is removing them from the node selection. One other factor is the overall load in the network. I would expect a higher rate in times with lots of upload or download traffic. That will drive some nodes to there current limits. Hopefully, the fix will increase the limit.

The overall failure rate for GE is currently ~10% (23 failures vs 185 success). If you haven’t dropped any data and keep your node online until graceful exit finishes the chance to fail GE is currently only 0.5% (that one bug that gets fixed soon). So even with that one bug in mind, there is a 99.5% chance that you will get your held back amount.

Just don’t lose data, don’t play around with the graceful exit settings unless you know how to track the failure rate and don’t go offline before its finished.

littleskunk · May 31, 2020, 11:04pm

Please open a github issue for that one. There is a good chance that the node selection cache is calling a different code path. Lets double check that. If you don’t want to publish the nodeID then please send a PM and I will create a ticket in private Jira.

Krey · May 31, 2020, 11:17pm

Ok, you reassured me. Maybe I’ll try

littleskunk · June 1, 2020, 12:04am

GE node tells the satellite that it wants to exit, the satellite will respond “give me a few minutes to find out which pieces you have to transfer”, storage node will ask again every 15 minutes

A few hours later the satellite starts the next meta info loop. That is one loop that goes through all the pointers. It will feed the audit reservoir sampling, repair checker, tally, GE, and GC job with data. The GE job will write all the pieces the storage node is currently holding on a queue. The queue is ordered by segment health. Pieces that are close to the repair threshold are getting a higher priority. On the other end, the GE node doesn’t need to transfer pieces with more than 100% segment health. That can happen because the repair job has no long tail cancelation but will upload a few bonus pieces to compensate possible upload errors.

After 15 minutes the GE node is contacting the satellite again. This time the satellite will return 500 orders Each order has a specific piece ID and the target node. The GE node will try to upload all 500 pieces. As soon as that work is done the GE node will contact the satellite again and submit the 500 results.

For a successful upload, the GE node has to submit the signed piece hash from the target node, the order, and the original signed piece hash from the uplink side. The satellite will check that the uplink signed piece hash is matching the target signed piece hash. The satellite will also check that both signatures are correct and not manipulated. On success, the entry will be removed from the GE queue.

For a failed upload the storage node has to contact the satellite as well. First of all the satellite will always check if the piece was deleted or touched by repair. The satellite will ignore these failures. For all other failures the satellite will increase a retry counter in the GE queue, select a new target node and return the new order in the next 500 batch.

If the counter increases to 5 the satellite will count that as a failed piece and remove it from the queue.

There is a third outcome. A GE node might restart and forget which pieces it is currently transferring. → For for some pieces no result. The satellite has a separate counter for that and is allowing 10 retries

At 10% failure rate GE will fail. There is also a deadline of 7 days to prevent nodes from going offline for a few months and later transfer a few pieces that didn’t get moved by repair.

Any question? (I might not respond for a few days)

Odmin · June 1, 2020, 9:08am

I will send all the detailed information to you via PM.

Odmin · June 1, 2020, 9:32am

Thanks a lot!
This information is exactly what I looking for!

So, now after reviewing my collected errors I can sad: My table and graph show only status (errors) of remote nodes during GE and how often these errors occur on remote nodes (like a remote storage node whether). This errors doesn’t mean that storage node failed transfer of the piece because satellite already have a mechanism for handling almost all the bad situations.

I have just one simple question, is it we have a native way to store logs into influxdb? (I would like to improve the quality of my analysis and automate manual tasks) If yes, could you please point me to the right direction? But if we do not have a native way, I will move forward with standard tools like a telegraf.

Odmin · June 1, 2020, 10:46am

During the collected detailed information, I realize, that I checked the status on the wrong storage node, and the storage node that specified on the error message is not did GE on Stefan satellite.
It was my epic fail, I apologize for that

Toyoo · June 1, 2020, 2:15pm

Are you sure you can consider these attempt failures independent?

BrightSilence · June 1, 2020, 3:20pm

Yes. New nodes are selected for the next attempt. The only way several could have the same problem in a row is if the problem is on the GE nodes end or bad luck. The former would be a rightful failure and the latter is calculated by the chance you quoted.

anon68609175 · June 1, 2020, 3:31pm

Stats of my GE.

SGC · June 1, 2020, 8:03pm

Nice work of dissecting the logs…

i suppose that is what tardigrade has to deal with just to run… as SNO’s we don’t really see that aspect of the network much… all those errors does seem scary, but if it all worked nearly flawlessly anyways, isn’t that a testament to the durability of the process.

I’m not saying either way… because there are just way to much information and way to many unknowns for me to process right now…

@Krey just because a node appears to work perfectly doesn’t mean the node was/is even still alive… it should be very possible to have zombie nodes that can survive for extended periods however when such a zombie node attempts GE, it is signing it’s own death sentence because all data will be run through.

kevink · June 1, 2020, 8:13pm

So now that we can all see that GE is not dangerous and the probabilty of failing GE is fairly low, can you @Odmin maybe change the thread title to something “less scary” for people that only read headlines?

BrightSilence · June 1, 2020, 9:03pm

You see many similar errors on the uplink side. But the network is built to work with untrusted nodes. That means independent vetting by each satellite before they get too much data, but also dealing with misconfigured or misbehaving nodes. Everything is over provisioned in similar ways like this GE process is. Except for uplink it uses RS encoding to start more uploads than it needs all at the same time to keep the highest possible speeds. That can’t really be done with GE as you’re looking to only transfer the pieces on one node.

Odmin · June 2, 2020, 8:52am

You are right, I have a lack of information about GE process on initial analysis, but thanks @littleskunk for explain how it working. I changed topic and remove “Dangerous” from initial posts because GE already has safety mechanisms to handle these errors.

Box-shadow · June 2, 2020, 4:28pm

for people that only read headlines?