Graceful Exit (log inspection)

JoshGarza · June 8, 2020, 2:45pm

I have many ERROR gracefulexit:chore unable to send failure. {"Satellite ID": "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW"}

Is it a normal error or should I be worried?

thepaul · June 10, 2020, 5:47pm

There should be another error above the “unable to send failure” line that says what the failure was. (Yes, that’s a pretty useless log line.)

It could be a failure to read a piece from local storage, a failure to validate a signature or hash on the piece (indicating corruption), a failure to send the piece to its destination storage node, or a bad signature from the destination storage node.

JoshGarza · June 10, 2020, 7:10pm

Thanks for the answer. There are so many errors (most of them commented before) that I haven’t checked the logs for a while.
I just check the exit-status… and it is progressing. Around 10% after 2 days!

Sasha · June 24, 2020, 10:48am

Here to report that my GE successfully completed for stefan b

Krey · July 24, 2020, 7:41am

I tried it and I can insist on my position voiced earlier. GE bent all over his head.
@littleskunk correction regarding the grouping of errors by chunks, although important, is quantitative in nature.
While the problem indicated in the first post by @Odmin is of a qualitative nature.
Really satellite can and do DQ you node for errors not caused by you storage node.

thepaul · August 26, 2020, 9:06pm

Getting back to this topic after too long of a delay, sorry-

What errors do you mean?

Krey · September 8, 2020, 10:28pm

TOTAL ERRROR: 45946
reset by peer: 29781
not enough available disk space: 2006
context deadline exceeded: 5622
context canceled: 536
filestore error: 986
storage node overloaded: 1210
dial tcp error: 800
unable to send failure: 4321
database is locked: 72
broken pipe: 80
no space left on device: 288
A device which does not exist was specified: 2
cannot find the path specified: 2
out of space: 6
order created too far in the future: 6
use of closed network connection: 32
EOF: 711

This is from one node. Total DQ two nodes for $277,88
All in default settings. More other nodes GE successful with more aggressive settings.

This result was expected by me I do not understand why it raises questions from you and why you ask me for errors.

When it comes to money, you, as a debtor, should show me why you do not pay my earned money (what I did wrong as a SNO or where, when and how much my node could not provide the service), and not just silently do DQ. But Storjlabs keep my money some months because someone on the network did not have enough space to receive data or he dropped the connection.

Reset by peer: 29781
The main error in my report means that SOHO routers or raspberries or some other similar stuff cannot cope with the traffic on your network. Why should i be responsible for this with my money!?
I provide server grade hardware to the storj network. Intel Xeon works on my routers with network from i210 to X540.

Alexey · September 8, 2020, 10:39pm

Could you please clarify, what the version of storagenode? Or is it an old stat?

Krey · September 8, 2020, 10:44pm

this is from open ticket (5950)
2020-07-27 you suggest stop GE and i don’t run GE anymore

thepaul · September 9, 2020, 1:59pm

Storj does not disqualify nodes for seeing any of the above errors. Storj disqualifies nodes if they have significantly more errors when transferring pieces than other storage nodes, since that should only happen if they have not correctly stored the pieces they are responsible for.

I asked you for clarification because you made a statement which I believe to be untrue, and I wanted to understand why you thought that.

Yes, it would be helpful if we could provide more information to node operators when a disqualification is done. We’re continuing to try to make the whole process more friendly. But it is not correct to say that Storj Labs is keeping your money because someone else on the network did not have enough space. If your node encounters that error when transferring a piece, it is given a different target node. If that node fails, you get another. A node currently gets five tries in total for each piece that it needs to transfer. And even then, if all five transfers fail, it is possible for your graceful exit overall to succeed. The graceful exit only fails when a significant percentage of your pieces fail all five tries.

Server grade hardware sounds fine! I’m not sure how your routers or their network adapters come into play, but they sound fine as far as that goes. However, if most nodes attempting graceful exit succeed, and yours does not, that does not indicate that the problem is the rest of the network.

Krey · September 9, 2020, 2:58pm

I have no other errors for you.
This is all errors, Next message after last error is about DQ. If this errors not a reason for DQ please tell me why my nodes was disqualified.
And where 0 zfs read write or checksum errors.

Krey · September 9, 2020, 3:04pm

We talk about millions of pieces, so 5 tries not looks like a good reason to DQ my node if it not a source of single one.

In this 50000 errors i do not found any error caused by my node.

thepaul · September 9, 2020, 3:07pm

Again, the type of error is not significant. The percentage of pieces that your node could not transfer is significant. Your node (or nodes?) had a far higher percentage of pieces that it could not transfer, even with five tries to different nodes, than other gracefully-exiting nodes usually get.

I don’t know why that was the case. Possibly your network connection was flaky or did not have enough bandwidth. Possibly your hard drive or drives were too slow.

And again, nodes are not disqualified after a single piece can not be transferred with 5 tries. They are disqualified after MANY pieces can not be transferred, each one having 5 tries. Other nodes do not fail to transfer pieces at such a high rate.

littleskunk · September 9, 2020, 3:08pm

That is not correct. I see 2 errors in that list that are not looking like issues on the target node.

That looks like a connection problem.

thepaul · September 9, 2020, 3:09pm

What do you mean? Are you suggesting that we do disqualify nodes for seeing those errors?

littleskunk · September 9, 2020, 3:16pm

I am not sure about this specific case here but in general yes.

I have tested this with storj-sim. All the other error messages in the list are returned from the target node. Even with 50% error rate the overall failure rate would be 0.5^5 = 3%. That means you can’t get DQed for that.

2 errors in that list are not from the target node and indicating an issue on the storage node itself. The router might have been overloaded. Too many DNS requests or too many open connections. I can’t say that for sure just based on a list of errors. In the raw log file it is more obvious if the storage node was getting into this situation.

thepaul · September 9, 2020, 3:18pm

I’m not making any claim about where the errors come from. I’m saying that reporting any of those errors will not get a node disqualified, whether the error came from the target node or the exiting node. Disqualifications only result when a large percentage of pieces could not be transferred, no matter what the errors were.

Krey · September 9, 2020, 3:18pm

not in my side.
this two nodes exited with default settings. All other nodes i run GE exited successful with 5 workers and 50 transfers which should place a much greater load on my equipment.

I can put forward a theory why I got DQ. At the time of the exit, there are no more normal nodes in the benten network. Everybody left before me.

littleskunk · September 9, 2020, 3:22pm

Your error list is not showing that. You get only a few errors that we could call a bad target node and with 5 retries that is not going to increase the failure counter.

This is what you get DQed for because even with 5 retries it might have increased the failure counter. This error is not from the target node.

Odmin · September 9, 2020, 3:23pm

Unfortunately, this is of great importance, because you can fail transfers if the remote side has a problem with the network (chap router, a lot of DNS requests, overloaded network, overloaded node)