Graceful Exit (log inspection)

littleskunk · September 9, 2020, 3:24pm

Even if 50% of the target nodes are overloaded graceful exit will finish with a filure rate of 3%. 0.5^5

The root cause must be on the storage node side and not on the target node.

thepaul · September 9, 2020, 3:26pm

Of course! That’s why we give more tries with other target nodes.

From the perspective of the distributed network, we cannot trust the text of an error message. That can easily be faked. When a piece is successfully transferred and the target node has signed an acknowledgement of that, only then can the satellite verify that the exiting storage node actually had the piece it was responsible for.

Odmin · September 9, 2020, 3:28pm

Please look into changes on GitHub, and search by “gracefulexit” especially this commit.
I will no comment here… just look into your findings.

thepaul · September 9, 2020, 3:30pm

Yes, we increased the timeout on the satellite side, because it was happening too frequently that nodes could not transfer pieces in time. That problem was happening with most exiting nodes. That is why we knew it was a problem on the satellite side and not the node side.

Krey · September 9, 2020, 3:31pm

Why “reset by peer” must necessarily be on my side? Connection have two sides.

littleskunk · September 9, 2020, 3:32pm

Because of the combination with:

Your storage node was unable to send the failure back to the satellite. That is also a connection with 2 ends and guess which end was failing here.

Odmin · September 9, 2020, 3:38pm

I will just remind: This if two way trust, so if you not trust… and every SNO can ask you:

So, we should make a fair judgment, only this way can build long term corporations.

In this case, I think @Krey already provided a full log for support, but you can ask it again if it needed.

Krey · September 9, 2020, 3:45pm

as example

2020-07-17T01:09:24.841+0300    INFO    piecestore      uploaded        {"Piece ID": "P3W3LYGPH6K2UWBHKE6V3U3GSSD67VFVR3BLDPZ2QNCYAIAPKP3Q", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Action": "PUT"}
2020-07-17T01:09:24.842+0300    INFO    piecestore      uploaded        {"Piece ID": "WKJC3EYWNLNTQ34K35BZ2WMF7YYS37MFNVCYCXLP4WWN7WPEG64A", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Action": "PUT"}
2020-07-17T01:09:24.866+0300    INFO    piecestore      upload started  {"Piece ID": "ZFSGHXIKJE3T5GNQDXZHWG5G7ROGJVFKBQESQD3GWLUZAG3Q4KAQ", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Action": "PUT", "Availa>
2020-07-17T01:09:24.870+0300    INFO    piecestore      upload started  {"Piece ID": "4FE6RT76QKRGCVPBN25QYBC6RJKLDI6Z5AGSE3AA5D3RQIV7DFVQ", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Action": "PUT", "Availa>
2020-07-17T01:09:24.915+0300    INFO    piecestore      upload canceled {"Piece ID": "ZFSGHXIKJE3T5GNQDXZHWG5G7ROGJVFKBQESQD3GWLUZAG3Q4KAQ", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Action": "PUT"}
2020-07-17T01:09:24.973+0300    INFO    piecestore      uploaded        {"Piece ID": "4FE6RT76QKRGCVPBN25QYBC6RJKLDI6Z5AGSE3AA5D3RQIV7DFVQ", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Action": "PUT"}
2020-07-17T01:09:25.366+0300    INFO    piecestore      downloaded      {"Piece ID": "4632IALML7LUKVKM2H5JUX5QUZAT3XXLFVSUJDMTDIHV65UTQ3TQ", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET"}
2020-07-17T01:09:25.374+0300    INFO    piecestore      upload started  {"Piece ID": "67Q5IM4KZMSF2UKCIIXOMDSLVC2GPSDXBPWLCIPKEZT5J3K4NCCQ", "Satellite ID": "12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB", "Action": "PUT", "Availa>
2020-07-17T01:09:25.543+0300    INFO    piecestore      uploaded        {"Piece ID": "6HCIGBYCH54DU7GB63BLOQVLAAB5GZCZMRYJI7W3YZITGKPBCUGA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "PUT"}
2020-07-17T01:09:25.768+0300    INFO    gracefulexit:chore      piece transferred to new storagenode    {"Storagenode ID": "1kSgMvcvNp21C5kwjzaxacdWBXAnxiVsSFS5PVRZFYSY8pNwhE", "Satellite ID": "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtS>
2020-07-17T01:09:25.769+0300    ERROR   gracefulexit:chore      failed to transfer piece.       {"Satellite ID": "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW", "error": "write tcp 192.168.156.4:38344->78.94.240.189:7777: write: c>
2020-07-17T01:09:26.613+0300    INFO    piecestore      download started        {"Piece ID": "XHJZ4C3EAEFM6EKNPJIFCWJGMX3YNRWSCC5VAROPKVLOBOACQK2A", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET_R>
2020-07-17T01:09:26.650+0300    DEBUG   gracefulexit:chore      Failed uploading piece to node  {"Piece ID": "UQXHXRWGXTOM6BOJCIULKZXBD6CRSIJAN5FS4MKLR22JDAATJOUA", "Node ID": "1ZPDU1NJm9yEYV9Q1aKwDPQyRBEJbUZRoxshgZPCfrfJ1STQpk", "Node >
2020-07-17T01:09:26.650+0300    ERROR   gracefulexit:chore      failed to put piece.    {"Satellite ID": "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW", "Piece ID": "TPCFKYU6KNY2LZB7R6PBKBOU2HB7NT73UDSHTHZ6CYRILR3R5MUQ", "error": >
2020-07-17T01:09:26.650+0300    ERROR   gracefulexit:chore      unable to send failure. {"Satellite ID": "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW"}
2020-07-17T01:09:26.650+0300    ERROR   gracefulexit:chore      failed to transfer piece.       {"Satellite ID": "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW", "error": "protocol: expected piece hash; context deadline exceeded", >
2020-07-17T01:09:26.667+0300    INFO    piecestore      upload started  {"Piece ID": "74PR6TXK7IMYBXMOMO2YH73RKQX3BTIE3X2GBKQU4XN6HLDLUB3A", "Satellite ID": "12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB", "Action": "PUT", "Availa>
2020-07-17T01:09:26.957+0300    INFO    piecestore      downloaded      {"Piece ID": "XHJZ4C3EAEFM6EKNPJIFCWJGMX3YNRWSCC5VAROPKVLOBOACQK2A", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET_REPAIR"}
2020-07-17T01:09:27.762+0300    INFO    piecestore      upload started  {"Piece ID": "XD6XN56NVBGPG2DESC43NAWQOYK2V5THUEWRZVEKBTAGBH2XA5UQ", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Action": "PUT", "Availa>
2020-07-17T01:09:27.799+0300    INFO    piecestore      uploaded        {"Piece ID": "XD6XN56NVBGPG2DESC43NAWQOYK2V5THUEWRZVEKBTAGBH2XA5UQ", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Action": "PUT"}
2020-07-17T01:09:27.854+0300    INFO    piecestore      upload started  {"Piece ID": "V2F345GWJPTKPPWXQIQVMVREMCNRUP6PKH4JJOMCSB7DWI6SJ6MQ", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "PUT", "Availa>
2020-07-17T01:09:27.940+0300    INFO    piecestore      uploaded        {"Piece ID": "67Q5IM4KZMSF2UKCIIXOMDSLVC2GPSDXBPWLCIPKEZT5J3K4NCCQ", "Satellite ID": "12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB", "Action": "PUT"}
2020-07-17T01:09:28.027+0300    INFO    piecestore      upload started  {"Piece ID": "LCEUBXNFKTB4WAIWFXPXI4IDDYURWMN4PG6BD3745NLB36UXNXXA", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Action": "PUT", "Availa>
2020-07-17T01:09:28.053+0300    INFO    piecestore      uploaded        {"Piece ID": "LCEUBXNFKTB4WAIWFXPXI4IDDYURWMN4PG6BD3745NLB36UXNXXA", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Action": "PUT"}
2

thepaul · September 9, 2020, 3:45pm

Yes, of course, that’s how distributed networks work. Participants don’t trust each other implicitly. Storj Labs has to build up a reputation that it will pay for storage as agreed. If you feel that this thread is evidence that we don’t pay as agreed, you should remove Tardigrade satellites from your trusted list.

I’m not trying to diagnose this case. I don’t know anything about the specific case. I was only responding to the claim that Storj Labs disqualifies nodes for errors out of the node’s control.

littleskunk · September 9, 2020, 3:49pm

This makes me believe something is unable to handle the connection count or DNS queries.

To be fair. Yes the storage node should be able to handle it. I am not 100% sure why this is happening. I am sure it is not a problem with the target node because after 5 retries you would still finish graceful exit sucessful. There must be an issue on your side and it would be nice if we could implement some kind of workaround for that situation. I don’t know much more. I only see this unusual pattern in a few log files.

Odmin · September 9, 2020, 3:55pm

Let’s me explain, I created this thread for log inspection during Graceful Exit (Graceful Exit (Log inspection) ), on this thread we discussing errors on a log during GE and how to prevent unfair DQ in the future.
Can I ask you, are you interesting improve this process?
At this moment, we have a lack of transparency here, both sides very hard to prove the truth. What do you think about it and how we can improve it?

PS.
Here is original reason:

Krey · September 9, 2020, 4:03pm

We can see “write tcp error” on any node any time. Logs contains ton of it on any single host with single node, my or on cloud dedicated servers.
Guided by your example, I see one point of failure here and this is not a server but your software.

Odmin · September 9, 2020, 4:05pm

My opinion: Responding and ignoring details it not fair, we here try to find the root cause of unfair DQ and how to prevent it. as evidence we analyzing log, its everything that SNO can on the storage node side.

littleskunk · September 9, 2020, 4:06pm

For a moment I was hoping that we would try to find the root cause.

Ok then I have to agree @thepaul. If you are saying it must be on the target node side then I have to say after 5 retries that point is irrelevant and you don’t get DQed for that.

thepaul · September 9, 2020, 4:14pm

I’m sorry if I’ve added too much noise. Maybe it can be split out to a separate topic? I don’t know how that works.

But I couldn’t let a claim like “Really satellite can and do DQ you node for errors not caused by you storage node.” sit uncontested. The Storj network is not perfect and has room to improve, but I would prefer that we have an understanding of what those problems are and what they aren’t.

I am extremely interested in preventing unfair DQs. I’ve done a lot of work to that end already. The desire to avoid unfair DQ is why we allow so many retries and allow such a high percentage of failed pieces (where the exiting node failed all five times to send a piece).

Very well, I apologize for taking the discussion in another direction. I’ll try to contribute to the original discussion.

It is very well understood on the satellite side that that error can and will happen a lot on an unreliable network like the Internet. Seeing that error at a certain rate is entirely normal. It can be caused by a problem on the local side, or the remote side, or many points in between. The problem only comes when your node sees far more of that error than other nodes do. That is what suggests that there is a problem that is specific to your node or your network connection.

Krey · September 9, 2020, 4:27pm

Ok, I pay attention to more robust log analizer and count all tries and pieces.

Krey · September 10, 2020, 7:40pm

But does it make sense to do it at all? Stefan is closed, in a couple of days you will clear everything and even if I do an additional analysis, you will not be able to finally sort out my problem?

To be clarify, money is not important to me, I am quite happy with payments from benten. I just want to put an end to the question of who is to blame for my two DQ. Like one of the very first topics on this forum, when I (with great help from @3bl3gamer ) proved that the disqualification was not my fault.

From my point of view, this is obvious, the network continued to work, this can be seen from the logs since there are successful uploads and downloads. It remains only to figure out exactly which errors influenced DQ counter and in their source. I will continue to deal with this issue, will you?

add Oh i see i already receive compensation. Big thanks.
I continue to look at this problem as i say, in my logs.

littleskunk · September 11, 2020, 3:08pm

I would blame your hardware for it and I would like to find a way that the storage node handles it in a way that it doesn’t fail GE.

This is obvious a wrong observation. The storage node has been DQed for connection issues. There is a difference between a graceful exit transfer and a customer upload and download. A problem with DNS queries could let your node fail graceful exit transfers but you would still see successful uploads and downloads. (I don’t expect a DNS issue here. I would expect different error messages in this case. I am only pointing out that the observation “all fine because uploads and downloads” is insufficient.)

If you still believe that the issue must be on the target node then I would say we don’t need to continue. What we would like to focus on is the question what is happening on your system that can cause connection issues. We would like to implement a fix for it.

Krey · September 11, 2020, 6:11pm

My two DQ happens on two different hosts, OS, locations and providers. One on 100mbit connection, other on 500 or 1gbps. It is necessary to clarify which of the two providers in the second location the node is targeting.

Krey · September 15, 2020, 6:59pm

I originally intended сut GE process from log files from cloud servers and aggregate error by types. But from storagenode logs difficult to frame start and end GE process. Look proof:

so, i search for all log file where GE was started

This log from OVH cloud dedicated server. We can see 393 “unable to send failure”.

yes, they are an order of magnitude less with a successful GE than with my fail GE. But what is important is the very fact that there are hundreds of such errors on the cloud server infrastructure.
If this infrastructure is not good enough for your software, is the software a problem?

If I do the same analysis of logs from successful GE on my hardware, most likely I will get a similar number of these errors. And if two different setups show their presence, it may not be about host or network infrastructure problem?

This is one more proof of my assertion that the success or failure of the exit is a quantitative and not a qualitative problem.

Maybe a bunch of similar proofs will someday make you get off the mountain of your confidence and face reality. And finally rebuild the GE process so that not to count the percentage of errors, but to understand their nature.