Disqualification during Graceful Exit

raftoral · September 22, 2023, 3:37pm

Hi:

I’ve just realized that a few days after I started Graceful Exit on July, my node was disqualified on Saltlake satellite and I don’t know why. Anyone could help me?

Besides, I would like to finalize the Graceful Exit as soon as possible and, apart of that satellite, it’s pending for europe-north-1 one. Anyone knows why it’s still pending?

Thanks and cheers!

Stob · September 22, 2023, 6:19pm

The disqualification reason will be shown in the node logs. You need to look for ERROR entries.

Yes, the satellite has been decommissioned and any held amount returned.

Alexey · September 23, 2023, 4:53am

From the guide:

So the reason is likely that your node has failed more than 10% of transfers.
As @Stob suggested, you need to search for transfer errors and errors like “file not found” (search for “graceful”/“piecetransfer” and “ERROR”/“failed”).
To calculate a number of failed transfers you need to filter logs by “piecetransfer” and group them by PieceID and “ERROR”/“transferred”. Please note - the node will attempt to transfer each piece at least 5 times before it will be considered as a failed transfer, so you need to calculate only failed transfers which have 5 attempts for the same PieceID.
You may also give me a NodeID and I can provide you with exact numbers, however I cannot provide a reason - it’s stated only in your logs.

raftoral · September 23, 2023, 11:22am

Thanks, @Stob, for your answer. I can confirm that satellite returned me the held amount.

raftoral · September 23, 2023, 11:27am

Thanks, @Alexey, for your answer, as well. Since, I’m not an expert, here you have the NodeID: 12oV5bdzCtFrxTequKDkkTBjH8Pd4Cf62sEfDNBPAJUUDuTT5zo. With the information you provide me, I’ll try to search for on logs.
In any case, it’s clear for me I’ve lost the held amount in that satellite. Well, it’s a pity!

Alexey · September 23, 2023, 11:32am

It’s transferred 10,820 pieces and failed to transfer 2,235 pieces, pretty high fail rate I would say. Either your node has corrupted data, or your connection was terrible bad.
For example, for US1 it’s transferred 494,039 pieces and 0 failed.

raftoral · September 23, 2023, 11:50am

Thanks for info. The problem was with my connection. I’ve suffering issues when my telco provider changed my public IP. Thanks for the support. Cheers!

Pac · October 14, 2023, 9:50pm

I’m facing the same issue on a small node (ID: 1G7CA8T8NwUYLibFqR85TUpXEXKKbwFDLz93srFMPgRDpcvzFj) that I’m currently graceful-exiting: It got disqualified a few days ago on Saltlake:

There are a few things I don’t get:

It failed although all scores are perfect on the dashboard, as shown above
I thought that with the new graceful-exit mechanism, the node was not supposed to transfer files anymore and simply had to stay online for 30 days?

If I check the exit-status on this node, it tells me the following:

# docker exec -it storj_node_5 /app/storagenode exit-status --config-dir /app/config
2023-10-14T21:42:11Z	INFO	Configuration loaded	{"process": "storagenode", "Location": "/app/config/config.yaml"}
2023-10-14T21:42:11Z	INFO	Anonymized tracing enabled	{"process": "storagenode"}
2023-10-14T21:42:11Z	INFO	Identity loaded.	{"process": "storagenode", "Node ID": "1G7CA8T8NwUYLibFqR85TUpXEXKKbwFDLz93srFMPgRDpcvzFj"}

Domain Name                  Node ID                                              Percent Complete  Successful  Completion Receipt
saltlake.tardigrade.io:7777  1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE   15.83%            N           0a483046022100e35f852b2beadd3696ff4aa206dde97524c09e88470d75f9c47c9db9d48a3feb022100cf4d9d875d07320a50724f8fa6156f8d4cf1036e0727958ee70327ccdc50ff1710021a207b2de9d72c2e935f1918c058caaf8ed00f0581639008707317ff1bd0000000002220224d400222d8bb705507ddebe3bb8381a8d34edf85600ca604011670000000002a0c08b9e29ca906108baab0d203  
ap1.storj.io:7777            121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6  100.00%           Y           0a473045022100979aa4f52bdad116144a4d1e4ef160a99af8c8608065a5464237040ca6a94ef002204d0b5d2d07fdbc69049001aea5512413225f8fd2f0c394f7d97515441696d404122084a74c2cd43c5ba76535e1f42f5df7c287ed68d33522782f4afabfdb400000001a20224d400222d8bb705507ddebe3bb8381a8d34edf85600ca60401167000000000220c08e78785a90610ec87f08a01        
us1.storj.io:7777            12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S  100.00%           Y           0a473045022036280c2dd40a2647ab86cc7722deb8f551dbe6d43fa88fa29b72eff0ab3d8b06022100efa22d0fac30af1280d853684df3c049f4c19ca9c6995aa5d0ab7db3c96b9ef81220a28b4f04e10bae85d67f4c6cb82bf8d4c0f0f47a8ea72627524deb6ec00000001a20224d400222d8bb705507ddebe3bb8381a8d34edf85600ca60401167000000000220c08c8b9fea80610ee9deff501        
eu1.storj.io:7777            12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs  100.00%           Y           0a47304502202d2a568f6a9f5e02aae4eb51444bc109cb5e5b1521373dfbd1f0a4e3f3f87ee7022100a14894756d0e0ac1a5f69259fc14c512eb65a44e38e50be4f738561eef370ba01220af2c42003efc826ab4361f73f9d890942146fe0ebe806786f8e71908000000001a20224d400222d8bb705507ddebe3bb8381a8d34edf85600ca60401167000000000220c08f6acfea80610aec9dbf001

I see some percentages above although I thought they were not meant to be used anymore.

I had a held amount of $1.30 on Saltlake on this small node so, you know… it doesn’t really matter but the behavior is somewhat confusing.

In my logs, I see millions of the following errors so I guess my node was not working properly:

2023-10-11T23:32:47Z	ERROR	piecetransfer	failed to put piece	{"process": "storagenode", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Piece ID": "3M5STYLAPU4QEIIZMHXSBICTOL2QRP42PYOAFZGYQTIBH5I3NCBA", "Storagenode ID": "12A2ntyfgmDMmGDvpxwCy2ZDWgxB38MYNedn6rGYkYrPJYHMDBD", "error": "ecclient: upload failed (node:12A2ntyfgmDMmGDvpxwCy2ZDWgxB38MYNedn6rGYkYrPJYHMDBD, address:175.158.56.27:28968): protocol: write tcp 172.17.0.5:56956->175.158.56.27:28968: use of closed network connection; write tcp 172.17.0.5:56956->175.158.56.27:28968: use of closed network connection; piecestore: piecestore close: write tcp 172.17.0.5:56956->175.158.56.27:28968: use of closed network connection", "errorVerbose": "ecclient: upload failed (node:12A2ntyfgmDMmGDvpxwCy2ZDWgxB38MYNedn6rGYkYrPJYHMDBD, address:175.158.56.27:28968): protocol: write tcp 172.17.0.5:56956->175.158.56.27:28968: use of closed network connection; write tcp 172.17.0.5:56956->175.158.56.27:28968: use of closed network connection; piecestore: piecestore close: write tcp 172.17.0.5:56956->175.158.56.27:28968: use of closed network connection\n\tstorj.io/uplink/private/ecclient.(*ecClient).PutPiece:244\n\tstorj.io/storj/storagenode/piecetransfer.(*service).TransferPiece:148\n\tstorj.io/storj/storagenode/gracefulexit.(*Worker).Run.func3:100\n\tstorj.io/common/sync2.(*Limiter).Go.func1:49"}
2023-10-11T23:32:47Z	ERROR	gracefulexit:chore.1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE@saltlake.tardigrade.io:7777	failed to send notification about piece transfer.	{"process": "storagenode", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "error": "EOF", "errorVerbose": "EOF\n\tstorj.io/storj/storagenode/gracefulexit.(*Worker).Run.func3:105\n\tstorj.io/common/sync2.(*Limiter).Go.func1:49"}
2023-10-11T23:32:47Z	ERROR	gracefulexit:chore	worker failed	{"process": "storagenode", "error": "gracefulexit: context canceled while waiting to receive message from storagenode", "errorVerbose": "gracefulexit: context canceled while waiting to receive message from storagenode\n\tstorj.io/storj/storagenode/gracefulexit.(*Worker).Run:90\n\tstorj.io/storj/storagenode/gracefulexit.(*Chore).AddMissing.func1:82\n\tstorj.io/common/sync2.(*Limiter).Go.func1:49"}

Taking care of a node is hard!

Alexey · October 15, 2023, 1:33am

Disqualification during Graceful Exit is happened because your node is failed to transfer more than 10% of pieces (each transfer takes 5 attempts to transfer a piece to different nodes before considered as failed).

it’s not deployed yet, it should be deployed soon:

Pac:

2023-10-11T23:32:47Z	ERROR	piecetransfer	failed to put piece	{"process": "storagenode", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Piece ID": "3M5STYLAPU4QEIIZMHXSBICTOL2QRP42PYOAFZGYQTIBH5I3NCBA", "Storagenode ID": "12A2ntyfgmDMmGDvpxwCy2ZDWgxB38MYNedn6rGYkYrPJYHMDBD", "error": "ecclient: upload failed (node:12A2ntyfgmDMmGDvpxwCy2ZDWgxB38MYNedn6rGYkYrPJYHMDBD, address:175.158.56.27:28968): protocol: write tcp 172.17.0.5:56956->175.158.56.27:28968: use of closed network connection; write tcp 172.17.0.5:56956->175.158.56.27:28968: use of closed network connection; piecestore: piecestore close: write tcp 172.17.0.5:56956->175.158.56.27:28968: use of closed network connection", "errorVerbose": "ecclient: upload failed (node:12A2ntyfgmDMmGDvpxwCy2ZDWgxB38MYNedn6rGYkYrPJYHMDBD, address:175.158.56.27:28968): protocol: write tcp 172.17.0.5:56956->175.158.56.27:28968: use of closed network connection; write tcp 172.17.0.5:56956->175.158.56.27:28968: use of closed network connection; piecestore: piecestore close: write tcp 172.17.0.5:56956->175.158.56.27:28968: use of closed network connection\n\tstorj.io/uplink/private/ecclient.(*ecClient).PutPiece:244\n\tstorj.io/storj/storagenode/piecetransfer.(*service).TransferPiece:148\n\tstorj.io/storj/storagenode/gracefulexit.(*Worker).Run.func3:100\n\tstorj.io/common/sync2.(*Limiter).Go.func1:49"}
2023-10-11T23:32:47Z	ERROR	gracefulexit:chore.1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE@saltlake.tardigrade.io:7777	failed to send notification about piece transfer.	{"process": "storagenode", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "error": "EOF", "errorVerbose": "EOF\n\tstorj.io/storj/storagenode/gracefulexit.(*Worker).Run.func3:105\n\tstorj.io/common/sync2.(*Limiter).Go.func1:49"}
2023-10-11T23:32:47Z	ERROR	gracefulexit:chore	worker failed	{"process": "storagenode", "error": "gracefulexit: context canceled while waiting to receive message from storagenode", "errorVerbose": "gracefulexit: context canceled while waiting to receive message from storagenode\n\tstorj.io/storj/storagenode/gracefulexit.(*Worker).Run:90\n\tstorj.io/storj/storagenode/gracefulexit.(*Chore).AddMissing.func1:82\n\tstorj.io/common/sync2.(*Limiter).Go.func1:49"}

These are communication errors, so likely your router wasn’t able to handle a lot of parallel transfers.
This is another reason why we want to change the complicated Graceful Exit.

Pac · October 15, 2023, 8:57am

Oh damn

Another node got disqualified this morning which feels unfair if it’s because my router can’t handle the load… Especially as I’m trying to help the network here by gracefully exiting. Is there anything I can do to slow down those transfers and stop nodes that are exiting from being disqualified?

I guess changing these parameters:

# number of concurrent transfers per graceful exit worker
# graceful-exit.num-concurrent-transfers: 5

# number of workers to handle satellite exits
# graceful-exit.num-workers: 4

What about these new numbers:

graceful-exit.num-concurrent-transfers: 2
graceful-exit.num-workers: 2

?

Alexey · October 16, 2023, 3:32am

Yes, you may try to reduce these parameters, save the config and restart the node.

Pac · October 16, 2023, 9:25pm

Thanks for your unfailing help, as always @Alexey.

Tried with 2 and 2. Still had quite a lot of errors.

I’m now down to 1 and 1, and although logs look better than with default values, there are still many errors. RaspbianOS is pretty chill right now (load average: 0.36, 0.26, 0.28 and an average constant upload of 2MiB/s). I don’t know what else I can do…

This graceful-exit mechanism seems pretty unreliable, I’m glad we’re moving to the shiny new one soon!

Too bad I misunderstood that it wasn’t live yet… that’s my bad.

raftoral · January 15, 2024, 3:57pm

Hi again, @Alexey!

After the graceful exit was done a few months ago, I’ve been waiting for receiving a few dollars that are undistributed. Do you know why I haven’t received them yet?

Thanks in advance!

Alexey · January 16, 2024, 1:52am

They are subject of the Minimum Payout Threshold. If your undistributed amount is lower, they will hold until they can clear a threshold.
You have three options:

Run a new node with the same wallet address, it will eventually collect enough to clear a threshold
If you have an identity of this node, you may opt-in for zkSync (it has a lower threshold)
Wait until the transaction fee could become lower in the first two weeks of every month.

The last time the threshold was:

I would also recommend to check your wallet, you actually may receive the undistributed amount, and you see it only for decommissioned satellites, see

raftoral · January 16, 2024, 1:26pm

Thanks for clarification. I wasn’t sure the reason was due to the minimum payout threshold or the graceful exit.

The undistributed amount is the same that was before I did the graceful exit, so I’m sure it’s not related to decommissioned satellite.

So, I’ve just opted in for zkSync in order to get the pending amount.

Thanks again for your fast and efficient support!

Cheers!

raftoral · May 7, 2024, 5:30pm

Hi again @Alexey:

Since last contact, I’ve been waiting for the undistributed payout. After opt-in for zkSync, I received a part of the pending amount, but rest less than $2 since then. Do you know why I haven’t received the total amount? If I uninstall the node (Graceful Exit is finished), will I receive that amount?

Thanks and best regards.

Alexey · May 10, 2024, 1:15pm

You likely received all held amount. However, your databases may do not have receipts for decommissioned satellites

This may happen, if the released held amount from them was not enough to clear the Minimum Payout Threshold that time, but was able to clear it after one or two payout cycles later, when these satellites were shutdown already, thus they were not able to send you receipts.

You may check your payout history on the dashboard around this time and find transactions from these satellites.

raftoral · May 10, 2024, 3:08pm

Thanks for the info. You’re right and it was due to those 2 decommissioned satellites. I didn’t surpass the Minimun Payout Threshold.

Best regards!