Graceful exit attempts to transfer an already transferred piece + misc other errors

I’ve currently started the graceful exit process on one of my nodes, due to a predicted upcoming hardware failure (I will be rejoining the network with new hardware in a few months). I’ve started with a GE on only the stefan-benten satellite just to see how it goes before I start on the rest.

Most piece transfers have been successful, only a small number have failed with the errors below.

I know graceful exit is a relatively new feature with not a lot of use yet so I am a little uncertain about these issus:

1.I seem to be having this strange issue where a few pieces will fail to transfer because the “database is locked”, but using grep on the logs, the piece had already been transferred?

2020-05-12T19:14:54.659Z        INFO    gracefulexit:chore      piece transferred to new storagenode    {"Storagenode ID": "1YwgqyPxA6n7enqJ3dRPhEasHMbuCSBacoFJeMerGr3DMgBmYz", "Satellite ID": "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW", "Piece ID": "4XZO7Y7HD3P63J7V7RAJMGEVY23GSHH3P75ZKFJIX6TX2E76MFVQ"}
2020-05-12T19:26:40.846Z        ERROR   gracefulexit:chore      failed to put piece.    {"Satellite ID": "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW", "Piece ID": "4XZO7Y7HD3P63J7V7RAJMGEVY23GSHH3P75ZKFJIX6TX2E76MFVQ", "error": "protocol: usedserialsdb error: database is locked", "errorVerbose": "protocol: usedserialsdb error: database is locked\n\tstorj.io/uplink/private/piecestore.(*Upload).Write:160\n\tbufio.(*Writer).Flush:593\n\tbufio.(*Writer).Write:629\n\tstorj.io/uplink/private/piecestore.(*BufferedUpload).Write:32\n\tstorj.io/uplink/private/piecestore.(*LockingUpload).Write:89\n\tio.copyBuffer:404\n\tio.Copy:364\n\tstorj.io/common/sync2.Copy:22\n\tstorj.io/uplink/private/ecclient.(*ecClient).PutPiece:240\n\tstorj.io/storj/storagenode/gracefulexit.(*Worker).transferPiece:212\n\tstorj.io/storj/storagenode/gracefulexit.(*Worker).Run.func2:110\n\tstorj.io/common/sync2.(*Limiter).Go.func1:43"}

2.I am also curious about the “storage node overloaded” error when transferring pieces for GE, is it hitting the limit on my node, or on the receiving node?

2020-05-12T19:28:18.128Z        ERROR   gracefulexit:chore      failed to transfer piece.       {"Satellite ID": "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW", "error": "protocol: storage node overloaded, request limit: 8", "errorVerbose": "protocol: storage node overloaded, request limit: 8\n\tstorj.io/uplink/private/piecestore.(*Upload).Write:160\n\tbufio.(*Writer).Flush:593\n\tbufio.(*Writer).Write:629\n\tstorj.io/uplink/private/piecestore.(*BufferedUpload).Write:32\n\tstorj.io/uplink/private/piecestore.(*LockingUpload).Write:89\n\tio.copyBuffer:404\n\tio.Copy:364\n\tstorj.io/common/sync2.Copy:22\n\tstorj.io/uplink/private/ecclient.(*ecClient).PutPiece:240\n\tstorj.io/storj/storagenode/gracefulexit.(*Worker).transferPiece:212\n\tstorj.io/storj/storagenode/gracefulexit.(*Worker).Run.func2:110\n\tstorj.io/common/sync2.(*Limiter).Go.func1:43"}

3.I’ve also noticed a few failures where the commonality seems to be that there is an ipv6 address involved with a “dial tcp” failure, but I’m not really sure what this means. If it matters, my network and ISP both support ipv6, and the node does have an ipv6 address assigned.

2020-05-12T19:46:51.426Z        ERROR   gracefulexit:chore      failed to put piece.    {"Satellite ID": "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW", "Piece ID": "XPRSKNTQB25EQMGLSIJIJ2VDG3GMYETYDKKNX7UXRAG4EQYOZPBA", "error": "piecestore: rpccompat: dial tcp [2a03:10c3:259a:4::10]:28967: connect: cannot assign requested address", "errorVerbose": "piecestore: rpccompat: dial tcp [2a03:10c3:259a:4::10]:28967: connect: cannot assign requested address\n\tstorj.io/common/rpc.Dialer.dialTransport:264\n\tstorj.io/common/rpc.Dialer.dial:241\n\tstorj.io/common/rpc.Dialer.DialNode:140\n\tstorj.io/uplink/private/piecestore.Dial:51\n\tstorj.io/uplink/private/ecclient.(*ecClient).dialPiecestore:68\n\tstorj.io/uplink/private/ecclient.(*ecClient).PutPiece:198\n\tstorj.io/storj/storagenode/gracefulexit.(*Worker).transferPiece:212\n\tstorj.io/storj/storagenode/gracefulexit.(*Worker).Run.func2:110\n\tstorj.io/common/sync2.(*Limiter).Go.func1:43"}
2020-05-12T19:46:51.429Z        ERROR   gracefulexit:chore      failed to transfer piece.       {"Satellite ID": "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW", "error": "piecestore: rpccompat: dial tcp [2a03:10c3:259a:4::10]:28967: connect: cannot assign requested address", "errorVerbose": "piecestore: rpccompat: dial tcp [2a03:10c3:259a:4::10]:28967: connect: cannot assign requested address\n\tstorj.io/common/rpc.Dialer.dialTransport:264\n\tstorj.io/common/rpc.Dialer.dial:241\n\tstorj.io/common/rpc.Dialer.DialNode:140\n\tstorj.io/uplink/private/piecestore.Dial:51\n\tstorj.io/uplink/private/ecclient.(*ecClient).dialPiecestore:68\n\tstorj.io/uplink/private/ecclient.(*ecClient).PutPiece:198\n\tstorj.io/storj/storagenode/gracefulexit.(*Worker).transferPiece:212\n\tstorj.io/storj/storagenode/gracefulexit.(*Worker).Run.func2:110\n\tstorj.io/common/sync2.(*Limiter).Go.func1:43"}

Would any of these transfer failures affect whether the graceful exit is successful or not?

Thanks for everyone’s help, and I apologize in advance if I have misinterpreted any of these errors, or if they aren’t important.

1 Like

I can only answer the second one. The rejections are because the receiving node has set a limit for concurrent requests. So that is not an issue on your end.

1 Like

Hello @dfsn,
Welcome to the forum!

I would add - all of those errors are related to errors, returned by destination, i.e. - nodes which have been targeted as a goal of transfer.
And yes - if your node would fail more than 10% of transfers - it will be disqualified.
To be less scary - each failed attempt will be retried at least 5 times with another nodes (should be 10, but I’m not sure what the current configuration on said satellite).
So be careful when you tune graceful exit parameters above default.

2 Likes

So, GE errors depend on transfer to destination nodes?
What happened if during GE we will have one or two slow destinations nodes? GE will fail?

GE will fail after 10% of failure rate

How we can control it during GE?

Do not overload your node with greater values than by default.
With default settings even pi3 is able to successful finish the GE.

1 Like

But how we can control failed percentage during GE?

Ah. I got it. You mean how to monitor it. Take a look into logs.
If you see any failed transfer - stop your node immediately and return settings to a previous value.
Perhaps it’s not a bad idea to reduce the available space, to do not bother your node with upload requests.

Sorry, but I understud about do not touch any parameters :slight_smile:
I ask you about, is it we have any control of the failed transfer rate during the GE process? (any counter, or anything)

Yes, you will see a transfer error in the logs. I can’t give you an exact example, I do not have it.
They looks similar to errors in the OP topic.

1 Like

Thanks, it enough for me, it will on the log and I can parse it (or just record)

1 Like

The transfer considered as completely failed (which is affect your node) after those 5-10 unsuccessful attempts.
So, search for failed and the piece ID. Also, some stat should be available in the satellites.db

TABLE satellite_exit_progress (
                                                satellite_id BLOB NOT NULL,
                                                initiated_at TIMESTAMP,
                                                finished_at TIMESTAMP,
                                                starting_disk_usage INTEGER NOT NULL,
                                                bytes_deleted INTEGER NOT NULL,
                                                completion_receipt BLOB,
                                                PRIMARY KEY (satellite_id)
                                        );
TABLE satellites (
                                                node_id BLOB NOT NULL,
                                                added_at TIMESTAMP NOT NULL,
                                                status INTEGER NOT NULL,
                                                PRIMARY KEY (node_id)
                                        );
2 Likes

Just an update for anyone who finds this thread in the future, or is wondering about what happened with the errors:

Even with these and some other occasional errors, my node has successfully completed graceful exit 100% on all the satellites that I initiated.

Thanks to BrightSilence and Alexey for their help!

5 Likes

Thanks for reporting back. There seems to be an idea going around that graceful exit fails a lot. I don’t think that’s true at all, we tend to just see the bad cases on the forum. So thanks for sharing the other side as well.