Graceful exit attempts to transfer an already transferred piece + misc other errors

dfsn · May 12, 2020, 8:00pm

I’ve currently started the graceful exit process on one of my nodes, due to a predicted upcoming hardware failure (I will be rejoining the network with new hardware in a few months). I’ve started with a GE on only the stefan-benten satellite just to see how it goes before I start on the rest.

Most piece transfers have been successful, only a small number have failed with the errors below.

I know graceful exit is a relatively new feature with not a lot of use yet so I am a little uncertain about these issus:

1.I seem to be having this strange issue where a few pieces will fail to transfer because the “database is locked”, but using grep on the logs, the piece had already been transferred?

2020-05-12T19:14:54.659Z        INFO    gracefulexit:chore      piece transferred to new storagenode    {"Storagenode ID": "1YwgqyPxA6n7enqJ3dRPhEasHMbuCSBacoFJeMerGr3DMgBmYz", "Satellite ID": "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW", "Piece ID": "4XZO7Y7HD3P63J7V7RAJMGEVY23GSHH3P75ZKFJIX6TX2E76MFVQ"}
2020-05-12T19:26:40.846Z        ERROR   gracefulexit:chore      failed to put piece.    {"Satellite ID": "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW", "Piece ID": "4XZO7Y7HD3P63J7V7RAJMGEVY23GSHH3P75ZKFJIX6TX2E76MFVQ", "error": "protocol: usedserialsdb error: database is locked", "errorVerbose": "protocol: usedserialsdb error: database is locked\n\tstorj.io/uplink/private/piecestore.(*Upload).Write:160\n\tbufio.(*Writer).Flush:593\n\tbufio.(*Writer).Write:629\n\tstorj.io/uplink/private/piecestore.(*BufferedUpload).Write:32\n\tstorj.io/uplink/private/piecestore.(*LockingUpload).Write:89\n\tio.copyBuffer:404\n\tio.Copy:364\n\tstorj.io/common/sync2.Copy:22\n\tstorj.io/uplink/private/ecclient.(*ecClient).PutPiece:240\n\tstorj.io/storj/storagenode/gracefulexit.(*Worker).transferPiece:212\n\tstorj.io/storj/storagenode/gracefulexit.(*Worker).Run.func2:110\n\tstorj.io/common/sync2.(*Limiter).Go.func1:43"}

2.I am also curious about the “storage node overloaded” error when transferring pieces for GE, is it hitting the limit on my node, or on the receiving node?

2020-05-12T19:28:18.128Z        ERROR   gracefulexit:chore      failed to transfer piece.       {"Satellite ID": "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW", "error": "protocol: storage node overloaded, request limit: 8", "errorVerbose": "protocol: storage node overloaded, request limit: 8\n\tstorj.io/uplink/private/piecestore.(*Upload).Write:160\n\tbufio.(*Writer).Flush:593\n\tbufio.(*Writer).Write:629\n\tstorj.io/uplink/private/piecestore.(*BufferedUpload).Write:32\n\tstorj.io/uplink/private/piecestore.(*LockingUpload).Write:89\n\tio.copyBuffer:404\n\tio.Copy:364\n\tstorj.io/common/sync2.Copy:22\n\tstorj.io/uplink/private/ecclient.(*ecClient).PutPiece:240\n\tstorj.io/storj/storagenode/gracefulexit.(*Worker).transferPiece:212\n\tstorj.io/storj/storagenode/gracefulexit.(*Worker).Run.func2:110\n\tstorj.io/common/sync2.(*Limiter).Go.func1:43"}

3.I’ve also noticed a few failures where the commonality seems to be that there is an ipv6 address involved with a “dial tcp” failure, but I’m not really sure what this means. If it matters, my network and ISP both support ipv6, and the node does have an ipv6 address assigned.

2020-05-12T19:46:51.426Z        ERROR   gracefulexit:chore      failed to put piece.    {"Satellite ID": "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW", "Piece ID": "XPRSKNTQB25EQMGLSIJIJ2VDG3GMYETYDKKNX7UXRAG4EQYOZPBA", "error": "piecestore: rpccompat: dial tcp [2a03:10c3:259a:4::10]:28967: connect: cannot assign requested address", "errorVerbose": "piecestore: rpccompat: dial tcp [2a03:10c3:259a:4::10]:28967: connect: cannot assign requested address\n\tstorj.io/common/rpc.Dialer.dialTransport:264\n\tstorj.io/common/rpc.Dialer.dial:241\n\tstorj.io/common/rpc.Dialer.DialNode:140\n\tstorj.io/uplink/private/piecestore.Dial:51\n\tstorj.io/uplink/private/ecclient.(*ecClient).dialPiecestore:68\n\tstorj.io/uplink/private/ecclient.(*ecClient).PutPiece:198\n\tstorj.io/storj/storagenode/gracefulexit.(*Worker).transferPiece:212\n\tstorj.io/storj/storagenode/gracefulexit.(*Worker).Run.func2:110\n\tstorj.io/common/sync2.(*Limiter).Go.func1:43"}
2020-05-12T19:46:51.429Z        ERROR   gracefulexit:chore      failed to transfer piece.       {"Satellite ID": "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW", "error": "piecestore: rpccompat: dial tcp [2a03:10c3:259a:4::10]:28967: connect: cannot assign requested address", "errorVerbose": "piecestore: rpccompat: dial tcp [2a03:10c3:259a:4::10]:28967: connect: cannot assign requested address\n\tstorj.io/common/rpc.Dialer.dialTransport:264\n\tstorj.io/common/rpc.Dialer.dial:241\n\tstorj.io/common/rpc.Dialer.DialNode:140\n\tstorj.io/uplink/private/piecestore.Dial:51\n\tstorj.io/uplink/private/ecclient.(*ecClient).dialPiecestore:68\n\tstorj.io/uplink/private/ecclient.(*ecClient).PutPiece:198\n\tstorj.io/storj/storagenode/gracefulexit.(*Worker).transferPiece:212\n\tstorj.io/storj/storagenode/gracefulexit.(*Worker).Run.func2:110\n\tstorj.io/common/sync2.(*Limiter).Go.func1:43"}

Would any of these transfer failures affect whether the graceful exit is successful or not?

Thanks for everyone’s help, and I apologize in advance if I have misinterpreted any of these errors, or if they aren’t important.

BrightSilence · May 12, 2020, 9:26pm

I can only answer the second one. The rejections are because the receiving node has set a limit for concurrent requests. So that is not an issue on your end.

Alexey · May 14, 2020, 10:30pm

Hello @dfsn,
Welcome to the forum!

I would add - all of those errors are related to errors, returned by destination, i.e. - nodes which have been targeted as a goal of transfer.
And yes - if your node would fail more than 10% of transfers - it will be disqualified.
To be less scary - each failed attempt will be retried at least 5 times with another nodes (should be 10, but I’m not sure what the current configuration on said satellite).
So be careful when you tune graceful exit parameters above default.

Odmin · May 16, 2020, 8:52pm

So, GE errors depend on transfer to destination nodes?
What happened if during GE we will have one or two slow destinations nodes? GE will fail?

Alexey · May 16, 2020, 8:58pm

GE will fail after 10% of failure rate

Odmin · May 16, 2020, 8:59pm

How we can control it during GE?

Alexey · May 16, 2020, 9:01pm

Do not overload your node with greater values than by default.
With default settings even pi3 is able to successful finish the GE.

Odmin · May 16, 2020, 9:02pm

But how we can control failed percentage during GE?

Alexey · May 16, 2020, 9:02pm

Ah. I got it. You mean how to monitor it. Take a look into logs.
If you see any failed transfer - stop your node immediately and return settings to a previous value.
Perhaps it’s not a bad idea to reduce the available space, to do not bother your node with upload requests.

Odmin · May 16, 2020, 9:05pm

Sorry, but I understud about do not touch any parameters
I ask you about, is it we have any control of the failed transfer rate during the GE process? (any counter, or anything)

Alexey · May 16, 2020, 9:06pm

Yes, you will see a transfer error in the logs. I can’t give you an exact example, I do not have it.
They looks similar to errors in the OP topic.

Odmin · May 16, 2020, 9:08pm

Thanks, it enough for me, it will on the log and I can parse it (or just record)

Alexey · May 16, 2020, 9:12pm

The transfer considered as completely failed (which is affect your node) after those 5-10 unsuccessful attempts.
So, search for failed and the piece ID. Also, some stat should be available in the satellites.db

TABLE satellite_exit_progress (
                                                satellite_id BLOB NOT NULL,
                                                initiated_at TIMESTAMP,
                                                finished_at TIMESTAMP,
                                                starting_disk_usage INTEGER NOT NULL,
                                                bytes_deleted INTEGER NOT NULL,
                                                completion_receipt BLOB,
                                                PRIMARY KEY (satellite_id)
                                        );
TABLE satellites (
                                                node_id BLOB NOT NULL,
                                                added_at TIMESTAMP NOT NULL,
                                                status INTEGER NOT NULL,
                                                PRIMARY KEY (node_id)
                                        );

dfsn · May 19, 2020, 6:50am

Just an update for anyone who finds this thread in the future, or is wondering about what happened with the errors:

Even with these and some other occasional errors, my node has successfully completed graceful exit 100% on all the satellites that I initiated.

Thanks to BrightSilence and Alexey for their help!

BrightSilence · May 19, 2020, 9:43am

Thanks for reporting back. There seems to be an idea going around that graceful exit fails a lot. I don’t think that’s true at all, we tend to just see the bad cases on the forum. So thanks for sharing the other side as well.