Graceful Exit (log inspection)

Odmin · May 31, 2020, 2:26pm

I creating this topic to pay attention to the Graceful Exit (GE) implementation and node selection on the satellite side during GE.
So, Let’s start to review GE process like an engineer.
Introduction:
I started GE on one of my storagenode for satellite.stefan-benten.de (118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW) to test how GE is going. All my databases located on separate SSD drives and I have ZFS storage.
When I started GE process I saw on the table with satellite names how much space is allocated by each satellite, for satellite.stefan-benten.de - 216GB I started the process and it was successfully finished.
Start time: 2020-05-27 12:24
Finished time: 2020-05-29 15:13
GE takes: ~51 hours
GE average speed: ~ 4,24GB/h (~101GB/day)
All settings on config.yaml are default.

I remember @Alexey sad:

When the process finished I collected the storage node log and starting analyze it.
docker logs -t storagenode >& output.log

I will continue with my analysis in the next post because it will be too long.

Odmin · May 31, 2020, 3:42pm

Let’s analyze log file and see what’s going on during GE process.

cat output.log |grep -c gracefulexit:chore
70956

70956 pieces are transferred/try to transfer during GE process.

cat output.log |grep gracefulexit:chore| grep -c failed
5963

5963 is failed during GE process

Why did it fail? I asked himself, my storage node is not loaded (controlling process with netdata), the network (at that moment) was silent. And I walk to the deep details of log… ( because the devil is always in the details )

Cut log file to GE errors only and go through the all log:
cat output.log |grep gracefulexit:chore| grep failed > errors.log

2020-05-28T05:48:17.769637997Z 2020-05-28T05:48:17.769Z ERROR gracefulexit:chore failed to put piece. {“Satellite ID”: “118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW”, “Piece ID”: “KQU5WWKHYMS6GGEGJEJSC4WIKLLONDC3C5TZQD
PQSGTHUCYQ3LTQ”, “error”: “protocol: pieces error: filestore error: truncate F:\temp\blob-655807176.partial: Espacio en disco insuficiente.", “errorVerbose”: "protocol: pieces error: filestore error: truncate F:\temp\blob-65580717
6.partial: Espacio en disco insuficiente.\n\tstorj.io/uplink/private/piecestore.(*Upload).Write:182\n\tbufio.(*Writer).Flush:593\n\tbufio.(*Writer).Write:629\n\tstorj.io/uplink/private/piecestore.(*BufferedUpload).Write:32\n\tstorj.io
/uplink/private/piecestore.(*LockingUpload).Write:89\n\tio.copyBuffer:404\n\tio.Copy:364\n\tstorj.io/common/sync2.Copy:22\n\tstorj.io/uplink/private/piecestore.(*Client).UploadReader:76\n\tstorj.io/uplink/private/ecclient.(*ecClient).
PutPiece:216\n\tstorj.io/storj/storagenode/gracefulexit.(*Worker).transferPiece:212\n\tstorj.io/storj/storagenode/gracefulexit.(*Worker).Run.func2:110\n\tstorj.io/common/sync2.(*Limiter).Go.func1:43”}

It errors on the remote side and I can do nothing with it, just calculate how many…

cat errors.log | grep -c "filestore error: truncate"
154

2020-05-28T05:49:28.188261774Z 2020-05-28T05:49:28.187Z ERROR gracefulexit:chore failed to put piece. {“Satellite ID”: “118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW”, “Piece ID”: “H4FMR73ZJZOIXENHWGNYP7YLJAZFA5AY4E3EFT
ADPGLT2HM2OPOA”, “error”: “piecestore: rpccompat: dial tcp [2a0b:f4c0:58:1:ac25:857:a749:6715]:28967: connect: cannot assign requested address", “errorVerbose”: "piecestore: rpccompat: dial tcp [2a0b:f4c0:58:1:ac25:857:a749:6715]:2896
> 7: connect: cannot assign requested address\n\tstorj.io/common/rpc.Dialer.dialTransport:264\n\tstorj.io/common/rpc.Dialer.dial:241\n\tstorj.io/common/rpc.Dialer.DialNode:140\n\tstorj.io/uplink/private/piecestore.Dial:51\n\tstorj.io/up
link/private/ecclient.(*ecClient).dialPiecestore:65\n\tstorj.io/uplink/private/ecclient.(*ecClient).PutPiece:193\n\tstorj.io/storj/storagenode/gracefulexit.(*Worker).transferPiece:212\n\tstorj.io/storj/storagenode/gracefulexit.(*Worke
r).Run.func2:110\n\tstorj.io/common/sync2.(*Limiter).Go.func1:43”}

IPv6 node, why satellite is differing it fo me… but let’s calculate how many.

cat errors.log | grep -c "cannot assign requested address"
480

2020-05-28T05:49:37.015766784Z 2020-05-28T05:49:37.015Z ERROR gracefulexit:chore failed to put piece. {“Satellite ID”: “118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW”, “Piece ID”: “SVWBAKXQP4PESCHATYCL43JPATSMXM2FAOJTJJ
S6TC7VKV4HHUAQ”, “error”: "protocol: usedserialsdb error: disk I/O error", “errorVerbose”: “protocol: usedserialsdb error: disk I/O error\n\tstorj.io/uplink/private/piecestore.(*Upload).Write:182\n\tbufio.(*Writer).Flush:593\n\tbufio.
(*Writer).Write:629\n\tstorj.io/uplink/private/piecestore.(*BufferedUpload).Write:32\n\tstorj.io/uplink/private/piecestore.(*LockingUpload).Write:89\n\tio.copyBuffer:404\n\tio.Copy:364\n\tstorj.io/common/sync2.Copy:22\n\tstorj.io/upli
nk/private/piecestore.(*Client).UploadReader:76\n\tstorj.io/uplink/private/ecclient.(*ecClient).PutPiece:216\n\tstorj.io/storj/storagenode/gracefulexit.(*Worker).transferPiece:212\n\tstorj.io/storj/storagenode/gracefulexit.(*Worker).R
un.func2:110\n\tstorj.io/common/sync2.(*Limiter).Go.func1:43”}

Disk I/O errors on the remote side, can’t do anything, just calculating.

cat errors.log | grep -c "disk I/O error"
104

2020-05-28T05:50:05.168852828Z 2020-05-28T05:50:05.168Z ERROR gracefulexit:chore failed to put piece. {“Satellite ID”: “118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW”, “Piece ID”: “2XPAGAXNXDWZ2ZZJCZKIRMUAHSQSA6T7YKAAITPBMUCYZQO3F4OQ”, “error”: “piecestore: rpccompat: context deadline exceeded”, “errorVerbose”: "piecestore: rpccompat: context deadline exceeded\n\tstorj.io/common/rpc.Dialer.dialTransport:264\n\tstorj.io/common/rpc.Dialer.dial:241\n\tstorj.io/common/rpc.Dialer.DialNode:140\n\tstorj.io/uplink/private/piecestore.Dial:51\n\tstorj.io/uplink/private/ecclient.(*ecClient).dialPiecestore:65\n\tstorj.io/uplink/private/ecclient.(*ecClient).PutPiece:193\n\tstorj.io/storj/storagenode/gracefulexit.(*Worker).transferPiece:212\n\tstorj.io/storj/storagenode/gracefulexit.(*Worker).Run.func2:110\n\tstorj.io/common/sync2.(*Limiter).Go.func1:43"}

Next error on the remote side, can’t do anything, calculating…

 cat errors.log | grep -c "deadline exceeded"
1912

2020-05-28T05:52:13.484607212Z 2020-05-28T05:52:13.484Z ERROR gracefulexit:chore failed to put piece. {“Satellite ID”: “118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW”, “Piece ID”: “IGFTUZXZZMFKJZQNM43AJUJYUTSIV2GJIAPC3UHGVI75DQ44PMOA”, “error”: “protocol: storage node overloaded, request limit: 2”, “errorVerbose”: “protocol: storage node overloaded, request limit: 2\n\tstorj.io/uplink/private/piecestore.(*Upload).Write:182\n\tbufio.(*Writer).Flush:5
93\n\tbufio.(*Writer).Write:629\n\tstorj.io/uplink/private/piecestore.(*BufferedUpload).Write:32\n\tstorj.io/uplink/private/piecestore.(*LockingUpload).Write:89\n\tio.copyBuffer:404\n\tio.Copy:364\n\tstorj.io/common/sync2.Copy:22\n\ts
torj.io/uplink/private/piecestore.(*Client).UploadReader:76\n\tstorj.io/uplink/private/ecclient.(*ecClient).PutPiece:216\n\tstorj.io/storj/storagenode/gracefulexit.(*Worker).transferPiece:212\n\tstorj.io/storj/storagenode/gracefulexit
.(*Worker).Run.func2:110\n\tstorj.io/common/sync2.(*Limiter).Go.func1:43”}

Overloaded storage node on the remote side (looks like have connection limit:2), calculating…

cat errors.log | grep -c "storage node overloaded"
184

2020-05-28T05:53:12.517507175Z 2020-05-28T05:53:12.517Z ERROR gracefulexit:chore failed to put piece. {“Satellite ID”: “118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW”, “Piece ID”: “EER6GZ7D6P5Y3JB6ZSYZYYUAJ32KZRLBB3UMIG
ZQDM6LRKUQP4EQ”, “error”: “protocol: pieces error: write config/storage/temp/blob-192182437.partial: no space left on device”, “errorVerbose”: “protocol: pieces error: write config/storage/temp/blob-192182437.partial: no space left on
> device\n\tstorj.io/uplink/private/piecestore.(*Upload).Write:182\n\tbufio.(*Writer).Flush:593\n\tbufio.(*Writer).Write:629\n\tstorj.io/uplink/private/piecestore.(*BufferedUpload).Write:32\n\tstorj.io/uplink/private/piecestore.(*Locki
ngUpload).Write:89\n\tio.copyBuffer:404\n\tio.Copy:364\n\tstorj.io/common/sync2.Copy:22\n\tstorj.io/uplink/private/piecestore.(*Client).UploadReader:76\n\tstorj.io/uplink/private/ecclient.(*ecClient).PutPiece:216\n\tstorj.io/storj/sto
ragenode/gracefulexit.(*Worker).transferPiece:212\n\tstorj.io/storj/storagenode/gracefulexit.(*Worker).Run.func2:110\n\tstorj.io/common/sync2.(*Limiter).Go.func1:43”}

No space on the remote side, but why satellite is offering storage node with no space for me? calculating…

cat errors.log | grep -c "no space left on device"
436

2020-05-28T05:57:35.202904511Z 2020-05-28T05:57:35.202Z ERROR gracefulexit:chore failed to put piece. {“Satellite ID”: “118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW”, “Piece ID”: “EWBDOJLYYNMKTBQXTUWG5Y567DLCLKSUY5D67E
5HRULXG22SUAOA”, “error”: “piecestore: rpccompat: EOF”, “errorVerbose”: "piecestore: rpccompat: EOF\n\tstorj.io/common/rpc.Dialer.dialTransport:284\n\tstorj.io/common/rpc.Dialer.dial:241\n\tstorj.io/common/rpc.Dialer.DialNode:140\n\ts
torj.io/uplink/private/piecestore.Dial:51\n\tstorj.io/uplink/private/ecclient.(*ecClient).dialPiecestore:65\n\tstorj.io/uplink/private/ecclient.(*ecClient).PutPiece:193\n\tstorj.io/storj/storagenode/gracefulexit.(*Worker).transferPiec
e:212\n\tstorj.io/storj/storagenode/gracefulexit.(*Worker).Run.func2:110\n\tstorj.io/common/sync2.(*Limiter).Go.func1:43"}

I did not completely understand this error, so just calculating…

cat errors.log | grep -c "rpccompat: EOF"
120

2020-05-28T05:57:49.315963180Z 2020-05-28T05:57:49.315Z ERROR gracefulexit:chore failed to put piece. {“Satellite ID”: “118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW”, “Piece ID”: “D2LSIGSO4TVUUIL3FSIIYRUPBBMXO776RSRDGM
ZYZGJLLDDNQQBA”, “error”: “protocol: pieces error: filestore error: open G:\Strong V3 interfase for WIN10\temp\blob-924928862.partial: There is not enough space on the disk.”, “errorVerbose”: "protocol: pieces error: filestore erro
> r: open G:\Strong V3 interfase for WIN10\temp\blob-924928862.partial: There is not enough space on the disk.\n\tstorj.io/uplink/private/piecestore.(*Upload).Write:182\n\tbufio.(*Writer).Flush:593\n\tbufio.(*Writer).Write:629\n\tsto
rj.io/uplink/private/piecestore.(*BufferedUpload).Write:32\n\tstorj.io/uplink/private/piecestore.(*LockingUpload).Write:89\n\tio.copyBuffer:404\n\tio.Copy:364\n\tstorj.io/common/sync2.Copy:22\n\tstorj.io/uplink/private/piecestore.(*Cl
ient).UploadReader:76\n\tstorj.io/uplink/private/ecclient.(*ecClient).PutPiece:216\n\tstorj.io/storj/storagenode/gracefulexit.(*Worker).transferPiece:212\n\tstorj.io/storj/storagenode/gracefulexit.(*Worker).Run.func2:110\n\tstorj.io/c
ommon/sync2.(*Limiter).Go.func1:43"}

Windows storagenode that not have space, the question is the same, why satellite is offering this storagenode

cat errors.log | grep -c "filestore error: open"
366

2020-05-28T06:02:47.243508871Z 2020-05-28T06:02:47.243Z ERROR gracefulexit:chore failed to put piece. {“Satellite ID”: “118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW”, “Piece ID”: “QKOEUBB6FJPOZMQX7SI5M7R7TOIHZVXC23GZG3
BIUWFKIE4RZUGA”, “error”: “protocol: expected piece hash; usedserialsdb error: database is locked”, “errorVerbose”: “protocol: expected piece hash; usedserialsdb error: database is locked\n\tstorj.io/storj/storagenode/gracefulexit.(*W
orker).transferPiece:224\n\tstorj.io/storj/storagenode/gracefulexit.(*Worker).Run.func2:110\n\tstorj.io/common/sync2.(*Limiter).Go.func1:43”}

Our lovely error from our favorite topic on the remote side

cat errors.log | grep -c "database is locked"
150

2020-05-28T06:16:48.336886120Z 2020-05-28T06:16:48.336Z ERROR gracefulexit:chore failed to put piece. {“Satellite ID”: “118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW”, “Piece ID”: “A3622ZWQ5C2JYUTLRI7FH7NFYTJHYBPN4HLPDK
UDDNUWALJOQEFA”, “error”: “piecestore: rpccompat: dial tcp 72.80.88.83:28967: connect: no route to host”, “errorVerbose”: “piecestore: rpccompat: dial tcp 72.80.88.83:28967: connect: no route to host\n\tstorj.io/common/rpc.Dialer.dial
Transport:264\n\tstorj.io/common/rpc.Dialer.dial:241\n\tstorj.io/common/rpc.Dialer.DialNode:140\n\tstorj.io/uplink/private/piecestore.Dial:51\n\tstorj.io/uplink/private/ecclient.(*ecClient).dialPiecestore:65\n\tstorj.io/uplink/private
/ecclient.(*ecClient).PutPiece:193\n\tstorj.io/storj/storagenode/gracefulexit.(*Worker).transferPiece:212\n\tstorj.io/storj/storagenode/gracefulexit.(*Worker).Run.func2:110\n\tstorj.io/common/sync2.(*Limiter).Go.func1:43”}

Communication issues to the Verizon side (wireless client!?)

cat errors.log | grep -c "no route to host"
162

2020-05-28T06:50:48.911313385Z 2020-05-28T06:50:48.911Z ERROR gracefulexit:chore failed to put piece. {“Satellite ID”: “118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW”, “Piece ID”: “WZSNWSRV3GE56H57223NKQ2ETU4XCJGMUV3ULF
4TS5NXA3RPGVBA”, “error”: “piecestore: rpccompat: dial tcp: lookup spinme.asuscomm.com on 192.168.83.254:53: server misbehaving”, “errorVerbose”: “piecestore: rpccompat: dial tcp: lookup spinme.asuscomm.com on 192.168.83.254:53: serve
r misbehaving\n\tstorj.io/common/rpc.Dialer.dialTransport:264\n\tstorj.io/common/rpc.Dialer.dial:241\n\tstorj.io/common/rpc.Dialer.DialNode:140\n\tstorj.io/uplink/private/piecestore.Dial:51\n\tstorj.io/uplink/private/ecclient.(*ecClie
nt).dialPiecestore:65\n\tstorj.io/uplink/private/ecclient.(*ecClient).PutPiece:193\n\tstorj.io/storj/storagenode/gracefulexit.(*Worker).transferPiece:212\n\tstorj.io/storj/storagenode/gracefulexit.(*Worker).Run.func2:110\n\tstorj.io/c
ommon/sync2.(*Limiter).Go.func1:43”}

DDNS resolving issue, but why satellite is offering name instead of IP? I remember it was reworked a long time ago.

cat errors.log | grep -c "dial tcp: lookup"
94

2020-05-28T07:00:02.723604423Z 2020-05-28T07:00:02.723Z ERROR gracefulexit:chore failed to put piece. {“Satellite ID”: “118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW”, “Piece ID”: “DED75GOCGNGYJXFMTRJTVNPR3ENA2N535NFVXJ
RZZC7BJS52BIRA”, “error”: “protocol: expected piece hash; context canceled”, “errorVerbose”: “protocol: expected piece hash; context canceled\n\tstorj.io/storj/storagenode/gracefulexit.(*Worker).transferPiece:224\n\tstorj.io/storj/sto
ragenode/gracefulexit.(*Worker).Run.func2:110\n\tstorj.io/common/sync2.(*Limiter).Go.func1:43”}

Remote side canceled request…

cat errors.log | grep -c "context canceled"
46

2020-05-28T07:10:14.789079205Z 2020-05-28T07:10:14.788Z ERROR gracefulexit:chore failed to put piece. {“Satellite ID”: “118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW”, “Piece ID”: “WB6OXPN2TOCG6APIGEOPVEYUSU5VFZA425ILYB
3EXOACL4MCZVWQ”, “error”: “piecestore: rpccompat: dial tcp 75.69.200.27:28967: connect: connection refused”, “errorVerbose”: “piecestore: rpccompat: dial tcp 75.69.200.27:28967: connect: connection refused\n\tstorj.io/common/rpc.Diale
r.dialTransport:264\n\tstorj.io/common/rpc.Dialer.dial:241\n\tstorj.io/common/rpc.Dialer.DialNode:140\n\tstorj.io/uplink/private/piecestore.Dial:51\n\tstorj.io/uplink/private/ecclient.(*ecClient).dialPiecestore:65\n\tstorj.io/uplink/p
rivate/ecclient.(*ecClient).PutPiece:193\n\tstorj.io/storj/storagenode/gracefulexit.(*Worker).transferPiece:212\n\tstorj.io/storj/storagenode/gracefulexit.(*Worker).Run.func2:110\n\tstorj.io/common/sync2.(*Limiter).Go.func1:43”}

Remote side refused connections…

 cat errors.log | grep -c "connection refused"
280

2020-05-28T08:47:48.213756026Z 2020-05-28T08:47:48.213Z ERROR gracefulexit:chore failed to put piece. {“Satellite ID”: “118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW”, “Piece ID”: “GRQLBIYDPFBUOY6TA5Y2EAU2MHQSYDNVNA3GAI
44USIFQSZK7XIA”, “error”: "protocol: expected piece hash; pieces error: filestore error: sync S:\Storjplot_S1\temp\blob-134854034.partial: The request could not be performed because of an I/O device error.", “errorVerbose”: “protoc
ol: expected piece hash; pieces error: filestore error: sync S:\Storjplot_S1\temp\blob-134854034.partial: The request could not be performed because of an I/O device error.\n\tstorj.io/storj/storagenode/gracefulexit.(*Worker).trans
ferPiece:224\n\tstorj.io/storj/storagenode/gracefulexit.(*Worker).Run.func2:110\n\tstorj.io/common/sync2.(*Limiter).Go.func1:43”}

Disk trouble on the remote side…

cat errors.log | grep -c "filestore error: sync"
2

2020-05-28T08:54:21.563290721Z 2020-05-28T08:54:21.562Z ERROR gracefulexit:chore failed to put piece. {“Satellite ID”: “118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW”, “Piece ID”: “5BU2IBIYCGMBQVFNBZY7SGO4YDM4HAG6HMHYTS
5TZF7FKRNLGEZA”, “error”: “protocol: out of space”, “errorVerbose”: “protocol: out of space\n\tstorj.io/uplink/private/piecestore.(*Upload).Write:182\n\tbufio.(*Writer).Flush:593\n\tbufio.(*Writer).Write:629\n\tstorj.io/uplink/private
/piecestore.(*BufferedUpload).Write:32\n\tstorj.io/uplink/private/piecestore.(*LockingUpload).Write:89\n\tio.copyBuffer:404\n\tio.Copy:364\n\tstorj.io/common/sync2.Copy:22\n\tstorj.io/uplink/private/piecestore.(*Client).UploadReader:7
6\n\tstorj.io/uplink/private/ecclient.(*ecClient).PutPiece:216\n\tstorj.io/storj/storagenode/gracefulexit.(*Worker).transferPiece:212\n\tstorj.io/storj/storagenode/gracefulexit.(*Worker).Run.func2:110\n\tstorj.io/common/sync2.(*Limite
r).Go.func1:43”}

Out of space on the remote side.

 cat errors.log | grep -c "out of space"
10

2020-05-28T10:32:04.022911065Z 2020-05-28T10:32:04.022Z ERROR gracefulexit:chore failed to put piece. {“Satellite ID”: “118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW”, “Piece ID”: “VKYODNEW2BB777EQ7VHW7NMXNLLZWOQRU7CCB3
ZO5L5B6PPIMWAA”, **“error”: “piecestore: rpccompat: dial tcp *****:: i/o timeout”, “errorVerbose”: “piecestore: rpccompat: dial tcp ******:: i/o timeout\n\tstorj.io/common/rpc.Dialer.dialTransport:264\n\tstorj.i
o/common/rpc.Dialer.dial:241\n\tstorj.io/common/rpc.Dialer.DialNode:140\n\tstorj.io/uplink/private/piecestore.Dial:51\n\tstorj.io/uplink/private/ecclient.(*ecClient).dialPiecestore:65\n\tstorj.io/uplink/private/ecclient.(*ecClient).Pu
tPiece:193\n\tstorj.io/storj/storagenode/gracefulexit.(*Worker).transferPiece:212\n\tstorj.io/storj/storagenode/gracefulexit.(*Worker).Run.func2:110\n\tstorj.io/common/sync2.(*Limiter).Go.func1:43”}

I shocked… it was my node (secured with *****:) on another location that did GE a few weeks ago on Stefan’s satellite… It absolutely epic fail on the satellite side!
It definitely should not be offered to me during GE

cat errors.log | grep -c "i/o timeout"
44

2020-05-28T11:06:27.099539573Z 2020-05-28T11:06:27.099Z ERROR gracefulexit:chore failed to transfer piece. {“Satellite ID”: “118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW”, “error”: “write tcp 192.168.83.39:8376->78.
94.240.189:7777: write: broken pipe", “errorVerbose”: "write tcp 192.168.83.39:8376->78.94.240.189:7777: write: broken pipe\n\tstorj.io/drpc/drpcstream.(*Stream).RawFlush:287\n\tstorj.io/drpc/drpcstream.(*Stream).MsgSend:321\n\tstorj.
io/common/pb.(*drpcSatelliteGracefulExitProcessClient).Send:1345\n\tstorj.io/storj/storagenode/gracefulexit.(*Worker).transferPiece:274\n\tstorj.io/storj/storagenode/gracefulexit.(*Worker).Run.func2:110\n\tstorj.io/common/sync2.(*Limi
ter).Go.func1:43”}

Communication issues…

cat errors.log | grep -c "write: broken pipe"
1001

2020-05-28T11:38:41.658935955Z 2020-05-28T11:38:41.658Z ERROR gracefulexit:chore failed to transfer piece. {“Satellite ID”: “118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW”, “error”: "write tcp 192.168.83.39:14448->78
.94.240.189:7777: write: connection reset by peer", “errorVerbose”: “write tcp 192.168.83.39:14448->78.94.240.189:7777: write: connection reset by peer\n\tstorj.io/drpc/drpcstream.(*Stream).RawFlush:287\n\tstorj.io/drpc/drpcstream.(*S
tream).MsgSend:321\n\tstorj.io/common/pb.(*drpcSatelliteGracefulExitProcessClient).Send:1345\n\tstorj.io/storj/storagenode/gracefulexit.(*Worker).transferPiece:274\n\tstorj.io/storj/storagenode/gracefulexit.(*Worker).Run.func2:110\n\t
storj.io/common/sync2.(*Limiter).Go.func1:43”}

The remote peer is resetting connections…

cat errors.log | grep -c "connection reset by peer"
418

2020-05-28T22:22:10.727629759Z 2020-05-28T22:22:10.727Z ERROR gracefulexit:chore failed to transfer piece. {“Satellite ID”: “118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW”, “error”: "write tcp 192.168.83.39:50046->78
.94.240.189:7777: write: connection timed out", “errorVerbose”: “write tcp 192.168.83.39:50046->78.94.240.189:7777: write: connection timed out\n\tstorj.io/drpc/drpcstream.(*Stream).RawFlush:287\n\tstorj.io/drpc/drpcstream.(*Stream).M
sgSend:321\n\tstorj.io/common/pb.(*drpcSatelliteGracefulExitProcessClient).Send:1345\n\tstorj.io/storj/storagenode/gracefulexit.(*Worker).transferPiece:274\n\tstorj.io/storj/storagenode/gracefulexit.(*Worker).Run.func2:110\n\tstorj.io
/common/sync2.(*Limiter).Go.func1:43”}

Time out connections…

cat errors.log | grep -c "connection timed out"
149

So, let’s summarize this very long post. Most errors on the remote side and we can’t do anything with it. The next piece of error is a wrong offering from the satellite side and we also can’t do anything with it. Another big piece of error is routing/communication issues.

I prepare a simple table with graph:

Error type	Count of errors
filestore error: truncate	154
cannot assign requested address	480
disk I/O error	104
deadline exceeded	1912
storage node overloaded	184
no space left on device	436
rpccompat: EOF	120
filestore error: open	366
database is locked	150
no route to host	162
dial tcp: lookup	94
context canceled	46
connection refused	280
filestore error: sync	2
out of space	10
i/o timeout	44
write: broken pipe	1001
connection reset by peer	418
connection timed out	149

kevink · May 31, 2020, 3:48pm

Apprantly it was not implemented for GE, only for normal operations. Probably an oversight or priority was not high enough.

Guess we’ll need an official answer for that, however:

This means that you would actually have to fail 10% of the transfers at least 5 times before your node gets DQed.
You would still rely on remote nodes not throwing errors but it makes the probability of a failed GE a lot lower.

anon68609175 · May 31, 2020, 3:53pm

Doesn’t it seem strange that my success depends on the other side while I’m doing well? After all, in this case, the satellite may be interested in deliberately issuing bad receiving nodes.
In any case, we need to wait for an official answer why it was done as it was done.

kevink · May 31, 2020, 3:54pm

Of course, nobody is questioning that.

This is just plain unfair and ridiculous… The satellite gains nothing in a failed GE and has to rebuild the lost pieces eventually.

anon68609175 · May 31, 2020, 3:57pm

Why nothing? For example my escrow was $135 and used storage only 100Gb or $1 repair cost. So difference is $134

nerdatwork · May 31, 2020, 4:03pm

I always like how you are detailed and meticulous in your research and findings. I think this should be under SNO since its related to SNOs and not the platform.

Krey · May 31, 2020, 4:09pm

No, this is about everything, not just about SNO. The product is too raw to be called production.
His current state only scares away customers. And SNOs too.

Odmin · May 31, 2020, 4:43pm

Thanks!
I just decide put this investigation here because I pay attention to GE implementation, I not complain from SNO point of view, but I collected all this information like a “regular SNO”. So, I think it will be helpful information from an engineering point of view to improve the entire GE process and error handling, but I am open for move to any other location if needed.

littleskunk · May 31, 2020, 5:15pm

Can you group the errors by piece ID? (sort | uniq -c) hopefully not a single piece failed 5 times in a row which would mean your error rate was 0%

For fun I can check your node ID on the satellite side.

Odmin · May 31, 2020, 5:36pm

Did it with cat errors.log |cut -d : -f 8| sort | uniq -c | sort -g -r > errors_sorted.txt
The output is too long but you are right, no pieces failed 5 times in a row.
I will send my node id via PM now.

Odmin · May 31, 2020, 6:09pm

@littleskunk So, I think error handler already implemented?
I like your changelogs, could you please tell a little bit more about how GE is working? (especially how to remote errors is handling) I have no information, just a few posts from @Alexey and that all…

BrightSilence · May 31, 2020, 6:09pm

You’ve had roughly 8.4% of transfers failed. That sounds like a lot. But based on that number it’s easy to calculate the chance of failure of a single piece because of remote issues after 5 attempts.

8.4%^5=0.000418%

Given that the threshold is 10% I don’t think this means it is dangerous at all. In fact it’s negligible. That doesn’t mean that the issues listed should not be fixed though. At least those that have to do with selecting the wrong nodes.
But it sounds to me like the only way to fail GE is when you have issues on your own end.

So yeah, it really doesn’t depend on the other side, despite the initial impressions these numbers may give. You’ll be fine if everything is ok on your end.
It would also not be in the satellites best interest to make the graceful exit fail as the repair would be much more costly than the held amount returned. So they would be shooting themselves in the foot.

Edit: I guess they could make you fail 11% of the transfers and limit repairs to only that part. Maybe that would be slightly more profitable? I’m not even sure if it would be.

Odmin · May 31, 2020, 6:25pm

Yep, I absolutely agree with you. I used “dangerous” because all errors on the remote side, and I can’t do anything to improve the situation with errors, so my success rate depending on how many “good nodes” in the network and how many “bad nodes” satellite is offered for me. It like a Russian roulette risk is high and I named it “dangerous” Another inspiration of “dangerous” is that you can did GE only once and can’t retry it or pause (like a car without brake).
My error rate looks like 0% if I correct understand @littleskunk and how to satellite calculating real errors (I think should be 5 retry in row = 1 error).

Krey · May 31, 2020, 6:32pm

Nevertheless, we saw many GEs leading to disqualification on perfectly worked nodes. I would like to analyze the log with such ge. Something doubts take me about five errors from a piece.

BrightSilence · May 31, 2020, 6:38pm

Some nodes were disqualified after the GE succeeded. This is/was a bug, but they got the receipt of success and the payout as well. I’m not aware of any failed GE’s that didn’t have problems on their own end. Let me know if I overlooked something. Can you link me to one? Or have you encountered this yourself?

Well, in this case you could only fail 9.999582‬% instead of 10%. I’d say your success didn’t depend on that. But perhaps this would make it a little more clear. You’d have to fail 63% of transfers to get to 10% after 5 attempts. If that were the failure rate on the network, it wouldn’t exist anymore. So basically the network would go under before this is going to be an issue for graceful exit.

It is a one way street, so you have to be careful. But I don’t currently see any reason to blame a bad implementation for failing GE. But if I overlooked some aspect, please let me know.

Odmin · May 31, 2020, 6:50pm

Why are you thinking that I “blame implementation for failing GE” ?

Odmin · May 31, 2020, 6:52pm

@BrightSilence

I creating this topic to pay attention to the Graceful Exit (GE) implementation and node selection on the satellite side during GE.

This topic - is engeener discussions, no any blame/complain, only engineering constructive is here. So, let’s keep it clean.

BrightSilence · May 31, 2020, 7:05pm

I guess I was just misled by the word danger then. I see no danger to the SNO, the customer data or the network in any of this. The impact on the success chance is pretty much 0. But I guess you just mean errors in graceful exit. And I appreciate you pointing those out in a detailed way. Keep up the good work!

Odmin · May 31, 2020, 7:08pm

I’m sorry, maybe I chose a too loud title for the topic