GE problems, looks like something broken

Looks like GE system have some big problems

023-05-15T06:58:16.383+0300 ERROR piecetransfer failed to put piece {“Satellite ID”: “12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB”, “Piece ID”: “IIKERLYPJFIZRP5IJTGIBPCHEK2AXPT6MXCPRK27BMWTVKA46O4A”, “Storagenode ID”: “12oR7DXV4ASc57kSGYbu99rLi4XkgJxjkBfnDJeDzX9Yjeu9hCm”, “error”: “ecclient: upload failed (node:12oR7DXV4ASc57kSGYbu99rLi4XkgJxjkBfnDJeDzX9Yjeu9hCm, address:185.24.52.123:28969): rpc: dial tcp 185.24.52.123:28969: connectex: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.”, “errorVerbose”: “ecclient: upload failed (node:12oR7DXV4ASc57kSGYbu99rLi4XkgJxjkBfnDJeDzX9Yjeu9hCm, address:185.24.52.123:28969): rpc: dial tcp 185.24.52.123:28969: connectex: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.\n\tstorj.io/uplink/private/ecclient.(*ecClient).PutPiece:244\n\tstorj.io/storj/storagenode/piecetransfer.(*service).TransferPiece:148\n\tstorj.io/storj/storagenode/gracefulexit.(*Worker).Run.func3:100\n\tstorj.io/common/sync2.(*Limiter).Go.func1:49”}

I have lot of this errors.

Are all nodes different?
If so, I suspect that something is blocking outgoing connections.

I checked all but cant find any blocking, nodes working also OK

version is 1.76 my be some version differences makes problem, but today 1.76 is latest

can it be related to tcp fast open protocol?

hey @Vadim we are looking into this; please continue to post issues you encounter to help us find and fix them! CC @thepaul might have more questions for you.

1 Like

I can send you my logs if it helps? My logs contain only errors, As i have WARN error state Also Configs, IP and so if needed.
Also As i posted in 1.79 prerelise i tried to update to 1.79 and GE status also broken

Release preparation v1.79 - Product Discussions / STORJLINGS - Storj Community Forum (official)

1 Like

I see several type of errors
One i posted in first post second is
2023-05-15T17:13:05.093+0300 ERROR gracefulexit:chore.12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB@europe-north-1.tardigrade.io:7777 failed to send notification about piece transfer. {“Satellite ID”: “12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB”, “error”: “EOF”, “errorVerbose”: “EOF\n\tstorj.io/storj/storagenode/gracefulexit.(*Worker).Run.func3:105\n\tstorj.io/common/sync2.(*Limiter).Go.func1:49”}

Looks like whole tcp layer is broken since you TCP fast open integrated.

I tried versions 1.75 and 1.74 errors are the same, soo looks more it is somewhere in receiver side, nodes ignoring connections or something. but satellites also not taking connection to get report.

That error (“piecetransfer failed to put piece”) is fine in most cases. It just means the destination node was offline. Your node will report its failure, and the satellite will assign a different destination node for that piece. Your node only completely fails to transmit the piece if it fails five transfers in a row.

The second error you mention (“failed to send notification about piece transfer”) is more of a problem, because it means your node may have to resend some pieces that it already sent. I think that is related to a connection timeout setting somewhere—I’m currently looking into where that timeout is set or how we fix this. But it has been going on for longer than we’ve had TCP fastopen, so I don’t think fastopen is related.

2 Likes

I just cant be that there is so many offline nodes.

Sorry to interrupt. But doesn´t the satellite keep track of offline nodes? Why would it assign a pice to an offline node? Or are offline nodes selected in the hope that they are online and just falsely marked as offline?

How possible that all this nodes are offline at same time?

2023-05-16T15:18:07.634+0300 ERROR piecetransfer failed to put piece {Satellite ID: 12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB, Piece ID: HSAZOMBIYJNRC3SVETMA2CACHX6VLSYHHAX4NGNFVRFAIJJXO7FA, Storagenode ID: 12pzXkKb18LwrGCF89tSPrqG4jwYPX3bz64HbipY17AyErQiNFP, error: ecclient: upload failed (node:12pzXkKb18LwrGCF89tSPrqG4jwYPX3bz64HbipY17AyErQiNFP, address:77.125.28.31:28967): rpc: dial tcp 77.125.28.31:28967: connectex: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond., errorVerbose: ecclient: upload failed (node:12pzXkKb18LwrGCF89tSPrqG4jwYPX3bz64HbipY17AyErQiNFP, address:77.125.28.31:28967): rpc: dial tcp 77.125.28.31:28967: connectex: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.\n\tstorj.io/uplink/private/ecclient.(*ecClient).PutPiece:244\n\tstorj.io/storj/storagenode/piecetransfer.(*service).TransferPiece:148\n\tstorj.io/storj/storagenode/gracefulexit.(*Worker).Run.func3:100\n\tstorj.io/common/sync2.(*Limiter).Go.func1:49}
2023-05-16T15:18:19.332+0300 ERROR piecetransfer failed to put piece {Satellite ID: 12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB, Piece ID: 7G7LMP5F6JB64FNOIMYFBOPHVOXK345PKWJY3ZJJ2G5SET3GSG2Q, Storagenode ID: 127Cx52wLctkcDSpBzHQ9ytRJHEJyvtGd7aanF81rrR6uSSEMEE, error: ecclient: upload failed (node:127Cx52wLctkcDSpBzHQ9ytRJHEJyvtGd7aanF81rrR6uSSEMEE, address:129.151.157.172:29140): rpc: dial tcp 129.151.157.172:29140: connectex: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond., errorVerbose: ecclient: upload failed (node:127Cx52wLctkcDSpBzHQ9ytRJHEJyvtGd7aanF81rrR6uSSEMEE, address:129.151.157.172:29140): rpc: dial tcp 129.151.157.172:29140: connectex: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.\n\tstorj.io/uplink/private/ecclient.(*ecClient).PutPiece:244\n\tstorj.io/storj/storagenode/piecetransfer.(*service).TransferPiece:148\n\tstorj.io/storj/storagenode/gracefulexit.(*Worker).Run.func3:100\n\tstorj.io/common/sync2.(*Limiter).Go.func1:49}
2023-05-16T15:18:20.268+0300 ERROR piecetransfer failed to put piece {Satellite ID: 12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB, Piece ID: 2TCSUAQ5IBT3Q2VQFCVYTTCTP2A3ISRK3JFX46AIKTLEX5FDZVEQ, Storagenode ID: 12tWYGyhAyKU7kXKA76hh1H7QpFvRtQygdpahupD7UJ9VijvdBt, error: ecclient: upload failed (node:12tWYGyhAyKU7kXKA76hh1H7QpFvRtQygdpahupD7UJ9VijvdBt, address:83.33.191.86:29106): rpc: dial tcp 83.33.191.86:29106: connectex: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond., errorVerbose: ecclient: upload failed (node:12tWYGyhAyKU7kXKA76hh1H7QpFvRtQygdpahupD7UJ9VijvdBt, address:83.33.191.86:29106): rpc: dial tcp 83.33.191.86:29106: connectex: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.\n\tstorj.io/uplink/private/ecclient.(*ecClient).PutPiece:244\n\tstorj.io/storj/storagenode/piecetransfer.(*service).TransferPiece:148\n\tstorj.io/storj/storagenode/gracefulexit.(*Worker).Run.func3:100\n\tstorj.io/common/sync2.(*Limiter).Go.func1:49}
2023-05-16T15:18:21.591+0300 ERROR piecetransfer failed to put piece {Satellite ID: 12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB, Piece ID: PMCUH5JXSUYJ7JONGSK7MW6PSP4GJHQGU2GBEPJN4FFFYKLCGTZA, Storagenode ID: 12CwKXosvCqty4eErZTa5Lg7jkX4NpjHVb7MdnfenHvpMEvatAY, error: ecclient: upload failed (node:12CwKXosvCqty4eErZTa5Lg7jkX4NpjHVb7MdnfenHvpMEvatAY, address:152.70.148.92:29110): rpc: dial tcp 152.70.148.92:29110: connectex: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond., errorVerbose: ecclient: upload failed (node:12CwKXosvCqty4eErZTa5Lg7jkX4NpjHVb7MdnfenHvpMEvatAY, address:152.70.148.92:29110): rpc: dial tcp 152.70.148.92:29110: connectex: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.\n\tstorj.io/uplink/private/ecclient.(*ecClient).PutPiece:244\n\tstorj.io/storj/storagenode/piecetransfer.(*service).TransferPiece:148\n\tstorj.io/storj/storagenode/gracefulexit.(*Worker).Run.func3:100\n\tstorj.io/common/sync2.(*Limiter).Go.func1:49}
2023-05-16T15:18:29.938+0300 ERROR piecetransfer failed to put piece {Satellite ID: 12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB, Piece ID: XI3JZADFBYA3A6WL43MND5DB2YGJXFADUYRT5CCBVHU54MV6AVPA, Storagenode ID: 12d26rnJQnvQ2P8oJTCmUNvnHwokSG3yEhA3TeBTJgtjM4JrqcH, error: ecclient: upload failed (node:12d26rnJQnvQ2P8oJTCmUNvnHwokSG3yEhA3TeBTJgtjM4JrqcH, address:51.161.88.68:28967): rpc: dial tcp 51.161.88.68:28967: connectex: No connection could be made because the target machine actively refused it., errorVerbose: ecclient: upload failed (node:12d26rnJQnvQ2P8oJTCmUNvnHwokSG3yEhA3TeBTJgtjM4JrqcH, address:51.161.88.68:28967): rpc: dial tcp 51.161.88.68:28967: connectex: No connection could be made because the target machine actively refused it.\n\tstorj.io/uplink/private/ecclient.(*ecClient).PutPiece:244\n\tstorj.io/storj/storagenode/piecetransfer.(*service).TransferPiece:148\n\tstorj.io/storj/storagenode/gracefulexit.(*Worker).Run.func3:100\n\tstorj.io/common/sync2.(*Limiter).Go.func1:49}
2023-05-16T15:18:32.091+0300 ERROR piecetransfer failed to put piece {Satellite ID: 12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB, Piece ID: TNI7GJHQUOVT5FP6DYPDQ2UI44Z5FUWOAMRSIRDAT2LM65HTRCMA, Storagenode ID: 16HHisawgQTrXmQNphVDsFB7NfT8DLnNNMwt3jnQLqGy7vkjxE, error: ecclient: upload failed (node:16HHisawgQTrXmQNphVDsFB7NfT8DLnNNMwt3jnQLqGy7vkjxE, address:152.67.91.180:443): protocol: expected piece hash; Upload too small: 768; EOF, errorVerbose: ecclient: upload failed (node:16HHisawgQTrXmQNphVDsFB7NfT8DLnNNMwt3jnQLqGy7vkjxE, address:152.67.91.180:443): protocol: expected piece hash; Upload too small: 768; EOF\n\tstorj.io/uplink/private/ecclient.(*ecClient).PutPiece:244\n\tstorj.io/storj/storagenode/piecetransfer.(*service).TransferPiece:148\n\tstorj.io/storj/storagenode/gracefulexit.(*Worker).Run.func3:100\n\tstorj.io/common/sync2.(*Limiter).Go.func1:49}
2023-05-16T15:18:33.579+0300 ERROR piecetransfer failed to put piece {Satellite ID: 12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB, Piece ID: TXODVPUSW7QXB67THGKKLHG6IF3T7PPWF5ADV7WGZ7EKU4ZM76OQ, Storagenode ID: 12uK5iyBJopiRCNYUeete5GHr8A82BAfxWpiqm4VAnD6nWLHt1v, error: ecclient: upload failed (node:12uK5iyBJopiRCNYUeete5GHr8A82BAfxWpiqm4VAnD6nWLHt1v, address:129.151.129.149:29103): rpc: dial tcp 129.151.129.149:29103: connectex: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond., errorVerbose: ecclient: upload failed (node:12uK5iyBJopiRCNYUeete5GHr8A82BAfxWpiqm4VAnD6nWLHt1v, address:129.151.129.149:29103): rpc: dial tcp 129.151.129.149:29103: connectex: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.\n\tstorj.io/uplink/private/ecclient.(*ecClient).PutPiece:244\n\tstorj.io/storj/storagenode/piecetransfer.(*service).TransferPiece:148\n\tstorj.io/storj/storagenode/gracefulexit.(*Worker).Run.func3:100\n\tstorj.io/common/sync2.(*Limiter).Go.func1:49}

there is 469 times nodes didnt responded to piece transfer call in 24h it is little bit too much offline nodes, then I think there is something bigger.
You can find all it here, i made it sortable

Not all of these are simple connection failures- some of them are nodes rejecting small uploads, some accept the connection but then time out, and some have something else go wrong. The individual error for each failure is given.

If your node is transmitting 10000s or 100000s of pieces in 24h (a normal rate for GE), experiencing 469 failures on a public network is not a big problem. Storj is made to offer a reliable storage platform on top of less-reliable community-run storage nodes contacted over the public Internet, with all of the possible failure modes that entails.

The satellite keeps track of nodes that haven’t checked in for some period of time, which isn’t exactly “all offline nodes” but it bears a passing resemblance. And you’re right, the satellite won’t assign a node to accept pieces if it thinks that node is offline. But of course, “node N checked in 45 minutes ago” is not exactly the same thing as “node N is online”. So it’s very normal to be assigned some nodes which can’t be contacted or time out when you try to upload data.

2 Likes

Vadim are you leaving us? :cry: Or are you just shutting down some nodes?

I try to save node, it is around 7tb, I need to rid of around 4 tb to be able copy disk and bring it to warrety. hdd disapearing just time to time it is not normal, also it gone very slow, so i make GE from test Satellites on this node.

1 Like

Could it be as simple as sata cable wiggled itself out? ( I had it happen before :slight_smile: )

Sata cable, sata port, power port all changed, SMART also have errors- so ass there is a warranty, i hope they change it to new, it is enterprise disk.

2 Likes