I see several type of errors
One i posted in first post second is
2023-05-15T17:13:05.093+0300 ERROR gracefulexit:chore.12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB@europe-north-1.tardigrade.io:7777 failed to send notification about piece transfer. {“Satellite ID”: “12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB”, “error”: “EOF”, “errorVerbose”: “EOF\n\tstorj.io/storj/storagenode/gracefulexit.(*Worker).Run.func3:105\n\tstorj.io/common/sync2.(*Limiter).Go.func1:49”}
Looks like whole tcp layer is broken since you TCP fast open integrated.
I tried versions 1.75 and 1.74 errors are the same, soo looks more it is somewhere in receiver side, nodes ignoring connections or something. but satellites also not taking connection to get report.
That error (“piecetransfer failed to put piece”) is fine in most cases. It just means the destination node was offline. Your node will report its failure, and the satellite will assign a different destination node for that piece. Your node only completely fails to transmit the piece if it fails five transfers in a row.
The second error you mention (“failed to send notification about piece transfer”) is more of a problem, because it means your node may have to resend some pieces that it already sent. I think that is related to a connection timeout setting somewhere—I’m currently looking into where that timeout is set or how we fix this. But it has been going on for longer than we’ve had TCP fastopen, so I don’t think fastopen is related.
Sorry to interrupt. But doesn´t the satellite keep track of offline nodes? Why would it assign a pice to an offline node? Or are offline nodes selected in the hope that they are online and just falsely marked as offline?
How possible that all this nodes are offline at same time?
2023-05-16T15:18:07.634+0300
ERROR
piecetransfer
failed to put piece
{Satellite ID: 12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB, Piece ID: HSAZOMBIYJNRC3SVETMA2CACHX6VLSYHHAX4NGNFVRFAIJJXO7FA, Storagenode ID: 12pzXkKb18LwrGCF89tSPrqG4jwYPX3bz64HbipY17AyErQiNFP, error: ecclient: upload failed (node:12pzXkKb18LwrGCF89tSPrqG4jwYPX3bz64HbipY17AyErQiNFP, address:77.125.28.31:28967): rpc: dial tcp 77.125.28.31:28967: connectex: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond., errorVerbose: ecclient: upload failed (node:12pzXkKb18LwrGCF89tSPrqG4jwYPX3bz64HbipY17AyErQiNFP, address:77.125.28.31:28967): rpc: dial tcp 77.125.28.31:28967: connectex: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.\n\tstorj.io/uplink/private/ecclient.(*ecClient).PutPiece:244\n\tstorj.io/storj/storagenode/piecetransfer.(*service).TransferPiece:148\n\tstorj.io/storj/storagenode/gracefulexit.(*Worker).Run.func3:100\n\tstorj.io/common/sync2.(*Limiter).Go.func1:49}
2023-05-16T15:18:19.332+0300
ERROR
piecetransfer
failed to put piece
{Satellite ID: 12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB, Piece ID: 7G7LMP5F6JB64FNOIMYFBOPHVOXK345PKWJY3ZJJ2G5SET3GSG2Q, Storagenode ID: 127Cx52wLctkcDSpBzHQ9ytRJHEJyvtGd7aanF81rrR6uSSEMEE, error: ecclient: upload failed (node:127Cx52wLctkcDSpBzHQ9ytRJHEJyvtGd7aanF81rrR6uSSEMEE, address:129.151.157.172:29140): rpc: dial tcp 129.151.157.172:29140: connectex: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond., errorVerbose: ecclient: upload failed (node:127Cx52wLctkcDSpBzHQ9ytRJHEJyvtGd7aanF81rrR6uSSEMEE, address:129.151.157.172:29140): rpc: dial tcp 129.151.157.172:29140: connectex: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.\n\tstorj.io/uplink/private/ecclient.(*ecClient).PutPiece:244\n\tstorj.io/storj/storagenode/piecetransfer.(*service).TransferPiece:148\n\tstorj.io/storj/storagenode/gracefulexit.(*Worker).Run.func3:100\n\tstorj.io/common/sync2.(*Limiter).Go.func1:49}
2023-05-16T15:18:20.268+0300
ERROR
piecetransfer
failed to put piece
{Satellite ID: 12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB, Piece ID: 2TCSUAQ5IBT3Q2VQFCVYTTCTP2A3ISRK3JFX46AIKTLEX5FDZVEQ, Storagenode ID: 12tWYGyhAyKU7kXKA76hh1H7QpFvRtQygdpahupD7UJ9VijvdBt, error: ecclient: upload failed (node:12tWYGyhAyKU7kXKA76hh1H7QpFvRtQygdpahupD7UJ9VijvdBt, address:83.33.191.86:29106): rpc: dial tcp 83.33.191.86:29106: connectex: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond., errorVerbose: ecclient: upload failed (node:12tWYGyhAyKU7kXKA76hh1H7QpFvRtQygdpahupD7UJ9VijvdBt, address:83.33.191.86:29106): rpc: dial tcp 83.33.191.86:29106: connectex: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.\n\tstorj.io/uplink/private/ecclient.(*ecClient).PutPiece:244\n\tstorj.io/storj/storagenode/piecetransfer.(*service).TransferPiece:148\n\tstorj.io/storj/storagenode/gracefulexit.(*Worker).Run.func3:100\n\tstorj.io/common/sync2.(*Limiter).Go.func1:49}
2023-05-16T15:18:21.591+0300
ERROR
piecetransfer
failed to put piece
{Satellite ID: 12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB, Piece ID: PMCUH5JXSUYJ7JONGSK7MW6PSP4GJHQGU2GBEPJN4FFFYKLCGTZA, Storagenode ID: 12CwKXosvCqty4eErZTa5Lg7jkX4NpjHVb7MdnfenHvpMEvatAY, error: ecclient: upload failed (node:12CwKXosvCqty4eErZTa5Lg7jkX4NpjHVb7MdnfenHvpMEvatAY, address:152.70.148.92:29110): rpc: dial tcp 152.70.148.92:29110: connectex: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond., errorVerbose: ecclient: upload failed (node:12CwKXosvCqty4eErZTa5Lg7jkX4NpjHVb7MdnfenHvpMEvatAY, address:152.70.148.92:29110): rpc: dial tcp 152.70.148.92:29110: connectex: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.\n\tstorj.io/uplink/private/ecclient.(*ecClient).PutPiece:244\n\tstorj.io/storj/storagenode/piecetransfer.(*service).TransferPiece:148\n\tstorj.io/storj/storagenode/gracefulexit.(*Worker).Run.func3:100\n\tstorj.io/common/sync2.(*Limiter).Go.func1:49}
2023-05-16T15:18:29.938+0300
ERROR
piecetransfer
failed to put piece
{Satellite ID: 12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB, Piece ID: XI3JZADFBYA3A6WL43MND5DB2YGJXFADUYRT5CCBVHU54MV6AVPA, Storagenode ID: 12d26rnJQnvQ2P8oJTCmUNvnHwokSG3yEhA3TeBTJgtjM4JrqcH, error: ecclient: upload failed (node:12d26rnJQnvQ2P8oJTCmUNvnHwokSG3yEhA3TeBTJgtjM4JrqcH, address:51.161.88.68:28967): rpc: dial tcp 51.161.88.68:28967: connectex: No connection could be made because the target machine actively refused it., errorVerbose: ecclient: upload failed (node:12d26rnJQnvQ2P8oJTCmUNvnHwokSG3yEhA3TeBTJgtjM4JrqcH, address:51.161.88.68:28967): rpc: dial tcp 51.161.88.68:28967: connectex: No connection could be made because the target machine actively refused it.\n\tstorj.io/uplink/private/ecclient.(*ecClient).PutPiece:244\n\tstorj.io/storj/storagenode/piecetransfer.(*service).TransferPiece:148\n\tstorj.io/storj/storagenode/gracefulexit.(*Worker).Run.func3:100\n\tstorj.io/common/sync2.(*Limiter).Go.func1:49}
{Satellite ID: 12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB, Piece ID: TXODVPUSW7QXB67THGKKLHG6IF3T7PPWF5ADV7WGZ7EKU4ZM76OQ, Storagenode ID: 12uK5iyBJopiRCNYUeete5GHr8A82BAfxWpiqm4VAnD6nWLHt1v, error: ecclient: upload failed (node:12uK5iyBJopiRCNYUeete5GHr8A82BAfxWpiqm4VAnD6nWLHt1v, address:129.151.129.149:29103): rpc: dial tcp 129.151.129.149:29103: connectex: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond., errorVerbose: ecclient: upload failed (node:12uK5iyBJopiRCNYUeete5GHr8A82BAfxWpiqm4VAnD6nWLHt1v, address:129.151.129.149:29103): rpc: dial tcp 129.151.129.149:29103: connectex: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.\n\tstorj.io/uplink/private/ecclient.(*ecClient).PutPiece:244\n\tstorj.io/storj/storagenode/piecetransfer.(*service).TransferPiece:148\n\tstorj.io/storj/storagenode/gracefulexit.(*Worker).Run.func3:100\n\tstorj.io/common/sync2.(*Limiter).Go.func1:49}
there is 469 times nodes didnt responded to piece transfer call in 24h it is little bit too much offline nodes, then I think there is something bigger.
You can find all it here, i made it sortable
Not all of these are simple connection failures- some of them are nodes rejecting small uploads, some accept the connection but then time out, and some have something else go wrong. The individual error for each failure is given.
If your node is transmitting 10000s or 100000s of pieces in 24h (a normal rate for GE), experiencing 469 failures on a public network is not a big problem. Storj is made to offer a reliable storage platform on top of less-reliable community-run storage nodes contacted over the public Internet, with all of the possible failure modes that entails.
The satellite keeps track of nodes that haven’t checked in for some period of time, which isn’t exactly “all offline nodes” but it bears a passing resemblance. And you’re right, the satellite won’t assign a node to accept pieces if it thinks that node is offline. But of course, “node N checked in 45 minutes ago” is not exactly the same thing as “node N is online”. So it’s very normal to be assigned some nodes which can’t be contacted or time out when you try to upload data.
I try to save node, it is around 7tb, I need to rid of around 4 tb to be able copy disk and bring it to warrety. hdd disapearing just time to time it is not normal, also it gone very slow, so i make GE from test Satellites on this node.
Sata cable, sata port, power port all changed, SMART also have errors- so ass there is a warranty, i hope they change it to new, it is enterprise disk.
I’m getting the same kind of errors in the log file as Vadim but the traffic is consistent with a GE, 10Mbs outbound and free disk space increasing, what concern me is the message I’m getting from the GE status report:
2023-05-19T19:34:27.734Z
INFO
Anonymized tracing enabled
{Process: storagenode}
2023-05-19T19:34:27.758Z
FATAL
Failed to load identity.
{Process: storagenode, error: file or directory not found: open /identity.cert: no such file or directory, errorVerbose: file or directory not found: open /identity.cert: no such file or directory\n\tstorj.io/common/identity.Config.Load:326\n\tmain.cmdGracefulExitStatus:186\n\tmain.newGracefulExitStatusCmd.func1:59\n\tstorj.io/private/process.cleanup.func1.4:399\n\tstorj.io/private/process.cleanup.func1:417\n\tgithub.com/spf13/cobra.(*Command).execute:852\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:960\n\tgithub.com/spf13/cobra.(*Command).Execute:897\n\tstorj.io/private/process.ExecWithCustomOptions:113\n\tmain.main:29\n\truntime.main:250}
In the first 24hrs it was reporting correctly but not anymore. Anything I can do about it?
Thanks, as it seems to do its work I don’t worry to much about it, is your GE progressing correctly with the errors?
I looked around and still have no clues on how to downgrade the version.
You do not need to downgrade, however, there is a bug when you call exit-status:
It’s reported to the team but I do not have ETA yet.
But you may download a binary 1.76.2 for your OS and call it with mandatory parameters (you would use it only to check status, not as a regular storagenode binary).