Hi,
I run Storj node in a freebsd jail (inside truenas) and it has been running fine for 4 months.
It automatically updates to a new version.
About 2 days ago it auto updated to version 1.86.1
And now I get a QUIC misconfiguration error.
The port forwarding and IP have not changed and worked fine.
It did get a normal ingress/egress yesterday, so it doesn’t seem to have influenced the node.
We don’t know that. it did start happening with 1.86.1 update, on all my three unrelated machines. Whether this is storagenode bug (including in the libraries) or simply a change that surfaced bug somewhere else — needs determining.
All other services work fine, so it’s related to storagenode. This does not mean it’s a storagenode bug. But it doesn’t mean the opposite either. We just don’t know, more investigation is needed. I would start with comparing go.mod before and after in case there is some third party networking lib updated and introduced this behavior. I doubt the issue would be in storagenode source itself.
I did not analyze all code changes, but just looking at the difference in modules linked between “good” and “bad” releases, this catches the eye (see go.mod):
golang.org/x/net, which provides socket support among other networky thing, was updated from v0.9.0 to v0.10.0
github.com/quic-go/quic-go, which, well, does the obviously relevant to the present discussion thing, was updated from v0.32.0 to v0.37.4!
Related, github.com/quic-go/qtls-go1-20 was updated from v0.1.0 to v0.3.1!!
That’s quite a massive upgrade, and I bet the culprit is in there.
Were there any reasons for such a drastic update of these crucial dependencies? I would roll everything back and only upgrade things that need to be upgraded for the reasons well understood, not just because some fella released a new update.
I don’t know if they should roll back those things, because this update made my download success rates on updated nodes hit ATH. They hit 97% DSR, from 75%. Maybe I’m mistaken somehow, or my setups have some hickups, and the time frame of verifications is too small, but for now I’m pretty happy how things look.
Hence, carefully upgrading what needed no to introduce new bugs. On the other hand, I doubt success rate has anything to do with the backend libraries. You can revert to previous version to see if success rate follows
But you said that you tried a previous version and this did not change anything?
I’m agree, however, I do not think that we have an extensive tests for freeBSD specifically (even Windows and macOS not so covered), especially cases with multiple interfaces…
I’ll redo the experiments, maybe I’ve screwed up that one. Now that it’s not just me seeing this, and in both cases after upgrade to the 1.86.1, looks suspicious.
@Alexey, yep, I’m the doofus, when I downgraded the node, my auto-updater happened to kick in shortly after and updated it back to the current one, and I did not notice.
Now I confirm it’s definitely a regression in 1.86.1.
Her is what I did:
removed the IP in config, and downgraded to 1.84.2:
I think this is a fairly conclusive confirmation that the regression is in 1.86.1
I could regress individual changes, but this will require more time, and a bit rusty on go build system, assuming it’s possible to build a storagenode from scratch
I suggest looking for the phrase "Your node is still considered to be online but encountered an error" in your logs. If that line does not appear in the logs since you started running with 1.90.2, then you may be seeing a dashboard bug, corroborated by Not so... QUIC... Correctly "Misconfigured".
If you do find that phrase, it should have an error message attached describing why the server failed to contact your node with QUIC.
2023-10-23T14:44:38-07:00 WARN contact:service Your node is still considered to be online but encountered an error. {"process": "storagenode", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Error": "contact: failed to ping storage node using QUIC, your node indicated error code: 0, rpc: quic: timeout: no recent network activity"}
2023-10-23T14:44:38-07:00 WARN contact:service Your node is still considered to be online but encountered an error. {"process": "storagenode", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Error": "contact: failed to ping storage node using QUIC, your node indicated error code: 0, rpc: quic: timeout: no recent network activity"}
This indicates that the server was able to connect to your node with QUIC, but then had a timeout while trying to call a “ping” RPC on the node. So my guess was wrong, this probably isn’t related to the potential dashboard bug.
Did QUIC start working on your nodes at some point since August? The linked bugs 6216 and 6186 don’t seem to indicate that anything was fixed.