QUIC misconfiguration in v1.86.1

etienneb · August 25, 2023, 7:45am

Hi,
I run Storj node in a freebsd jail (inside truenas) and it has been running fine for 4 months.
It automatically updates to a new version.
About 2 days ago it auto updated to version 1.86.1
And now I get a QUIC misconfiguration error.
The port forwarding and IP have not changed and worked fine.

It did get a normal ingress/egress yesterday, so it doesn’t seem to have influenced the node.

Any ideas? Pretty noob on StorJ nodes, sorry.

best,
Etienne

Alexey · August 25, 2023, 8:33am

Hello @etienneb ,
Welcome to the forum!

Looks like a freeBSD behavior:

etienneb · August 25, 2023, 9:34am

Thanks @Alexey that did the trick.
Entering my server IP inside the config.yaml on the storagenode drive.

Strangely, it worked fine for all versions until 1.86.1. The system autoupdated whenever a version came available. So 1.85 was fine too.

snorkel · August 25, 2023, 11:52am

So it’s just a FreeBSD thing?

arrogantrabbit · August 25, 2023, 3:13pm

We don’t know that. it did start happening with 1.86.1 update, on all my three unrelated machines. Whether this is storagenode bug (including in the libraries) or simply a change that surfaced bug somewhere else — needs determining.

All other services work fine, so it’s related to storagenode. This does not mean it’s a storagenode bug. But it doesn’t mean the opposite either. We just don’t know, more investigation is needed. I would start with comparing go.mod before and after in case there is some third party networking lib updated and introduced this behavior. I doubt the issue would be in storagenode source itself.

snorkel · August 25, 2023, 4:06pm

I have one Synology updated and is working fine.

arrogantrabbit · August 25, 2023, 4:21pm

Of course. If it was broken for everyone it would not have passed QA.

arrogantrabbit · August 25, 2023, 4:26pm

That was the case for me too. But when I rolled back to 85 the issue persisted.

I’ll test with 85 again. Maybe there was some back off mechanism after multiple failures so it would not even try with reverted node.

I’ll report here.

arrogantrabbit · August 26, 2023, 6:38am

I did not analyze all code changes, but just looking at the difference in modules linked between “good” and “bad” releases, this catches the eye (see go.mod):

golang.org/x/net, which provides socket support among other networky thing, was updated from v0.9.0 to v0.10.0
github.com/quic-go/quic-go, which, well, does the obviously relevant to the present discussion thing, was updated from v0.32.0 to v0.37.4!
Related, github.com/quic-go/qtls-go1-20 was updated from v0.1.0 to v0.3.1!!

That’s quite a massive upgrade, and I bet the culprit is in there.

Were there any reasons for such a drastic update of these crucial dependencies? I would roll everything back and only upgrade things that need to be upgraded for the reasons well understood, not just because some fella released a new update.

snorkel · August 26, 2023, 6:58am

I don’t know if they should roll back those things, because this update made my download success rates on updated nodes hit ATH. They hit 97% DSR, from 75%. Maybe I’m mistaken somehow, or my setups have some hickups, and the time frame of verifications is too small, but for now I’m pretty happy how things look.

arrogantrabbit · August 26, 2023, 7:21am

Hence, carefully upgrading what needed no to introduce new bugs. On the other hand, I doubt success rate has anything to do with the backend libraries. You can revert to previous version to see if success rate follows

Alexey · August 26, 2023, 9:08am

But you said that you tried a previous version and this did not change anything?

I’m agree, however, I do not think that we have an extensive tests for freeBSD specifically (even Windows and macOS not so covered), especially cases with multiple interfaces…

arrogantrabbit · August 26, 2023, 5:23pm

I’ll redo the experiments, maybe I’ve screwed up that one. Now that it’s not just me seeing this, and in both cases after upgrade to the 1.86.1, looks suspicious.

arrogantrabbit · August 26, 2023, 8:13pm

@Alexey, yep, I’m the doofus, when I downgraded the node, my auto-updater happened to kick in shortly after and updated it back to the current one, and I did not notice.

Now I confirm it’s definitely a regression in 1.86.1.

Her is what I did:

removed the IP in config, and downgraded to 1.84.2:

Screenshot 2023-08-26 at 1.07.38 PM1534×178 13 KB
Then I upgraded to 1.86.1

Screenshot 2023-08-26 at 1.09.38 PM1520×162 14.9 KB
Then back to 1.84.2 – OK, back to 1.86.1 – Misconfigured (did not save screenshot)
Then to 1.85.1 – OK

Screenshot 2023-08-26 at 1.16.26 PM1526×176 12.8 KB
Then back to 1.86.1 – Misconfigured

Screenshot 2023-08-26 at 1.16.51 PM1534×142 15.3 KB
Then added IP back:

Screenshot 2023-08-26 at 1.10.28 PM1504×172 12.3 KB

I think this is a fairly conclusive confirmation that the regression is in 1.86.1

I could regress individual changes, but this will require more time, and a bit rusty on go build system, assuming it’s possible to build a storagenode from scratch

arrogantrabbit · August 26, 2023, 8:49pm

I’ve filed a bug for this: 1.86.1: storagenode QUIC connectivity regression on FreeBSD · Issue #6216 · storj/storj · GitHub

etienneb · August 26, 2023, 10:48pm

Wow, thorough testing. Thanks.

arrogantrabbit · October 23, 2023, 9:27pm

My nodes updated to 1.90.2 two hours ago and all of them are now quic misconfigured. I won’t be able to look into regressing this again until evening.

@Alexey, does storj test on FreeBSD as part of qa? This is so reliably broken — how come it was allowed to make it to the public?

thepaul · October 23, 2023, 9:44pm

I suggest looking for the phrase "Your node is still considered to be online but encountered an error" in your logs. If that line does not appear in the logs since you started running with 1.90.2, then you may be seeing a dashboard bug, corroborated by Not so... QUIC... Correctly "Misconfigured".

If you do find that phrase, it should have an error message attached describing why the server failed to contact your node with QUIC.

arrogantrabbit · October 23, 2023, 9:47pm

2023-10-23T14:44:38-07:00       WARN    contact:service Your node is still considered to be online but encountered an error.    {"process": "storagenode", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Error": "contact: failed to ping storage node using QUIC, your node indicated error code: 0, rpc: quic: timeout: no recent network activity"}
2023-10-23T14:44:38-07:00       WARN    contact:service Your node is still considered to be online but encountered an error.    {"process": "storagenode", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Error": "contact: failed to ping storage node using QUIC, your node indicated error code: 0, rpc: quic: timeout: no recent network activity"}

NodeID: 12ou4iHaJCon9MFDpNC8ZX3Byz3Rb7UJUiwv3MucHsBKEQCuTQA

thepaul · October 23, 2023, 10:03pm

This indicates that the server was able to connect to your node with QUIC, but then had a timeout while trying to call a “ping” RPC on the node. So my guess was wrong, this probably isn’t related to the potential dashboard bug.

Did QUIC start working on your nodes at some point since August? The linked bugs 6216 and 6186 don’t seem to indicate that anything was fixed.