QUIC misconfiguration in v1.86.1

Of course. If it was broken for everyone it would not have passed QA.

That was the case for me too. But when I rolled back to 85 the issue persisted.

I’ll test with 85 again. Maybe there was some back off mechanism after multiple failures so it would not even try with reverted node.

I’ll report here.

I did not analyze all code changes, but just looking at the difference in modules linked between “good” and “bad” releases, this catches the eye (see go.mod):

  1. golang.org/x/net, which provides socket support among other networky thing, was updated from v0.9.0 to v0.10.0
  2. github.com/quic-go/quic-go, which, well, does the obviously relevant to the present discussion thing, was updated from v0.32.0 to v0.37.4!
  3. Related, github.com/quic-go/qtls-go1-20 was updated from v0.1.0 to v0.3.1!!

That’s quite a massive upgrade, and I bet the culprit is in there.

Were there any reasons for such a drastic update of these crucial dependencies? I would roll everything back and only upgrade things that need to be upgraded for the reasons well understood, not just because some fella released a new update.

I don’t know if they should roll back those things, because this update made my download success rates on updated nodes hit ATH. They hit 97% DSR, from 75%. Maybe I’m mistaken somehow, or my setups have some hickups, and the time frame of verifications is too small, but for now I’m pretty happy how things look.

1 Like

Hence, carefully upgrading what needed no to introduce new bugs. On the other hand, I doubt success rate has anything to do with the backend libraries. You can revert to previous version to see if success rate follows

But you said that you tried a previous version and this did not change anything?

I’m agree, however, I do not think that we have an extensive tests for freeBSD specifically (even Windows and macOS not so covered), especially cases with multiple interfaces…

I’ll redo the experiments, maybe I’ve screwed up that one. Now that it’s not just me seeing this, and in both cases after upgrade to the 1.86.1, looks suspicious.

@Alexey, yep, I’m the doofus, when I downgraded the node, my auto-updater happened to kick in shortly after and updated it back to the current one, and I did not notice.

Now I confirm it’s definitely a regression in 1.86.1.

Her is what I did:

  1. removed the IP in config, and downgraded to 1.84.2:

  2. Then I upgraded to 1.86.1

  3. Then back to 1.84.2 – OK, back to 1.86.1 – Misconfigured (did not save screenshot)

  4. Then to 1.85.1 – OK

  5. Then back to 1.86.1 – Misconfigured

  6. Then added IP back:

I think this is a fairly conclusive confirmation that the regression is in 1.86.1

I could regress individual changes, but this will require more time, and a bit rusty on go build system, assuming it’s possible to build a storagenode from scratch

2 Likes

I’ve filed a bug for this: 1.86.1: storagenode QUIC connectivity regression on FreeBSD · Issue #6216 · storj/storj · GitHub

1 Like

Wow, thorough testing. Thanks.

My nodes updated to 1.90.2 two hours ago and all of them are now quic misconfigured. I won’t be able to look into regressing this again until evening.

@Alexey, does storj test on FreeBSD as part of qa? This is so reliably broken — how come it was allowed to make it to the public?

I suggest looking for the phrase "Your node is still considered to be online but encountered an error" in your logs. If that line does not appear in the logs since you started running with 1.90.2, then you may be seeing a dashboard bug, corroborated by Not so... QUIC... Correctly "Misconfigured".

If you do find that phrase, it should have an error message attached describing why the server failed to contact your node with QUIC.

2023-10-23T14:44:38-07:00       WARN    contact:service Your node is still considered to be online but encountered an error.    {"process": "storagenode", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Error": "contact: failed to ping storage node using QUIC, your node indicated error code: 0, rpc: quic: timeout: no recent network activity"}
2023-10-23T14:44:38-07:00       WARN    contact:service Your node is still considered to be online but encountered an error.    {"process": "storagenode", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Error": "contact: failed to ping storage node using QUIC, your node indicated error code: 0, rpc: quic: timeout: no recent network activity"}

NodeID: 12ou4iHaJCon9MFDpNC8ZX3Byz3Rb7UJUiwv3MucHsBKEQCuTQA

This indicates that the server was able to connect to your node with QUIC, but then had a timeout while trying to call a “ping” RPC on the node. So my guess was wrong, this probably isn’t related to the potential dashboard bug.

Did QUIC start working on your nodes at some point since August? The linked bugs 6216 and 6186 don’t seem to indicate that anything was fixed.

2 Likes

Yes, it worked after I specified the server address to include the local adapter IP explicitly, as described above.

# public address to listen on
#server.address: :28967
server.address: 10.0.70.2:28967

It worked up until 2 hours ago, when they updated to 1.90.2.

Now it does not work regardless of whether the address is specified. Apparently, it’s now more “broken” than it was before.

Ok, I see. I’ll poke around and see if there is any progress on those bug reports, which might help.

2 Likes

I have this error in my log too. The system updated overnight to 1.90.2, was on 1.89.2 yesterday (I update automatically). Checked config.yaml and my IP and port are still there.
Restarting the freebsd jail in my case, didn’t help neither.
I do think I had ingress after the error has occurred as I am at 16Gb so far today and the error started at 00:16 hours.
Any thoughts?

My freebsd jails have also this error since they upgraded from 1.89 to 1.90.

There hasn’t been any progress on those QUIC bugs. It’s considered low priority because apparently uplinks quite rarely use QUIC in production. I suppose we’re recommending not to worry about what the dashboard says about your QUIC status for now.

1 Like

SYNOLOGY - Docker Container also issue only node with latest version: