QUIC misconfiguration in v1.86.1

Hi,
I run Storj node in a freebsd jail (inside truenas) and it has been running fine for 4 months.
It automatically updates to a new version.
About 2 days ago it auto updated to version 1.86.1
And now I get a QUIC misconfiguration error.
The port forwarding and IP have not changed and worked fine.

It did get a normal ingress/egress yesterday, so it doesn’t seem to have influenced the node.

Any ideas? Pretty noob on StorJ nodes, sorry.

best,
Etienne

Hello @etienneb ,
Welcome to the forum!

Looks like a freeBSD behavior:

1 Like

Thanks @Alexey that did the trick.
Entering my server IP inside the config.yaml on the storagenode drive.

Strangely, it worked fine for all versions until 1.86.1. The system autoupdated whenever a version came available. So 1.85 was fine too.

1 Like

So it’s just a FreeBSD thing?

We don’t know that. it did start happening with 1.86.1 update, on all my three unrelated machines. Whether this is storagenode bug (including in the libraries) or simply a change that surfaced bug somewhere else — needs determining.

All other services work fine, so it’s related to storagenode. This does not mean it’s a storagenode bug. But it doesn’t mean the opposite either. We just don’t know, more investigation is needed. I would start with comparing go.mod before and after in case there is some third party networking lib updated and introduced this behavior. I doubt the issue would be in storagenode source itself.

I have one Synology updated and is working fine.

Of course. If it was broken for everyone it would not have passed QA.

That was the case for me too. But when I rolled back to 85 the issue persisted.

I’ll test with 85 again. Maybe there was some back off mechanism after multiple failures so it would not even try with reverted node.

I’ll report here.

I did not analyze all code changes, but just looking at the difference in modules linked between “good” and “bad” releases, this catches the eye (see go.mod):

  1. golang.org/x/net, which provides socket support among other networky thing, was updated from v0.9.0 to v0.10.0
  2. github.com/quic-go/quic-go, which, well, does the obviously relevant to the present discussion thing, was updated from v0.32.0 to v0.37.4!
  3. Related, github.com/quic-go/qtls-go1-20 was updated from v0.1.0 to v0.3.1!!

That’s quite a massive upgrade, and I bet the culprit is in there.

Were there any reasons for such a drastic update of these crucial dependencies? I would roll everything back and only upgrade things that need to be upgraded for the reasons well understood, not just because some fella released a new update.

I don’t know if they should roll back those things, because this update made my download success rates on updated nodes hit ATH. They hit 97% DSR, from 75%. Maybe I’m mistaken somehow, or my setups have some hickups, and the time frame of verifications is too small, but for now I’m pretty happy how things look.

1 Like

Hence, carefully upgrading what needed no to introduce new bugs. On the other hand, I doubt success rate has anything to do with the backend libraries. You can revert to previous version to see if success rate follows

But you said that you tried a previous version and this did not change anything?

I’m agree, however, I do not think that we have an extensive tests for freeBSD specifically (even Windows and macOS not so covered), especially cases with multiple interfaces…

I’ll redo the experiments, maybe I’ve screwed up that one. Now that it’s not just me seeing this, and in both cases after upgrade to the 1.86.1, looks suspicious.

@Alexey, yep, I’m the doofus, when I downgraded the node, my auto-updater happened to kick in shortly after and updated it back to the current one, and I did not notice.

Now I confirm it’s definitely a regression in 1.86.1.

Her is what I did:

  1. removed the IP in config, and downgraded to 1.84.2:

  2. Then I upgraded to 1.86.1

  3. Then back to 1.84.2 – OK, back to 1.86.1 – Misconfigured (did not save screenshot)

  4. Then to 1.85.1 – OK

  5. Then back to 1.86.1 – Misconfigured

  6. Then added IP back:

I think this is a fairly conclusive confirmation that the regression is in 1.86.1

I could regress individual changes, but this will require more time, and a bit rusty on go build system, assuming it’s possible to build a storagenode from scratch

2 Likes

I’ve filed a bug for this: 1.86.1: storagenode QUIC connectivity regression on FreeBSD · Issue #6216 · storj/storj · GitHub

1 Like

Wow, thorough testing. Thanks.

My nodes updated to 1.90.2 two hours ago and all of them are now quic misconfigured. I won’t be able to look into regressing this again until evening.

@Alexey, does storj test on FreeBSD as part of qa? This is so reliably broken — how come it was allowed to make it to the public?

I suggest looking for the phrase "Your node is still considered to be online but encountered an error" in your logs. If that line does not appear in the logs since you started running with 1.90.2, then you may be seeing a dashboard bug, corroborated by Not so... QUIC... Correctly "Misconfigured".

If you do find that phrase, it should have an error message attached describing why the server failed to contact your node with QUIC.

2023-10-23T14:44:38-07:00       WARN    contact:service Your node is still considered to be online but encountered an error.    {"process": "storagenode", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Error": "contact: failed to ping storage node using QUIC, your node indicated error code: 0, rpc: quic: timeout: no recent network activity"}
2023-10-23T14:44:38-07:00       WARN    contact:service Your node is still considered to be online but encountered an error.    {"process": "storagenode", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Error": "contact: failed to ping storage node using QUIC, your node indicated error code: 0, rpc: quic: timeout: no recent network activity"}

NodeID: 12ou4iHaJCon9MFDpNC8ZX3Byz3Rb7UJUiwv3MucHsBKEQCuTQA

This indicates that the server was able to connect to your node with QUIC, but then had a timeout while trying to call a “ping” RPC on the node. So my guess was wrong, this probably isn’t related to the potential dashboard bug.

Did QUIC start working on your nodes at some point since August? The linked bugs 6216 and 6186 don’t seem to indicate that anything was fixed.

2 Likes