Binding to a specific interface to stabilize QUIC connectivity

TLDR: if you are having flaky QUIC connectivity, go to the config file, and add an explicit interface address to the line server address line. e.g.: server.address: 10.0.17.120:28967

Backstory: I noticed that this morning, at 7:30PM PST according to the logs all three my nodes on three separate machines, in three separate states, connected in two different ways (one directly, and the other two over VPS) all lost ability to see QUICK. It was rather bizarre, and I don’t know why that happened. There is nothing in common between those nodes, except all are running in jails on FreeBSD machines.

I suspected software update, since updater is broken on freebsd I update nodes as soon as build is suggested automatically. I rolled back a few revisions – did not help.

I’ve tried restarting the node, restarting the vpn service, restarted everything I could think of but could not kick it back into operation. This has never happen before, QUIC has been rock solid.

So I went to the rabbit hole of packet sniffers investigating where do UDP packets go, and turns out there are no problems delivering them to the node’s jail interfaces. So why does node claim no QUIC?

I then noticed that the node is not listening on some of the available interfaces, and it seems random. Even after rebooting the node while all interfaces are available it seems to be listening to a random one. Wrong one.

So I went and explicitly set the interface IP in that setting. On the first node to its LAN IP, and on the other two – to their respective wireguard endpoint addresses.

Rebooted each node – bam, QUIC is connected.

There seems to be some flakiness in how the detection is handled, but explicitly specifying the interface should not hurt. All those who experience intermittent QUICK failures – try it.

4 Likes

Interesting. The default bind address is 0.0.0.0, i.e. listen all interfaces available on the node start.

Right. The problem could be somewhere outside of node software — maybe in the network frameworks it relies on, or somewhere in FreeBSD’s VNET driver.

This, by the way, also resolved intermittent connectivity issues I had trying to reach the dashboard of one of the nodes remotely over vpn — it would spin indefinitely. It felt like reply packets would fly to the wrong interface.

Now it connects every time. The dashboard by the way is still listening on 0.0.0.0, so only stabilizing the storagenode endpoint fixed network connectivity flakiness for both usecases.

I can’t explain why.

2 Likes

This appears to be a regression in 1.86.1. See this comment for proof:

and this comment for potential culprit:

1 Like