This is a really good question.
So, if the router the storage node is behind doesn’t support TCP_FASTOPEN, my understanding of the problem (though this would be good to more rigorously test with a variety of different nonconforming routers) is that the router will simply drop the TCP_FASTOPEN connection establishment packets, which is pretty much the same behavior as if a packet hits the firewall. So, as far as the client is concerned, it will appear like it’s talking and no one is responding.
If a storage node is behind such a router and the network has some percent of clients that are attempting TCP_FASTOPEN, then when the storage node flags support for TCP_FASTOPEN, I imagine the percent of new connections the node receives would drop some percent (probably the same percent of times clients are trying TCP_FASTOPEN), because the router is just dropping the connection establishment packets. Unfortunately, the clients’ operating systems have memory on which peers support TCP_FASTOPEN, so disabling TCP_FASTOPEN wouldn’t necessarily fix the problem for the node immediately. The node would have to wait until client TCP_FASTOPEN support memory flushed out (which does happen over time).
In the case that the node’s network topology is great for TCP_FASTOPEN, upon enabling TCP_FASTOPEN, the node would suddenly start seeing some percent of successful TCP_FASTOPEN requests.
The hard case is when the node’s network topology has partial support (e.g., maybe some routes support it well and some don’t). I don’t know how to even detect this case. If there is a slight drop in new connections after enabling, the node operator may assume that TCP_FASTOPEN was stalling for some routes, but it may also be a natural lull in load. Even more, perhaps the node operator might prefer to leave TCP_FASTOPEN on because, of the connections it is receiving, it is winning more of the races.
So, given all of this, I’m not sure what would be on the dashboard. A graph of the rate of new non-TCP_FASTOPEN connections and a graph of new TCP_FASTOPEN connections? I actually haven’t checked if these values are things an unprivileged storage node can ask the kernel for, so I’m not even sure if this dashboard is possible on an unprivileged node.
QUIC is easy - it worked from the Satellite or it didn’t. I suppose we could extend the TCP_FASTOPEN design to work the same - the Satellite is the first check, and clients do not try TCP_FASTOPEN unless the Satellite had a successful TCP_FASTOPEN?
Open problems with that idea I need to think through:
- Maybe TCP_FASTOPEN works for most Uplinks even if the Satellite didn’t! Maybe that’s enough for the SNO to win more races and earn more.
- This would mean the Satellite would have to double dial (the first dial indicates support to the OS that TCP_FASTOPEN works between the peers, and the second dial would confirm that it worked), and so this is more resource usage for checkins.
- This certainly expands the scope of work for this task quite a bit (the Satellite checking, the database keeping track of which nodes the Satellite was successful with, the dashboard feature of showing whether the Satellite was successful). On the other hand, the graph-based approach is also quite a lift for the dashboard.
I feel like the graph based approach (or even just a counter based approach of number of successful connections) leaves more control with the SNO than having the Satellite keep track of what worked, but yeah, I’m definitely open to feedback about what folks prefer.