Two new blueprints/design drafts seeking feedback: Replacing TLS with Noise and TCP_FASTOPEN

Well, running it locally is enough to tell you if your operating system is set up, but you’re right, if you want to have a good hint that your network topology has good support for TCP_FASTOPEN, yeah, you’d want to run this Python tool in server mode with port forwarding setup in your network, then run the client somewhere far outside your network. So, yep, you’re right! I imagine that would look like:

./fastopen.py server :5996

locally, with port forwarding 5996, then somewhere else (a rented VPS or something):

./fastopen.py client yournetwork.yourdns.example.com:5996

and then watching

netstat -s | grep TCPFastOpen

on your server.

1 Like

For me this works over WAN with server behind the NAT using RouterOS.
I did try client from two VPSes - two different locations, same provider, same Debian including the config. It works from one but getting TCPFastOpenActiveFail: 1 and TCPFastOpenBlackhole: 1 on the other client, with or without firewall attached to that instance. Ran both as root.
The server was QEMU VM with MacVTap network adapter.
Had to do sysctl -w net.ipv4.tcp_fastopen=3 on the server instance, hypervisor is 1 and needed no change.

2 Likes

What was the experience from that client? Did it fail or was it still able to send “completed.” successfully?

TBH not sure now, I have closed those terminals already.
It was however completing on the client side (the script terminated without any errors), but it wasn’t incrementing TCPFastOpenPassive on the server side. I tried couple of times.
Trying it again now, and it did print ‘complete’ on the server side, but didn’t increment TCPFastOpenPassive. In that case it took a little longer to complete on the client side.
Trying it more and it completed pretty fast, again printing ‘complete’ and it did increment TCPFastOpenPassive.
Then again and it took a little longer to complete with printing ‘complete’, but didn’t increment.
So it looks like it is working, just not all the time, but this might be related to network conditions, load, loss etc.
And to add, this all was from the VPS that was doing that ActiveFail and Blackhole before.

# ./fastopen.py server :10001 && netstat -s | grep TCPFastOpen
b'complete'
    TCPFastOpenActive: 4
    TCPFastOpenActiveFail: 2
    TCPFastOpenPassive: 22
    TCPFastOpenCookieReqd: 4
# ./fastopen.py server :10001 && netstat -s | grep TCPFastOpen
b'complete'
    TCPFastOpenActive: 4
    TCPFastOpenActiveFail: 2
    TCPFastOpenPassive: 22
    TCPFastOpenCookieReqd: 4
# ./fastopen.py server :10001 && netstat -s | grep TCPFastOpen
b'complete'
    TCPFastOpenActive: 4
    TCPFastOpenActiveFail: 2
    TCPFastOpenPassive: 23
    TCPFastOpenCookieReqd: 4
# ./fastopen.py server :10001 && netstat -s | grep TCPFastOpen
b'complete'
    TCPFastOpenActive: 4
    TCPFastOpenActiveFail: 2
    TCPFastOpenPassive: 24
    TCPFastOpenCookieReqd: 4
# ./fastopen.py server :10001 && netstat -s | grep TCPFastOpen
b'complete'
    TCPFastOpenActive: 4
    TCPFastOpenActiveFail: 2
    TCPFastOpenPassive: 25
    TCPFastOpenCookieReqd: 4
# ./fastopen.py server :10001 && netstat -s | grep TCPFastOpen
b'complete'
    TCPFastOpenActive: 4
    TCPFastOpenActiveFail: 2
    TCPFastOpenPassive: 25
    TCPFastOpenCookieReqd: 4
# ./fastopen.py server :10001 && netstat -s | grep TCPFastOpen
b'complete'
    TCPFastOpenActive: 4
    TCPFastOpenActiveFail: 2
    TCPFastOpenPassive: 26
    TCPFastOpenCookieReqd: 4

This is actually fantastic news!! Edit: maybe good but not great news. See bottom.

The case we’re worried about is when “complete” isn’t sent at all, and as far as I can tell from your experiments, “complete” was always sent, even when TCP_FASTOPEN didn’t work, which means it always gracefully fell back to normal TCP. Which is perfect! That’s exactly what we want to happen.

What I’ve been worried about is if “complete” never made it, because the packets weren’t delivered, and the connection timed out. The ideal case is that TCP_FASTOPEN is completely successful, but it’s okay if it’s not, as long as the connection itself still is. It looks like your connections were always successful! Hooray!

Let’s see what more tests look like, but if every test is like yours, I have no qualms about suggesting TCP_FASTOPEN be enabled by default.

Edited to add: oops, I missed a detail. You said that in the cases it appears TCP_FASTOPEN failed and things fell back to normal TCP the client side took longer? That’s interesting, maybe your client side kernel is timing out and then resending without TCP_FASTOPEN, and the case I’m worried about is actually happening. How much longer? If you’re able to reproduce your setup again, could you time how long the client takes for a few tests and give us some rough idea of how long a TCP_FASTOPEN complete takes vs a non-TCP_FASTOPEN complete?

After some testing I can confirm that it works locally on my server. Unfortunately the story is different when using an external client on VPS. It doesn’t seem to increase the TCPFastOpenPassive counter on the server at all. And while it usually finishes near instantly, it can also take a few seconds. And worse… sometimes it just gets stuck and never finishes at all and ends up with a connection timeout. I’ve now also seen instances where the client finishes without an error, but the server never received the message. It seems to be really intermittent.

Edit: Running the client function from windows throws the following error.

>>> client('192.168.1.100:5996')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 5, in client
OSError: [WinError 10042] An unknown, invalid, or unsupported option or level was specified in a getsockopt or setsockopt call

I wanted to try from local network, to see if that would work, but it looks like I’d have to adjust something in windows to make it work. Or find another Linux system. WSL also doesn’t work.

>>> client('192.168.1.100:5996')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 5, in client
OSError: [Errno 92] Protocol not available

Probably a limitation of WSL as I can’t set the setting with sysctl either (though I was kind of expecting this).

sysctl: cannot stat /proc/sys/net/ipv4/tcp_fastopen: No such file or directory

@jtolio do you have any tips of how to see if this could be fixed in my network? Maybe some changes on my router are required? (Though I doubt my ISP would let me change much)

1 Like

This is my experience with WSL

image

I faced the same error with getsockopt or setsockopt call. I am trying to figure it out too.

It worked with sudo

USER@USER$ sudo sysctl -w net.ipv4.tcp_fastopen=3
net.ipv4.tcp_fastopen = 3

My bad… I was apparently still on WSL1. Was able to test from my windows system using WSL2 now. It works as intended. I see the counter increase and the response is instant. So fastopen is definitely working within my LAN. Now to find out how to fix it for WAN…

Edit: Further testing. I have (unfortunately) a dual NAT setup. My ISP makes it really difficult to use your own router and keep TV/Phone functionality. But testing with the wan IP from my internal router, it still works… Which probably makes the ISP router the issue… as I feared.

Edit2: Interesting, testing with my external IP from my ISP also works when testing from within my network. So either NAT loopback is treated differently, or the issue was with the VPS I used.

Edit3: Tested from my phone (not connected to wifi) using pydroid and I’m seeing the same intermittent results as when testing from the VPS. It doesn’t increase the counter, sometimes the message is received instantly, sometimes after a few seconds and sometimes it times out. Though who knows if that app could even use fastopen. So this might be a client issue. However, either way, the fallback seems unstable.

Under Windows 10 Pro (22H2) these are the default parameters

image

I am finding it difficult so far to test this under Windows.

1 Like

I also did some client testing from four VPS nodes located in USA/Spain/UK/DE to a server in DK (Denmark).
All servers are running Ubuntu (20.04.4 LTS), and no NAT/PAT should be in between any of the servers.
The net.ipv4.tcp_fastopen=3 option are also used.

USA → DK :

#
# --- Client ---
#
root@localhost:~# for x in {1..5}; do echo "Test #$x :"; time ./fastopen.py client server030.storj.dk:5996 ; netstat -s | grep TCPFastOpen; sleep 1; done

Test #1 :
real    0m0.159s
user    0m0.033s
sys     0m0.008s
    TCPFastOpenActive: 166

Test #2 :
real    0m0.154s
user    0m0.026s
sys     0m0.009s
    TCPFastOpenActive: 167

Test #3 :
real    0m0.161s
user    0m0.037s
sys     0m0.005s
    TCPFastOpenActive: 168

Test #4 :
real    0m0.167s
user    0m0.033s
sys     0m0.011s
    TCPFastOpenActive: 169

Test #5 :
real    0m0.158s
user    0m0.027s
sys     0m0.011s
    TCPFastOpenActive: 170

root@localhost:~# ping -c 3 server030.storj.dk
PING server030.storj.dk (89.249.2.94) 56(84) bytes of data.
64 bytes from 89.249.2.94 (89.249.2.94): icmp_seq=1 ttl=45 time=122 ms
64 bytes from 89.249.2.94 (89.249.2.94): icmp_seq=2 ttl=45 time=123 ms
64 bytes from 89.249.2.94 (89.249.2.94): icmp_seq=3 ttl=45 time=122 ms

#
# --- Server ---
#
root@server030:~# while true; do ./fastopen.py server :5996; netstat -s | grep TCPFastOpen;  done
b'complete'
    TCPFastOpenPassive: 237
    TCPFastOpenCookieReqd: 1
b'complete'
    TCPFastOpenPassive: 238
    TCPFastOpenCookieReqd: 1
b'complete'
    TCPFastOpenPassive: 239
    TCPFastOpenCookieReqd: 1
b'complete'
    TCPFastOpenPassive: 240
    TCPFastOpenCookieReqd: 1
b'complete'
    TCPFastOpenPassive: 241
    TCPFastOpenCookieReqd: 1

Spain → DK :

#
# --- Client ---
#
Test #1 :
real    0m0.202s
user    0m0.032s
sys     0m0.012s
    TCPFastOpenActive: 19

Test #2 :
real    0m0.091s
user    0m0.024s
sys     0m0.008s
    TCPFastOpenActive: 20

Test #3 :
real    0m0.092s
user    0m0.025s
sys     0m0.009s
    TCPFastOpenActive: 21

Test #4 :
real    0m0.092s
user    0m0.017s
sys     0m0.017s
    TCPFastOpenActive: 22

Test #5 :
real    0m0.094s
user    0m0.030s
sys     0m0.004s
    TCPFastOpenActive: 23

root@localhost:~# ping -c 3 server030.storj.dk
PING server030.storj.dk (89.249.2.94) 56(84) bytes of data.
64 bytes from 89.249.2.94 (89.249.2.94): icmp_seq=1 ttl=50 time=51.4 ms
64 bytes from 89.249.2.94 (89.249.2.94): icmp_seq=2 ttl=50 time=51.4 ms
64 bytes from 89.249.2.94 (89.249.2.94): icmp_seq=3 ttl=50 time=51.3 ms

#
# --- Server ---
#
b'complete'
    TCPFastOpenPassive: 242
    TCPFastOpenCookieReqd: 2
b'complete'
    TCPFastOpenPassive: 243
    TCPFastOpenCookieReqd: 2
b'complete'
    TCPFastOpenPassive: 244
    TCPFastOpenCookieReqd: 2
b'complete'
    TCPFastOpenPassive: 245
    TCPFastOpenCookieReqd: 2
b'complete'
    TCPFastOpenPassive: 246
    TCPFastOpenCookieReqd: 2

UK → DK

#
# --- Client ---
#
Test #1 :
real    0m0.132s
user    0m0.022s
sys     0m0.019s
    TCPFastOpenActive: 34

Test #2 :
real    0m0.071s
user    0m0.025s
sys     0m0.009s
    TCPFastOpenActive: 35

Test #3 :
real    0m0.074s
user    0m0.036s
sys     0m0.000s
    TCPFastOpenActive: 36

Test #4 :
real    0m0.077s
user    0m0.017s
sys     0m0.021s
    TCPFastOpenActive: 37

Test #5 :
real    0m0.076s
user    0m0.039s
sys     0m0.000s
    TCPFastOpenActive: 38

root@localhost:~# ping -c 3 server030.storj.dk
PING server030.storj.dk (89.249.2.94) 56(84) bytes of data.
64 bytes from 89.249.2.94 (89.249.2.94): icmp_seq=1 ttl=49 time=34.0 ms
64 bytes from 89.249.2.94 (89.249.2.94): icmp_seq=2 ttl=49 time=33.8 ms
64 bytes from 89.249.2.94 (89.249.2.94): icmp_seq=3 ttl=49 time=34.1 ms

#
# --- Server ---
#
b'complete'
    TCPFastOpenPassive: 247
    TCPFastOpenCookieReqd: 3
b'complete'
    TCPFastOpenPassive: 248
    TCPFastOpenCookieReqd: 3
b'complete'
    TCPFastOpenPassive: 249
    TCPFastOpenCookieReqd: 3
b'complete'
    TCPFastOpenPassive: 250
    TCPFastOpenCookieReqd: 3
b'complete'
    TCPFastOpenPassive: 251
    TCPFastOpenCookieReqd: 3

DE → DK

#
# --- Client ---
#
Test #1 :
real    0m0.093s
user    0m0.037s
sys     0m0.015s
    TCPFastOpenActive: 18

Test #2 :
real    0m0.055s
user    0m0.026s
sys     0m0.008s
    TCPFastOpenActive: 19

Test #3 :
real    0m0.054s
user    0m0.030s
sys     0m0.000s
    TCPFastOpenActive: 20

Test #4 :
real    0m0.069s
user    0m0.041s
sys     0m0.004s
    TCPFastOpenActive: 21

Test #5 :
real    0m2.077s  <-- hmm...
user    0m0.024s
sys     0m0.008s
    TCPFastOpenActive: 22

root@localhost:~# ping -c 3 server030.storj.dk
PING server030.storj.dk (89.249.2.94) 56(84) bytes of data.
64 bytes from 89.249.2.94 (89.249.2.94): icmp_seq=1 ttl=51 time=17.9 ms
64 bytes from 89.249.2.94 (89.249.2.94): icmp_seq=2 ttl=51 time=18.0 ms
64 bytes from 89.249.2.94 (89.249.2.94): icmp_seq=3 ttl=51 time=18.0 ms

#
# --- Server ---
#
b'complete'
    TCPFastOpenPassive: 252
    TCPFastOpenCookieReqd: 4
b'complete'
    TCPFastOpenPassive: 253
    TCPFastOpenCookieReqd: 4
b'complete'
    TCPFastOpenPassive: 254
    TCPFastOpenCookieReqd: 4
b'complete'
    TCPFastOpenPassive: 255
    TCPFastOpenCookieReqd: 4
b'complete'
    TCPFastOpenPassive: 256
    TCPFastOpenCookieReqd: 4

Th3Van.dk

2 Likes

Interesting! and is it possible to run the same test without fastopen to compare it?

@BrightSilence and @nerdatwork, I’m sorry, but I got Windows support wrong in my Python tool (though we got it right in the storagenode code). I’ve updated the Python tool to have support for Windows (at least on the server-side, probably not client side): fastopen.py · GitHub

I don’t have any confidence this works on the client side, but if you have the ability to test from a Linux client to a Windows server with this Python tool, that might work better.

On the Windows server, you may need to run

netsh int tcp set global fastopen=enabled
netsh int tcp set global fastopenfallback=disabled

Otherwise, my impression from this thread is that support is going to be hairy. Even though I haven’t figured out how to tell from the client side if a TCP_FASTOPEN connection was successfully established or not, it seems like from all of your experiences that even just timing the connection would be enlightening. I’m not sure if this will overburden the Satellite, but maybe the Satellite should try timing two connections to the storage node. If the TCP_FASTOPEN one is slower, then perhaps the heuristic should be that we don’t try TCP_FASTOPEN with that storage node ever, and if the TCP_FASTOPEN one is noticeably faster, then maybe we leave it up to clients to try and do some similar benchmarking (maybe clients also disable TCP_FASTOPEN if it fails too much).

4 Likes

I tried the python script and have these results:

  1. A Linux (in my case Debian) router does not need to have this enabled. It can pass the packets properly with the default setting of 1.
  2. It only seems to speed up second and further connection attempts:

Without fastopen enabled (using nc) from the perspective of the client:

>07.229256 SYN
<07.231011 SYN+ACK
>07.231032 ACK
>07.231231 "complete"
<07.233009 ACK

3.75ms from SYN sent to the last ACK. I tested this over the internet (since I wanted to test my router), but I do not have servers far away.
2xRTT

With fastopen (using the script) the first time:

>21.889913 SYN, cookie
<21.892729 SYN+ACK
>21.892747 "complete"
>21.893031 ACK
<21.895860 ACK

5.95ms from SYN to last ACK. Longer, but it could just be the jitter, since ping is so low here. Anyway, still 2xRTT

However, the next time it becomes faster:

25.086627 > SYN, cookie, "complete"
25.088446 < SYN+ACK

1.8ms, 1xRTT

Maybe it was news to me (I am new to TCP_FASTOPEN), but it looks like that only second and later connections from the same client are faster. The first time is the same as normal TCP, so it would speed up downloading multiple files (assuming they are stored on the same nodes), but it would not speed up downloading one small file. The “session” probably has an expiration time.

This is right - the first connection establishes a TCP_FASTOPEN “cookie” that is then used for subsequent sessions.

This will make a major difference especially for our gateway, in which almost every connection is a subsequent connection.

I used WSL2 as client and Windows 10 PRO as server with following results.

$python3 fastopen.py client 127.0.0.1:5996
Traceback (most recent call last):
  File "fastopen.py", line 107, in <module>
    main()
  File "fastopen.py", line 103, in main
    dispatch[command](addr)
  File "fastopen.py", line 72, in client
    sock.sendto(b"complete", MSG_FASTOPEN, (host, int(port)))
ConnectionRefusedError: [Errno 111] Connection refused

Windows 10 Pro as server :point_down:

image

Please try to use the local IP instead of 127.0.0.1, wsl2 is a VM with own networking and with complicated networking with Windows host.
Or you may to use a client inside another PowerShell window, this way it will connect to your server listening on Windows host and then 127.0.0.1 will be available.

Here are the results with local IP on WSL2

$ python3 fastopen.py client 192.168.1.5:5996
Traceback (most recent call last):
  File "fastopen.py", line 107, in <module>
    main()
  File "fastopen.py", line 103, in main
    dispatch[command](addr)
  File "fastopen.py", line 72, in client
    sock.sendto(b"complete", MSG_FASTOPEN, (host, int(port)))
TimeoutError: [Errno 110] Connection timed out

This is using 2 powershell windows.
First with localhost

python .\fastopen.py client 127.0.0.1:5996
Traceback (most recent call last):
  File "D:\test\fastopen.py", line 107, in <module>
    main()
  File "D:\test\fastopen.py", line 103, in main
    dispatch[command](addr)
  File "D:\test\fastopen.py", line 72, in client
    sock.sendto(b"complete", MSG_FASTOPEN, (host, int(port)))
OSError: [WinError 10045] The attempted operation is not supported for the type of object referenced

Secondly with local IP

python .\fastopen.py client 192.168.1.5:5996
Traceback (most recent call last):
  File "D:\test\fastopen.py", line 107, in <module>
    main()
  File "D:\test\fastopen.py", line 103, in main
    dispatch[command](addr)
  File "D:\test\fastopen.py", line 72, in client
    sock.sendto(b"complete", MSG_FASTOPEN, (host, int(port)))
OSError: [WinError 10045] The attempted operation is not supported for the type of object referenced

Okay, right, the Python utility doesn’t work in client mode on Windows. To make it work I would need to figure out how to use the ConnectEx Windows call instead. You can use the Python utility in server mode on Windows, and then just find a Linuxy place to run the client.

Please add TCP 5996 port to the Windows firewall, then try the client from wsl2 using local IP of your PC.

1 Like

That worked flawlessly. Here are the results after adding TCP port 5996 to firewall

On WSL2 client

$ python3 fastopen.py  client 192.168.1.5:5996 && netstat -s | grep TCPFastOpen
    TCPFastOpenActive: 1
    TCPFastOpenActiveFail: 2

On Windows PRO server

PS D:\test> python.exe .\fastopen.py server :5996
b'complete'
1 Like