Well, running it locally is enough to tell you if your operating system is set up, but you’re right, if you want to have a good hint that your network topology has good support for TCP_FASTOPEN, yeah, you’d want to run this Python tool in server mode with port forwarding setup in your network, then run the client somewhere far outside your network. So, yep, you’re right! I imagine that would look like:
./fastopen.py server :5996
locally, with port forwarding 5996, then somewhere else (a rented VPS or something):
For me this works over WAN with server behind the NAT using RouterOS.
I did try client from two VPSes - two different locations, same provider, same Debian including the config. It works from one but getting TCPFastOpenActiveFail: 1 and TCPFastOpenBlackhole: 1 on the other client, with or without firewall attached to that instance. Ran both as root.
The server was QEMU VM with MacVTap network adapter.
Had to do sysctl -w net.ipv4.tcp_fastopen=3 on the server instance, hypervisor is 1 and needed no change.
TBH not sure now, I have closed those terminals already.
It was however completing on the client side (the script terminated without any errors), but it wasn’t incrementing TCPFastOpenPassive on the server side. I tried couple of times.
Trying it again now, and it did print ‘complete’ on the server side, but didn’t increment TCPFastOpenPassive. In that case it took a little longer to complete on the client side.
Trying it more and it completed pretty fast, again printing ‘complete’ and it did increment TCPFastOpenPassive.
Then again and it took a little longer to complete with printing ‘complete’, but didn’t increment.
So it looks like it is working, just not all the time, but this might be related to network conditions, load, loss etc.
And to add, this all was from the VPS that was doing that ActiveFail and Blackhole before.
This is actually fantastic news!! Edit: maybe good but not great news. See bottom.
The case we’re worried about is when “complete” isn’t sent at all, and as far as I can tell from your experiments, “complete” was always sent, even when TCP_FASTOPEN didn’t work, which means it always gracefully fell back to normal TCP. Which is perfect! That’s exactly what we want to happen.
What I’ve been worried about is if “complete” never made it, because the packets weren’t delivered, and the connection timed out. The ideal case is that TCP_FASTOPEN is completely successful, but it’s okay if it’s not, as long as the connection itself still is. It looks like your connections were always successful! Hooray!
Let’s see what more tests look like, but if every test is like yours, I have no qualms about suggesting TCP_FASTOPEN be enabled by default.
Edited to add: oops, I missed a detail. You said that in the cases it appears TCP_FASTOPEN failed and things fell back to normal TCP the client side took longer? That’s interesting, maybe your client side kernel is timing out and then resending without TCP_FASTOPEN, and the case I’m worried about is actually happening. How much longer? If you’re able to reproduce your setup again, could you time how long the client takes for a few tests and give us some rough idea of how long a TCP_FASTOPEN complete takes vs a non-TCP_FASTOPEN complete?
After some testing I can confirm that it works locally on my server. Unfortunately the story is different when using an external client on VPS. It doesn’t seem to increase the TCPFastOpenPassive counter on the server at all. And while it usually finishes near instantly, it can also take a few seconds. And worse… sometimes it just gets stuck and never finishes at all and ends up with a connection timeout. I’ve now also seen instances where the client finishes without an error, but the server never received the message. It seems to be really intermittent.
Edit: Running the client function from windows throws the following error.
>>> client('192.168.1.100:5996')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 5, in client
OSError: [WinError 10042] An unknown, invalid, or unsupported option or level was specified in a getsockopt or setsockopt call
I wanted to try from local network, to see if that would work, but it looks like I’d have to adjust something in windows to make it work. Or find another Linux system. WSL also doesn’t work.
>>> client('192.168.1.100:5996')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 5, in client
OSError: [Errno 92] Protocol not available
Probably a limitation of WSL as I can’t set the setting with sysctl either (though I was kind of expecting this).
sysctl: cannot stat /proc/sys/net/ipv4/tcp_fastopen: No such file or directory
@jtolio do you have any tips of how to see if this could be fixed in my network? Maybe some changes on my router are required? (Though I doubt my ISP would let me change much)
My bad… I was apparently still on WSL1. Was able to test from my windows system using WSL2 now. It works as intended. I see the counter increase and the response is instant. So fastopen is definitely working within my LAN. Now to find out how to fix it for WAN…
Edit: Further testing. I have (unfortunately) a dual NAT setup. My ISP makes it really difficult to use your own router and keep TV/Phone functionality. But testing with the wan IP from my internal router, it still works… Which probably makes the ISP router the issue… as I feared.
Edit2: Interesting, testing with my external IP from my ISP also works when testing from within my network. So either NAT loopback is treated differently, or the issue was with the VPS I used.
Edit3: Tested from my phone (not connected to wifi) using pydroid and I’m seeing the same intermittent results as when testing from the VPS. It doesn’t increase the counter, sometimes the message is received instantly, sometimes after a few seconds and sometimes it times out. Though who knows if that app could even use fastopen. So this might be a client issue. However, either way, the fallback seems unstable.
I also did some client testing from four VPS nodes located in USA/Spain/UK/DE to a server in DK (Denmark).
All servers are running Ubuntu (20.04.4 LTS), and no NAT/PAT should be in between any of the servers.
The net.ipv4.tcp_fastopen=3 option are also used.
USA → DK :
#
# --- Client ---
#
root@localhost:~# for x in {1..5}; do echo "Test #$x :"; time ./fastopen.py client server030.storj.dk:5996 ; netstat -s | grep TCPFastOpen; sleep 1; done
Test #1 :
real 0m0.159s
user 0m0.033s
sys 0m0.008s
TCPFastOpenActive: 166
Test #2 :
real 0m0.154s
user 0m0.026s
sys 0m0.009s
TCPFastOpenActive: 167
Test #3 :
real 0m0.161s
user 0m0.037s
sys 0m0.005s
TCPFastOpenActive: 168
Test #4 :
real 0m0.167s
user 0m0.033s
sys 0m0.011s
TCPFastOpenActive: 169
Test #5 :
real 0m0.158s
user 0m0.027s
sys 0m0.011s
TCPFastOpenActive: 170
root@localhost:~# ping -c 3 server030.storj.dk
PING server030.storj.dk (89.249.2.94) 56(84) bytes of data.
64 bytes from 89.249.2.94 (89.249.2.94): icmp_seq=1 ttl=45 time=122 ms
64 bytes from 89.249.2.94 (89.249.2.94): icmp_seq=2 ttl=45 time=123 ms
64 bytes from 89.249.2.94 (89.249.2.94): icmp_seq=3 ttl=45 time=122 ms
#
# --- Server ---
#
root@server030:~# while true; do ./fastopen.py server :5996; netstat -s | grep TCPFastOpen; done
b'complete'
TCPFastOpenPassive: 237
TCPFastOpenCookieReqd: 1
b'complete'
TCPFastOpenPassive: 238
TCPFastOpenCookieReqd: 1
b'complete'
TCPFastOpenPassive: 239
TCPFastOpenCookieReqd: 1
b'complete'
TCPFastOpenPassive: 240
TCPFastOpenCookieReqd: 1
b'complete'
TCPFastOpenPassive: 241
TCPFastOpenCookieReqd: 1
Spain → DK :
#
# --- Client ---
#
Test #1 :
real 0m0.202s
user 0m0.032s
sys 0m0.012s
TCPFastOpenActive: 19
Test #2 :
real 0m0.091s
user 0m0.024s
sys 0m0.008s
TCPFastOpenActive: 20
Test #3 :
real 0m0.092s
user 0m0.025s
sys 0m0.009s
TCPFastOpenActive: 21
Test #4 :
real 0m0.092s
user 0m0.017s
sys 0m0.017s
TCPFastOpenActive: 22
Test #5 :
real 0m0.094s
user 0m0.030s
sys 0m0.004s
TCPFastOpenActive: 23
root@localhost:~# ping -c 3 server030.storj.dk
PING server030.storj.dk (89.249.2.94) 56(84) bytes of data.
64 bytes from 89.249.2.94 (89.249.2.94): icmp_seq=1 ttl=50 time=51.4 ms
64 bytes from 89.249.2.94 (89.249.2.94): icmp_seq=2 ttl=50 time=51.4 ms
64 bytes from 89.249.2.94 (89.249.2.94): icmp_seq=3 ttl=50 time=51.3 ms
#
# --- Server ---
#
b'complete'
TCPFastOpenPassive: 242
TCPFastOpenCookieReqd: 2
b'complete'
TCPFastOpenPassive: 243
TCPFastOpenCookieReqd: 2
b'complete'
TCPFastOpenPassive: 244
TCPFastOpenCookieReqd: 2
b'complete'
TCPFastOpenPassive: 245
TCPFastOpenCookieReqd: 2
b'complete'
TCPFastOpenPassive: 246
TCPFastOpenCookieReqd: 2
UK → DK
#
# --- Client ---
#
Test #1 :
real 0m0.132s
user 0m0.022s
sys 0m0.019s
TCPFastOpenActive: 34
Test #2 :
real 0m0.071s
user 0m0.025s
sys 0m0.009s
TCPFastOpenActive: 35
Test #3 :
real 0m0.074s
user 0m0.036s
sys 0m0.000s
TCPFastOpenActive: 36
Test #4 :
real 0m0.077s
user 0m0.017s
sys 0m0.021s
TCPFastOpenActive: 37
Test #5 :
real 0m0.076s
user 0m0.039s
sys 0m0.000s
TCPFastOpenActive: 38
root@localhost:~# ping -c 3 server030.storj.dk
PING server030.storj.dk (89.249.2.94) 56(84) bytes of data.
64 bytes from 89.249.2.94 (89.249.2.94): icmp_seq=1 ttl=49 time=34.0 ms
64 bytes from 89.249.2.94 (89.249.2.94): icmp_seq=2 ttl=49 time=33.8 ms
64 bytes from 89.249.2.94 (89.249.2.94): icmp_seq=3 ttl=49 time=34.1 ms
#
# --- Server ---
#
b'complete'
TCPFastOpenPassive: 247
TCPFastOpenCookieReqd: 3
b'complete'
TCPFastOpenPassive: 248
TCPFastOpenCookieReqd: 3
b'complete'
TCPFastOpenPassive: 249
TCPFastOpenCookieReqd: 3
b'complete'
TCPFastOpenPassive: 250
TCPFastOpenCookieReqd: 3
b'complete'
TCPFastOpenPassive: 251
TCPFastOpenCookieReqd: 3
DE → DK
#
# --- Client ---
#
Test #1 :
real 0m0.093s
user 0m0.037s
sys 0m0.015s
TCPFastOpenActive: 18
Test #2 :
real 0m0.055s
user 0m0.026s
sys 0m0.008s
TCPFastOpenActive: 19
Test #3 :
real 0m0.054s
user 0m0.030s
sys 0m0.000s
TCPFastOpenActive: 20
Test #4 :
real 0m0.069s
user 0m0.041s
sys 0m0.004s
TCPFastOpenActive: 21
Test #5 :
real 0m2.077s <-- hmm...
user 0m0.024s
sys 0m0.008s
TCPFastOpenActive: 22
root@localhost:~# ping -c 3 server030.storj.dk
PING server030.storj.dk (89.249.2.94) 56(84) bytes of data.
64 bytes from 89.249.2.94 (89.249.2.94): icmp_seq=1 ttl=51 time=17.9 ms
64 bytes from 89.249.2.94 (89.249.2.94): icmp_seq=2 ttl=51 time=18.0 ms
64 bytes from 89.249.2.94 (89.249.2.94): icmp_seq=3 ttl=51 time=18.0 ms
#
# --- Server ---
#
b'complete'
TCPFastOpenPassive: 252
TCPFastOpenCookieReqd: 4
b'complete'
TCPFastOpenPassive: 253
TCPFastOpenCookieReqd: 4
b'complete'
TCPFastOpenPassive: 254
TCPFastOpenCookieReqd: 4
b'complete'
TCPFastOpenPassive: 255
TCPFastOpenCookieReqd: 4
b'complete'
TCPFastOpenPassive: 256
TCPFastOpenCookieReqd: 4
@BrightSilence and @nerdatwork, I’m sorry, but I got Windows support wrong in my Python tool (though we got it right in the storagenode code). I’ve updated the Python tool to have support for Windows (at least on the server-side, probably not client side): fastopen.py · GitHub
I don’t have any confidence this works on the client side, but if you have the ability to test from a Linux client to a Windows server with this Python tool, that might work better.
On the Windows server, you may need to run
netsh int tcp set global fastopen=enabled
netsh int tcp set global fastopenfallback=disabled
Otherwise, my impression from this thread is that support is going to be hairy. Even though I haven’t figured out how to tell from the client side if a TCP_FASTOPEN connection was successfully established or not, it seems like from all of your experiences that even just timing the connection would be enlightening. I’m not sure if this will overburden the Satellite, but maybe the Satellite should try timing two connections to the storage node. If the TCP_FASTOPEN one is slower, then perhaps the heuristic should be that we don’t try TCP_FASTOPEN with that storage node ever, and if the TCP_FASTOPEN one is noticeably faster, then maybe we leave it up to clients to try and do some similar benchmarking (maybe clients also disable TCP_FASTOPEN if it fails too much).
Maybe it was news to me (I am new to TCP_FASTOPEN), but it looks like that only second and later connections from the same client are faster. The first time is the same as normal TCP, so it would speed up downloading multiple files (assuming they are stored on the same nodes), but it would not speed up downloading one small file. The “session” probably has an expiration time.
Please try to use the local IP instead of 127.0.0.1, wsl2 is a VM with own networking and with complicated networking with Windows host.
Or you may to use a client inside another PowerShell window, this way it will connect to your server listening on Windows host and then 127.0.0.1 will be available.
$ python3 fastopen.py client 192.168.1.5:5996
Traceback (most recent call last):
File "fastopen.py", line 107, in <module>
main()
File "fastopen.py", line 103, in main
dispatch[command](addr)
File "fastopen.py", line 72, in client
sock.sendto(b"complete", MSG_FASTOPEN, (host, int(port)))
TimeoutError: [Errno 110] Connection timed out
This is using 2 powershell windows.
First with localhost
python .\fastopen.py client 127.0.0.1:5996
Traceback (most recent call last):
File "D:\test\fastopen.py", line 107, in <module>
main()
File "D:\test\fastopen.py", line 103, in main
dispatch[command](addr)
File "D:\test\fastopen.py", line 72, in client
sock.sendto(b"complete", MSG_FASTOPEN, (host, int(port)))
OSError: [WinError 10045] The attempted operation is not supported for the type of object referenced
Secondly with local IP
python .\fastopen.py client 192.168.1.5:5996
Traceback (most recent call last):
File "D:\test\fastopen.py", line 107, in <module>
main()
File "D:\test\fastopen.py", line 103, in main
dispatch[command](addr)
File "D:\test\fastopen.py", line 72, in client
sock.sendto(b"complete", MSG_FASTOPEN, (host, int(port)))
OSError: [WinError 10045] The attempted operation is not supported for the type of object referenced
Okay, right, the Python utility doesn’t work in client mode on Windows. To make it work I would need to figure out how to use the ConnectEx Windows call instead. You can use the Python utility in server mode on Windows, and then just find a Linuxy place to run the client.