Please enable TCP fastopen on your storage nodes

nerdatwork · August 21, 2023, 12:50am

You should rather compare success rates before and after enabling TCPFO. Since enabling TCPFO helps you win more races. This would be upload - successful, failed and canceled, with and without TCPFO.
Look here

snorkel · August 21, 2023, 3:46am

Yes, you’re right. I forgot that those stats account for lost races too.
I will compare the success rate between the 2 machines on a period of time, because before and after is not so precise. The traffic may varry in time, but for the same period, it’s pretty similar for both machines; they are identical, and the internet connection is very similar. They are in the same location too, but on 2 ISPs.

snorkel · August 21, 2023, 12:05pm

Is there a way to trace the success rate, without changing the log level to info, and without prometheus/grafana?
The debug functions will display this?

littleskunk · August 21, 2023, 12:27pm

prometheus is just getting the information from the monkit endpoint of the storage node. Since you did not exclude that from your list you can use it. I mean the logical next step would be to setup prometheus and grafana but thats up to you.

snorkel · August 21, 2023, 12:56pm

I don’t want to setup prometheus and co, because I think it creates logs and takes up resources from the node. I don’t want anything else besides the node to use the HDD. Even the log.level is set to fatal.
Maybe I’m mistaken about how Prometheus and Grafana works, I realy don’t know.
Anyway, I activated the debug port, I restarted the node (with stop and rm) and now I access from a PC the specified URLs:

http://192.168.1.201:5999/mon/stats
http://192.168.1.201:5999/mon/ps
http://192.168.1.201:5999/mon/func

I can’t understand any of those outputs.
Does it shows success rates for uploads and downloads?
How to display just that info?
The stats are counted since when? Since node’s installation, recreation (with rm) or restart (without rm)?

Toyoo · August 21, 2023, 9:28pm

You’d usually look at the source code to understand the specific entries. They’re not documented anywhere else, I assume they can change from release to release. The entries in the mon/stats endpoint that I’ve personally found useful are:

audit_success_count — a quick peek on the number of audits for node vetting progress,
upload_started_count — number of uploads attempted (excluding rejected attempts due to concurrent requests limit) since last node restart, to compare the rates of change across several nodes.
upload_cancel_count/upload_failure_count/upload_success_count would probably be used to replace log parsing for the success rate script…

snorkel · August 22, 2023, 6:59am

I believe, in the end, all that matters is the amount of data stored in a given time (a month or so) and the egress payed in that time.
I’ll wait the end of the month for comparition.

snorkel · August 23, 2023, 4:24pm

Here are the results after 19 hours. The uploads seems pretty similar between the 2 nodes, but with downloads, things get very different… and strange. Please notice the huge number of downloads started on NODE 2 vs NODE 1, and the similar number of successful downloads. This results in a very low success rate for downloads, but the earnings are quite similar. Can anyone explain this?

STRINGS:
========
upload_started_count
upload_cancel_count
upload_failure_count
upload_success_count

download_started_count
download_cancel_count
download_failure_count
download_success_count


RESULTS:
========
NODE 1 - TCPfast on:
upload_started_count=130564
upload_cancel_count=64
upload_failure_count=2997
upload_success_count=127502
upload_success_rate=97.65% /19h

download_started_count,action=GET=73097
download_cancel_count,action=GET=2997
download_failure_count,action=GET=144
download_success_count,action=GET=69956
download_success_rate=95.70% /19h

NODE 2 - TCPfast off:
upload_started_count=131193
upload_cancel_count=56
upload_failure_count=3368
upload_success_count=127766
upload_success_rate=97.39% /19h

download_started_count,action=GET=107658
download_cancel_count,action=GET=34173
download_failure_count,action=GET=18
download_success_count,action=GET=73466
download_success_rate=68.24% /19h

snorkel · August 24, 2023, 5:27am

After 24h, I checked all nodes and the difference between TCP_fastopen enabled and disabled is none.
The discrepancy in the above results manifest to nodes with TCP_fastopen enabled too. The official info is that the clients didn’t enabled TCP_fastopen in a large percent, so maybe is not so utilised?..
Here are the result; I can’t say why some nodes perform better for downloads than others, but the uploads are pretty similar. Maybe the routers play some part? The best performer is Router5, which is 250$, the others are 50-100$. Each node has it’s own router and subnet. The number just defines a different model.

NODE1, TCPF ON, ISP1, ROUTER1, USR 95.9%, DSR 72.72%.
NODE2, TCPF ON, ISP1, ROUTER2, USR 98.28%, DSR 85.04%.
NODE3, TCPF ON, ISP1, ROUTER3, USR 98.48%, DSR 91.31%.
NODE4, TCPF ON, ISP2, ROUTER4, USR 97.37%, DSR 74.39%.
NODE5, TCPF ON, ISP2, ROUTER5, USR 97.94%, DSR 95.57%.
NODE6, TCPF ON, ISP1, ROUTER1, USR 98.2%, DSR 89.43%.
NODE7, TCPF ON, ISP1, ROUTER1, USR 98.63%, DSR 71.49%.
NODE8, TCPF OFF, ISP2, ROUTER4, USR 97.63%, DSR 72.67%.

ACarneiro · August 24, 2023, 11:18am

Thank you for that datapoint.
That’s quite reassuring for me because the bulk of my data is on my NAS which has been EOLed by the manufacturer.
It’s still based on Debian Jessie and I don’t think that kernel has support for FastOpen so at least I’m hoping I won’t be left at a great disadvantage

snorkel · August 24, 2023, 2:11pm

Check the TCP fastopen wiki. There was specified the minimum Linux version that supports it, I’m not accustomed with this names…
An interesting discovery; I found that all nodes can only satisfy like 2 uploads and 1 download per second, so in total 3 requests per second. The success count is pretty similar between them for up and down, and when the requests increase, the success rate goes down. I see total succesful requests for 24h between 211000 - 262000, but we can round them to 3/s; the time could be more or less 24h, I didn’t mesured by the minute.
Maybe this is just the maximum limit of my setups (Synology + Exos, no SSD), and is not limited somehow by the router, switch, ISP.
Would be interesting if more SNOs check these values and report back, with details about their setups, to have a biger picture.

Toyoo · August 25, 2023, 10:54pm

Might just as well mean that the additional connections are from a customer that has short bursts of uploads, but is too far from you.

snorkel · August 26, 2023, 5:18am

Probably. I checked again after update, and the success rate is 97-98% on the nodes that were show 75% previously, including TCPf off.

Unique · September 8, 2023, 7:03am

I see that most have ActiveFail’s eg. TCPFastOpenActiveFail: 1192, with never any TCPFastOpenActive: 1111 as in Active successful. My nodes have the same, always FastOpen Actives fail. These failures are slowing down the associated connections by a small amount.

If I read the doco on fastopen correctly, these are server initiated FastOpen connections. “TCPFastOpenActiveFail: Fast Open attempts (SYN/data) failed because
the remote does not accept it or the attempts timed out.”

As these Always fail, I ponder if we have the “net.ipv4.tcp_fastopen” parameter set incorrectly. Rather than having it set to 3, enabling Fastopen connection in and out, Shouldn’t it be set only to 2?

Alexey · September 8, 2023, 7:15am

As far as I know, it should be 3. And since

there is no disadvantages to have it enabled.

arrogantrabbit · September 8, 2023, 6:08pm

Oh… I thought enabling fastopen with reduce the traffic… instead, it will almost double it! Dang it, I was looking at it wrong.

snorkel · September 8, 2023, 10:59pm

So, it seems like you start needing dedicated internet connection, because you can’t use that router for anything else, like playing games or watching movies without lag…
They should take into account that we don’t have/buy dedicated internet connection just for storagenodes; that would make running nodes worthless.

arrogantrabbit · September 9, 2023, 1:10am

No, the lag is usually completely eliminated by SQM – many routers, including consumer ones support it today. The problem is just a sheer number of concurrent connections that bridges and modems can’t handle.

And there is nothing users can do – be that dedicated line or not, it will die, and most people use ISP provided modems; and those that don’t – still run ISP provided firmware on their personal modems, because that’s how DOCSIS providers operate. So the solution for those fellas is to use hosted gateway.

You can host your own gateway in the cloud, or use storj managed one. And wait until Google or ATT Fiber reaches your neck of the woods… and hope that your optical network terminal would not exhibit the same issues.

jtolio · September 28, 2023, 5:57pm

Just a heads up that as of v1.89.x, storage nodes will support TCP_FASTOPEN on Windows and FreeBSD in addition to Linux.

Windows users (who aren’t running a storage node through WSL) shouldn’t need to do anything other than make sure they are running a recent build of Windows 10. (https://review.dev.storj.io/c/storj/storj/+/11221)

FreeBSD users may need to do a step (https://review.dev.storj.io/c/storj/storj/+/11241):

enable with: sysctl net.inet.tcp.fastopen.server_enable=1
enable on-boot by setting net.inet.tcp.fastopen.server_enable=1 in /etc/sysctl.conf

agente · October 2, 2023, 7:44am

Ubuntu 22.
–sysctl net.ipv4.tcp_fastopen=3 \ added.
net.ipv4.tcp_fastopen=3 setted in sysctl
Error in logs about tcp fast open disappeared.

As many other ubuntu user here in this post still no result with “netstat -s | grep FastOpen” after a month. Someone use tcp fastopen with Ubuntu 22?