Node not starting - i/o timeout / connection timeout

mcanto73 · February 21, 2025, 5:30pm

Hello,
I have two ubuntu nodes that stopped working after an OS upgrade and reboot.

I already tried to destroy and recreate the docker storagenode application - I use docker compose.

These are my logs :
2025-02-21T16:41:32Z INFO Configuration loaded {“Process”: “storagenode”, “Location”: “/app/config/config.yaml”}
2025-02-21T16:41:32Z INFO Anonymized tracing enabled {“Process”: “storagenode”}
2025-02-21T16:41:32Z INFO Operator email {“Process”: “storagenode”, “Address”: “xxxxx”}
2025-02-21T16:41:32Z INFO Operator wallet {“Process”: “storagenode”, “Address”: “xxxxx”}
2025-02-21T16:41:33Z INFO server existing kernel support for server-side tcp fast open detected {“Process”: “storagenode”}
2025-02-21T16:41:41Z INFO hashstore hashstore opened successfully {“Process”: “storagenode”, “satellite”: “121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6”, “open_time”: “5.562200562s”}
2025-02-21T16:41:47Z INFO hashstore hashstore opened successfully {“Process”: “storagenode”, “satellite”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “open_time”: “5.454139228s”}
2025-02-21T16:41:52Z INFO hashstore hashstore opened successfully {“Process”: “storagenode”, “satellite”: “12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs”, “open_time”: “5.547410667s”}
2025-02-21T16:41:58Z INFO hashstore hashstore opened successfully {“Process”: “storagenode”, “satellite”: “1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE”, “open_time”: “5.397359886s”}
**2025-02-21T16:42:28Z ERROR version failed to get process version info {“Process”: “storagenode”, “error”: “version checker client: Get "https://version.storj.io": dial tcp 34.173.164.90:443: i/o timeout”, “errorVerbose”: "version checker client: Get "https://version.storj.io": dial tcp 34.173.164.90:443: i/o timeout\n\tstorj.io/storj/private/version/checker.(*Client).All:68\n\tstorj.io/storj/private/version/checker.(*Client).Process:89\n\tstorj.io/storj/private/version/checker.(*Service).checkVersion:104\n\tstorj.io/storj/private/version/checker.(Service).CheckVersion:78\n\tmain.cmdRun:91\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:392\n\tstorj.io/common/process.cleanup.func1:410\n\tgithub.com/spf13/cobra.(Command).execute:985\n\tgithub.com/spf13/cobra.(Command).ExecuteC:1117\n\tgithub.com/spf13/cobra.(Command).Execute:1041\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tmain.main:34\n\truntime.main:272"}
2025-02-21T16:42:28Z INFO Telemetry enabled {“Process”: “storagenode”, “instance ID”: “1xQxZaPRchNV8qd74uy6fEqDZzjPkX15Sk17WkaUNVbZZ4TRhy”}
2025-02-21T16:42:28Z INFO Event collection enabled {“Process”: “storagenode”, “instance ID”: “1xQxZaPRchNV8qd74uy6fEqDZzjPkX15Sk17WkaUNVbZZ4TRhy”}
2025-02-21T16:42:28Z INFO db.migration Database Version {“Process”: “storagenode”, “version”: 62}
2025-02-21T16:42:59Z WARN trust Failed to fetch URLs from source; used cache {“Process”: “storagenode”, “source”: “https://static.storj.io/dcs-satellites”, “error”: “HTTP source: Get "https://static.storj.io/dcs-satellites\”: dial tcp 34.120.119.150:443: i/o timeout", “errorVerbose”: “HTTP source: Get "https://static.storj.io/dcs-satellites\”: dial tcp 34.120.119.150:443: i/o timeout\n\tstorj.io/storj/storagenode/trust.(*HTTPSource).FetchEntries:68\n\tstorj.io/storj/storagenode/trust.(*List).fetchEntries:90\n\tstorj.io/storj/storagenode/trust.(*List).FetchURLs:49\n\tstorj.io/storj/storagenode/trust.(*Pool).fetchURLs:326\n\tstorj.io/storj/storagenode/trust.(*Pool).Refresh:209\n\tstorj.io/storj/storagenode.(*Peer).Run:1079\n\tmain.cmdRun:127\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:392\n\tstorj.io/common/process.cleanup.func1:410\n\tgithub.com/spf13/cobra.(*Command).execute:985\n\tgithub.com/spf13/cobra.(Command).ExecuteC:1117\n\tgithub.com/spf13/cobra.(Command).Execute:1041\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tmain.main:34\n\truntime.main:272"}
2025-02-21T16:42:59Z INFO preflight:localtime start checking local system clock with trusted satellites’ system clock. {“Process”: “storagenode”}
2025-02-21T16:45:11Z ERROR preflight:localtime unable to get satellite system time {“Process”: “storagenode”, “Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “error”: “rpc: tcp connector failed: rpc: dial tcp 34.150.199.48:7777: connect: connection timed out”, “errorVerbose”: “rpc: tcp connector failed: rpc: dial tcp 34.150.199.48:7777: connect: connection timed out\n\tstorj.io/common/rpc.HybridConnector.DialContext.func1:190”}
2025-02-21T16:45:11Z ERROR preflight:localtime unable to get satellite system time {“Process”: “storagenode”, “Satellite ID”: “1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE”, “error”: “rpc: tcp connector failed: rpc: dial tcp 34.94.153.46:7777: connect: connection timed out”, “errorVerbose”: “rpc: tcp connector failed: rpc: dial tcp 34.94.153.46:7777: connect: connection timed out\n\tstorj.io/common/rpc.HybridConnector.DialContext.func1:190”}
2025-02-21T16:45:11Z ERROR preflight:localtime unable to get satellite system time {“Process”: “storagenode”, “Satellite ID”: “121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6”, “error”: “rpc: tcp connector failed: rpc: dial tcp 34.126.92.94:7777: connect: connection timed out”, “errorVerbose”: “rpc: tcp connector failed: rpc: dial tcp 34.126.92.94:7777: connect: connection timed out\n\tstorj.io/common/rpc.HybridConnector.DialContext.func1:190”}
2025-02-21T16:45:11Z ERROR preflight:localtime unable to get satellite system time {“Process”: “storagenode”, “Satellite ID”: “12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs”, “error”: “rpc: tcp connector failed: rpc: dial tcp 34.159.134.91:7777: connect: connection timed out”, “errorVerbose”: “rpc: tcp connector failed: rpc: dial tcp 34.159.134.91:7777: connect: connection timed out\n\tstorj.io/common/rpc.HybridConnector.DialContext.func1:190”}

so there is a strange i/o timeout related to “dial tcp 34.173.164.90:443” and also connection timed out

Any idea ?

Thanks

nerdatwork · February 21, 2025, 5:44pm

How are your disks connected ? Check your firewall settings too.

PieceKeeper · February 21, 2025, 5:46pm

If you run

curl -I "https://version.storj.io"

do you also get a timeout there? What if you do the same inside the container?

mcanto73 · February 21, 2025, 6:29pm

Hello,
it seems to work from the server

ubuntu@hpool:~$ curl -I “https://version.storj.io”
HTTP/2 405
date: Fri, 21 Feb 2025 18:20:58 GMT
strict-transport-security: max-age=15724800; includeSubDomains

from inside docker app (storagenode) there is no curl and I can’t install , no connectivity.
Not sure what happened

Maybe I will try to remove all, try to clean up docker and reinstall all.

Best regards

mcanto73 · February 21, 2025, 7:21pm

unfortunately no way

I reinstalled the OS (Ubuntu 24.04) so there is no dirty configuration.
I copied my docker-compose.yml (used also in other nodes)

storagenode4 | 2025-02-21T19:16:43Z ERROR Error retrieving version info. {“Process”: “storagenode-updater”, “error”: “version checker client: Get "https://version.storj.io": dial tcp 34.173.164.90:443: connect: no route to host”, “errorVerbose”: “version checker client: Get "https://version.storj.io": dial tcp 34.173.164.90:443: connect: no route to host\n\tstorj.io/storj/private/version/checker.(*Client).All:68\n\tmain.loopFunc:20\n\tstorj.io/common/sync2.(*Cycle).Run:102\n\tmain.cmdRun:139\n\tstorj.io/common/process.cleanup.func1.4:392\n\tstorj.io/common/process.cleanup.func1:410\n\tgithub.com/spf13/cobra.(*Command).execute:985\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1117\n\tgithub.com/spf13/cobra.(*Command).Execute:1041\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tstorj.io/common/process.ExecWithCustomConfigAndLogger:77\n\tmain.main:22\n\truntime.main:272”}

How can I solve this error ?

Best regards

Roxor · February 21, 2025, 9:47pm

Could you post a sanitized (no email, no wallet, no hostname etc) version of your docker-compose.yml? It’s like the container has no internet access at all.

PieceKeeper · February 21, 2025, 9:50pm

That error is a bit different - no route to host. Do you have other nodes on that server with a similar docker-compose file that works?
Did you use these instructions to install docker? Ubuntu | Docker Docs
Are you using some kind of tunnelling in your setup? Check firewalls.
And yes show the docker-compose file.

mcanto73 · February 22, 2025, 7:05am

Hello,
a bit of background. These two nodes (virtual machines) were working since a lot of time (more than 1 year) with the same configuration. yesterday I have upgraded the OS (but this is not the problem because I have reinstalled one from scratch) and restarted. Nodes no more worked.
No tunnels. Regarding firewall which port should I check ? 28967 or whatelse?

services:
storagenode:
image: storjlabs/storagenode:latest
container_name: storagenode4
volumes:
- type: bind
source: /NODE4/identity/storagenode4
target: /app/identity
- type: bind
source: /NODE4
target: /app/config
- type: bind
source: /STORJ_LOCAL-4
target: /app/dbs
- type: bind
source: /STORJ_LOCAL-4/LOG
target: /app/config/LOG
ports:
- 28967:28967/tcp
- 28967:28967/udp
- 14002:14002
restart: unless-stopped
stop_grace_period: 300s
sysctls:
net.ipv4.tcp_fastopen: 3
environment:
- WALLET=
- EMAIL=xxx@gmail.com
- ADDRESS=xxxxx:28967
- STORAGE=8800GB
- STORJ_PIECES_ENABLE_LAZY_FILEWALKER=true
- STORJ_STORAGE2_PIECE_SCAN_ON_STARTUP=false
#- STORJ_OPERATOR_WALLET_FEATURES=zksync
- STORJ_LOG_LEVEL=info
- STORJ_LOG_CUSTOM_LEVEL=piecestore=info,collector=error
#- STORJ_RETAIN_CONCURRENCY=1

watchtower:
image: storjlabs/watchtower
restart: always
container_name: watchtower
command: storagenode4 watchtower --stop-timeout 300s --interval 21600
volumes:
- /var/run/docker.sock:/var/run/docker.sock

storj_exporter:
image: thechristech/storj-exporter:latest
restart: unless-stopped
container_name: storj-exporter4
environment:
- STORJ_HOST_ADDRESS=storagenode4
ports:
- “9651:9651”

Thanks

mcanto73 · February 22, 2025, 7:22am

Hello,
tried again this morning,
and the docker has no internet access.

storagenode4 | downloading storagenode-updater
storagenode4 | --2025-02-22 07:20:19-- https://version.storj.io/processes/storagenode-updater/minimum/url?os=linux&arch=amd64
storagenode4 | Resolving version.storj.io (version.storj.io)… 34.173.164.90
storagenode4 | Connecting to version.storj.io (version.storj.io)|34.173.164.90|:443… failed: No route to host.
storagenode4 | http://: Invalid host name.

I really don’t know what happened here.
There is no firewall in the OS, ports are opened.

Thank you

mcanto73 · February 22, 2025, 7:59am

Fixed by my self !

it was an issue with iptables that has been broken by the apt-get dist-upgrade (OS patching)

To fix you need to flush iptables to go with the default, restart docker and it will recreate the needed docker rules … be aware that if you have additional iptables rules you need to apply them again.

ubuntu@hpool2:~$ sudo iptables -F
ubuntu@hpool2:~$ sudo iptables -X
ubuntu@hpool2:~$ sudo iptables -Z
ubuntu@hpool2:~$ sudo iptables -P FORWARD ACCEPT
ubuntu@hpool2:~$ sudo iptables -P INPUT ACCEPT
ubuntu@hpool2:~$ sudo iptables -P OUTPUT ACCEPT
ubuntu@hpool2:~$ sudo service docker restart

Thanks

LxdrJ · February 23, 2025, 7:38am

I switched to nftables usually a bit simpler imo

arrogantrabbit · February 23, 2025, 8:48am

Better yet, use higher level abstractions, like firewall_cmd.

Example: Simpler way to configure Oracle VPS as a VPN to get around CGSNAT for node hosting purposes

LxdrJ · February 23, 2025, 9:41am

I don’t know if you should post your FW rules to forum just have this simple Script I like the layout


table ip my_filter {
        chain input {
                type filter hook input priority filter; policy accept;
                iifname "eth0" tcp dport 28967 accept
                iifname "eth0" udp dport { 28967, 51820 } accept
                iifname "eth0" ip protocol icmp accept
                iifname "eth0" ct state established,related accept
                iifname "eth0" ct state invalid drop
                iifname "eth0" icmpv6 type { echo-request, nd-neighbor-solicit } accept
                iifname "eth0" drop
        }

        chain forward {
                type filter hook forward priority filter; policy accept;
        }

        chain output {
                type filter hook output priority filter; policy accept;
        }
}
table ip nat {
        chain POSTROUTING {
                type nat hook postrouting priority srcnat; policy accept;
                oifname "WG45" masquerade
        }

        chain PREROUTING {
                type nat hook prerouting priority dstnat; policy accept;
                iifname "eth0" tcp dport 28967 dnat to 10.1.1.2
                iifname "eth0" udp dport 28967 dnat to 10.1.1.2
        }
}

arrogantrabbit · February 23, 2025, 10:03am

lol. You don’t have to, but it helps as a reference for future readers.

You can appreciate the complexity of your configuration compared to a few firewall_cmd commands sufficient to accomplish the same thing, regardless of underlying setup. I strongly believe that if there’s a tool that works on a higher abstraction level – that tool shall be used. Messing with iptables is too low level, and added complexity and opportunities for errors add no value, especially for these trivial configuration.

LxdrJ · February 23, 2025, 5:45pm

It’s just a executable text file what I like. I was also fiddeling around with ufw commands but it was not my taste and everywhere you can read its not compatible with Docker. Iptable commands was giving me little headache. On other hand I also read regarding wireguard ifup ifdown is not supported you just put the nat rules hard on the tables. But interface will be running anyways. So far I don’t know what happens when you mess up your nftables.