Node Offline for 3 days, now won't come online

kupan787 · August 7, 2020, 5:04pm

Hello,

I’ve been running a node for months just fine. The other day, I had a drive die in my NAS while I was out of town. Because I couldn’t troubleshoot, I just remotely powered down the NAS.

Now that I am home, I powered things back up after replacing the drive. However, Storj won’t come back online. I can see the docker is running:

Storage Node Dashboard ( Node Version: v1.9.5 )

======================

ID           12ZSweTMauixx1kBErvAQ2pNU8GADHo5E6vnqCHpnXdJeuTGk56
Last Contact OFFLINE
Uptime       14h4m50s

                   Available       Used     Egress     Ingress
     Bandwidth           N/A        0 B        0 B         0 B (since Aug 1)
          Disk        4.2 GB     6.0 TB
Internal 127.0.0.1:7778
External volcrypt.com:28967

I didn’t change anything in the configuration, so I would expect it to come back up. Is there something that could have happened that caused my node to get blacklisted for being down for a few days?

This is all I see in the logs:

2020-08-07T02:54:56.517Z INFO Node 12ZSweTMauixx1kBErvAQ2pNU8GADHo5E6vnqCHpnXdJeuTGk56 started
2020-08-07T02:54:56.517Z INFO Public server started on [::]:28967
2020-08-07T02:54:56.517Z INFO Private server started on 127.0.0.1:7778
2020-08-07T02:54:56.517Z INFO trust Scheduling next refresh {"after": "4h23m53.495204955s"}
2020-08-07T03:54:56.517Z INFO bandwidth Performing bandwidth usage rollups
2020-08-07T04:54:56.517Z INFO bandwidth Performing bandwidth usage rollups
2020-08-07T05:54:56.517Z INFO bandwidth Performing bandwidth usage rollups
2020-08-07T06:54:56.517Z INFO bandwidth Performing bandwidth usage rollups
2020-08-07T07:18:56.188Z INFO trust Scheduling next refresh {"after": "5h56m29.825057408s"}
2020-08-07T07:54:56.517Z INFO bandwidth Performing bandwidth usage rollups
2020-08-07T08:54:56.517Z INFO bandwidth Performing bandwidth usage rollups
2020-08-07T09:54:56.517Z INFO bandwidth Performing bandwidth usage rollups
2020-08-07T10:54:56.517Z INFO bandwidth Performing bandwidth usage rollups
2020-08-07T11:54:56.517Z INFO bandwidth Performing bandwidth usage rollups
2020-08-07T12:54:56.517Z INFO bandwidth Performing bandwidth usage rollups
2020-08-07T13:15:47.760Z INFO trust Scheduling next refresh {"after": "5h39m12.608699088s"}
2020-08-07T13:54:56.517Z INFO bandwidth Performing bandwidth usage rollups
2020-08-07T14:54:56.517Z INFO bandwidth Performing bandwidth usage rollups
2020-08-07T15:54:56.517Z INFO bandwidth Performing bandwidth usage rollups
2020-08-07T16:54:56.517Z INFO bandwidth Performing bandwidth usage rollups

baker · August 7, 2020, 5:21pm

Looks like your port is closed.

nerdatwork · August 7, 2020, 5:33pm

kupan787 · August 7, 2020, 5:40pm

Very strange that the port forwarding isn’t working. I recently added a second IP to that box. I wonder if somehow the two network interfaces are causing an issue?

I can confirm that the port shows open internally:

macbookpro:~ kupan787$ telnet 10.100.1.31 28967
Trying 10.100.1.31...
Connected to 10.100.1.31.
Escape character is '^]'.
^]
telnet> quit

But hitting the external IP fails:

macbookpro:~ kupan787$ telnet 73.192.200.255 28967
Trying 73.192.200.255...
telnet: Unable to connect to remote host: Connection refused

I’ll do some more digging.

Vadim · August 7, 2020, 5:50pm

did you opened firewall port on windows pc?

kupan787 · August 7, 2020, 6:29pm

I’m actually running on CentOS.

I can connect locally using nc:

macbookpro:~ kupan787$ nc -v 10.100.1.31 28967
Connection to 10.100.1.31 port 28967 [tcp/*] succeeded!
^C

So I am pretty sure the firewall on the CentOS box isn’t blocking anything.

I turned on the firewall logs for the port rule, and I can see the request coming in, and it looks like it is routing to the right internal IP (10.100.1.31) when I do the port scan from yougetsignal.com:

Aug  7 10:42:50 erl kernel: [WAN_IN-110-A]IN=eth1 OUT=eth0 SRC=198.199.98.246 DST=10.100.1.31 LEN=60 TOS=0x00 PREC=0x20 TTL=54 ID=37536 DF PROTO=TCP SPT=51529 DPT=28967 WINDOW=14600 RES=0x00 SYN URGP=0
Aug  7 10:42:50 erl kernel: [WAN_IN-110-A]IN=eth1 OUT=eth0 SRC=198.199.98.246 DST=10.100.1.31 LEN=60 TOS=0x00 PREC=0x20 TTL=54 ID=23338 DF PROTO=TCP SPT=51530 DPT=28967 WINDOW=14600 RES=0x00 SYN URGP=0
Aug  7 10:42:51 erl kernel: [WAN_IN-110-A]IN=eth1 OUT=eth0 SRC=198.199.98.246 DST=10.100.1.31 LEN=60 TOS=0x00 PREC=0x20 TTL=54 ID=23339 DF PROTO=TCP SPT=51530 DPT=28967 WINDOW=14600 RES=0x00 SYN URGP=0
Aug  7 10:42:51 erl kernel: [WAN_IN-110-A]IN=eth1 OUT=eth0 SRC=198.199.98.246 DST=10.100.1.31 LEN=60 TOS=0x00 PREC=0x20 TTL=54 ID=12327 DF PROTO=TCP SPT=51532 DPT=28967 WINDOW=14600 RES=0x00 SYN URGP=0
Aug  7 10:42:52 erl kernel: [WAN_IN-110-A]IN=eth1 OUT=eth0 SRC=198.199.98.246 DST=10.100.1.31 LEN=60 TOS=0x00 PREC=0x20 TTL=54 ID=12328 DF PROTO=TCP SPT=51532 DPT=28967 WINDOW=14600 RES=0x00 SYN URGP=0

But if I try and hit the WAN IP, I get a “connection refused”:

macbookpro:~ kupan787$ nc -v 73.192.200.255 28967  
nc: connectx to 73.192.200.255 port 28967 (tcp) failed: Connection refused

So I am kind of at a loss right now. The docker service/storj is up and running. I can hit the port locally on the internal IP. From what I can tell the router’s firewall is passing the port correctly to my host. But something is just causing it to drop.

I’ll keep digging in, and see if I can figure it out.

The only other change I made recently was that I added another ethernet interface (10g card) that gets an IP from the DHCP server. Here is what I see on the box for ip routes:

[root@rocknas Storj]# ip r
default via 10.100.1.1 dev enp6s0 proto dhcp metric 101
default via 10.10.10.1 dev enp2s0 proto dhcp metric 102
10.10.10.0/24 dev enp2s0 proto kernel scope link src 10.10.10.20 metric 102
10.100.1.0/24 dev enp6s0 proto kernel scope link src 10.100.1.31 metric 101
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1

I’m not sure how that would effect anything, as the port forward rule is for my existing ethernet interface (enp6s0), which has been working fine for months.

I’ll keep digging.

kupan787 · August 7, 2020, 6:34pm

And now we are back online!

The only thing I did, that I can tell, is changed my enp2s0 to a Static IP instead of a Dynamic IP from DHCP. Not sure if Docker was trying to bind to the wrong interface?

Either way, I seem to be back up and running now.