Host unreachable but still up

challet · June 13, 2024, 11:01am

Hi,

My node is suddenly unreachable to connect to with ssh. Ping is unresponsive too with Host is down errors.
But I’m still hearing the hdd being used. It may be the filewalker running.

Or an assumption I am having is that the recent use of async writes combined with a slow drive (SMR) has made the memory full and some services have shut down, like ssh. Would it make sense ?

Before to reboot it through the power button, I’m wondering if that scenario is plausible. If so, I may just cut it out of the network (which already happens in a way) and wait for the actual writes to free the memory.

Any thought ?

Roxor · June 13, 2024, 11:12am

If you have physical access to the power button… can you connect a monitor+keyboard? Or if it’s a server use the remote-management port? I’d expect the console to always work.

If you can get in… it may be a memory issue specifically for networking: I’d look for dmesg output that says something like “TCP: out of memory” (like here). So perhaps sshd is still running fine… but TCP/IP connections simply can’t be created anymore.

(Edit: but your guess is reasonable too. You may have to just reboot it and go through logs when it comes back. /var/log/syslog may have one or more smoking guns)

challet · June 13, 2024, 11:28am

Update : it was something else, somehow mysterious.

I was checking the host connectivity on my local router. Its DHCP lease was indeed marked as “unreachable”. But I noticed the IPV6 lease was still up.

I’ve then been able to ssh to the node through this IPV6 address and everything was normal : no memory overfilled, no huge load, all services up. Except ifconfig showing the same thing as my router (no ipv4) :

eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet6 *::*:*:696f:4c36  prefixlen 64  scopeid 0x20<link>
        inet6 *:*:*:*:*:*:cc8:9d47  prefixlen 64  scopeid 0x0<global>
        ether *:*:*:*:*:*  txqueuelen 1000  (Ethernet)
        RX packets 207966471  bytes 261174448314 (243.2 GiB)
        RX errors 16  dropped 317  overruns 0  frame 14
        TX packets 126349520  bytes 90792180267 (84.5 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

After manually refreshing the DHCP lease, it went back online instantly

$ sudo dhclient -r eth0 ; sudo dhclient eth0
$ ifconfig
[...]
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.0.*  netmask 255.255.255.0  broadcast 192.168.0.*
        inet6 *::*:*:696f:4c36  prefixlen 64  scopeid 0x20<link>
        inet6 *:*:*:*:*:*:cc8:9d47  prefixlen 64  scopeid 0x0<global>
        ether *:*:*:*:*:*  txqueuelen 1000  (Ethernet)
        RX packets 207967562  bytes 261175266000 (243.2 GiB)
        RX errors 16  dropped 317  overruns 0  frame 14
        TX packets 126350225  bytes 90792413683 (84.5 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

Still wondering how it could have happened. Probably more on the router settings side, I’m going to go through it.

Roxor · June 13, 2024, 11:42am

Glad to hear it wasn’t a node issue!

I’d prefer to think that you’ve been receiving so much ingress lately (and will be making so much money)… that your interface simply decided to take a vacation… and went to an all-inclusive resort in the Dominican Republic for awhile.

And then you brutally dragged it back to work

Alexey · June 14, 2024, 4:59am

I agree with your assumption, that the router could be a culprit. As poetically described by @Roxor your router could had been overloaded and decided to drop all IPv4 connections to survive.

challet · June 14, 2024, 1:38pm

The network overload is most probably what happened. But Storj newly adopted async writes and an SMR disk are not the origin of my issue. Actually and excepting that specific case, I didn’t notice any IO load increase since that change, even in a bigger ingress context.

However, it was this exact scenario we told about but on a whole worst level : The machine is also used to make regular personal backups, on a specific disk. This disk did unmount somehow and the backups were being written on the empty mountpoint. Without the disk mounted, it points to the root partition directory, which happens to be an SD card.

Bigger ingress than with the storj node being written on an even slower disk could only lead to a catastrophe.

Alexey · June 16, 2024, 6:56am

It could happen, if this an external disk without an own power supply, so it may be disconncted under the load due to higher power usage. It also may go to a sleep mode and the OS may unmount it. But perhaps it just rebooted and you do not have a record in /etc/fstab for that disk and relay only on automount?

challet · June 18, 2024, 11:54am

That’s exactly what happened and it did before. It is just the first time the OS unmounted it and allow to write on the empty mount point. Previous times, there was only errors about my backups unables to be written.

I looked for 2.5 inches enclosures with an external power supply, they seem not to be that popular.

donald.m.motsinger · June 18, 2024, 12:07pm

You shouldn’t write your backups directly in the root of a mounted disk. Create a subdirectory and check for its existence when doing the backup.