My node is suddenly unreachable to connect to with ssh. Ping is unresponsive too with Host is down errors.
But I’m still hearing the hdd being used. It may be the filewalker running.
Or an assumption I am having is that the recent use of async writes combined with a slow drive (SMR) has made the memory full and some services have shut down, like ssh. Would it make sense ?
Before to reboot it through the power button, I’m wondering if that scenario is plausible. If so, I may just cut it out of the network (which already happens in a way) and wait for the actual writes to free the memory.
If you have physical access to the power button… can you connect a monitor+keyboard? Or if it’s a server use the remote-management port? I’d expect the console to always work.
If you can get in… it may be a memory issue specifically for networking: I’d look for dmesg output that says something like “TCP: out of memory” (like here). So perhaps sshd is still running fine… but TCP/IP connections simply can’t be created anymore.
(Edit: but your guess is reasonable too. You may have to just reboot it and go through logs when it comes back. /var/log/syslog may have one or more smoking guns)
Update : it was something else, somehow mysterious.
I was checking the host connectivity on my local router. Its DHCP lease was indeed marked as “unreachable”. But I noticed the IPV6 lease was still up.
I’ve then been able to ssh to the node through this IPV6 address and everything was normal : no memory overfilled, no huge load, all services up. Except ifconfig showing the same thing as my router (no ipv4) :
I’d prefer to think that you’ve been receiving so much ingress lately (and will be making so much money)… that your interface simply decided to take a vacation… and went to an all-inclusive resort in the Dominican Republic for awhile.
I agree with your assumption, that the router could be a culprit. As poetically described by @Roxor your router could had been overloaded and decided to drop all IPv4 connections to survive.
The network overload is most probably what happened. But Storj newly adopted async writes and an SMR disk are not the origin of my issue. Actually and excepting that specific case, I didn’t notice any IO load increase since that change, even in a bigger ingress context.
However, it was this exact scenario we told about but on a whole worst level : The machine is also used to make regular personal backups, on a specific disk. This disk did unmount somehow and the backups were being written on the empty mountpoint. Without the disk mounted, it points to the root partition directory, which happens to be an SD card.
Bigger ingress than with the storj node being written on an even slower disk could only lead to a catastrophe.
It could happen, if this an external disk without an own power supply, so it may be disconncted under the load due to higher power usage. It also may go to a sleep mode and the OS may unmount it. But perhaps it just rebooted and you do not have a record in /etc/fstab for that disk and relay only on automount?
That’s exactly what happened and it did before. It is just the first time the OS unmounted it and allow to write on the empty mount point. Previous times, there was only errors about my backups unables to be written.
I looked for 2.5 inches enclosures with an external power supply, they seem not to be that popular.