All of a sudden the node was no longer reachable.
The log showed:
dockerd[416]: runtime: out of memory: cannot allocate 1395810304-byte block (1248329728 in use)
dockerd[416]: fatal error: out of memory
dockerd[416]: runtime stack:
dockerd[416]: runtime.throw(0x22e619c, 0xd)
etc.
It seems that this triggered Docker to restart itself:
systemd[1]: docker.service: Failed with result 'exit-code'.
systemd[1]: docker.service: Service RestartSec=2s expired, scheduling restart.
systemd[1]: docker.service: Scheduled restart job, restart counter is at 1.
systemd[1]: Stopped Docker Application Container Engine.
systemd[1]: Starting Docker Application Container Engine...
So the system almost recovered itself. But it did not work. The node still was not reachable and did not work. I manually stopped and restarted, the container, no avail. I had to remove the container and start it then.
Any ideas how to catch and resolve such a failure and restart the container automatically?
generally the main times the storagenode will use larger amounts of memory is when its doing the filewalker or if the hdd cannot keep up with the io demand.
So, you’re suggesting rebuilding the entire system? That’s not very helpful. ECC memory will do absolutely nothing for out of memory errors, ZFS is pretty much the biggest memory hog of a file system, so that can only make it worse. And SSDs really should not be necessary at all.
@jammerdan usually high memory usage is related to IO bottlenecks. What kind of HDD are you using (avoid SMR as you probably know already)? Is there other stuff running that might impact IO performance? Also make sure nothing else is gobbling up memory. Good performing nodes shouldn’t use more than 2 to lower 3 digit MB’s of memory.
Have a check on IO wait for your CPU. That’s usually a pretty good indicator that there is an IO bottleneck. Also check that it is actually the node that uses a lot of RAM.
No it is not a SMR disk. Now when I think of it, all logs have been vanished afterwards. Maybe log rotation together with high disk usage from storagenode could be a problem.
The free -m for this node looks like this right now:
total used free shared buff/cache available
Mem: 1990 785 120 17 1083 1563
Swap: 995 23 971
Certainly not great RAM size but it’s an HC-2 so there is no way to change that. I wonder where the swap space is located and maybe I can either move or increase it.
For now I have added the live-restore option for Docker so hopefully the containers remain active even if Docker dies.
Thanks, yes I have the --memory command already in the run command.
I just learned that there is ZRam configured. I am not sure if that’s really helpful and if the size should be that large. And it seems that there is no additional Swap space.
So I might try to reduce the ZRAM and add an additional swap space on the Sd-Card.
That should be ok I think. For monitoring it might be useful to keep an eye on iotop. If you don’t have it on your system, you can run it through docker too.
docker run --rm -ti --privileged --net=host --pid=host esnunes/iotop:latest
That’s what I use from time to time on Synology as it doesn’t really allow me to install it.