Multiple docker containers restarting at same time upon reboot causing issues

joesmoe · October 22, 2020, 9:08am

On one of my machines, I run multiple nodes (it has many had drives).

I have noticed that if they all start at the same time, some just fail to ever start receiving traffic. I have to stop, remove, and restart them.

I think this is due to the filewalker process when the server reboots and all the nodes restart at the same time.

Is it easy to change a systemwide docker config that would make docker instances restart at random times between 1 minute and 100 minutes after reboot?

Do other people have this issue?

kevink · October 22, 2020, 9:55am

I run 3 nodes on the same array but no problems on startup, certainly not with receiving traffic.
Regardless, I never found a solution to delay a container startup. Only solution I can come up with is to set your containers to never restart and use a script on boot that will start those containers manually.
However, that also means that if a container fails, it will stay stopped and not try to restart itself.
So it’s not a perfect solution either.

joesmoe · October 22, 2020, 10:32am

Yes, that’s what i’m doing now, but i prefer it to be more automated. Thanks for the feedback.

SGC · October 22, 2020, 11:03am

does kinda make me wonder how high latency has to get before the storagenode does it’s emergency shotdown because it thinks a disk or array is disconnected…

you are not the first talking about random reboots lately, last guy was using iSCSI and i think his connection dropped briefly every 1-3 days which made his node shutdown and not reboot because he was on windows, and the service wasn’t set to restart by itself, ofc according to him there was no problem with the iSCSI… but yeah…

node boot does cause a good deal of iowait, maybe try and see if they will boot and run fine if started in sequence, when i measured and timed iowait it took like 20-40 minutes to boot before the iowait dropped to more reasonable levels.

ofc this will be highly dependent on the io of your setup, something like an ARC from ZFS may make it run better or worse, so if you are running a conventional raid, you may need to compare with their experiences.

i haven’t seen any indication of it, while doing my migration which did put a lot of workload on the drive while the storagenode was running, so i kinda doubt it’s io related… unless if maybe there is some kind of problem with your array or it’s different from what i work with which is zfs

kevink · October 22, 2020, 1:15pm

where did he talk about random reboots?

SGC · October 22, 2020, 1:33pm

phrased it wrong, was talking about the storagenode shutdowns which should have been reboots, it could be related to his issue, because the same storagenode hdd timeout thing might be triggered and thus shutdown the node or attempt to do so…

when i was tinkering and removed my l2arc incorrectly and lost most pool access but still had enough that some things would work, the storagenode would keep running, and wouldn’t even initiate a shutdown when i commanded it to… or the io wait was so high that it would take 20min + before the system could shutdown… however i could do a docker logs --tail 20 storagenode --follow
and even a reboot command in the terminal didn’t seem to work… but it would still access the pool… just so slowly it was ridiculous.

so high iowait / latency to his storage may cause such things, even if rare…
one time the storagenode even ran fine… another time it was just all errors because it had lost pool access.

also if one does end up having to much work where a hdd or array has to spend a lot of time in seek, one can basically reach an infinite time scale for how long something takes to finished depending on how parallel the data demands are.

ofc it could also be something like the vhd configuration, container limitations… like in proxmox one can set priority of tasks and if one did that to favor one container then during boot the others may timeout or whatever we are calling it…

my proxmox containers and vm’s took a long time to get use to and know all the in’s and outs…

SGC · October 22, 2020, 1:34pm

you do have the latest docker version right?
i seem to remember there might be some issues if you get far behind… can’t remember what they were… but as some sort of instability / suspension thing i believe.

and is it always the same that freezes / stalls or is it dependent on which ones that are started first, or does it seem random…

the more info you can find on how it goes wrong or why, the easier it might be to troubleshoot

joesmoe · October 22, 2020, 1:54pm

Latest docker version. And not random reboots, rare system reboots that cause all dockers to start at once.

SGC · October 22, 2020, 3:41pm

but it works fine when you start them one by one with enough time inbetween so that the filewalker is done?

because in that case i would just do a script like kevink suggested…

something like this seems like it could be adapted with a bit of research, then you can script it to wait until a storagenode is running and accepting connections before it starts the next one, i think…
don’t really understand all of it, but i would imagine it working with that a script on the container green lights the continuation of the startup order script in docker… i guess docker can see that somehow or thats what seems to be suggested in the documentation.

and if this could work like that, it should also be able to retain the ability to restart a storagenode container if it crashes, because the run command would still use the unless stopped parameter

Alexey · October 22, 2020, 9:19pm

Why do you think that @joesmoe running a Docker desktop for Windows?

joesmoe · October 22, 2020, 9:44pm

To be clear, it is not on windows.

Floxit · October 22, 2020, 9:50pm

Whoops! My bad, apologies! Removed.

Derkades · October 23, 2020, 4:39pm

What are your server specs? I run multiple nodes on one server just fine (on different disks of course).

Also, did you set a restart policy? With restart=always it won’t matter if a container crashes while starting, it’ll just try starting again.