Docker health check not working

Celizior · October 3, 2025, 10:00am

Hey guys,
I have a node running on an old and slow disk, so slow that the process seems to freeze sometimes

The container is still running but ports 14002 and 28967 are no more responding

I remembered that kube had tcp health check (TCP liveness probe) so I tried to add that with these parameters added to the docker run command :

    --health-cmd="bash -c 'exec 3<>/dev/tcp/localhost/14002 || exit 1'" \
    --health-interval=1m \
    --health-retries=5 \
    --health-timeout=10s \
    --health-start-period=30m \

For a reason I don’t understand, I got this

/home/user# docker exec -it storagenode bash
root@fc4a9ca0761e:/app# bash -c 'exec 3<>/dev/tcp/localhost/14002 || exit 1'
bash: connect: Connection refused
bash: /dev/tcp/localhost/14002: Connection refused
root@fc4a9ca0761e:/app# echo $?
1

So I have an exit code with error, BUT !

/home/user# docker inspect --format='{{json .State.Health}}' storagenode
{
  "Status": "healthy",
  "FailingStreak": 0,
  "Log": [
    {
      "Start": "2025-10-01T18:11:54.154565154Z",
      "End": "2025-10-01T18:11:55.158906829Z",
      "ExitCode": 0,
      "Output": ""
    },
    {
      "Start": "2025-10-01T18:12:55.160430966Z",
      "End": "2025-10-01T18:12:55.574776418Z",
      "ExitCode": 0,
      "Output": ""
    },
    {
      "Start": "2025-10-01T18:13:55.577041455Z",
      "End": "2025-10-01T18:13:55.639423661Z",
      "ExitCode": 0,
      "Output": ""
    },
    {
      "Start": "2025-10-01T18:14:55.640117067Z",
      "End": "2025-10-01T18:14:55.698302942Z",
      "ExitCode": 0,
      "Output": ""
    },
    {
      "Start": "2025-10-01T18:15:55.699620444Z",
      "End": "2025-10-01T18:15:55.749854431Z",
      "ExitCode": 0,
      "Output": ""
    }
  ]
}

Docker is like “yeah, everything is alright”

Do you have any better idea ?
Or a better health check ?

I already have a script that restart unhealthy nodes, just waiting for the node to go into this state

Edit : I gave a try with an apache container, it’s properly working. I don’t get what’s wrong here

Alexey · October 4, 2025, 4:26am

Hello @Celizior,
Welcome back!

What’s your docker run command look like? Specifically how do you map a port for the dashboard?

The problem with this command:

because you got an exit 1 in the second bash, then resulted exit code for the invoker will be 0, because this second bash exited successfully.
You can solve this in several ways.

Use a curl http://localhost:28967 || exit 1 or curl http://localhost:14002 || exit 1 instead.
use bash -eo pipefail -c 'exec 3<>/dev/tcp/localhost/14002 || exit 1' to pass 1 to the invoker (but this could not work too, you need to avoid calling the second bash).
use bash -eo pipefail -c 'exec 3<>/dev/tcp/localhost/14002' || exit 1 (note - the exit 1 is outside of the bash -c command).

Celizior · October 4, 2025, 11:28am

Hello Alexey,

Thank you for you answer

Here is the exact command

docker run -d --restart unless-stopped --stop-timeout 300 \
    -p 28967:28967/tcp \
    -p 28967:28967/udp \
    -p 14002:14002 \
    -e WALLET="0xXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" \
    -e EMAIL="mymail@provider.com" \
    -e ADDRESS="MyDomainName:28967" \
    -e STORAGE="10TB" \
    --user $(id -u):$(id -g) \
    --mount type=bind,source="/mnt/storj/storj_identity/storagenode",destination=/app/identity \
    --mount type=bind,source="/mnt/storj/storj_data",destination=/app/config \
    --log-driver json-file \
    --log-opt max-size=10m \
    --log-opt max-file=3 \
    --health-cmd="bash -c 'exec 3<>/dev/tcp/localhost/14002 || exit 1'" \
    --health-interval=1m \
    --health-retries=5 \
    --health-timeout=10s \
    --health-start-period=30m \
    --name storagenode storjlabs/storagenode:latest

It’s more or less the standard command with parameters to rotate logs (I had problems with boot disk full) and health check

I just checked with docker exec -it storagenode bash, curl http://localhost:28967 || exit 1 doesn’t work because curl is not present in the container but bash -eo pipefail -c 'exec 3<>/dev/tcp/localhost/14002 || exit 1' and bash -eo pipefail -c 'exec 3<>/dev/tcp/localhost/14002' || exit 1 works

I try them with port 14003 to provoke an error

My command without || exit 1

root@77eb9df12f8c:/app# bash -c 'exec 3<>/dev/tcp/localhost/14003'
bash: connect: Connection refused
bash: /dev/tcp/localhost/14003: Connection refused
root@77eb9df12f8c:/app# echo $?
1

Your command 2

root@77eb9df12f8c:/app# bash -eo pipefail -c 'exec 3<>/dev/tcp/localhost/14003 || exit 1'
bash: connect: Connection refused
bash: /dev/tcp/localhost/14003: Connection refused
root@77eb9df12f8c:/app# echo $?
1

Your command 3

root@77eb9df12f8c:/app# bash -eo pipefail -c 'exec 3<>/dev/tcp/localhost/14003' || exit 1
bash: connect: Connection refused
bash: /dev/tcp/localhost/14003: Connection refused
exit
root@storj:~#

This one kicked me out of the container, which is surprising, but I’m not kicked if I try on port 14002

I lack of time today but I gonna give a try to all of them tomorrow and keep you informed

Alexey · October 5, 2025, 7:05am

I see, then you can replace it with a wget command, i.e.

wget -O - http://127.0.0.1:14002 || exit 1

Celizior · October 5, 2025, 4:28pm

I don’t get it, all of theses commands work now to detect if a port is closed (checked with port 14003)

bash -c 'exec 3<>/dev/tcp/localhost/14002 || exit 1'
bash -eo pipefail -c 'exec 3<>/dev/tcp/localhost/14002 || exit 1'
bash -eo pipefail -c 'exec 3<>/dev/tcp/localhost/14002' || exit 1
wget -O - http://127.0.0.1:14002 || exit 1

Maybe as the process is frozen, the tcp connection is still established but nothing more. But I’m surprised that my zabbix is able to detect it, maybe because of a standard timeout of 3s with net.tcp.service[tcp,14002]

I think that I gonna have to wait that the container freeze to be able to check it for real.

Update while writting this post, I’m lucky. The container froze !

Previous commands establish the tcp connexion but nothing happen, so there is not error. But wget established a tcp session and expect http which never come.
I tried wget -q -O - --timeout=3 --tries=1 http://127.0.0.1:14002 >/dev/null 2>&1 which return an error code 4. I think wget -O - http://127.0.0.1:14002 would do the job as I already have --health-timeout=10s

Let’s give it a try and give you news soon

Alexey · October 6, 2025, 4:17am

This is why I suggested curl first and wget as a second option. I suspected that tcp may still respond, while the node is already frozen. And when the node cannot respond on a simple http request, then it’s dead.

Celizior · October 6, 2025, 4:13pm

I have good news, this config work as expected !

    --health-cmd="wget -O - http://127.0.0.1:14003" \
    --health-interval=1m \
    --health-retries=5 \
    --health-timeout=10s \
    --health-start-period=30m \

The container turn into unhealthy after the 5 retries and the script restart it

I add it here if anyone is interested

#!/bin/bash
for container in $(docker ps -q); do
  STATUS=$(docker inspect --format='{{.State.Health.Status}}' "$container" 2>/dev/null)

  if [ "$STATUS" = "unhealthy" ]; then
    NAME=$(docker inspect --format='{{.Name}}' "$container" | cut -c2-)
    docker restart "$container"
  fi
done

+ add to crontab

Thanks for your help Alexey

Alexey · October 7, 2025, 3:34am

You may also use a sidecar containers like willfarrell/autoheal or qmcgaw/deunhealth to restart the unhealthy container or even run a one node docker swarm.