FATAL state after 10s of disk disconnection (v 1.52.2)

Pac · April 23, 2022, 9:31pm

My disk got disconnected because of me fiddling with how they are organized on my shelf…
Unsurprisingly, my node stopped, which is by the way a good thing: better stopped than not finding files and getting disqualified

I issued a sudo mount -a to reconnect the disk, which worked as expected.

However, after some time, I was expecting the node to get back up on its own after 1 minute or two, but after 10 minutes it was still not up, so I checked the logs and saw it was actually not trying anything anymore, it had given up:

2022-04-23 21:01:10,598 INFO exited: storagenode (exit status 1; not expected)
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/supervisor/loggers.py", line 84, in emit
    self.flush()
  File "/usr/lib/python2.7/dist-packages/supervisor/loggers.py", line 64, in flush
    self.stream.flush()
IOError: [Errno 30] Read-only file system
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/supervisor/loggers.py", line 84, in emit
    self.flush()
  File "/usr/lib/python2.7/dist-packages/supervisor/loggers.py", line 64, in flush
    self.stream.flush()
IOError: [Errno 30] Read-only file system
2022-04-23 21:01:11,612 INFO spawned: 'storagenode' with pid 3112
2022/04/23 21:01:11 failed to check for file existence: stat config/config.yaml: input/output error
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/supervisor/loggers.py", line 84, in emit
    self.flush()
  File "/usr/lib/python2.7/dist-packages/supervisor/loggers.py", line 64, in flush
    self.stream.flush()
IOError: [Errno 30] Read-only file system

[...]

2022-04-23 21:01:18,043 INFO exited: storagenode (exit status 1; not expected)
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/supervisor/loggers.py", line 84, in emit
    self.flush()
  File "/usr/lib/python2.7/dist-packages/supervisor/loggers.py", line 64, in flush
    self.stream.flush()
IOError: [Errno 30] Read-only file system
2022-04-23 21:01:19,045 INFO gave up: storagenode entered FATAL state, too many start retries too quickly

Although it seems similar to the following post, I think my case is a different issue because this time it’s about the “node” process.

It feels to me like the node shouldn’t give up like that after 10 seconds, and it should certainly not retry every second.
I think something probably needs to be fine-tuned here, so the node process is restarted less frequently, and indefinitely? Or at least, during a long period (like 24h maybe? a few days? more?) before giving up?

I am running version v1.52.2, and I believe this is due to the new way Nodes work. In versions 1.49 and earlier, they kept trying forever as docker was the one restarting them automatically. This does not work anymore now, as Node docker containers do not stop when the node dies inside the container.

Note: Restarting the node’s docker container gets the node back online.

Alexey · April 24, 2022, 7:40am

Seems it was not completely fixed. You need to re-create the container.
And giving up is better than to be disqualified because it could be able to answer on audit between restarts with “file not found”.

Pac · April 24, 2022, 8:33am

This happened while disks were actually considered as disconnected by the system (as far as I know - maybe they switched to read-only for a few seconds before being actually fully umounted, I’m not sure).
When I remounted them again, nodes had given up already.

Of course, I totally agree with that!
I’m just suggesting/asking if the node shouldn’t regularly (once per minute?) check if things went back to normal, and restart itself automatically? Instead of staying off forever, requiring a manual action for rebooting. The more nodes are autonomous, the better I think!

That’s not possible anymore since nodes now check if the storage is still present, right? It’s been the case for a while, IIRC?

Alexey · April 24, 2022, 1:11pm

Not always. If something failed and could be irreversible disqualified, I would like to have a control.
As far as I know, the docker not always can fix the failed binding without container re-creation.
The best what we can do - to fail container completely to allow docker to restart the container. But in this case it will re-download binaries and will be in the not fixed state (the read-only filesystem), so I do not think it’s a good solution.

The container is unable to confirm that the storage is ok now, it will only check for the special file to make a decision - is the volume writeable/readable or not. But consistence check should be done by Storage Node Operator before re-creating of the container.