My disk got disconnected because of me fiddling with how they are organized on my shelf…
Unsurprisingly, my node stopped, which is by the way a good thing: better stopped than not finding files and getting disqualified
I issued a sudo mount -a
to reconnect the disk, which worked as expected.
However, after some time, I was expecting the node to get back up on its own after 1 minute or two, but after 10 minutes it was still not up, so I checked the logs and saw it was actually not trying anything anymore, it had given up:
2022-04-23 21:01:10,598 INFO exited: storagenode (exit status 1; not expected)
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/supervisor/loggers.py", line 84, in emit
self.flush()
File "/usr/lib/python2.7/dist-packages/supervisor/loggers.py", line 64, in flush
self.stream.flush()
IOError: [Errno 30] Read-only file system
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/supervisor/loggers.py", line 84, in emit
self.flush()
File "/usr/lib/python2.7/dist-packages/supervisor/loggers.py", line 64, in flush
self.stream.flush()
IOError: [Errno 30] Read-only file system
2022-04-23 21:01:11,612 INFO spawned: 'storagenode' with pid 3112
2022/04/23 21:01:11 failed to check for file existence: stat config/config.yaml: input/output error
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/supervisor/loggers.py", line 84, in emit
self.flush()
File "/usr/lib/python2.7/dist-packages/supervisor/loggers.py", line 64, in flush
self.stream.flush()
IOError: [Errno 30] Read-only file system
[...]
2022-04-23 21:01:18,043 INFO exited: storagenode (exit status 1; not expected)
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/supervisor/loggers.py", line 84, in emit
self.flush()
File "/usr/lib/python2.7/dist-packages/supervisor/loggers.py", line 64, in flush
self.stream.flush()
IOError: [Errno 30] Read-only file system
2022-04-23 21:01:19,045 INFO gave up: storagenode entered FATAL state, too many start retries too quickly
Although it seems similar to the following post, I think my case is a different issue because this time it’s about the “node” process.
It feels to me like the node shouldn’t give up like that after 10 seconds, and it should certainly not retry every second.
I think something probably needs to be fine-tuned here, so the node process is restarted less frequently, and indefinitely? Or at least, during a long period (like 24h maybe? a few days? more?) before giving up?
I am running version v1.52.2
, and I believe this is due to the new way Nodes work. In versions 1.49 and earlier, they kept trying forever as docker was the one restarting them automatically. This does not work anymore now, as Node docker containers do not stop when the node dies inside the container.
Note: Restarting the node’s docker container gets the node back online.