Appears it also restarts the node on
ERROR piecestore download/upload failed
This does not take the node offline, so it is OK to not restart the node on such errors, I take it.
@Alexey, is this OK to you or I could modify it to this…
* * * * * /bin/journalctl --since "1 min ago" --until "now" -eu NODESERVICENAME | awk '/ERROR/ && /ping satellite failed/ {a=1}; END { if (a == 1){system("systemctl restart NODESERVICENAME.service")} }'
or even
* * * * * /bin/journalctl --since "1 min ago" --until "now" -eu NODESERVICENAME | awk /ping satellite failed/ {a=1}; END { if (a == 1){system("systemctl restart NODESERVICENAME.service")} }'
After all, the nodes DO NOT GO OFFLINE now with this hack implemented on the source server. Even if they do, it is for less than a second - during the service restart.
Again… Why this works:
awk
uses a pattern-action paradigm. /ERROR/
and /ping satellite failed/
are the pattern and {a=1}
is the action . After awk
has processed all the journalctl
output of the node it is testing, the END
section is executed. A simple if (a == 1)
test determines if one or more matches occurred, and if so, the systemctl restart NODESERVICENAME.service
command is executed to restart the node on error found in the node log for the last 1 min.