1.55.1 Problems: ERROR contact:service ping satellite failed?

svet0slav · June 8, 2022, 9:05am

Appears it also restarts the node on

ERROR piecestore download/upload failed

This does not take the node offline, so it is OK to not restart the node on such errors, I take it.
@Alexey, is this OK to you or I could modify it to this…

* * * * * /bin/journalctl --since "1 min ago" --until "now" -eu NODESERVICENAME | awk '/ERROR/ && /ping satellite failed/ {a=1}; END { if (a == 1){system("systemctl restart NODESERVICENAME.service")} }'

or even

* * * * * /bin/journalctl --since "1 min ago" --until "now" -eu NODESERVICENAME | awk /ping satellite failed/ {a=1}; END { if (a == 1){system("systemctl restart NODESERVICENAME.service")} }'

After all, the nodes DO NOT GO OFFLINE now with this hack implemented on the source server. Even if they do, it is for less than a second - during the service restart.

Again… Why this works:

awk uses a pattern-action paradigm. /ERROR/ and /ping satellite failed/ are the pattern and {a=1} is the action . After awk has processed all the journalctl output of the node it is testing, the END section is executed. A simple if (a == 1) test determines if one or more matches occurred, and if so, the systemctl restart NODESERVICENAME.service command is executed to restart the node on error found in the node log for the last 1 min.