Today I got some “node suspended” emails for 3 satellites and the dashboard shows that I’m suspended on 5 satellites, but if I search for “GET_AUDIT” and “failed” the last errors are from 2020-06-24. However, I got some “database locked” errors today, but no failed audits. (Update below)
When I tried to stop the node to check the db’s, the hard disk with the node’s data seemed to have locked up completely, I couldn’t even write an empty file to it. Some db-wal files were huge. I remember the usedserial.db was about 90MB, but the usedserial.db-wal was about 1GB. Even though top and iotop didn’t show anything unusual, the system couldn’t cope with the db’s being this big it looks like.
So I had to reboot the server and after that I checked and vacuumed the database. There were no errors.
Update: Just went through the logs and noticed that there are almost 6h of logs missing. I run the dashboard and “tail -f node.log” in a tmux session and there was definitely output from the log around that time. I guess it couldn’t really be written to disk because of the lockup.
Since then there were successful audits on all satellites and data is uploaded again, but the web dashboard still says suspended on 5 satellites. How long will the suspension show after the problem was solved? I would hate to lose this node 2 days before it completes the 15 months…