There wasn’t a single GC run this week according to the inbuilt prometheus metrics:
Usually my node dies on Wednesdays/Thursdays because of the GC, and there’s always an 8-9 hours run on every Sunday morning.
There wasn’t a single GC run this week according to the inbuilt prometheus metrics:
Usually my node dies on Wednesdays/Thursdays because of the GC, and there’s always an 8-9 hours run on every Sunday morning.
oh, how do you monitor GC?
Why is it dies? With what error?
My nodes died last week Monday and today exactly at the same time (2023-09-25T13:30:00Z) with the same errors as described by you in the other thread:
2023-09-25 03:35:12,662 INFO waiting for storagenode, processes-exit-eventlistener to die
2023-09-25 03:35:13,663 WARN killing 'storagenode' (57) with SIGKILL
2023-09-25 03:35:15,666 INFO waiting for storagenode, processes-exit-eventlistener to die
2023-09-25 03:35:18,670 INFO waiting for storagenode, processes-exit-eventlistener to die
2023-09-25 03:35:21,674 INFO waiting for storagenode, processes-exit-eventlistener to die
2023-09-25 03:35:23,676 WARN killing 'storagenode' (57) with SIGKILL
2023-09-25 03:35:24,678 INFO waiting for storagenode, processes-exit-eventlistener to die
2023-09-25 03:35:27,682 INFO waiting for storagenode, processes-exit-eventlistener to die
I wasn’t able to kill the container, only a reboot of the whole system helped. Since this happened 2 weeks in a row at exactly the same time, I don’t think it was a coincidence.
so there was a capacity issue with the database used by GC. We have now scaled up the storage and will continue to run GCs manually.