Repair queue increasing since 2 days. Any clue why?

jammerdan · November 27, 2022, 6:42am

Does it?

littleskunk · November 27, 2022, 2:50pm

Thank you all for your help. The repair queue has been drained in the meantime.

There have been some reports about storage nodes being offline because of the impact of a garbage collection. I have filed a bug report for that: Garbage collection running with same IO priority as customer downloads, audits, repair · Issue #5349 · storj/storj · GitHub
In general garbage collection can’t be the root cause. The online score you see on your dashboard might drop because audits are timing out but as long as the hourly checkin is still contacting the satellite once per hour the repair job will count you as online. So garbage collection alone can’t do it.

My theory so far:
A short time before we send out the garbage collection bloom filters a few hundred nodes have been offline. The repair service queued up all of the segments that needed repair.

The reason we started ringing the alarm bell is that we thought our new garbage collection might have a problem. It was just the timing that was suspicious. If the repair queue would have increased 2 days prior to garbage collection we wouldn’t have worried about it.

The repair queue is ordered by segment health so no matter how big the queue is it will always repair the segment with the lowest segment health first. It is also acceptable for the repair job to fail to repair a segment momentarily. The repair job will retry repairing that segment a few hours later. No reason for panic.

We noticed some inconsistencies in our metrics. We will continue looking into that and correct the metrics for next time. As far as we know we haven’t lost a segment and it also doesn’t look like we have been close to that. We just overreacted here. Well, we better react to early than to late in this situation so all good. Again thank you so much for your help.