Repair queue increasing since 2 days. Any clue why?

I 'm wondering if this is just GC… I have a node which is full for 6+ month now.
Yesterday morning, starting from 4 AM it became unavailable for about 4 hours according to uptime monitor. It is a RPi3B with a good old WD RED HDD connected through USB.
I checked it as soon as it became available it it had 70GB in trash. Before that, it was allways around 2-3GB max.
As the node was full, it did not had any significant ingress, especially not 70GB from the past days…
The node is otherwise in perfect condition, no errors in the log.

GC has been off for a while. The fact that your trash increased a lot only confirms that that’s likely what it was.

2 Likes

AND how much data was deleted during that ?

So as @BrightSilence warned me, my node is being hammered again. Hopefully not for as long as last time.
Thanks for the warning that this would happen again, I’ll just sit back and wait :slight_smile:

(When I win Euromillions I’ll replace all the spinning rust with SSDs)

Maybe not just the bloom filter? I think it would be generally great, if the node software would be smart enough to recognize when resource intense activity can be performed and when not. Things like file walker or GC should not happen when the node is busy serving downloads and uploads. And if they must run, maybe the software should be smart enough to reduce requests to reduce the overall load on a node.

5 Likes

Well, the network seems to be coping OK with this blip so it’s probably not worth spending a lot of developer time implementing a solution to a minimal problem….

Repair is costly though, and individual nodes do seem to suffer. It’s worth preventing nodes from toppling during GC. Though I agree that these runs are abnormally large.

2 Likes

GC and filewalker probably should run with reduced priority so that regular requests can be served.

3 Likes

That’s is possibly a better solution, yes…

At least that would be a good start. Maybe it would be even sufficient.
But Storj schould do something. :grimacing:

While this process is rare, it might be worth introducing limits to the filewalker and the GC. My nas has suffered quite a bit…

Looks like it’s eu1’s turn today. I hope nodes are holding up ok.

Not here they’re not…

1 Like

Not all nodes get bloomfilters at exactly the same time, but I have several nodes running GC for EU1 right now.

Most likely it was not counted as “stored”. This came about when @BrightSilence noticed a discrepancy between the amount of data the node says it has and the amount the satellite says the node has.

For some reason my node was not as affected by this - before I had about 50-70GB of trash, right now ~170GB. This is with 24.5TB stored.

It wasn’t paid. GC cleans up pieces on your node that aren’t accounted for on the satellite end. Your node shouldn’t have them to begin with. So no it wasn’t included in the storage graph or payout overview. But it was included in the pie chart as that uses your nodes local storage totals.

2 Likes

As you’d said, The Hammering lasted much less than the first time. Everything nominal again. :slightly_smiling_face:

Makes me wonder whether I should move my nodes to a higher specced machine, although if this was just a one-off I’ll probably just wait and see… :thinking:

This time it was likely just a different satellite. EU1 most likely. New runs from the same satellite will not happen within the same week normally. I wouldn’t worry about the spec. This shouldn’t happen to this extent normally.

2 Likes

Does it?

Has Storj finally lost some segments?

Thank you all for your help. The repair queue has been drained in the meantime.

There have been some reports about storage nodes being offline because of the impact of a garbage collection. I have filed a bug report for that: Garbage collection running with same IO priority as customer downloads, audits, repair · Issue #5349 · storj/storj · GitHub
In general garbage collection can’t be the root cause. The online score you see on your dashboard might drop because audits are timing out but as long as the hourly checkin is still contacting the satellite once per hour the repair job will count you as online. So garbage collection alone can’t do it.

My theory so far:
A short time before we send out the garbage collection bloom filters a few hundred nodes have been offline. The repair service queued up all of the segments that needed repair.

The reason we started ringing the alarm bell is that we thought our new garbage collection might have a problem. It was just the timing that was suspicious. If the repair queue would have increased 2 days prior to garbage collection we wouldn’t have worried about it.

The repair queue is ordered by segment health so no matter how big the queue is it will always repair the segment with the lowest segment health first. It is also acceptable for the repair job to fail to repair a segment momentarily. The repair job will retry repairing that segment a few hours later. No reason for panic.

We noticed some inconsistencies in our metrics. We will continue looking into that and correct the metrics for next time. As far as we know we haven’t lost a segment and it also doesn’t look like we have been close to that. We just overreacted here. Well, we better react to early than to late in this situation so all good. Again thank you so much for your help.

12 Likes