First off, let me start by saying this isn’t a complaint—more of a curious observation I’ve noticed over the past couple of months. I’ve been monitoring my nodes, and I’ve noticed something a bit quirky: on average, my repair egress is actually higher than my download usage!
Now, I’m not saying the sky is falling, but it does make me wonder—what’s going on here? It seems a little odd, doesn’t it? You’d think that in a healthy network, download egress would outpace repairs, right?
I’ve been checking the usual suspects here—no massive decrease in nodes or a sudden wave of suspensions. So, what gives? Is this just a temporary blip, or is there something deeper at play here that’s causing this uptick in repairs?
I’d love to hear your thoughts on what might be the reasons. Let’s keep the discussion lighthearted and constructive—after all, we’re all in this together!
Thanks for tossing in your 1/2 cent! You might be onto something with the test data theory. I’ve noticed this repair surge seems to coincide with the influx of test TTL data, so there could be a connection there. But why exactly would the test data traffic jam push repairs into the fast lane?
Repair traffic is unavoidable while download traffic dependens on the customer. So in a healthy network it is possible to have more repair traffic than customer downloads.
True, that makes sense! So repair traffic is a given, while customer usage can vary, meaning they aren’t directly related. But I’m still curious about what’s driving this increase in repair activity. Any thoughts?
Repairs happen when SNOs that promised to keep data available: don’t make it available. More repairs come from more unreliable nodes.
Some of that will always be the churn of nodes simply leaving… but since our Active Node count still keeps creeping higher it’s not the main reason for repairs.
Egress from a node is download, ingress into a node is upload.
All wording from the customers’ point of view, so the title a little bit confusing, because both words “egress” and “downloads” means the same in the Storj network , so I guess you mean either egress vs ingress or downloads vs uploads.
Thanks for your reply Alexey. I’m talking only about egress here. I’m comparing the amount of repair egress against the customer usage egress. Would the topic title be better understandable if I changed the word download to customer usage?
No, not needed, as I said it was a little bit confusing to me. It could be nice to have them aligned, but not required. You just need to read not only the title, but also the message
Otherwise the title is exactly described your observation.
Graceful exit on targeted satellite sound plausible, but I’m not sure, that a very delicate action - giving how some are not even aware how much they should have been paid, and for traffic to contest with real traffic, that need to be at scale too.
We could do voting to see anyone have already done targeted graceful exit vs not yet to see.
Some Operators have had an issue with a high load from the customers of the Saltlake satellite, so they decided to exit from it.
The only problem here, that these customers with the similar pattern would use a production satellites. So, not a solution.
Unpaid data from Saltlake (now fixed) was compromising the input of real customer data because the node was full. Maybe someone gave up waiting for the fix, hoping that by the time the data went to production, the fix would have arrived.
We have a fix now and are ready for that kind of patterns on production without problems. Maybe someone is thinking they should have exited earlier, given that now the nodes are forced to delete a lot of accumulated data that should have been gradually deleted over this period.
Perhaps it’s not ethical to gracefull exit or untrust the test satellite at this stage of the project.
From what I can see, it shows an overall increase in nodes for SLC, while the number of exited nodes stayed the same. However, there’s also been an uptick in disqualified nodes.
So maybe the repair traffic isn’t so much due to graceful exits, but rather nodes getting disqualified because they couldn’t handle the test data load?
The disqualification is usually happen because of lost or corrupted data or being offline for too long (blocking access from the satellite could have a similar effect). The disqualification for being too slow is very rare (I do not have an evidence so far), the only impact of that I saw there:
I see huge repair traffic in the last months, even on nodes with no SL satellite. Many nodes quitting? Untill may, the repair bearly reached 1 TB/month across my 17 nodes.
This is how it looks last months:
month 5: 2.73 TB
month 6: 4.45 TB
month 7: 5.56 TB
month 8: 9.63 TB
The oldest nodes seen the biggest traffic, with 970GB peak last month on a node.