Why is there so much repair traffic, isn't that a danger to storjs business model?

About 25% of the ingress traffic to my (relatively new) node is repair traffic. Repair traffic is something that Storj has to pay node operators for but it isn’t paid by the customers. Isn’t this high amount of repair traffic a danger to Storjs business model?
They are already overpaying node operators without counting the repair traffic. If at some point their premined “free” STORJ tokens run out, they will have to drastically reduce the payments to node operators, which will lead to a lot of them leaving the network at the same time, again causing costly repair traffic.

We have increased the repair threshold recently, see

and

I’ve seen that, but at this point, I assumed all the data from russia should have been “repaired” already. Is this still data from russian nodes being multiplied onto non-russian nodes?

Over the last week I had 337k GET_REPAIRs and only 82k GETs. It is indeed worrying.

Not sure why theres confusion about it, theres 1400 nodes in russia thats alot of data that would need to be repaired.

1 Like

jtolio’s post was 25 days ago. Moving even few petabytes of data shouldn’t take that long with Storj’s massively distributed architecture.

There are about 3PB of customer data stored which would need to be repaired. That shouldn’t take a month, especially since there is no bottleneck but the non-russian nodes just give each other the data. Should be a matter of days. If 10% of the nodes are in Russia then I assume every non-russian node should get about 1/9th of the average capacity of a Russian node as repair (since the data currently stored on Russian nodes is distributed among non-russian ones).

That’s not actually how it works it doesn’t happen instantly it could take over a month plus data already being repaired because of nodes being offline or suspended. There’s a lot more to it then just sending the data to new nodes.

Just think about how graceful exit works it can take a long time depending on the amount of data, that x 1400 nodes.

1 Like

But repair is a lot more complicated. With graceful exit, it’s just the node sending pieces directly to other nodes.

For repair, a Storj managed repair worker needs to retrieve 29 pieces for each segment that needs repair, reconstruct the (encrypted) segment and then generate a whole bunch of new pieces, depending on how low the availability has gotten.

In addition to that, I believe there is now soft geofencing in place that doesn’t remove pieces from nodes in the effected areas, but doesn’t count them towards the repair threshold. As a result, pieces will trigger repair faster than normally on top of the already higher repair threshold. So yeah… these situations cost money for Storj. But we’re probably still in the peak of that effect and it will likely settle down once all repairs are caught up with the new situation.

4 Likes

Yes exactly im just comparing to something we all know and we know it takes a long time, which means repairing takes alot longer.

I understand, but it’s not necessarily true that it would take even longer. Graceful exit necessarily moves data from a single node, which then becomes the bottleneck. Repair can benefit from distributed load, because while it will focus on repairing segments for which pieces are stored on nodes in the effected areas, it can download pieces from any node storing data for those segments. This distributes the load over basically all nodes across all segments that need repair. So the constraint is not so much the single node, but rather the amount and speed of the repair workers. I believe Storj can fairly easily scale those up when needed.

So yeah, different constraints determine how long it takes. In theory I think it can be done rather quickly, but that’s not really necessary in this case since availability of pieces is already quite high.

Lots of repairing also can have bottlenecks on slower nodes theres alot to consider. Which we have seen on here people have complained having alot more IO then normal.

Sure it can slow down specific nodes, but long tail cancellation can ensure it won’t slow down the actual repair. Until of course it gets so crazy that a significant amount of nodes see problems, but they would need to be pushing repair a lot harder for that to happen.