Continuous piece download attempt saturating bandwidth

Toyoo · March 30, 2024, 4:05pm

I was trying to raise this issue in the past. I believe that overloading a small number of nodes that contain pieces of the same very popular file in extreme cases may lead to cascading failure, where the file becomes unavailable for customers.

First, the slowest node gets hit by too many requests. Assuming equal distribution of bandwidth across requests, this means that even if the node was fast enough to serve a small number of requests, it will fail most, if not all requests when there’s too many of them, because all requests will be served at a similar slow speed. Then, this load will spread to other nodes, which would on a regular basis serve just an equal fraction of traffic, and will now have to serve more of it—again potentially leading to overwhelming them as well. In effect, despite that there are nodes that could have been serving traffic, most of their bandwidth is spent on inevitably canceled downloads.

Obviously as a storage node operator I would much prefer to handle only downloads that have a high chance of succeeding. This is difficult to predict though, so in the post linked above I suggested tracking available bandwidth and not commiting to serve a download until the node is sure it can do so at a decent speed.

Right now we seem to have plenty of nodes with very high bandwidth for this scenario to be very unlikely, so I don’t think it’s any urgent to do anything. On average, even low bandwidth nodes can serve most traffic, and in the peaks (or when nodes are running maintenance procedures like the file walkers) we still seem to have a lot of high-bandwidth nodes. But when the need arises, I suspect this suggestion might be cheaper to achieve than having more pieces for popular files.