Throttling of data flow when disk cannot keep up

BrightSilence · April 19, 2021, 10:44am

I suggested something like this a long time ago in a topic that is now locked.

Since then some things have changed, we now have a node selection cache that may make it easier to keep track of some node stats without having to do constant database updates. I would specifically point to options 3 and 4 offered there.

Add settings for maximum number of uploads per minute and maximum number of downloads per minute. These settings are then advertised to the satellites. Whenever a node is selected for a transfer, that gets logged in the satellite’s DB. And whenever nodes need to be selected for transfer, nodes that have reached their maximum threshold for the past minute will be excluded from selection. The downside is that this doesn’t exactly limit the number of concurrent requests, but the number of started requests. Additionally, this would be a limit per satellite. If the number of active satellites changes a lot, it might be needed for SNOs to adjust this setting accordingly. The upload rate limit would be the most important one, because you ideally don’t want to limit downloads as that directly impacts how much your node makes. But I would still implement a download limit as well to account for SNOs with slow upload connections. After all, upload speed is often drastically lower than download speed.

@Pentium100: If a node is overloaded (set by SNO with concurrent request limit or CPU iowait limit or unix oad limit) the node contacts each satellite and informs it that it is overloaded. The satellite then reduces the performance coefficient by some amount.
When the load drops below a lower limit and stays there for 10 minutes, the node contacts the satellite and informs it that it is almost idle. The satellite increases the performance coefficient by some amount that is smaller than the decrease.
The coefficient is used when selecting a node. If a node would be selected, a random value between 0 and 1 is generated. If the value is less than the coefficient, some other node is selected instead.

I think updating a node selection coefficient or a simple selection count in the node selection cache may be a lot easier than it was when this data was all still in databases.

A simple version could be that the node simply provides a number of maximum transfers per minute. When the cache is built, this number can be translated into how many transfers per cache cycle. And the node can simply be removed from the cache when the limit is reached.
Alternatively the selection process could keep track of a coefficient that drops every time the node is selected. So you start at 100% and if your node can only be selected 500 times during the cache window, every time it is selected the coefficient drops by 0.2%, lowering the chance of the node being selected again after that.