Throttling of data flow when disk cannot keep up

jammerdan · April 19, 2021, 7:26am

This comes from here:
https://forum.storj.io/t/initial-communication-to-node-operators-re-price-cost-change-feedback/13083/49?u=jammerdan

The problem is well known: Especially SMR drives can have problems to keep up with constant data flow. Some times it will cause nodes to crash.
SNOs could use only CMR drive but the Storj narrative is: Use what you have. So it is likely that especially new entrants will just use what they have so using SMR drives for nodes will stay.

I cannot suggest a technical solution, but the idea would be way the node could tell the satellite that it cannot keep up with the data flow currently and gets a break to keep up instead of crashing.

Pac · April 19, 2021, 8:28am

Much needed in my opinion indeed!

I shared some thought on this in the past, there:

To sum up, I think what the node software could do is something along those lines:

Monitor how much RAM is taken by buffered pieces waiting to be written down to disk.
Accept new pieces as long as this buffer does not go beyond a certain threshold.
This threshold could be set at 32MiB by default for instance, but it should be configurable so we can adjust it depending on how much RAM we can dedicate to our node.
Tell the satellite it cannot keep up if this buffer go beyond the threshold, and maybe start rejecting incoming requests from clients.
Wait for the buffer to be empty, and tell the satellite we’re back in business and ready to accept new pieces again.

That would prevent what’s currently happening which is either:

the RAM consumption going up and up until it runs out and the node gets killed by the OOM killer and then restarted by docker (which is a real problem as it can cause many issues like DB corruption, file pieces corruption, and it restarts the file walker process that takes hours of high disk usage, which is likely to worsen even more the disk performance…).
or, the need to configure our nodes with the a very low number of maximum concurrent requests, which wasn’t designed to address this issue and unnecessarily and massively throttles down ingress even when disk can keep up. Indeed, typically an SMR drive can receive surges of massive ingress with no problem as they use their CMR cache section for that. They only struggle with long period of massive ingress: only then would they need the ingress to be throttled down.

Of course all of that is probably way more complicated than my bullet point list above to implement in practice, but as @jammerdan said SMR drives are everywhere and probably here to stay.
So I think they’re right: the node software needs to handle this by itself

BrightSilence · April 19, 2021, 10:44am

I suggested something like this a long time ago in a topic that is now locked.

Since then some things have changed, we now have a node selection cache that may make it easier to keep track of some node stats without having to do constant database updates. I would specifically point to options 3 and 4 offered there.

Add settings for maximum number of uploads per minute and maximum number of downloads per minute. These settings are then advertised to the satellites. Whenever a node is selected for a transfer, that gets logged in the satellite’s DB. And whenever nodes need to be selected for transfer, nodes that have reached their maximum threshold for the past minute will be excluded from selection. The downside is that this doesn’t exactly limit the number of concurrent requests, but the number of started requests. Additionally, this would be a limit per satellite. If the number of active satellites changes a lot, it might be needed for SNOs to adjust this setting accordingly. The upload rate limit would be the most important one, because you ideally don’t want to limit downloads as that directly impacts how much your node makes. But I would still implement a download limit as well to account for SNOs with slow upload connections. After all, upload speed is often drastically lower than download speed.

@Pentium100: If a node is overloaded (set by SNO with concurrent request limit or CPU iowait limit or unix oad limit) the node contacts each satellite and informs it that it is overloaded. The satellite then reduces the performance coefficient by some amount.
When the load drops below a lower limit and stays there for 10 minutes, the node contacts the satellite and informs it that it is almost idle. The satellite increases the performance coefficient by some amount that is smaller than the decrease.
The coefficient is used when selecting a node. If a node would be selected, a random value between 0 and 1 is generated. If the value is less than the coefficient, some other node is selected instead.

I think updating a node selection coefficient or a simple selection count in the node selection cache may be a lot easier than it was when this data was all still in databases.

A simple version could be that the node simply provides a number of maximum transfers per minute. When the cache is built, this number can be translated into how many transfers per cache cycle. And the node can simply be removed from the cache when the limit is reached.
Alternatively the selection process could keep track of a coefficient that drops every time the node is selected. So you start at 100% and if your node can only be selected 500 times during the cache window, every time it is selected the coefficient drops by 0.2%, lowering the chance of the node being selected again after that.