Limit node transfers through node selection

BrightSilence · April 27, 2020, 9:43am

Problem

Some nodes are running into problems where a high amount of traffic runs into a bottleneck. This bottleneck could be related to disk IO, like SMR disks running into their write limitations, slow and older disks or USB2 disks. This could lead to the node hardware becoming unbearably slow and unresponsive at times. It also has an impact on success rates with nodes failing more transfers than average. This bottleneck could also occur in relation to internet connection speed. In which case it could cause their internet grinding to a halt for all devices on the network.

Current “solution”

Currently many nodes use the max-concurrent-requests setting to work around these issues, despite advise to the contrary. They use this setting because it’s the only way for them to be able to solve these problems other than stopping their node altogether. However, this current setting gets applied too late in the process. The node has already been selected for upload and the satellite and uplink count on it that enough of the selected nodes finish that upload to be successful. If those thresholds aren’t reached customers (uplinks) could run into errors and uploads might fail.

Failed to copy: uplink: segment error: ecclient error: successful puts (79) less than success threshold (80)
Failed to copy: uplink: segment error: ecclient error: successful puts (29) less than or equal to repair threshold (35)

Limiting transfers based on node selection

Instead of the nodes rejecting transfers they are already selected for, a much better approach would be to not select busy nodes for transfers to begin with. There are several ways to go about this. I’ll outline some options here.

When a node rejects a transfer, track those rejections and make that node less likely to be selected in the future. This has the big downside that these nodes also get selected less often during quiet times, which isn’t ideal. Additionally the rejection, which is currently part of the exchange between uplink and storagenode, needs to be reported to the satellite. The upside is that you could keep using the current setting in exactly the same way.
Don’t select nodes when their max-concurrent-requests limit is reached. This solution would be ideal, as it would use the most current data to limit requests. From the SNO point of view the impact would be very similar to what this setting does now. But there would have to be constant communication between the node and satellite to update the number of active requests. This may run into scalability issues.
Add settings for maximum number of uploads per minute and maximum number of downloads per minute. These settings are then advertised to the satellites. Whenever a node is selected for a transfer, that gets logged in the satellite’s DB. And whenever nodes need to be selected for transfer, nodes that have reached their maximum threshold for the past minute will be excluded from selection. The downside is that this doesn’t exactly limit the number of concurrent requests, but the number of started requests. Additionally, this would be a limit per satellite. If the number of active satellites changes a lot, it might be needed for SNOs to adjust this setting accordingly. The upload rate limit would be the most important one, because you ideally don’t want to limit downloads as that directly impacts how much your node makes. But I would still implement a download limit as well to account for SNOs with slow upload connections. After all, upload speed is often drastically lower than download speed.
@Pentium100: If a node is overloaded (set by SNO with concurrent request limit or CPU iowait limit or unix oad limit) the node contacts each satellite and informs it that it is overloaded. The satellite then reduces the performance coefficient by some amount.
When the load drops below a lower limit and stays there for 10 minutes, the node contacts the satellite and informs it that it is almost idle. The satellite increases the performance coefficient by some amount that is smaller than the decrease.
The coefficient is used when selecting a node. If a node would be selected, a random value between 0 and 1 is generated. If the value is less than the coefficient, some other node is selected instead.

Personally I think option 3 is the most feasible. The max-concurrent-requests setting could then be deprecated after this solution is in place. SNOs would need to be informed about this change. Preferably an email to those who use the old setting. After having given them time to switch over, the old setting can be removed from the code.

Advantages

Implementing one of these solutions would have the following advantages.

Implementing the rate limit in node selection prevents nodes from rejecting transfers and prevents transfers from failing because too many nodes rejected it
Nodes have a supported way to slow Storj down a bit to prevent bottlenecks in their setup
Increases upload speed by ensuring there is always a long tail that can be cancelled when the success threshold has been reached. (Currently if enough nodes reject the transfer it’s possible that the uplink has to wait for all remaining nodes to finish, without long tail optimization)
Less overhead, because it might be possible to lower the amount of nodes having to be selected because you don’t have to account for nodes rejecting the transfer.

jammerdan · April 27, 2020, 10:31am

Great ideas.
However no. 3 seems to be very complex. I for myself would not be able to tell easily what numbers to put there.
And it seems that it would require to constantly adjust these numbers in case of temporarily bottle necks that can occur as well.
I would prefer something like no. 1 where satellite is informed of actual problems and takes care of it. However I don’t know how. Couldn’t a number of nodes to be reserved as hot spare for failed downloads and submitted to the uplink in case a node cannot fullfill?

BrightSilence · April 27, 2020, 10:42am

What do you find complex about it? Perhaps it could be simplified.

That would be messy. Either you start more downloads to begin with, which would be the same as just raising the number of selected nodes overall. Or these nodes would start only if the others have failed, which would cause a significant delay. Furthermore, you run into the same issue. What if these nodes are also rejecting transfers?
I’m not a big fan of option 1 because you start a transaction without knowing whether nodes will actually finish it. It’s kind of a half solution. You’d still have to dissuade people from using it, because if everyone had set some kind of limit, there could be a situation with such high demand that a majority of nodes run into their limits. At that point the chances of failing an upload altogether go up despite having spares.

Ps. If you like the idea you can vote for it in this part of the forum.

jammerdan · April 27, 2020, 11:08am

I am currently thinking of temporary issues with a node. The result for uplink would be the same so it does not really matter if a bottleneck is temporarily or permanently. Therefore with no. 3 I would still keep the number for up-/downloads as high as possible. If the computer the node is running on is busy then the result for uplink is the same so nothing has been gained.
Otherwise I would constantly have to adjust those settings according to workload. I don’t think that’s feasible.
But I totally see the problem: Satellite should not select a node that cannot handle the load. But I think it would be better to make a decision on actual situation than on some imaginary numbers the SNO has set.

anon27637763 · April 27, 2020, 11:51am

I vote for Number 3 …

I would further expand on that idea to have an initialization field during node creation where the node operator can fill in a few details about the node’s hardware capabilities. And the ability for a node operator to update that hardware data once a month if need be.

Such hardware data, if shared, could be used to create tiered pricing of storage for customers as well as useful analytics for use in satellite algorithms.

Pentium100 · April 27, 2020, 12:08pm

There would be no way to guarantee the accuracy of this data. My node, for example, runs on a server with 4x20 core CPUs and 2TB of RAM , I mean if it gives me more data because of that…

anon27637763 · April 27, 2020, 1:20pm

There are definite methods of ensuring the accuracy of such hardware data collection. But I wasn’t suggesting that those methods be employed. Simple self-reporting of hardware would suffice in 99% of the cases. Cheating the system isn’t really possible… if an SNO lies about hardware capabilities, the node selection algorithm would reveal the false report quickly and the node would lose profitability.

Pentium100 · April 27, 2020, 1:32pm

If a SNO with a Pi says that his node is running on a big server, it may be worse for him. However, if it got me more traffic I would say that my node is running on a bigger server than the real one and if I see a bottleneck, I’ll try to improve the performance.

So either the hardware reporting is just two classes - “full/server” vs “limited/Rpi” or reporting higher capabilities may give me slightly more data. Besides, how would the satellite figure out how many concurrent transfers a given CPU and hard drive can sustain?

For example, my node is running on a server with 2x Xeon X5687 CPUs (4 cores given to node VM), 100GB or RAM (16GB given to node VM). The storage is a 6 drive raidz2 pool (2 drives are 5k4 and SMR) with ZIL and L2ARC on a couple of SATA SSDs.

So, how many concurrent connections can it sustain? I do not think I have seen the limit being reached with node versions that display the version number

deathlessdd · April 27, 2020, 1:49pm

I feel like no one is going to be truthful about the hardware they are running, everyone sees $$$$$ so they will say they got high end server when they probably running it on a rpi3 with 2 USB hard drives running in usb2.0. Letting people select the hardware settings would be more hurtful for the network then if it set it on its own without provision from the user. Cause we both know that its already happening now saying they have more hard drive space then they do. Then the person who is running a full on server with a full raid array with nvme and ssds gonna come in and take up all the data. Because the person with a rpi3 thought they were trying to be smart but actually hurting themselfs.

There almost needs to be some kinda benchmark that will run and set it for you and not allow you to change it. Realistically

anon27637763 · April 27, 2020, 1:52pm

Yes… and thus the number of SNOs trying to “outsmart” the system is a self-limiting variable.

That would be at least a little more helpful to the satellites.

Pentium100 · April 27, 2020, 1:57pm

With Storj v2 you got more data if you said that your node is bigger, so the “correct” way was to advertise the node as 8TB (maximum) and then change that to the real value once the drive is almost full.

AFAIK, v3 does not do that - as long as your node has “enough” free space it has the same chance to get the piece.

I guess it could be done with “some” accuracy, though it either needs to be repeated frequently or I can just run the benchmark on a fast server and then move the node VM to a slower one.

Still, I think that it should be better if the node operator just specified the values. Start the default at “unlimited” and if the node runs into some bottleneck then the operator can set it to a lower value and get fewer requests, but higher success rate and more actual data compared to leaving it at unlimited, getting more requests and being unable to finish any of them.

anon27637763 · April 27, 2020, 1:58pm

Cryptographic hash of hardware.

Pentium100 · April 27, 2020, 2:00pm

I’m pretty sure the hash of my VM is the same no matter the hardware it’s running on.

anon27637763 · April 27, 2020, 2:04pm

It’s not.

However, that’s not what I was suggesting… simply self-reporting would be useful and most SNOs would report as accurately as possible… because after they reported incorrectly and their node became disqualified, they would find the forum posts saying “If you don’t report accurately, your node will be less profitable”

Pentium100 · April 27, 2020, 2:23pm

OK, I am not going to debate this, I never needed to fool a hardware hash after doing virsh migrate before, so I do not know how easy or difficult that would be.

Why would it become disqualified? It would just run slower and have bad success rate most of the time.

Also, I would report higher performance for the reason that I want to see where the bottleneck is so I can improve it. If my node is at 50% because the satellite limits it then I will not see how I could make it run better. On the other hand, if my node is at 100% it’s easier to see how I can make it faster.

So, instead of reporting the hardware (and then trying to figure out how each non-standard configuration would behave), how about just reporting the allowed simultaneous requests etc? I am not a big fan of indirection - “OK, if I report my real CPU, I get 50 requests and they do not load my node at 100%, so what CPU should I report to get 60 requests?”.

Also, who would determine the optimal number of requests for each configuration. For example, how much would changing my ZIL from SATA to nvme increase the request count?

Also, the requests are not equal - let’s say there are two groups of clients - in group A, everyone has 10mbps and in group B everyone has 1gbps. My node would be able to serve more simultaneous requests from group A than group B. So, a single setting would not do it.

This happened to me with v2 - there was a setting to limit concurrent transfers and if I set it to, say, 10 I sometimes ended up with 10 transfers 10kbps each etc.

anon27637763 · April 27, 2020, 2:56pm

Simultaneous requests have several limiting factors…

The drive/s need to respond within the request expiration time.
Multiple connections with lower bandwidth per connection results in more efficient use of the TCP pipe.

So, limiting simultaneous connections is a blunt force tool that sacrifices bandwidth efficiency in exchange for allowing for slower hardware.

If I weren’t in the middle of trying to teach my 1st and 2nd grader the difference between 8:30 a.m. and 8:30 p.m. … I would probably run an iperf test and hand you the results…

Pentium100 · April 27, 2020, 3:12pm

I know the difference between a single TCP connection and multiple. Multiple connections win especially if there is packet loss.

However, let’s say your suggestion gets implemented and I report that my node is running on a 1GHz Atom with 2GB RAM with a 4200RPM SMR drive.

What can the satellite do about it?

Limit concurrent requests to not overload the drive, RAM or CPU.
Limit the request rate to, well, not overload the drive or CPU.

Both are linked - by limiting the request rate you will limit the number of concurrent requests, because if a request takes 3 seconds to process and the node gets one request/s, it has 3 concurrent requests on average.

So, instead of making the SNO report the hardware and the trying to determine the optimal values for concurrent requests or request rate for each individual configuration, how about allowing the SNO to specify those values directly? Specifying them too low means little traffic. Specifying them too high means low success rate and little useful traffic.

anon27637763 · April 27, 2020, 3:24pm

Storj is centralized data distribution. The satellites are making the choices. As indicated in OP post, having the SNOs make these decisions is too late in the network decision process.

BrightSilence · April 27, 2020, 3:33pm

While I enjoy the discussion, it got a little bit off track. This suggestion wasn’t at all about creating different tiers of nodes. I would in fact be strongly against that.

My suggestion was simply to give nodes that are having trouble with the “unlimited” default setting an option to lower the load on their systems without it hurting customers. So basically, by using these settings you can only make income worse for yourself, but never better. I think that makes the entire discussion above moot.

Pentium100 · April 27, 2020, 3:35pm

The difference is that currently I can set my node to accept 10 requests and my node will reject all others.
The proposed solution is that my setting of 10 requests is given to the satellite so the satellite won’t give my node more than 10 requests.