Benchmark nodes

Toyoo · October 13, 2022, 8:12am

Precisely this.

Remember also that the success rate script shows overstated values, i.e. higher than real. This is because a node might finish work before the node notices dropped connection. In this case logs will often still report that the data transfer was finished successfully. So a high number in the success rate script does not translate to a high number of won races.

Toyoo · October 13, 2022, 11:18am

It’s not me, it’s the script’s author who thinks so.

Whether this can be corrected for, I don’t know. The simplest place to collect lost races data would be uplink, because this is the code that actually implements the race. But right now there’s no communication on these results implemented. AFAIU in some circumstances the notification on dropped connection might be delayed by many seconds, maybe even minutes, so trying to work it around might be difficult.

EDIT: after thinking about it a bit, it might be even more difficult. Both uplink and node may close connections not knowing about the other side also closing connection. Then, even if uplink closed connection earlier, the node’s OS may simply ignore uplink’s closing connection because the node already closed it as well. You’d probably need to ask someone experienced with socket libraries across different OSes to have an answer.

BrightSilence · October 13, 2022, 11:44am

Not exactly, the transfer probably actually finishes. The uplink terminates all running remaining uploads the moment it has 80 successfully uploaded pieces. But there is of course latency between the uplink detecting this and the cancellation message being received by the remaining nodes. If lots of nodes are fast and likely equally fast, they may all finish at almost the same time and before that cancellation message arrives. The node may then still end up losing the race, even though the entire transfer was completed succesfully.

Basically, if the script shows low numbers (<95%) you are losing more races than average… but if the scores are high… you still might be… we just can’t know for sure. So low = problem, high = who knows…

Toyoo · October 13, 2022, 11:51am

Uplink knows. ؜؜؜؜

BrightSilence · October 13, 2022, 12:23pm

True, but there is no upside for the customer to add additional communication with nodes to report that data back. And you want to keep any additional overhead as low as possible during transfers. So I doubt we’re going to get that information reflected on our nodes.

Toyoo · October 13, 2022, 1:30pm

Indeed! But uplink already passes this information to the satellite (for accounting reasons), so the satellite could tell the node which transfers did it win or not. Or, well, at least provide node operators some aggregated information via some web interface.

Hmmm… this might be a strictly better option than the one in post #15.

BrightSilence · October 13, 2022, 6:58pm

It’s an interesting idea, but I’m pretty sure the satellite doesn’t store the nodes it has offered to the uplink for upload and the uplink only send back those who finished. So it would require storing interrupted transfers somewhere as well and also a heavy query to calculate percentages for each node. Probably too much overhead on the satellite DB’s for storj to consider.

Toyoo · October 13, 2022, 7:30pm

Eh, the minimum necessary would be to collect four more integeres counting events per node: attempted downloads, successful downloads, attempted uploads, successful uploads. So, let say, whenever the satellite increments node’s bandwidth consumption, it would also increment the successful counter.

Then it would be a matter of periodically sending these counters, just like data for payments is sent now.

I have some experience working with telecom performance monitoring systems and we routinely processed billions of counters like that on almost commodity hardware. 60k counters is nothing.

Sure, it would be great to have more detailed view (e.g. per customer’s area), but this would go a long way already.

Vadim · October 13, 2022, 7:35pm

I think this benchmark from different regions will be interesting to clients. That it is not only tested big speed from US and some Eu countries, but some test results from every country in the world.
I understand that it will be changing during the time, but it will give people some measurements what to expect. Also if will be logged by satellite every uplink made, that data will be accurate all the time. This give Storj Labs very good information where need make something better.

BrightSilence · October 13, 2022, 7:57pm

That may be worse than just dumping the cancelled transfers in a table. What you’re suggesting involves updating 110 records for each transfer. And because it’s an update, that would require an index seek to find the record, locking the record, updating the counter and releasing the record. All the while transfers are highly parallel and multiple processes might want to update the same record at the same time, leading to lock contention. And the worst part is that this then has to be implemented as part of dealing with a data transfer, instead of in a separate process.

The satellite updates bandwidth usage based on bandwidth orders sent by the node in batches. Exactly to avoid the kind of updates I described above.

Toyoo · October 13, 2022, 8:02pm

Ok, you might be right on technical details—I don’t know the code enough. Still, the scale doesn’t look anywhere close to the point where it would be a problem.

Alexey · October 14, 2022, 7:09am

I could only invite everyone interested to try to implement these suggestions and create PR on our GitHub. It could be a nice addition from the Community!

Toyoo · October 14, 2022, 7:54am

Would it convince you if we could probabilistically update them, e.g. once every 100 transfers? That should give good enough accuracy while limiting overhead.

Vadim · October 14, 2022, 8:06am

what parameters shold be logged?
Avrage speed or uplod time?
Lets define for beginning what data and what type should be stored for this function.

Vadim · October 14, 2022, 8:37am

May be it is much efficient will be to monitor only succsesful ended uplads, and add upload time for the file, as there is always data how big chunk it is, for analysing nothing more needed.
As every node get data, then it will be benchmarked all the time.

BrightSilence · October 14, 2022, 12:13pm

I wouldn’t want anything that could impose lock contention in the code path for transfers itself. But the satellite could dump a random 1/x sample of transfers into a table and have a separate process calculate percentages on that. That could probably work. But that would still require the uplink to also send back info about failed/canceled transfers.

Toyoo · October 15, 2022, 2:45pm

None. It’s enough to just count the events, you don’t need to log them.