Remember also that the success rate script shows overstated values, i.e. higher than real. This is because a node might finish work before the node notices dropped connection. In this case logs will often still report that the data transfer was finished successfully. So a high number in the success rate script does not translate to a high number of won races.
Whether this can be corrected for, I donât know. The simplest place to collect lost races data would be uplink, because this is the code that actually implements the race. But right now thereâs no communication on these results implemented. AFAIU in some circumstances the notification on dropped connection might be delayed by many seconds, maybe even minutes, so trying to work it around might be difficult.
EDIT: after thinking about it a bit, it might be even more difficult. Both uplink and node may close connections not knowing about the other side also closing connection. Then, even if uplink closed connection earlier, the nodeâs OS may simply ignore uplinkâs closing connection because the node already closed it as well. Youâd probably need to ask someone experienced with socket libraries across different OSes to have an answer.
Not exactly, the transfer probably actually finishes. The uplink terminates all running remaining uploads the moment it has 80 successfully uploaded pieces. But there is of course latency between the uplink detecting this and the cancellation message being received by the remaining nodes. If lots of nodes are fast and likely equally fast, they may all finish at almost the same time and before that cancellation message arrives. The node may then still end up losing the race, even though the entire transfer was completed succesfully.
Basically, if the script shows low numbers (<95%) you are losing more races than average⌠but if the scores are high⌠you still might be⌠we just canât know for sure. So low = problem, high = who knowsâŚ
True, but there is no upside for the customer to add additional communication with nodes to report that data back. And you want to keep any additional overhead as low as possible during transfers. So I doubt weâre going to get that information reflected on our nodes.
Indeed! But uplink already passes this information to the satellite (for accounting reasons), so the satellite could tell the node which transfers did it win or not. Or, well, at least provide node operators some aggregated information via some web interface.
Hmmm⌠this might be a strictly better option than the one in post #15.
Itâs an interesting idea, but Iâm pretty sure the satellite doesnât store the nodes it has offered to the uplink for upload and the uplink only send back those who finished. So it would require storing interrupted transfers somewhere as well and also a heavy query to calculate percentages for each node. Probably too much overhead on the satellite DBâs for storj to consider.
Eh, the minimum necessary would be to collect four more integeres counting events per node: attempted downloads, successful downloads, attempted uploads, successful uploads. So, let say, whenever the satellite increments nodeâs bandwidth consumption, it would also increment the successful counter.
Then it would be a matter of periodically sending these counters, just like data for payments is sent now.
I have some experience working with telecom performance monitoring systems and we routinely processed billions of counters like that on almost commodity hardware. 60k counters is nothing.
Sure, it would be great to have more detailed view (e.g. per customerâs area), but this would go a long way already.
I think this benchmark from different regions will be interesting to clients. That it is not only tested big speed from US and some Eu countries, but some test results from every country in the world.
I understand that it will be changing during the time, but it will give people some measurements what to expect. Also if will be logged by satellite every uplink made, that data will be accurate all the time. This give Storj Labs very good information where need make something better.
That may be worse than just dumping the cancelled transfers in a table. What youâre suggesting involves updating 110 records for each transfer. And because itâs an update, that would require an index seek to find the record, locking the record, updating the counter and releasing the record. All the while transfers are highly parallel and multiple processes might want to update the same record at the same time, leading to lock contention. And the worst part is that this then has to be implemented as part of dealing with a data transfer, instead of in a separate process.
The satellite updates bandwidth usage based on bandwidth orders sent by the node in batches. Exactly to avoid the kind of updates I described above.
Ok, you might be right on technical detailsâI donât know the code enough. Still, the scale doesnât look anywhere close to the point where it would be a problem.
I could only invite everyone interested to try to implement these suggestions and create PR on our GitHub. It could be a nice addition from the Community!
Would it convince you if we could probabilistically update them, e.g. once every 100 transfers? That should give good enough accuracy while limiting overhead.
May be it is much efficient will be to monitor only succsesful ended uplads, and add upload time for the file, as there is always data how big chunk it is, for analysing nothing more needed.
As every node get data, then it will be benchmarked all the time.
I wouldnât want anything that could impose lock contention in the code path for transfers itself. But the satellite could dump a random 1/x sample of transfers into a table and have a separate process calculate percentages on that. That could probably work. But that would still require the uplink to also send back info about failed/canceled transfers.