Your team has a lot of interesting challenges: a chance to try new things! I get that the ‘race’ system already biases data towards faster nodes… but I understand you may want to lean harder towards those fast nodes so clients have better interactive performance.
I guess that could also be offset the other way to not penalize slower nodes too much: like perhaps the repair system could be biased towards slower nodes… because there’s no customer-visible time pressure on repair traffic. If it takes twice as long to move repair data around: nobody will care.
I have no suggestions or comments or criticisms. It can just be exciting to think about creative answers to technical questions. Have fun Storj devs!
Will we receive an update on the results as well, or are they only for internal purposes? For instance, will information about the peak bandwidth and other tests conducted be shared? (I hope I didn’t miss any updates among the many messages.)
You’d have to be pretty unlucky, wouldn’t you? I mean, repair triggers before there are not enough segments to reconstruct a file, or did I not understand it right?
Seems like the node was overloaded and couldn’t handle the IO pressure. Could also result from disk failure.
The rest of the stacktrace is just the node forcefully shutting down from failing the readability test. You can increase the timeout (not recommended) with --storage2.monitor.verify-dir-readable-timeout or change the config to only log a warning when the readability/writability check fails with --storage2.monitor.verify-dir-warn-only.
However, the default behaviour is set to fail fast so you can quickly fix it and your node doesn’t get disqualified
Maybe I missed it, but is there a predetermined system/logic which determines the actually tested node? I mean the test traffic/load is directed to any given node, then the next one, then the next one and so on…? Does the test take place on each and every node? Or since it is from SLC satellite, the geographically close nodes are tested only?
I ask because I do see an increase in traffic, but far from so much that would really put the nodes to the test. My trafic is around 15Mbit/s and it is shared over 6 nodes (behind the same IP)…
I had much more than usual egress on saturday-sunday and 3 times more than usual ingress today. But nothing more, nothing else. I have a 1000/300 Mbit connection, so pleanty of room for more. The nodes are fine and happy even the smallest with the RPi3B+
Repair must be done quickly. The safety of the data is the priority, so as soon as all the segments are back to the standard number, the better. You can’t delay the repair. If this is tunable, than I will choose the fastest nodes, not the slowest. This is actualy the case now; the fastest nodes upload pieces first, and the rest get lost races.
That’s certainly a nice theory. However… for as long as the numbers have been on the dashboard… ‘healthy pieces’ min numbers have been around 50, and median around 65 (of 80). Repairing back to 80 is clearly on a best-effort/good-enough basis. Storj knows if you need 29: you don’t need to stress when you’re missing a few.
So repairs must only be done “fast enough”.
And repairs aren’t sequential: but they are almost perfectly scalable. If each action took twice as long Storj could certainly run twice as many repair jobs at the same time if they needed to.
I’m not saying the idea of a slow-node-tailored repair system is good or bad. Simply that “Repair must be done quickly” is not true: based on the numbers provided by the satellites that have been running for years.
I should have used more words. That data would end up on nodes that don’t do well under pressure. If the customer would then download that data, it would be downloaded from slower nodes as well. So there is an incentive to just upload less to nodes that don’t deal well with the pressure.
Repair running slowly might add to cost though if it means more repair workers are needed. But the speed of the repair itself is certainly not an issue.
This sounds pretty reasonable, esp. considering that this use case has a TTL at the time of the upload. Would only be logical that data thats only stored for a short time does not need that much replication.
I see it like an investition , so Storj can betterly serve customers.
We are in it together for good and bad, if Storj needs that, i will provide whats needed, for theirs success is also mine.