Updates on Test Data

Roxor · May 21, 2024, 12:51pm

Your team has a lot of interesting challenges: a chance to try new things! I get that the ‘race’ system already biases data towards faster nodes… but I understand you may want to lean harder towards those fast nodes so clients have better interactive performance.

I guess that could also be offset the other way to not penalize slower nodes too much: like perhaps the repair system could be biased towards slower nodes… because there’s no customer-visible time pressure on repair traffic. If it takes twice as long to move repair data around: nobody will care.

I have no suggestions or comments or criticisms. It can just be exciting to think about creative answers to technical questions. Have fun Storj devs!

ACarneiro · May 21, 2024, 12:55pm

What a bloody clever idea! Not sure whether it will be easy or even necessary to implement, but seems like a great suggestion!
Fun times indeed!

MarviBiene · May 21, 2024, 1:27pm

Will we receive an update on the results as well, or are they only for internal purposes? For instance, will information about the peak bandwidth and other tests conducted be shared? (I hope I didn’t miss any updates among the many messages.)

BrightSilence · May 21, 2024, 2:21pm

There will be if the customer then tries to download that repaired segment.

ACarneiro · May 21, 2024, 2:23pm

You’d have to be pretty unlucky, wouldn’t you? I mean, repair triggers before there are not enough segments to reconstruct a file, or did I not understand it right?

clement · May 21, 2024, 2:29pm

Seems like the node was overloaded and couldn’t handle the IO pressure. Could also result from disk failure.

The rest of the stacktrace is just the node forcefully shutting down from failing the readability test. You can increase the timeout (not recommended) with --storage2.monitor.verify-dir-readable-timeout or change the config to only log a warning when the readability/writability check fails with --storage2.monitor.verify-dir-warn-only.

However, the default behaviour is set to fail fast so you can quickly fix it and your node doesn’t get disqualified

Balage76 · May 21, 2024, 3:10pm

Maybe I missed it, but is there a predetermined system/logic which determines the actually tested node? I mean the test traffic/load is directed to any given node, then the next one, then the next one and so on…? Does the test take place on each and every node? Or since it is from SLC satellite, the geographically close nodes are tested only?
I ask because I do see an increase in traffic, but far from so much that would really put the nodes to the test. My trafic is around 15Mbit/s and it is shared over 6 nodes (behind the same IP)…
I had much more than usual egress on saturday-sunday and 3 times more than usual ingress today. But nothing more, nothing else. I have a 1000/300 Mbit connection, so pleanty of room for more. The nodes are fine and happy even the smallest with the RPi3B+

snorkel · May 21, 2024, 3:12pm

Repair must be done quickly. The safety of the data is the priority, so as soon as all the segments are back to the standard number, the better. You can’t delay the repair. If this is tunable, than I will choose the fastest nodes, not the slowest. This is actualy the case now; the fastest nodes upload pieces first, and the rest get lost races.

ACarneiro · May 21, 2024, 3:31pm

5 to 10 seconds quickly, not less-than-a-second quickly, though… The slowest nodes would surely still be plenty fast enough?

Roxor · May 21, 2024, 3:47pm

That’s certainly a nice theory. However… for as long as the numbers have been on the dashboard… ‘healthy pieces’ min numbers have been around 50, and median around 65 (of 80). Repairing back to 80 is clearly on a best-effort/good-enough basis. Storj knows if you need 29: you don’t need to stress when you’re missing a few.

So repairs must only be done “fast enough”.

And repairs aren’t sequential: but they are almost perfectly scalable. If each action took twice as long Storj could certainly run twice as many repair jobs at the same time if they needed to.

I’m not saying the idea of a slow-node-tailored repair system is good or bad. Simply that “Repair must be done quickly” is not true: based on the numbers provided by the satellites that have been running for years.

BrightSilence · May 21, 2024, 4:08pm

I should have used more words. That data would end up on nodes that don’t do well under pressure. If the customer would then download that data, it would be downloaded from slower nodes as well. So there is an incentive to just upload less to nodes that don’t deal well with the pressure.

Repair running slowly might add to cost though if it means more repair workers are needed. But the speed of the repair itself is certainly not an issue.

That’s in line with what I see per IP.

Trooper · May 21, 2024, 4:39pm

This sounds pretty reasonable, esp. considering that this use case has a TTL at the time of the upload. Would only be logical that data thats only stored for a short time does not need that much replication.

littleskunk · May 21, 2024, 5:19pm

Game plan for today:

Test 6 KB files to find out what the maximum throuput of the satellite DB is
Test 2-3 alternative RS numbers
Test an alternative node selection (requires a cherry pick and point release. Not sure if we can get that in time.)

flwstern · May 21, 2024, 5:56pm

Well you are pushing pretty hard on bandwidth which we get 0 payment for. I wouldnt complain if it was filling our disks but its not.

Roxor · May 21, 2024, 6:07pm

After all the changes… I think I just got back to the same stored-data I had May 1st. At least I’m not going backwards this month!

Ruskiem · May 21, 2024, 6:08pm

I see it like an investition , so Storj can betterly serve customers.
We are in it together for good and bad, if Storj needs that, i will provide whats needed, for theirs success is also mine.

Pentium100 · May 21, 2024, 6:28pm

I am. Apparently they cleared the old data from the saltlake satellite and I had about 7TB of it.

JoshieGarza · May 21, 2024, 6:45pm

yeah, me too. Many nodes losing several TBs and only recovered 600gb with this testing… Anyways, I am doing the periodic maintenance earlier this year

ACarneiro · May 21, 2024, 9:59pm

Fairly sedate test today. Nodes didn’t even break a sweat

mattventura · May 21, 2024, 10:42pm

This makes sense. If storj has to pay $1.5/tb/mo for a fast node, or $1.5/tb/mo for a slow node, why would they want data on the slow node?