Strategy for TESTING NEW NODES

snorkel · May 6, 2024, 9:49am

I the current light of events, with the stress tests ran by Saltlake sat, and so many reports about nodes not handling the load, I believe there must be a new way to test and approve NEW NODES entering the network. It’s better to catch the problems from the begging, than after 10-15TB of stored data.
This is what I can quickly draft and any input whould be verry helpful.

When a new node joins the network, the Saltlake test satellite should start a set of tests, stress tests, bloom filter tests etc., to let the user see if there could be any problems with it’s setup and adjust accordingly - change some hardware or software choises, improve some settings, or quit and don’t bother the network with “useless patatos”.

The SL tests should be not-paid, time and load limited for each run, and give a score.
If all is good, the data satellites receive a thumbs up, start sending data and start the vetting process. If it’s bad, let the SNO know, give him the oportunity to adjust his setup, and redo the tests.
The SNO should be allowed to join the SL tests for unlimited number of runs, untill he is satisfied with the score (some abuse limit should be in place).

The test score could be a numeric one, like 0-100, or an attribute, like verry good/ good/ medium/ bad.
The logs should display what part of the tests got bad results, to help with trubleshooting.

The Dashboard should display:
-the status of the node (in testing/ in vetting/ in production);
-the score of the testing from SL sat (passed/verry good, or failed/bad - Check your logs);
-the vetting status with numbers from all sats (like US1 - 50/100, EU1 - 75/100, AP1 20/100, etc.).

I believe many SNO, especialy new ones, don’t check their logs, or put them in low logging mode, and rely on the Dashboard, so this should be more informative than is now.

Roxor · May 6, 2024, 12:30pm

Don’t we already have a 0-100 Test Score for nodes: the upload/download %-success numbers from @BrightSilence script?

If you’re winning 95%+ of races your setup is fine. If you’re winning 5% of them your setup is not fine .

The Satellites don’t have to do anything special: you either win races with normal traffic (and get paid normally)… or you lose them and make less. SNOs see the impact in they wallets, and can decide if they’re happy with their earning rate… or if they need to try to tune performance for more coins.

snorkel · May 6, 2024, 1:13pm

That is not a stress test for future proof nodes. It just shows the actual performance with present use load.

Roxor · May 6, 2024, 2:01pm

Actual performance with present load is what you need to measure success against. Because present load is including anticipated future loads using synthetic test data:

Or did I understand it wrong? I thought the test Satellite handles simulated future workloads… gathered from the requirements of people who aren’t Storj customers yet?

Unless you opt-out of SLC… doesn’t that mean if you see consistently high race wins… that your node is “future proof”… because it’s winning with those simulated-future-workloads?

IsThisOn · May 6, 2024, 3:50pm

In theory, such test would not be needed in a decentralized system. It should auto correct itself by loosing races and getting less egress paid. Also you can’t test peering.

In reality, maybe 80 nodes is not wide enough?
Really depends on how many shit nodes there are.

I am not completely against benchmarking nodes, I just find it hard to decide where you would make the cut or how you would steer traffic based on that. And again, if the majority of nodes is not trash, it does not matter anyway.

snorkel · May 6, 2024, 5:13pm

I wasn’t thinking to use those tests for node selection in data allocation. They are for the SNO to see if their new setup can handle the load, and to not even let them start the vetting if the setup is not passing the test phaze.

Alexey · May 7, 2024, 4:51am

The libuplink will start uploads to 110 nodes even, only when the current required amount (e.g. 80) are finished, all other will be canceled.
@snorkel These others will get the increased failure rate, as suggested

but it influences a node selection too: rejected nodes will not stick in the hot cache for that libuplink.