Many nodes on the same HDD

Toyoo · May 24, 2022, 7:22pm

Solutions are always designed with some intention. That doesn’t mean the intended effect will actually be achieved by the solution. To achieve the goal of vetting being an actual verification of hardware you’d probably need to take the similar effort to guard against vetted nodes being moved as Microsoft trying to avoid OEM-licensed Windows installations being moved. And that’s a rather gargantuan task, I believe.

I assume you meant that a stable node that had time to grow might fail after being moved, losing all the growth gained after vetting finished. I understand this point of view. I just don’t find it practical. Any change to the system in operation has a risk of introducing failures, and there are plenty of changes even without moving a node from one disk to another. Intentional, well-meaning changes like operating system updates, ISP-level networking changes, storage node updates, hardware upgrades (like me recently swapping memory chips for bigger ones), etc. Early vetting is just not the tool to manage these kind of disruptions.

Yet a large number of small nodes is actually what enables safer experimentation with these changes. It is possible to move a small node to a new set up without moving all data. If something’s wrong, it will only affect that one small node, and not the full dataset. It was thanks to moving a small node to ext4 that I was able to quickly determine that btrfs might have been at fault in this thread.

Another useful case is when there are two already vetted hardware/software stacks and the operator only wants to balance utilization of these two stacks, e.g. because they need more space for their primary purposes on one of the nodes. This is now only possible if the operator manages a large number of small nodes.

And again, there is a simple software solution to this problem—staged restarts. Which, by the way, Storj has implemented recently, might only need a slight tuning, just making it more granular that just going from 10% to 25% in a single step. So it’s not like it’s something hard to do, it’s actually almost just a configuration change for Storj to remedy this specific problem—certainly easier than designing a whole new vetting process.

After a recent study session of the file walker code I noticed that its IO impact can be decently approximated by running du on the storage directories. I measured it. du takes around 1 minute per 100 GB of blobs on my simple ext4 setup, scaling pretty much linearly across my smaller and larger nodes. So the impact of the file walker should be the same whether we have a single large node or a bunch of smaller nodes with staged restart.

Let’s consider having 10 nodes on a single drive, each being 2 TB on a fresh-from-factory modern 20 TB drive. Having 1% steps, each taking 30 minutes should be enough to avoid with decent probability any pair of nodes doing their after-startup file walker process doing at the same time. Simulation code:

steps <- 100
nodes <- 10
replications <- 1000
table(replicate(replications, nodes - length(unique(floor(runif(nodes) * steps)))))

Do you have a specific scenario in mind that doesn’t have the same outcome when these requests hit a single large node? A single large node also needs to support concurrent requests sharing bandwidth, disk IO and memory.