Many nodes on the same HDD

Toyoo · May 23, 2022, 12:21pm

This is likely a topic for a separate thread, so please feel free to split it if you think it makes sense.

I always took the existence of the migration guide as a sign that vetting (and the held amount as well) is not exactly about reliability of node’s hardware, software or network. Nodes were always changing hardware or being moved from place to place. Applying the analogy of the Ship of Theseus would be a good fit here. I assume here nobody—neither SNO nor Storj—wants to abandon a failing node that can be recovered and brought to a stable condition by just swapping some parts or moving it to a different network. The alternative would be SNO losing revenue and Storj having to pay for repair traffic.

Instead, I thought vetting is about operator prowess. «Does SNO have the skills necessary to set up a node?» «Can they configure hardware and software so that the node is reliable and fast enough?» We know—from many, many threads on this forum—that even the first steps might be troublesome, so the fact that a node survived vetting is more an account of SNO’s work than just good hardware. «Cheating» would then be a word I’d reserve to practices like selling pre-vetted nodes to other people.

And the difference between having two nodes on the same hard drive, and two nodes on two different hard drives in the same computer is not big. You still share most of hardware, all of software and networking setup, and all of wetware who manages the nodes. At some point something might happen to the shared power supply or the motherboard, or there might be a botched Windows update. What guards against these events is not vetting, but geographical spread of data.

Hence the statement:

doesn’t really make sense to me in this context. Besides, another of your statements also makes me ask a question:

…so if I plan from start to have multiple nodes on a single drive, making that drive “the actual hardware” I plan to run all nodes on, I am fine, because it’s the hardware that vetting “rightfully” tests?

andrew2.hart · May 23, 2022, 6:22pm

I don’t recommend this, mainly because having two filewalkers going at the same time is painful.
If you ran nodes on one disk with separated IPs by /24, then this would negative for the network, possibly

Toyoo · May 23, 2022, 6:38pm

Yeah, I’ve got staged restart script for that.

Indeed, not recommending this myself.

Alexey · May 23, 2022, 7:08pm

I can agree that vetting could be considered as test for SNO too, but the intention was to lose as little customers’ data as possible during the initial time of storagenode setup and stabilization. So maybe both.
However I disagree with the idea that it’s a test for SNO only. The same SNO could make a good setup and a bad setup, and perhaps the bad setup could be discovered only at scale or during attempt to repeat it somewhere with different hardware or software or OS.

As I said in an original topic, running multiple nodes on the same HDD is violation of Node Operator Terms & Conditions, so it should not be a target setup.
And this is for a reason - using the one disk for multiple nodes increases risk to wear out it faster, because it’s a mechanical drive and it doesn’t work well when multiple processes try to put a lot of stress on it, like mentioned filewalker. So, the customers’ data for these nodes at a higher risk to be lost.
Such nodes with high probability will affect each other when they receive multiple requests, thus they could loose the race for pieces more often, or even denial such requests and the customer should retry the request, loosing the speed and bandwidth.
Perhaps both reasons could be mitigated using the SSD, except losing data of several nodes at once.

However, as I mentioned in the original thread

JDA · May 23, 2022, 10:02pm

As much as I agree with you, the current implementation have some flaws for big nodes:

One Node per HDD
- But if the disk goes bad, you loose the reputation on this node and have to start all over again (vetting, hold etc.)
Running multiple nodes on same machine with each node assigned to an HDD
- Unrealistic with the current vetting time. I started a new node 3 month ago, I’m at 27/100 on 2 of the satellites, only 3 satellites have finished vetting.
  I have 16 drives to allocates to Storj, 16 nodes would take me years to finish.

The only viable options for large node today is Raid 5/6 + Trying to slowly get a second instance running (would take 6 month at least with the current vetting progress.)

But I guess it kind of make sense as 1000x1TB nodes is better for the network tant 100x10TB. The issues is that the income for a 1TB node is not worth the trouble, especially if you take into account that you have to start it all over when the disk will go bad.

Alexey · May 24, 2022, 4:23am

The big node with RAID setup in such conditions will wear out all disks in the array and will take not less time to fill up, but it may be almost empty during this time in compare with size of the one HDD, so why wear out all of them?

Unfortunately it’s true for RAID5 too, with todays disks the array could be lost during rebuild after the one disk failure because of bitrot. With 16 disks it’s almost 100% probability:

In case of separate disks you could lose 1/16 of the common data (it’s distributed between them), and not all.
However RAID setup is different from running several nodes on one HDD.

JDA · May 24, 2022, 4:47am

I’m not sure I understand this part. On my case those drives are already in use with other stuff, so there is no real wear related to Storj only. Furthermore, I have access to pretty much unlimited disk numbers of all kind (magnetic and SSD) so its really not a worry for me.
The point for me is that is with current vetting times it’s not possible to have 16 nodes behind the same IP.

I’m sorry but as a Storage Engineer I disagree. On modern software Raid 5 or 6 (depending on individual disk size) is resilient enough. You just have to take into account the rebuid time and have a decent monitoring.
And Bitrot will make you lose one file at most, not the all array.
A total failure or a Parity raid is very unlikely if you know what you are doing.

You are right if we go back on topic Multiple node on a SINGLE HDD is a bad idea for sure.

Alexey · May 24, 2022, 5:19am

Then it’s fine, I thought you build a RAID for Storj only.

I’m agree that if you would use small disk (less than 4TB), the probability would be lower. However, this in not my stat in the link above.
I have had problems with hardware RAID5 and 4-5 SCSI PRO disks with size less than 1TB in the past, we have had 14 branch offices with similar setup and every year we have had at least one failure during rebuild in one of the branch. And this was in 2004-2012. Then we migrated to RAID10 and forgot about data loss.

If that happen during rebuild, the rebuild process will fail, it should read every single byte to create a parity. In best case you will end with silently corrupted files. In worst, well, during rebuild disks of the same age will have much more stress and probability to fail for another disk is highly increased. During rebuild the RAID5 become RAID0 with all consequences.

JDA · May 24, 2022, 6:25am

I realize I should have been more specific.

Hardware RAID is bad (and the article above talk about those); no matter what the vendors say, modern Software parity RAID is always superior when it comes to data resiliency.
With parity RAID redundancy the key factor for data resiliency is rebuild time, software RAID usually use the following concept to help:
- Disk “Columns” to limit the spread of data to manageable disk number
- Parallel rebuild to speed up the rebuild time
- Predictive failure (this is where the monitoring is extremely important)
  - Personally most (if not all) the disks I removed from my arrays where not dead, but would probably fail in the coming months. Doing this on modern RAID software does not kick a RAID rebuild, but just move the block from this disk to the others. This is much less stressful and lowers considerably the probability to have a RAID rebuild)
- Throttling to limit the stress on the reminding disk during the rebuild. Which is the main issue, losing a second disk during the rebuild
- And last but not least, usually in software RAID you have multiple “volume” hosted on your physical disks, and when you rebuild you can rebuid in the order you want to (less likely to loose data first, most important first etc.) that way if at some point you loose a second disk part of your data have already been saved

As a personal rule I try to keep the rebuild time bellow 8h for single redundancy volumes and 24h for dual redundancy ones

That is a fact with single redundancy, you WILL get single corrupted files without a proper scrubbing process. However, you will also get those exact same corrupted file on a single drive without redundancy, so there is not real difference between the two scenarios on this topic.

Alexey · May 24, 2022, 6:46am

If you go with software RAID, you can use zfs. It has not only redundancy, it can also recover the corrupted data during scrubbing or even on the fly.

The difference is that in RAID you have much more high i/o than with a single HDD due to how RAID works and may provoke corruption more often, and the usual RAID without hash checks and autorepair will just keep the corrupted files or make it worse.
Regarding Storj in both cases if this piece would be audited - the audit will fail every time. In case of hash checks and autorepair the piece will be repaired before audit.

JDA · May 24, 2022, 7:21am

Exactly. This is not what I use but it’s similar.

That’s the important part, a single file corrupted from time to time doesn’t hurt the network or the node, it will be repaired.

Toyoo · May 24, 2022, 7:22pm

Solutions are always designed with some intention. That doesn’t mean the intended effect will actually be achieved by the solution. To achieve the goal of vetting being an actual verification of hardware you’d probably need to take the similar effort to guard against vetted nodes being moved as Microsoft trying to avoid OEM-licensed Windows installations being moved. And that’s a rather gargantuan task, I believe.

I assume you meant that a stable node that had time to grow might fail after being moved, losing all the growth gained after vetting finished. I understand this point of view. I just don’t find it practical. Any change to the system in operation has a risk of introducing failures, and there are plenty of changes even without moving a node from one disk to another. Intentional, well-meaning changes like operating system updates, ISP-level networking changes, storage node updates, hardware upgrades (like me recently swapping memory chips for bigger ones), etc. Early vetting is just not the tool to manage these kind of disruptions.

Yet a large number of small nodes is actually what enables safer experimentation with these changes. It is possible to move a small node to a new set up without moving all data. If something’s wrong, it will only affect that one small node, and not the full dataset. It was thanks to moving a small node to ext4 that I was able to quickly determine that btrfs might have been at fault in this thread.

Another useful case is when there are two already vetted hardware/software stacks and the operator only wants to balance utilization of these two stacks, e.g. because they need more space for their primary purposes on one of the nodes. This is now only possible if the operator manages a large number of small nodes.

And again, there is a simple software solution to this problem—staged restarts. Which, by the way, Storj has implemented recently, might only need a slight tuning, just making it more granular that just going from 10% to 25% in a single step. So it’s not like it’s something hard to do, it’s actually almost just a configuration change for Storj to remedy this specific problem—certainly easier than designing a whole new vetting process.

After a recent study session of the file walker code I noticed that its IO impact can be decently approximated by running du on the storage directories. I measured it. du takes around 1 minute per 100 GB of blobs on my simple ext4 setup, scaling pretty much linearly across my smaller and larger nodes. So the impact of the file walker should be the same whether we have a single large node or a bunch of smaller nodes with staged restart.

Let’s consider having 10 nodes on a single drive, each being 2 TB on a fresh-from-factory modern 20 TB drive. Having 1% steps, each taking 30 minutes should be enough to avoid with decent probability any pair of nodes doing their after-startup file walker process doing at the same time. Simulation code:

steps <- 100
nodes <- 10
replications <- 1000
table(replicate(replications, nodes - length(unique(floor(runif(nodes) * steps)))))

Do you have a specific scenario in mind that doesn’t have the same outcome when these requests hit a single large node? A single large node also needs to support concurrent requests sharing bandwidth, disk IO and memory.

Toyoo · May 25, 2022, 10:31am

One more thought. Due to a quirk we have recently discussed, the held amount is more or less capped per node. However, each node has a separate cap. Therefore a SNO who has one large node effectively has less in held amount than a SNO who has many small nodes. I assume this was not an intention of held amount, yet should make SNOs who host many small nodes more careful when sharing infrastructure!

A single 20 TB node and 10 nodes of 2 TB each will, at current conditions, earn about 80 USD per month. Yet the single node will have about 8 USD in the held amount, and ten small nodes will have about 80 USD there. And adding the eleventh 2TB node will even increase that, making the escrow essentially linear in terms of storage.