Scale up vs Scale out

andrew2.hart · July 15, 2024, 12:06pm

I see a lot of effort being put into performance of nodes: DBs on ssd, metadata caching, bloom filter size, file walker changes, so on , so on.

Surely the storagenode is ideally suited to “scale out”? If you have performance issues just add another HDD and put storagenode on it and load is halved (eventually).

So is this an idea for the far future when devs are twiddling thumbs? How about making this formal so that you can add a suplimentary node…
config.xml :
storagenode-group=12312323-232321-787878-234fedabc

When two or more nodes are in a storagenode group the satellites try to balance load to them and transfer blocks from the fullest node to the least full in the group

This then allows other features people want

storagenode-group=12312323-232321-787878-234fedabc
storagenode-group-role=decommission-fast

storagenode-group-role=test

Ambifacient · July 15, 2024, 2:55pm

People generally just spin up another node that points to this new HDD. Others may use something like mergerfs to spread the load across drives without commiting to something like RAID.

The ingress is spread across more drives so there is technically less load, but when you run GC, it still has to walk over all pieces of the node. So at some point when all drives are close to full there won’t be much of a benefit of adding another drive.

Alexey · July 16, 2024, 4:02am

It’s not better than RAID0, the one disk failure and the node is gone. It’s better to run a second one on a second HDD instead, halving the risk and the load. But you are correct, if at least one node is full the all load will go to the empty one.

andrew2.hart · July 16, 2024, 4:51am

You get 200 extra iops

Ambifacient · July 16, 2024, 12:53pm

You also get proportionally a larger node. The node pieces/available IOPs ratio when node is near full doesn’t change. If the drives are half full, then sure there might be some benefit.

What you probably want is dual actuator drives configured in stripes.

mtone · July 16, 2024, 5:28pm

Striping has no benefit to latency and random I/O, only sequential which is useless for StorJ.

With dual actuators I think what you might want is spanning. If data is spread out evenly each actuator can service a request on each half in parallel.

So for dual actuator drives, either:

A single/spanned partition for better average I/O (when queue depth > 1)
Run one smaller node on each half, doubling I/O, but that is likely against ToS (closest thing is to run small nodes on multiple larger drives)

andrew2.hart · July 16, 2024, 7:08pm

No, this is the opposite of what I want…increasing complex and error prone systems instead of just scale out and add another simple hdd based storagenode

mtone · July 16, 2024, 7:31pm

Can’t you already achieve this using docker? Set a lower size limit to one node to let it decrease over time as desired and start a new node. Or start 2 new nodes – from what I’ve seen all my nodes got their share of test traffic evenly.

That will even things out gradually with the least complexity – no shared configuration, no explicit block balancing by satellites, can take out or move any one node to a different system and everything works.

andrew2.hart · July 17, 2024, 4:23am

It is generally understood that bandwidth is highest within a server, then within a cabinet, then row, room, data centre, maybe availability zone. Lowest bandwidth between unrelated locations such as satellites or storagenodes.
Making grouping on storagenodes formal would allow fast balancing of node load and decommissioning of failing nodes without huge repair costs to satellites and using high speed local networks.
Informally running two storagenodes on the same /24 network splits the ingest load but nodes can’t be balanced or decommissioned (or merged) without going the long expensive route on the public internet

Alexey · July 17, 2024, 7:03am

I see here only one problem. This balancing can be supported by a repair only (nodes are not communicating with each other anymore). So why the repair should put the piece to the risk to be in the same location as other pieces? It will not, the piece will be removed from your setup completely to the unrelated node.
You may do the same by shutdown the one node for more than 4h (and it would be under a risk of DQ of course), and it will start to lose pieces (they may be repaired to another nodes, since all offline pieces considered as unhealthy). Longer it’s offline - more pieces it will lost (via GC, of course, but it’s still faster than just reduce the available space below the usage) and with a constantly increasing risk of the DQ with each hour, of course.

andrew2.hart · July 17, 2024, 7:11am

It is an idea for the future when devs have nothing else to do. nodes would communicate to each other within the storagenode-group, managed by the satellite of course.

An example message from a satellite could be
“move bloomfilter|blocklist to storagenode-group node 123123123-12-3-12-3999”, audit then delete then when the satellite gets confirmation it just marks the blocks as on the new node.

Roxor · July 20, 2024, 9:30am

Can SNOs do this now… just by placing the nodes behind the same IP (or at least same /24)? The satellite already balances those configs: they’d naturally share load. Most people are trying to control more IPs… but the same tech (like VPNs) can put more nodes behind the same IP.

The piece missing would be the transfer-block feature: which is a mistake anyways (as it always adds cost: it’s cheapest to pay for data to be placed once… then never move it). Leave repairs for actual integrity issues: not busy-work.

Alexey · July 20, 2024, 12:37pm

yes. Because the repair will notice that and will try to remove those pieces to another nodes. However, it’s not immediately.