Distributed architecture for SNOs for high availability of storage nodes

pietro · May 7, 2020, 7:14am

From a client perspective the Tardigrade network has high availability (HA) by design because a piece is spread across several nodes.

But on SNO side, the architecture is single node, without HA. If a node dies, Tardigrade customers are not affected but the SNO will be in trouble by being disqualified and loosing the amount held back. Perhaps he/she will not join the network anymore.

A HA distributed architecture on SNO side could be implemented in the following way, for example in a three-nodes scenario:

if a node is configured in HA, for example in a three nodes configuration, the identity will be unique but the storage nodes will provide three different IP/port mappings to satellites, one for every node in the cluster
satellites will have a pool of destinations to connect to for a specific identity: they will try the first one and if not available will try the second one and so on; no answer from any node is considered a failure
the node receiving the command (i.e.: PUT, GET or DELETE) will execute the command on the node and then replicate the command to the other two nodes like a satellite does; this could be managed in an asynchronous way
the satellite who sent the command knows that a node is handling the request and will not forward the same request to the other nodes, so integrity is preserved

That’s just an example, there are a lot of systems which implements a symmetric distributed HA without any point of failure by replicating the storage on backend side.

Another approach could be to implement the HA only on SNO side, leaving the satellite software untouched, by introducing a load balanced architecture on storage nodes.

This approach is not a duplicate of the HA on the client side, it’s rather a complement. Think about a typical enterprise storage environment: there are a lot of HA techniques on each architecture layer: on the app side, on the middleware side, on the server side and on the storage side with RAID. No one would decide, for example, to remove RAID redundancy on a storage box because the HA is handled by the application. A robust environment needs specific HA on every critical layer of the architecture.

If storage operators can provide a higher level of availability, tardigrade users will benefit for the increased availability and therefore Storj will benefit too. It’s a win-win-win.

Pentium100 · May 7, 2020, 8:18am

I agree with the general idea, though have not really analyzed the details.
Another good thing would be to allow backups of the node (it can have a reasonable time limit, so I do not restore a year-old backup).

Right now it is possible to implement HA, but without the support of the node software it is more difficult to do.

Alexey · May 7, 2020, 8:32am

The replication should be on other level - storage, not the application.
We should not invent the wheel - you can use a Gluster or Cepf for that, which are robust these days and have a long history of development, it will be more reliable than introduce a replication on SN level, because we do not have an implementation of replication anymore.
Your node can only spread pieces across the network during the Graceful Exit. I don’t think you want that in case of compute failure.

For the compute failure protection you can use a Kubernetes or Docker Swarm. Those solutions are proven in the production too.

I don’t think that we should re-invent those technologies in the SN, it will be more complicated and less reliable than a separate specialized solutions.

Pentium100 · May 7, 2020, 8:57am

One way that would help with HA would be the ability to use MySQL for the databases. MySQL can be run in a cluster. That, combined with shared storage (nfs would not be a problem anymore since the db is not on it) could make it possible to run the node in a cluster as well.
Running multiple instances of the node with the same identity and using the same db cluster and storage would make it much more reliable in case of failure.

Can I do something like this now? Yes, I can use vmWare or similar HA and run the VM on two hosts, it is not as good as being able to run the node in a cluster.

In addition, sqlite db does not work with shared storage (smb, nfs) well.

pietro · May 7, 2020, 9:03am

Kubernetes or Swarm are not enough, you also need storage replication. At work we’re using Portworx but it’s a commercial solution, not suited for normal SNOs.

GlusterFS is slow and buggy, if you want to be disqualified, that’s the right solution.

Anyway, I generally agree with you, but how many SNOs are able to setup a HA solution using one of the techniques you pointed out?

There are many products that use native replication rather than relying on the underlying raw storage replication: etcd, Minio, Kafka, ActiveMQ, almost all databases like MongoDB, Postgres, etc. and many other platforms all use internal replication. Did they all reinvent the wheel?

direktorn · May 7, 2020, 5:04pm

I’m already running HA, there is a separation of logic from storage up to OS. The only part that is not HA is the application itself.

Cmdrd · May 7, 2020, 8:24pm

I use CEPH for my Storj nodes and while it isn’t quite as quick latency wise as local storage using the same disks I am sitting at approximately 35% upload success rate and less than 0.01% audit failures. Thanks to that storage resiliency the only time I take downtime is if I have to reboot after patching the VM hosting the storagenode. I have run through over a dozen cycles of patching hosts while the CEPH storage remains available just with degraded resiliency when a host is rebooting. I agree with @Alexey in that there are solutions to cover areas of resiliency.

This was a question I asked at a town hall last year because for SNOs with larger amounts of infrastructure it would be valuable, but really only at the application layer as all other layers can be abstracted and made more resilient separately. Implementing stateful HA applications with a shared back end is not easy and adds a lot of complexity to maintain for Storj Labs from a code base perspective.

I like the idea but the vast majority of infrastructure resiliency can be done without having to modify the storagenode and done at the other infrastructure layers.

direktorn · May 7, 2020, 9:07pm

You mean download success?

Cmdrd · May 7, 2020, 9:20pm

Upload in the context from the logs/from customer perspective. My storage nodes succeed 35% of the time to win the race for storing data. My download rate (customer downloading data from my nodes) is 99.8%.

Alexey · May 7, 2020, 9:23pm

It’s simple as is - they should not. If they knowledgeable to run such technologies - they will.
Also, take in consideration costs of HA solution.
From the practical point of view - run several nodes, one by one, when the previous is full. If one fail - you still have others. This is cost of simplicity.

Cmdrd · May 7, 2020, 10:21pm

This is the big thing here, if deploying an HA solution at any of the other levels of the infrastructure stack is too much, then deploying HA storage nodes will be too much. Setting up a Kubernetes cluster is pretty trivial to do in the scope of setting up highly available platforms.

Also this. If you want to a run a truly highly available solution then you are looking at the entire stack. Redundant switches, multiple internet connections, server cluster, some sort of redundant storage backend. If you aren’t doing all of that and want an application to be the only point of fault tolerance in your infrastructure then there is really no point. If you have everything on one switch and it dies, then everything goes. If it is all going out one internet connection and that goes down, then everything goes. If you have multiple VMs running on one server and it dies, everything goes.

Then there’s bulk storage. How is data redundancy handled? Replicated volumes is too expensive, paying twice for the same amount of provided storage is ridiculous, too high of overhead. So now you need a solution that utilizes shards with erasure coding to ensure that one node lost doesn’t take everything out. At that point you are looking at a distributed file system like CEPH or you could go the enterprise route and roll a SAN with redudandant controllers and pay out the nose per TB of capacity.

Essentially it comes down to if you aren’t willing and able to get everything else in your infrastructure built to handle every other failure other than a storage node, then having a highly available storage node isn’t going to help either.

Sorry for the long post for this but if an HA platform is being evaluated, then everything underneath it needs to be built for that too. Banking on the application to handle all that in a self contained manner is asking for trouble, and it is trouble that will ultimately land on Storj Labs plate. Best to stick with the KISS principle when building an application, especially when it is as already as complex as Storj.

Just my long-winded $0.02 on the subject.

Pentium100 · May 8, 2020, 4:21am

I agree that full HA requires everything redundant. However, with limited amount of money, you have to look at probabilities.

Can my internet connection fail? Yes, and if it does, my node could be offline for days. So I have a second connection.
Can my switch fail? Yes, however, it is not as likely as the other failures.
Can a hard drive fail? Yes, and it is likely, so I use RAID.
Can a power supply fail? Yes, that’s why my servers have two of them.
and so on…
Can a external drive array fail? Yes, but it is not that likely.
Can a UPS fail? Yes, but dual PSU servers work nice with two separate UPSs.

Here’s how I do HA with web servers:

Multiple app servers.
Two load balancers.
DB cluster and, if needed, shared storage for some files.
Backups.

While it is possible to run the node VM on a HA setup without any support from the software, it would be better if the software itself supported some parts of HA.
For example:
1.currently the database is sqlite and sqlite does not work correctly on smb or nfs. That prevents me from just having two VMs with shared storage. I have to use iSCSI and make sure that the array is never mounted twice at the same time. If I could use either separate storage for the database, or especially a db cluster it would be better.
2. (depends on #1) If the node used a DB cluster, then I could most likely run two instances with shared storage and just have a simple HAProxy or similar balancing between them.
3. Right now it is not possible to have a backup of the node in case something bad does happen. Not even a database dump or replication (and while database corruption threads are less frequent now, they still appear from time to time).

cdhowie · May 8, 2020, 4:31am

HA setups have little advantage to an SNO, however. If I have three 6TB drives, why would i run a 12TB RAID5 when I could run three 6TB nodes and have 1.5x the potential income? Any hardware that is dedicated to provide storage redundancy is hardware that could provide additional income.

Nodes are meant to be somewhat disposable under the current model.

Pentium100 · May 8, 2020, 5:24am

or lower actual income if/when a drive fails.

Let’s see what happens when a node fails and I create a new one.

I lose the escrow money, especially if my node was younger than 15 months (in my case the held amount is $232, so I stand to get $115 once the node is old enough).
The new node will start empty.
The new node will have to go trough the vetting process for a month, so, first month - essentially zero income.
After the vetting process ends, my node will slowly start to fill up (my current node has 9.5TiB of data, with the current ingress rate it would take 35 days to get this amount of data, assuming there are no deletes and my PUT success is 100%.
The escrow percentage goes back up and stays up for a long time.
Node potentially has lower reputation and gets less traffic than normal (I do not know if this is implemented yet).

So, when a node fails you lose a lot of income for months.

Also, looking after multiple nodes adds more overhead. Remembering to update/monitor/expand/etc many nodes instead of one.

cdhowie · May 8, 2020, 5:41am

Indeed, but if your drives aren’t lasting at least 15 months maybe it’s time to change vendors… an HDD failure within 15 months should not be expected unless you are using old drives.

The additional income from the extra node(s) should be more than enough to offset all of these points.

You could consider redundancy to be an insurance policy (a reduction in risk), and every insurance policy has a premium to be paid. In this case it’s the loss of income of the additional nodes you could be running.

Of course, if a disk fails after just a few months, the policy “pays out” so to speak and your revenue is higher for a brief period than if you weren’t using any redundancy.

This is IMO the only valid point. As an SNO, using redundancy is a technique to decrease maintenance burden, not increase revenue, and this is a trade-off that each person will need to evaluate for themselves.

However, the number of SNOs interested in a HA setup beyond storage redundancy is probably extremely minimal.

Pentium100 · May 8, 2020, 6:18am

“Potential” income. For example, if I reconfigured my node to be RAID0 instead of RAID6 (thus gaining two drives worth of space), I would not get any more income, my node would just have more empty space.

What I would be really interested in is backups. At the very least keeping a snaphot, even if it is on the same pool as the node, just in case some mistake or something results in lost files.

Running the node in HA? Really depends on how long the node can be offline before DQ. I do not have employees to fix/reboot my servers if something happens when I am on vacation etc.

pietro · May 8, 2020, 6:24am

I read the entire thread, I basically agree with you all, but the real reason why I opened this suggestion is exactly what @Pentium100 stated: if a node fails you lose the entire escrow. For me at the moment is about $250.

@cdhowie said nodes should be disposable. Right, as far as you consider $250 disposable.

We all know that the main problem is disks failure: all other component in the infrastructure is stateless. I’ve a couple of spare switches, routers, a bunch of compute boards, I’ve a UPS, etc. I also have spare hard disk capacity, but the problem is still the same: since backup is not possible, if my HDD fails, even in terms of a limited number of unrecoverable bad sectors, I loose what it takes one year to earn.

So, perhaps the suggestion who started the thread is not viable because Storj is not willing to invest on that, but a HA reference architecture from Storj would be highly appreciated. Then it will be up to the SNOs to implement it, but at least we know which is the best HA solution from Storj perspective.

cdhowie · May 8, 2020, 6:28am

You’re missing the point. Given an example setup with three disks, if you run RAID5 then you are reducing your income potential by 1/3. In the average case, a third node is going to make you more money than wasting that disk on redundancy. Yes, when another node fails you may lose some escrow, but the income from running an additional node should make up for that and then some.

Indeed. In one system I have 2x4TB disks and I ran them in a RAID1 – until the node filled up. Then I removed one of the mirrors and started a new node on it.

In other words, the insurance policy is free… until it’s not.

pietro · May 8, 2020, 6:30am

That’s in the plans of Storj:

because:

and SNOs will loose money.

Source: https://storj.io/storj.pdf

So Storj recognise the need of a more robust software solution.

cdhowie · May 8, 2020, 6:32am

I think @Pentium100 is talking about backups of storage nodes, not customer-side metadata backups.