Single point of Failure if a Satellite is down?

jtolio · February 20, 2020, 5:14pm

Hey friends!

Yeah, this issue is something we’ve thought long and hard about. @nerdatwork is right that a Satellite is not a single computer, but Pentium100 seems to understand this - Satellites should be run with industry-leading techniques for high availability and so on like other cloud services. But yes, if a disaster happens to the right data centers, all data stored with a specific Satellite (which may be spread across multiple data centers) could be lost. Of course, if a user replicates their data to multiple Satellites at their discretion, this problem is lessened, but I still understand why this might feel dissatisfying from a true-decentralization perspective.

We think there’s a long road ahead in decentralizing the internet, and we’re trying to take pragmatic, incremental steps to get there. Here’s an excerpt about this very issue from our whitepaper, section 6.2:

6.2. Improving user experience around metadata

In our initial concrete implementation, we place significant burdens on the Satellite operator to maintain a good service level with high availability, high durability, regular payments, and regular backups. We expect a large degree of variation in quality of Satellites, which led us to implement our quality control program (see section 4.21).

Over time, clients of Satellites will want to reduce their dependence on Satellite operators and enjoy more efficient data portability between Satellites besides downloading and uploading their data manually. We plan to spend significant time on improving this user experience in a number of ways.

In the short term, we plan to build a metadata import/export system, so users can make backups of their metadata on their own and transfer their metadata between Satellites.

In the medium term, we plan to reduce the size of these exports considerably and make as much of this backup process as automatic and seamless as possible. We expect to build a system to periodically back up the major portion of the metadata directly to the network.

In the long term, we plan to architect the Satellite out of the platform. We hope to eliminate Satellite control of the metadata entirely via a viable Byzantine-fault tolerant consensus algorithm, should one arise. The biggest challenge to this is finding the right balance between coordination avoidance and Byzantine fault tolerant consensus, where storage nodes can interact with one another and share encoded pieces of files while still operating within the performance levels users will expect from a platform that is competing with traditional cloud storage providers. Our team will continue to research viable means to achieve this end.

See section 2.10 and appendix A for discussions on why we aren’t tackling the Byzantine fault tolerant consensus problem right away.

The referenced sections are also worth reading. Lots of trade-offs in this field!