Single point of Failure if a Satellite is down?

Sasha · February 12, 2020, 2:23am

I’ve been running a StorJ node on V3 since launch and I am quite pleased to be here and experience the testing and providing a SNO to the network.

However during this period I have noticed from time to time that a Satellite is unavailable (there’s a few threads with examples and I wont mention specific).

What actually happens from a customer’s point of view with the data they uploaded to StorJ via the Satellite (region) they’ve selected when that satellite is offline or unavailable temporarily.

Does this mean the customer is unable to access any of their data?
Is this a single point of failure which seems to indicate centralization even though there are hundreds of SNO’s providing storage to X piece of data for that Satellite?

Is this is a potential single point of failure a sort of centralization in StorJ. What can be done to overcome this obstacle to make it trutlly decentralized even from a Satellite’s perspective?

I’d love to see an in-depth discussion both with SNO’s and the StorJ team.

CC: @jocelyn

Pentium100 · February 12, 2020, 4:14am

Yes, if the satellite goes down then the customer cannot access his data, even though the nodes are online. If something bad happens to the satellite, the customers may lose access to their data, even though the data itself is still on the nodes.

Storj is not completely decentralized (like Bitcoin or trackerless torrents), the data is decentralized, so you get the performance and the lower cost of that, but the metadata is more-or-less centralized (depending on how many backup datacenters the satellite uses) and a few bombs could take it out.

Sasha · February 20, 2020, 3:46am

I am surprised, only 1 reply?

deathlessdd · February 20, 2020, 4:20am

I think I saw somewhere if one satellite goes down another one will pick up where the other one left off but I could be wrong.

Pentium100 · February 20, 2020, 6:14am

I do not think so, the satellites seem to be completely separate - separate billing, node can be DQ on one satellite, but not another etc.

nerdatwork · February 20, 2020, 6:24am

I hope you know Satellite is NOT a single computer but a bunch of them.

Pentium100 · February 20, 2020, 6:39am

Yes, but it is still centralized. While the data itself is spread out over multiple nodes and is presumably very resilient, the whole service is only as reliable as the central satellite (which, I assume, is using HA and such, like all the other cloud services, for example AWS).

So, a couple of bombs on data centers or a good earthquake would result in lost data, even though the files themselves are intact.

At least in my opinion, Storj/Tardigrade is not more decentralized than, say, AWS or other similar services as far as reliability is concerned, just that using SNOs for the data make it possible to offer the same service cheaper and possibly faster. Reliability would be essentially the same.

jtolio · February 20, 2020, 5:14pm

Hey friends!

Yeah, this issue is something we’ve thought long and hard about. @nerdatwork is right that a Satellite is not a single computer, but Pentium100 seems to understand this - Satellites should be run with industry-leading techniques for high availability and so on like other cloud services. But yes, if a disaster happens to the right data centers, all data stored with a specific Satellite (which may be spread across multiple data centers) could be lost. Of course, if a user replicates their data to multiple Satellites at their discretion, this problem is lessened, but I still understand why this might feel dissatisfying from a true-decentralization perspective.

We think there’s a long road ahead in decentralizing the internet, and we’re trying to take pragmatic, incremental steps to get there. Here’s an excerpt about this very issue from our whitepaper, section 6.2:

6.2. Improving user experience around metadata

In our initial concrete implementation, we place significant burdens on the Satellite operator to maintain a good service level with high availability, high durability, regular payments, and regular backups. We expect a large degree of variation in quality of Satellites, which led us to implement our quality control program (see section 4.21).

Over time, clients of Satellites will want to reduce their dependence on Satellite operators and enjoy more efficient data portability between Satellites besides downloading and uploading their data manually. We plan to spend significant time on improving this user experience in a number of ways.

In the short term, we plan to build a metadata import/export system, so users can make backups of their metadata on their own and transfer their metadata between Satellites.

In the medium term, we plan to reduce the size of these exports considerably and make as much of this backup process as automatic and seamless as possible. We expect to build a system to periodically back up the major portion of the metadata directly to the network.

In the long term, we plan to architect the Satellite out of the platform. We hope to eliminate Satellite control of the metadata entirely via a viable Byzantine-fault tolerant consensus algorithm, should one arise. The biggest challenge to this is finding the right balance between coordination avoidance and Byzantine fault tolerant consensus, where storage nodes can interact with one another and share encoded pieces of files while still operating within the performance levels users will expect from a platform that is competing with traditional cloud storage providers. Our team will continue to research viable means to achieve this end.

See section 2.10 and appendix A for discussions on why we aren’t tackling the Byzantine fault tolerant consensus problem right away.

The referenced sections are also worth reading. Lots of trade-offs in this field!