i’m new and really like the concept of storj and tardigrade. I know that data is duplicated and spread across as chunks to many different SNOs. But what happens when many SNOs suddenly shut down their servers?
Lets assume a scenario:
One person setups up around 100 storage nodes with each 1TB of storage in different parts of the globe, let them fill up over many months and then suddenly shuts down every one of them without a graceful exit.
What happens to the network? How much data is lost?
Is there any protection against hostile overtake? For example against enemies of decentralized storage (could be a corporation or a government)?
Sorry if this already discussed somewhere else, i couldn’t find anything.
That’s a really interesting question.
I’m not a specialist by any means but the answer seems to be that your data is never 100% safe. But that also applies to big data centres, you can only reduce risk but never entirely get rid of it.
Firstly it is unlikely that someone would have hundreds of nodes with the sole purpose of removing them from the network ungracefully. I would imagine there are easier and cheaper and quicker ways of attacking the network’s integrity.
Secondly, as the network grows and you have more and more SNOs, the risk of ONE bad actor having a significant number of shards becomes less and less. If the enemy has 100 nodes out of total of 200 then they can cause damage. If they have 100 out of 100,000 then the chances of someone’s data going only into the compromised nodes becomes minuscule (but still possible).
Do you want to consider a catastrophic failure or a deliberate action? Even if a lot of SNOs are angry, there is an incentive for many of them to first send data to other SNOs before shutting down their nodes: graceful exit, which pays money. This is currently available for nodes that are at least 6 months old, and one could argue that younger nodes just don’t have enough data to matter (though I guess only Storj has statistics to confirm).
In case of a catastrophic failure, you can start by consulting section 7.3 of the Storj Whitepaper. It contains some math and simulations that attempt to answer your question: the parameters for the network were initially selected to have safety better than what competition offers. There are two caveats when reading the paper, one working in our advantage, and two against:
The simulations assume the smallest time period is a month, but it’s a good thing: this means that they assume the soonest Storj can take action is the next month, and obviously in practice Storj can take action much more often.
The simulations talk about single chunks. Currently a single chunk is about 2.5MB, and a single file might be composed of many chunks. Loss of any chunk is essentially a loss of the whole file. Hence if you compute the probability of losing a single chunk to be X (let say, 10¯²⁰), the probability of losing a file consisting of 42 chunks will be 1-(1-X)⁴² (in this case about 4×10¯¹⁸). The larger the file, the bigger this effect will become.
The math doesn’t take correlation of failures into account:
Consider a case where two nodes initially hosted separately are now migrated to single hardware. For any chunk that happened to have its stripes hosted on those two nodes, before the migration a failure might take down one stripe at a time, and now it will take both. This is pretty much equivalent to a loss of one stripe. It would be possible to simply repair the chunks affected by this problem in a regular way, but so far we have no word from Storj that this actually happens, so for the purpose of risk estimation we must assume it doesn’t.
Given a large-enough number of nodes, this is not a problem though, again because of all the redundancy already in the network: it is very unlikely that a given chunk will be affected by a migration like that more than once. I believe that Storj for a long time now has more than enough nodes to not be affected by this problem. However, only satellite operators, that is currently only Storj employees, can measure the actual correlation in the network.
One last thing is that in case of a large-scale failure, raw capacity is very easy and cheap to rebuild on top of regular cloud storage, and the current tooling already allows that. It won’t be as massively redundant as a proper SNO-based network, but it will be enough to provide temporary means to survive the failure.