Uptime requirement is inadequate

nazar-pc · August 29, 2019, 7:06pm

Taking Tardigrade into consideration I think it is fair to compare Storj to AWS S3. And if you look at their SLA even that is not as strict as single node in Storj: https://aws.amazon.com/en/s3/sla/
What I’m trying to say is that having multiple shards across the network should be repairable with much lower uptime requirement.
Also graceful shutdown is not implemented and it is not clear to me at least what will happen if the node will be down for longer than 5 hours. Lets assume 6 hours.
Would be nice for node operator to be able to tell the network in advance that the node will be offline, specify time duration and let the network allow to prepare and approve the time frame. Node shouldn’t be punished too hard even after 12 hours of downtime if it gets back online and all of the content is still there.
This is especially true for those who have node running at home with powerful networking, but may occasionally need to clean their computer, upgrade parts or something like that.

Alexey · August 29, 2019, 7:23pm

You mixed up the SLA for the customers and SLA for components.
The storagenode is a component.
There is no comparable requirements.

At the moment - nothing, the uptime disqualification is currently disabled, but it will affect the reputation

You can add or vote for the idea there: https://ideas.storj.io

nazar-pc · August 29, 2019, 7:27pm

Not necessarily, you can and will have higher reliability when using sharding across multiple nodes in different locations comparing to a single node. Hence requirements to individual nodes should be lower than to the system as a whole (e.g. S3).

I’m new to Storj, so this may be a silly question. But is the reputation recoverable? Le’t say I was online for half a year, then offline for 2 days and then a year online again. What will be the outcome? Will reputation recover fully after some time or is there even such a concept here?

Alexey · August 29, 2019, 7:41pm

I suggest you to read a whitepaper first. The “sharding” already in place for sure
There is a whole section “2.5 Durability, device failure, and churn” describes online requirements and math behind it, especially in the “3.4 Redundancy”, “3.8 Data repair” and so on.

Also, you can read the blog posts:

You can read the design draft here: Design draft: New way to measure SN uptimes

thepaul · August 31, 2019, 2:31pm

@nazar-pc You are right that it seems like the uptime requirement for a single node should be lower than that of the overall service itself, because the service can continue working even when many nodes are offline. However, the model is quite a bit more complicated than it may seem at first. It has to allow for nine nines of segment durability as well as the availability requirement, and considerations like auditing and ongoing repair and service profitability. And with the best predictions and estimates we can come up with, the model requires a fairly high availability from successful nodes.

We do plan to add features to the storagenode software that make this requirement more manageable; for example, allowing live migration to a new hard drive without downtime. Storj wants storage mode operators to succeed- without a good storage network, the whole service could fail.

thepaul · August 31, 2019, 2:45pm

I’m new to Storj, so this may be a silly question. But is the reputation recoverable? Le’t say I was online for half a year, then offline for 2 days and then a year online again. What will be the outcome? Will reputation recover fully after some time or is there even such a concept here?

To be frank, the details of the disqualification and reputation-recovery policies are not yet decided. It will depend on how the network grows and the types of demand that become most popular. What we do know is that nodes which satisfy the current uptime guidance will be fine*.

that is, ”fine” with respect to uptime disqualification and uptime effects on reputation. Of course there are lots of things a node could do to lose reputation or get disqualified in other ways, like not actually storing data or attacking the rest of the network.

eagleye · September 8, 2019, 3:35pm

Is this where we could make suggestions on downtime and recovery? I think individual nodes should have more leniency for downtime based on the fact that the network has many redundant nodes for backup. Downtime of a few days due to a home operator away on vacation should not disqualify. It should depreciate and reduce reliability, but if the node comes back online and stays online with good audits it should recover quickly in reputation and maintain the network.

One plan would be if the satellite retained the number of online nodes needed to maintain the stability of file integrity and then could disqualify nodes that are offline with the least value of reputation and longest time offline. This concept would theoretically eliminate needs for an hour limit to downtime per month and maintain file stability across the network. It would allow longer downtimes on individual nodes but risk disqualification only if the network overall file system became at risk.

I see 300+ neighbors now and my upload successes have dropped dramatically, I think, based on network size. So therefore, there should be plenty of active nodes available to recover offline nodes quickly when n number of shards risked file integrity.

Maybe tardigrade keeps that integrity calculation and requests file integrity maintenance based on available nodes for the files stored.

Alexey · September 8, 2019, 3:58pm

You can read a design here: