In case you missed it - number of segments per data block to be reduced

rit · October 4, 2020, 10:58pm

The latest blog entry includes the following " Coming Down the Pipe:"

“Reed Solomon for uploads to the network: We’re fine-tuning these numbers to reduce the amount of redundancy and keep our durability consistent with our SLA of 11’9s. Our initial choice for RS numbers was very conservative and since we have data about our network for over a year, we can begin to fine-tune it to help increase our performance and decrease redundancy without sacrificing any durability of the files.”

As an SNO this will affect any calculations you may be making as it will result in fewer blocks being stored across the network for any particular amount of data stored and less repair traffic being generated when blocks need to be reformed.

On the upside, this change will not just improve performance for anyone storing data, but it will also provide Storj an improved cash flow as they will retain more of the fees charged for data storage. This may seem like a bad thing from an SNO’s point of view, but if you do the maths on the current data structures you find that Storj has not really left themselves much ‘wiggle room’ to vary the fees charged without also having to vary the amount paid to SNOs per TB of storage space and traffic.

What is not clear is what will happen with current data - this change could result in the satellites deleting 10-30% of the data block already held by the SNOs depending on the new Reed Solomon values as the repair process repairs data held across the system due to nodes exiting the network. The consequence of this may be that anyone with a large storage node could see the amount of storage used drop as the repair process reduces the number of blocks stored faster than new blocks are added.

BrightSilence · October 4, 2020, 11:31pm

A bit of a mixed bag especially since ingress has already been dropping recently. But an understandable move.

I don’t see how this would lead to satellites removing pieces from nodes in good standing though? Can’t the repair simply create fewer new pieces when triggered while leaving the old ones intact? That way it won’t impact data already on healthy nodes.

rit · October 4, 2020, 11:58pm

That’s the reason I said it’s not clear - if they go from a calculation that generates 80 blocks down to one that uses say 56, it would clearly make sense to repair data using the new 56 block standard than maintain the 80 block standard as there would be a monthly saving of 30% on current storage payments. With that type of saving it would be worth transforming all the data currently stored as a background task that the satellites run if traffic volumes are light regardless of the resulting one-off egress data payments made to SNOs.

The thing is that it is not clear if they plan to change the erasure code (is currently something like 34:80, but I’m not sure what exactly) or just store fewer of the blocks generated by the current erasure code.

I think the main this is that everyone just needs to plan for change and this change may be rather noticeable.

BrightSilence · October 5, 2020, 12:11am

It’s 29/80 now. What you’re suggesting wouldn’t involve repair at all. They could just purge pieces over the amount of 56 from metadata and let garbage collection take care of it. I doubt that will actually happen though. Since pieces above the minimum of 80 that have already finished transferring are kept as well. Eventually enough nodes will leave to drop it below 56 anyway, there is little need to rush that especially as it would effectively punish older and reliable nodes the hardest.

And since repair only recreates the missing pieces, letting repair deal with it like normal would not impact data on healthy nodes at all.

Cmdrd · October 5, 2020, 3:57am

I’m guessing this is what I just saw with my nodes deleting ~500GB of data combined. Was wondering what the cause was.

TheMightyGreek · October 5, 2020, 6:32am

I don’t think they will go down the “repair all at once” route since repair is done on AWS servers that cost them a lot to run (especially egress).
However I do agree that it’s a good decision, I’ve always been in favor of a slightly lower payout to increase performance and utilization by clients.
I do however agree with @BrightSilence with the fact that it shouldn’t disproportionately impact big nodes that have been on the network for a long time.
I think the best way would be to let the amount of pieces per data block slowly decrease by repairing the data only when the number of segments drops below 56.
That would allow to do almost no repair for a while and save a huge chunk of money.

stefanbenten · October 5, 2020, 6:36am

Our repair workers are luckily not running in either AWS, GCP or Azure

As of right now, I can say that there is no plan to either cause a big garbage collection round nor heavy repair process to convert them all in one go. From my current knowledge the plan is to only update them if they reached the repair threshold. That means that it will be a very slow process that happens over time and it can definitely be done independently on each satellite.

TheMightyGreek · October 5, 2020, 6:43am

Sorry my bad it’s done on Hetzner or Google servers, my bad.
I was referring to this post by Alexey:

where he highlights that repair costs are exorbitant for Storj.

Toyoo · October 5, 2020, 1:35pm

Is this also why Storj became so lenient with storage node downtime? There’s likely a trade-off to be made between allowable downtime and redundancy given a fixed reliability target.

BrightSilence · October 5, 2020, 8:45pm

Nope, the old uptime measurement wasn’t working reliably and was eventually disabled, while work was being done on an alternative. This became really expensive which prompted them to move repairs off GCP. So it was quite the other way around.
You’re right about the trade off though. Which is why the new system is being implemented. That’s being done really carefully though to prevent issues when too many nodes get disqualified.

Toyoo · October 7, 2020, 7:13pm

Oh, indeed. Reminds me that I wanted to ask at the last meeting hall about whether Storj measures correlation between node downtimes in any way.