Thinking about RS ratios scrambles my brain. So this may sound like nonsense. But is this right?
Long ago the network uploaded 80 pieces, with 29 needed to recover: so that’s where a 2.7 expansion factor came from?
Then at some point 29/65 was configured… which is more like 2.2x (so the network can fit a bit more customer data in the same raw SNO space)
Now you’re testing 16/30, which is about 1.9x… so the network capacity will go up again.
And as that expansion factor goes down… it gives Storj-the-company a bit more breathing room: since you live in the margin between the $4/TB/m charged to the customer and the $1.5/TB/m for SNOs. But it does increase the risk of unrecoverable data. But you’ve been running the network for years and carefully watching things so understand you can still meet your recoverability/availability guarantees?
If so: nice job! I really want to see Storj run with a profit every month, for many months, before the treasury tokens get too low. Good luck!
The flat part is showing the limit of the network and not the limit of your node. We still need some fine tuning. We have to balance 2 objectives. Hit the peak load (=no flat part in the curve) and upload enough data every day to reserve enough capacity. We figured out that we can hit the peak load by increasing the choiceofn factor or increasing the long tail but we can still change these params a few days before it goes live. The other part with reserving enough space requires time. So that has higher priority at the moment. I am still running the math and the data points don’t make sense yet. The dashboard for management is saying the upload rate is too low and we don’t reserve enough space. Our upload tool is saying we are hitting the target. So there could be some hidden bugs that we still need to find. Best case it is just a bug in our math and the dashboard is just off. Worst case the dashboard is correct and we might see some extensive garbage collection. We don’t know yet and while we keep investigating we don’t worry about not hitting the peak load.
Beside that at least some of the target customers have a decent tollerance around the peak load. Imagin the flat part would be 3 hours every day. What would happen on the side of the customers is that the uploads will lag behind. So instead of an fast upload they kind of queue up a bit. Initially just by a few seconds but later we might be talking about minutes. The moment the current load reduces enough the queue will clean up and all files have been uploaded. So we can cut off the peak load to some extend and these customers wouldn’t have a problem with that. Ofc we want to give best service. We are going to communicate with these customers about what we see on our side and how it looks on their side.
We are working on a few plans to increase throughput. One of the plans is that we rent out a few servers ourself to fill the inital gap. So if your nodes can’t take the load we might just add 1000 nodes ourself and wait for the network to keep growing over time. We would reduce this surge capacity over time and don’t intend to run it for long. Just some surge capacity for times of rapid grow.
Maybe to clarify this part a bit. With your nodes I mean the entire network and not any single node. By increasing the choiceof factor we would increase the load for the fast nodes. That might sound like a great choice for the nodes that benefit from that but it also has some downsides. Your single node might be very happy about filling the free space in just 2 weeks but what do we do after that? To maintain the maximum throughput it is better to add more nodes instead of dialing in more and more on the fast nodes.
So at the end of the day I expect both to happen. We will increase the choiceof factor and also add surge capacity. That sounds like the best of both worlds.
The ideal situation would be us getting enough space reserved. At that point this kind of gets trivial. At that point we don’t need to guess which choiceof factor, long tail or surge capacity we need. We can simply test it with the remaining nodes. At that moment we have to hit the peak load just once and the TTL will make sure it will only get better and not worse from that point on.
Or… to make sure your own burst-capacity doesn’t get filled… could something like the repair system constantly be trying to empty the Storj burst nodes into the regular network? So for those few important hours per day they’re soaking up the extra data… but the rest of the day they’re slowly pushing those pieces off of themselves so they’re ready for the next surge? Kinda the opposite of a regular SNO: you’re trying for those nodes not to hold data long-term.
It’s like the rest of the SNOs are old lead-acid batteries… and you’d be running a bank of capacitors: speedy when you need it but they never have to hold a lot. Clever!
It was reduced because we had an off by one bug in the initial calculation. After correcting it 65 pieces would still give us the high durability we are targeting. The long tail still is 110 nodes. That feels a bit excessive to me especially with the new node selection. So I expect the long tail to go down but we will most likely stick with 65 as the upload target.
With network capacity you mean free space? This RS will coexist with the previous one and only be used for specific customers and not for everything. The downside of this RS number is more repair traffic if the data would stay around for long. So the short TTL allowes us to go down with the numbers but only for this data.
In total you could say old size before this RS number + new uploads * 1.9 expansion factor. We are not going to replace the old files with the new RS numbers. Maybe one day there will be an alternative RS number and a need to migrate old data but my feeling is it wouldn’t be an expansion factor of 1.9
As long as we keep the repair costs low and have enough throughput for the upload peak yes. The expansion factor isn’t really a consideration right now. Decreasing the storage expansion factor by 0.1 also means 10% more throughput with the same set of nodes. So that was the reason we optimized in that direction. We figured out it was benefitial to hit our throughput goal.
Still ongoing. My current estimation is that we might end up with 5% repair traffic. So 5% of all uploaded segments will need repair before hitting 30 days TTL. This is an early estimation and I will keep watching it. In a few weeks we will have more data to work with.
The TTL is benefitial here. We can start with any RS numbers and still change them without having to run any kind of migration. If it turns out that the repair traffic is too high we just add a few more extra nodes to the RS numbers and within 30 days that will reduce the repair traffic. So no need to overthink it.
In terms of losing a segment I already added one extra node. The RS numbers we are currently testing are based on some very old simulations. Some of our early load tests killed a few nodes all at the same time. I added an additional safety threshold on top and came up with a +1 repair threshold compared to the old simulations. Can’t hurt to take that extra safety. The optimal threshold is technically 2 nodes short compared to the old simulations. I am willing to give that a try. Worst case it does mean more repair traffic but as explained we can fine tune that part on the fly as long as the repair threshold garantees that no segments are getting lost.
We still have some internal debate about the RS numbers. How much repair traffic is too much? So current situation is that I try to measure how much repair traffic the current RS numbers will cause and management has to figure out if that is acceptable or too much. If it is too much we will adjust the RS numbers.
Thats a smart idea. I don’t think it will work here. The TTL is a bit short for that. Why repair data that will go away after 30 days anyway. All it would do is incentivise the nodes that can’t take the peak load and are happy to get the data shuffled around. I would still prefer more nodes so that we can tell our nodes to stop accepting any uploads. 30 days later they will be empty.
I think the pay is fair: and a short (less than one month) vetting period is fair… but the 9-month 75%/50%/25% holdback period may have to go.
New SNOs need to go from setting up their new node to their first payout in no more than 2 months, guaranteed. That may mean, for their very first payout only, you ignore the minimum monthly payout threshold. If you owe them 5 cents in Storj… they need to see those coins hit their wallet to know the system works. The ETH network costs for that first payout are just a cost-of-doing-business (and attracting new nodes)
I understand why the holdback exists. But at the scale Storj is reaching: repairs for disappearing nodes should just be another cost-of-doing business baked into their pricing. And then the whole graceful-exit system can be decommissioned.
My guess is more new SNOs would be attracted… if they didn’t face the unknowns of perhaps running for months, with variable ingress (and potential maintenance/upgrade issues)… not having any idea if things even work (because their first payout will arrive God-knows-when)
Money has to come from somewhere. If repairing the segments costs $x, then the $x can either come from the held back amount (which incentivizes running the node for longer) or from reduced payouts for everyone. This is assuming the customer is not willing to pay more.
Maybe it would be possible to have the nodes do some of the repair themselves (assemble the segment, recreate the missing pieces and distribute them), but that either has some problems I currently did not think of or is just less important than other tasks (making the network work better for a new customer etc).
Reduce payouts for everyone. Or more likely, as Storj experiences economies-of-scale and improved margins on Select… keep payouts the same when there would otherwise be the opportunity to increase them. Like every other cost of doing business: not every expense needs to be itemized to clients or providers.
Expanding on that a little bit and correct me if I’m wrong.
The way it works is that the old style node selection just selects more nodes than before. This is where the subnet filter still takes place. So this selects subnets and then picks a random node in that subnet.
The new part happens after that, now a bigger set of nodes is selected and they compete against each other (different logic based on which competition mechanism is used) on success rate which results in the set of nodes that actually get used for the transfer.
So it first spreads the load over all nodes in a subnet and then each node gets to compete based on their own metrics.
Edit: I should have read the rest of the topic first. @littleskunk already expanded on it with more detail. At least it seems I wasn’t wrong.
I would imagine that one way of stimulating organic network growth would be to inform the current SNO pool when you’re likely to have a large amount of data to store.
I am happy to expand my storage if I know it’ll get filled up.
Also, have you considered creating a sort of Plug and Play “Storj Appliance” and deploy it to a few hundred selected (or random!) SNOs with instructions to “plug into the mains, into your router and here you go, you can keep the profits”?)
Would those surge nodes not also receive data from other customers with a longer or no TTL? I could imagine a feature to label nodes as not-preferred (similar to how they are labeled as unhealthy now) and if repair kicks in, it will repair pieces for those nodes to SNO nodes as well. That way you don’t trigger repair until it’s needed and still gradually move that data out of burst nodes. TTL doesn’t really matter anymore as most pieces with a short TTL won’t hit repair anyway. Worth a thought, since I liked @Roxor’s idea as well.
How many times do I need to ask for more nodes? Why didn’t you expand yet? Do you see the gap now? I understand that you might want to wait with expanding until the traffic actually comes from US1 and isn’t test data anymore. But that means we might have to spin up some nodes ourself to buy you time to expand. If you are going to bring up more nodes on day 1 than yea we can turn down our surge capacity a short time after. Thats fine. But the way you don’t want to take an extra risk we are also not willing to take an extra risk. Some surge capacity is the middle ground that allows you to wait to the last moment before expanding but also allows us to still sign the deals with some surge capacity to fill the gaps.
I don’t understand that part. Are you asking for free hardware?
Correct me if I’m wrong, but weren’t you guys saying that with the reserved capacity there would be enough online capacity and we don’t need to add anything else?
I was the one that asked for a timeframe on how soon you want us to add disks. The reply was “you don’t need to add anything now”.
No. The surge nodes would be part of the storj select program. We can tag them as surge nodes and exclude them from all other uploads accept these customers with TTL data and the new RS numbers. It works similar to geofencing. We add geofencing to a bucket and all uploads to that bucket would follow a different placement contrain. Just that this placement constrain is not a geo location. It would be a different RS number + choiceof6 + a set of surge nodes.