Updates on Test Data

I guess the easiest way to attract more SNOs would be to just pay them better. :wink:

4 Likes

Correct.

It was reduced because we had an off by one bug in the initial calculation. After correcting it 65 pieces would still give us the high durability we are targeting. The long tail still is 110 nodes. That feels a bit excessive to me especially with the new node selection. So I expect the long tail to go down but we will most likely stick with 65 as the upload target.

With network capacity you mean free space? This RS will coexist with the previous one and only be used for specific customers and not for everything. The downside of this RS number is more repair traffic if the data would stay around for long. So the short TTL allowes us to go down with the numbers but only for this data.

In total you could say old size before this RS number + new uploads * 1.9 expansion factor. We are not going to replace the old files with the new RS numbers. Maybe one day there will be an alternative RS number and a need to migrate old data but my feeling is it wouldn’t be an expansion factor of 1.9

As long as we keep the repair costs low and have enough throughput for the upload peak yes. The expansion factor isn’t really a consideration right now. Decreasing the storage expansion factor by 0.1 also means 10% more throughput with the same set of nodes. So that was the reason we optimized in that direction. We figured out it was benefitial to hit our throughput goal.

Still ongoing. My current estimation is that we might end up with 5% repair traffic. So 5% of all uploaded segments will need repair before hitting 30 days TTL. This is an early estimation and I will keep watching it. In a few weeks we will have more data to work with.

The TTL is benefitial here. We can start with any RS numbers and still change them without having to run any kind of migration. If it turns out that the repair traffic is too high we just add a few more extra nodes to the RS numbers and within 30 days that will reduce the repair traffic. So no need to overthink it.

In terms of losing a segment I already added one extra node. The RS numbers we are currently testing are based on some very old simulations. Some of our early load tests killed a few nodes all at the same time. I added an additional safety threshold on top and came up with a +1 repair threshold compared to the old simulations. Can’t hurt to take that extra safety. The optimal threshold is technically 2 nodes short compared to the old simulations. I am willing to give that a try. Worst case it does mean more repair traffic but as explained we can fine tune that part on the fly as long as the repair threshold garantees that no segments are getting lost.

We still have some internal debate about the RS numbers. How much repair traffic is too much? So current situation is that I try to measure how much repair traffic the current RS numbers will cause and management has to figure out if that is acceptable or too much. If it is too much we will adjust the RS numbers.

2 Likes

Thats a smart idea. I don’t think it will work here. The TTL is a bit short for that. Why repair data that will go away after 30 days anyway. All it would do is incentivise the nodes that can’t take the peak load and are happy to get the data shuffled around. I would still prefer more nodes so that we can tell our nodes to stop accepting any uploads. 30 days later they will be empty.

1 Like

I think the pay is fair: and a short (less than one month) vetting period is fair… but the 9-month 75%/50%/25% holdback period may have to go.

New SNOs need to go from setting up their new node to their first payout in no more than 2 months, guaranteed. That may mean, for their very first payout only, you ignore the minimum monthly payout threshold. If you owe them 5 cents in Storj… they need to see those coins hit their wallet to know the system works. The ETH network costs for that first payout are just a cost-of-doing-business (and attracting new nodes)

I understand why the holdback exists. But at the scale Storj is reaching: repairs for disappearing nodes should just be another cost-of-doing business baked into their pricing. And then the whole graceful-exit system can be decommissioned.

My guess is more new SNOs would be attracted… if they didn’t face the unknowns of perhaps running for months, with variable ingress (and potential maintenance/upgrade issues)… not having any idea if things even work (because their first payout will arrive God-knows-when)

Money has to come from somewhere. If repairing the segments costs $x, then the $x can either come from the held back amount (which incentivizes running the node for longer) or from reduced payouts for everyone. This is assuming the customer is not willing to pay more.

Maybe it would be possible to have the nodes do some of the repair themselves (assemble the segment, recreate the missing pieces and distribute them), but that either has some problems I currently did not think of or is just less important than other tasks (making the network work better for a new customer etc).

1 Like

Reduce payouts for everyone. Or more likely, as Storj experiences economies-of-scale and improved margins on Select… keep payouts the same when there would otherwise be the opportunity to increase them. Like every other cost of doing business: not every expense needs to be itemized to clients or providers.

Expanding on that a little bit and correct me if I’m wrong.
The way it works is that the old style node selection just selects more nodes than before. This is where the subnet filter still takes place. So this selects subnets and then picks a random node in that subnet.
The new part happens after that, now a bigger set of nodes is selected and they compete against each other (different logic based on which competition mechanism is used) on success rate which results in the set of nodes that actually get used for the transfer.
So it first spreads the load over all nodes in a subnet and then each node gets to compete based on their own metrics.

Edit: I should have read the rest of the topic first. @littleskunk already expanded on it with more detail. At least it seems I wasn’t wrong. :stuck_out_tongue:

I would imagine that one way of stimulating organic network growth would be to inform the current SNO pool when you’re likely to have a large amount of data to store.
I am happy to expand my storage if I know it’ll get filled up.

Also, have you considered creating a sort of Plug and Play “Storj Appliance” and deploy it to a few hundred selected (or random!) SNOs with instructions to “plug into the mains, into your router and here you go, you can keep the profits”?)

I’m afraid I have a prediction…

Would those surge nodes not also receive data from other customers with a longer or no TTL? I could imagine a feature to label nodes as not-preferred (similar to how they are labeled as unhealthy now) and if repair kicks in, it will repair pieces for those nodes to SNO nodes as well. That way you don’t trigger repair until it’s needed and still gradually move that data out of burst nodes. TTL doesn’t really matter anymore as most pieces with a short TTL won’t hit repair anyway. Worth a thought, since I liked @Roxor’s idea as well.

1 Like

How many times do I need to ask for more nodes? Why didn’t you expand yet? Do you see the gap now? I understand that you might want to wait with expanding until the traffic actually comes from US1 and isn’t test data anymore. But that means we might have to spin up some nodes ourself to buy you time to expand. If you are going to bring up more nodes on day 1 than yea we can turn down our surge capacity a short time after. Thats fine. But the way you don’t want to take an extra risk we are also not willing to take an extra risk. Some surge capacity is the middle ground that allows you to wait to the last moment before expanding but also allows us to still sign the deals with some surge capacity to fill the gaps.

I don’t understand that part. Are you asking for free hardware?

Correct me if I’m wrong, but weren’t you guys saying that with the reserved capacity there would be enough online capacity and we don’t need to add anything else?

I was the one that asked for a timeframe on how soon you want us to add disks. The reply was “you don’t need to add anything now”.

2 Likes

No. The surge nodes would be part of the storj select program. We can tag them as surge nodes and exclude them from all other uploads accept these customers with TTL data and the new RS numbers. It works similar to geofencing. We add geofencing to a bucket and all uploads to that bucket would follow a different placement contrain. Just that this placement constrain is not a geo location. It would be a different RS number + choiceof6 + a set of surge nodes.

2 Likes

Pretty much.
You have to spend the money with your 1000 nodes anyway, at least this way you get the decentralisation… :man_shrugging:t2:

Yea nice try :smiley:

The difference is giving away free hardware is at least twice as expensive as some short term surge capacity.

Fair point, I have no idea how much either of those costs. :slight_smile:

I think he’s saying that capacity-reservation can solve the raw-space issue… but the current limitation is throughput: existing nodes have free space but can’t receive ingress fast enough?

So the request is maybe for us to have faster Internet plans (or non-SMR HDDs, or whatever can make writes slow). Like the request for “more nodes” is really “more network connections we can balance ingress across” (and more nodes, especially from new SNOs, could do that)?

1 Like

I don’t see that on my end. If I leave the nodes raw (ie on the internet line I have them on), I’m nowhere near at saturating my connection. If I route them through an EU-central-datacenter, I instantly saturate it and keep it saturated.

That, to me means that nodes closer to the satellite are selected more often. There is no logical scenario in my head that matches the traffic pattern. I can’t saturate my link, but routing half-way across the globe (+added vpn overhead) gets it saturated?

Capacity yes but we are concerned about throughput. Bringing online a 1 PB of storage on a 100MBit/s connection isn’t going to help. I think we filled like 3000 nodes in the past week. So if these nodes would add some extra hardware drives (if possible with their setup) that would get us back to the throughput we had a week ago.

That is an old statement. We now see that the number of nodes with free space needs to stay high enough. So any additional node on a different internet connection or additional space on full nodes helps.

Not useful is to add an additional 1 PB to the 100MBit/s node I mentioned above.

I can’t even predict for my own nodes how much used space I will have at the end. As explained a few posts earlier we are trying to reserve enough capacity in the network to answer that question. In the meantime we are preparing some alternative solutions to that problem. The surge capacity would be short term.

1 Like

Ah, I get it. I can see they can cap my Internet connection… but I haven’t filled any HDDs. So I just watch and wait…

For now I have enough free space both on the node and in the pool. I increase the size of the virtual disk and the node when it is close to running out of space. Expanding the pool may pose a problem, but there is enough space for now.

1 Like