Updates on Test Data

I’m sure if they had a crystal ball that told them this usage pattern would be required by a large prospective customers in the future, they would have. :laughing:

It seems I’m in the same place as you are when it comes to expansion decisions. But we just come at it with a different attitude. I don’t find much use in complaining about how long it takes for them to resolve issues. The change logs clearly show they are very busy right now and need to prioritize things.

But rather than complain, I just tell them this is influencing my willingness to upgrade and leave it at that. That doesn’t mean I’ll shut up about the issues either. I’m just not willing to get agitated or be so negative about it. This is supposed to be fun for me with some compensation on the side. I can’t imagine you’re having much fun with it right now.

6 Likes

It’s not just this customer. Those issues with filewalker and databases would have shown much sooner. For the last years test data from Saltlake was minimal if I remember correctly.
They could have done some general stress testing easily instead. To test if the network and the node software can keep up. I mean the plan was to grow to exabyte scale. If that’s the plan then you should test if the network and your node software can handle it at all. And if you look at this Github issue

you can see that even on the select network there are issues on datacenter grade nodes even with todays workload. I think these kind of issues could have detected earlier with appropriate testing.

It was fun until it was clear that the data from the nodes is all wrong and tons of garbage has not been deleted or not getting deleted for various reasons. Since then it is an ugly mess.

2 Likes

@littleskunk About testing setup.
Did you tried some other approach than make 80 pieses and send them to 110 nodes. In this case you wasting about 25-30% of bandwidth resources on client side and also on node side.

Did you tried to make for example 85 sends(as now you chose fastest nodes) and if some fail just fail piece restart send to backup nodes? In this case there will be less concurrent connections on same time, and more free bandwidth do existing existing connections. This will prevent uploads from failing, as process will go on until it uploaded. Also long tail can be shorter. In mass send bandwidth this approach can be more effective and more warranty file send at firs time.

this would reduce the upload speed for the customer, because it would be performed client-side.
So, it would be better to do not involve the client.

I understand that the test tries to optimise the upload of data to nodes in the fastest way.
Is the optimisation best also for downloads?
If a customer needs quick retrieval, are these tests cover that too? The impruved node selection is the best in both cases?

We didn’t check that so far. As far as I understand it’s expected to be only uploads with a TTL. And it may have downloads too, but I think that we can implement it for high speed downloads too, if that would be required.

Usually upload speed MAX out client upload, at same time 25-30 % of this traffic is unnecessary because of 40 additional pieces. this uploads also lower speeds of upload on client side. so my approach will give possibility to upload even faster, as uploader will upload again only failed to upload pieces. Also time to time people complain that upload fail because faild more than 40 pieces and 80 was not uploaded, this approach will also fix this problem, because better upload litle longer but from first try. it will give clients better expiriece.

@Vadim
They are already testing this.

1 Like

Only in the case if you bypassing a /24 rule. Otherwise all these downloads requests will go to a physically different nodes. So no any bottleneck and download will be much faster for the client, because it will cancel all slow connections when it reached the required minimum of pieces to reconstruct the segment.

You speak about node uipload, I speac about client uplod, dont forget that 90% of users dont have 10-100GBit upload.

This is exactly the same case anyway. The client will connect physically different nodes across the world in the ideal case, so it will cancel all slow when the minimum required pieces are uploaded. However, we can be risky here and set it equal to a minimum repair threshold, so all remained pieces will be reconstructed by repair workers. But I believe it’s too risky.

No no no, I am not speak about lower amount of pieces below 80, only those pieces that over 80. usually client start upload 110 pieses. this 30 additional pieces can upload only if some of first 80 are fails. so this additionl 40 pieses are about 25-30% of bandwidth on client side that try to compete for speed. if client have 100 mbit connection this additional uploads will take 30 mbit, that coluld use first 80 pieses to upload faster.

I think it is a subject to test, I am not suggest blind change of everything. Also it can be behavier that client can choose to use or not use on their side.

1 Like

You mean 30 additional pieces. 110-80=30 :grin:

On the question on how long it takes for a SNO to expand; for me it is a process of a few weeks usually, but unfortunately I have to go on vacation with the family so right now I can bring another 20 nodes online in about a month.

I have to wait for delivery, actually build the nodes and then deliver them to their assigned location. I have a lot of places where I can put nodes (with permission) but it does take a little time. I suggest storj keep a large enough buffer of data that can be quickly removed to cover the time it takes for most SNOs to expand. As the network grows, this buffer can then be made smaller.

I have noticed a possible bug, not that serious but still. I have historically been running nodes with used space filewalker disabled. Combined with a few oomkiller oopses etc this screws up the space calculations for the node as we all know. I’ve been fixing this and I notice that when the filewalker completes and the node is made aware of more space then this is not reported right away to the satellites because there is no incoming data. If i restart the node, data begins to flow in.

1 Like

I believe the sat gets the available space once per hour. It’s by design.

5 Likes

That’s good to know, I experienced something like that yesterday but only gave it half an hour until I restarted the node. Now I know. Thank you :slight_smile:

2 Likes

Sorry yes, this is correct. 30

2 Likes

Yes, with choice of two we now selects 220 nodes and selects 110 fastest from them (with a best success rate), then the usual long tail cancelation rule is applied.

Yes… But how do we know that the remaining 80 will be enough this exact time? We use the past stat every time… and these nodes could right now changing their IP or the SNO specified the rate limit and restarted the node, or the node just right now is offline…
We must succeed any cost, otherwise other round-trips will make it even worse. We want to be always fast without any unpredictable failures during the way.
So the implemented feature “choice of n” do not solve all other possible hiccups.

After upgrading the node VM to Debian 12, load average (on the VM) started jumping to about 100 for a couple of minutes about every 20 minutes. This stopped the ingress for a while, then ingress restarted then another load spike.

Enabling sync on the node made the load spikes go away and ingress went up a bit and load average stays at about 6. I have posted graphs in the other thread.

I am not completely sure what caused it, but since enabling sync fixed it, it probably was the OS trying to flush too much data at one time, overloading the IO, causing the load spike.

1 Like