Mass testing data from saltlake?

Not really. It’s on the contrary. Disk can’t keep up → storagenode software trying to save the day by accumulating pending writs in memory → disk is hopelessly busy → memory usage grows → OS looks at the process that seems to be inhaling all the ram as if there is no tomorrow and nukes it (to save the day as well, albeit in a murderous way).

Moving files from one place to another is bad example: it goes at max speed the receiving drive can sustain, if the writes slow down, reads also slow down. Storagenode ingress is controlled from outside. if writs are slow – it overflows the bucket and chokes. (leaving out the conversation about why is there no feedback loop)

2 Likes

Some follow-up questions:

-What is the purpose of this test?
-Can tests like this be announced, and offer SNO’s the option to opt out on the test without completely dropping the salt lake satellite?
-How long will this take?
-Do we get paid to store the test data? If so at the same rate?

Not out of my head. If you have a grafana instance up and running it should be easy to get the IOPs dashboard running. It doesn’t require any data from the storage nodes. So maybe that is any easy first step on the way to get more grafana experience.

I am sorry I don’t have the time to explain that right now. I am a bit behind with some other more important tasks.

Unlikely. If I can’t find the time to write a good announcement why should anyone else find the time right now? We need to get to the top of the hill first to get a better view of what exactly we have to announce. Right now there are just too many variables unclear and we try our best to get an answer on as many of them as possible.

The data is getting uploaded with a TTL. We are meeting once per day so I would say if you see load now the chances are high it will stay that way for at least one more day.

Yes and yes.

4 Likes

After investigating, one of my nodes on v1.102 was definitely experiencing a large ingress from saltlake. 10s after node startup docker stats saw 200+ active PIDs, growing to about 6000 and causing >1000 load. To remedy this I have just rate-limited the number of active requests via the max-concurrent-requests param to 100. The average bandwidth is still around 15 Mbit. The node rejects about 10-ish uploads every couple of seconds.

Weirdly this doesn’t happen on a different node that’s also experiencing big ingress from saltlake, and the disk seems fine. I’ve recreated the image and container as well.

Okay, that message sounds a little weird and quite unlike you. You’re usually very transparent. I can imagine wanting to test certain things without node operators being aware or reacting to details of a test, but it’s okay to just say: “We can’t say what the tests are right now.”

For what it’s worth, I’ve not seen a spike on any of my nodes. Without info I’m left to speculate that it’s either geo-restricted or perhaps something to do with QUIC or TCP Fast Open, as my current setup doesn’t support those. Or perhaps something that targets specific nodes. Hard to say, but it doesn’t seem to hit everyone.

1 Like

I guess it’s a preparation for a big customer? like a peak performance stress test?

1 Like

I also see abnormal high ingress from Saltlake on all nodes that I didn’t GE, because are new, but it’s not a high ingress overall. I get much more from US1 or EU1. For SL it’s 1.65GB, but on US1 it’s like 50GB.
I don’t know how to check the number of peers.
I run Synologys and an Ubuntu, all Exos drives, SATA connections, enough RAM. I don’t get crashes or restarts.

Have a look at darkstat

1 Like

Its clearly stress testing on nodes, and there’s is clearly a purpose, although we don’t seem to be told why at the moment, which indicates it is looking at something they don’t want us to fiddle with to fake the results.

I’ve seen like others the uptick in load, although it’s well within the sort of thing we had a few years ago, when the number of nodes was in the 4-6k level, and we had the SB server, and US Beta server.

The code has had the ability to select which nodes to upload to for a while now for Geo-fencing, and other things which can’t be written on the forum, and probably the new Pro tier, but it’s uses are kind of unlimited :slight_smile:

I hope they are adding tags to identify nodes on VPN’s which are faking Geo-IP, and slow to respond, so the customer can excluded these nodes - maybe these nodes will make up what is the current network with a drop in $$ for storage, and the pure nodes which people follow the spirit of running a node by having 1 node per disk, or a good private disk setup become the new default public network customers can use at current $$ rate, where the don’t have to wait for the 79th percentile node to upload at 1kbps…

:popcorn:

CP

If a node can’t keep up with ingress… that’s clearly a node problem. We’re clearly getting paid to store it. We’re clearly getting paid when a client requests to download it. What we’re seeing is clearly completely normal traffic.

What satellite is managing it… doesn’t matter.

I hope the current rate continues: perhaps we can stay ahead of forever-free account deletions! :+1:

3 Likes

Interesting :sunglasses: We clearly have different views - A node exists to support the customer, by storing the data quickly, and making it available quickly. A node that can’t keep up on Ingress / Egress just causes a really negative customer experience, and it’s technically very challenging to pin that down.

I agree the extra inbound data is really good, and we get paid for it :+1: - however, I would not expect this data to hang around long - its probably to test out the garbage collection changes as well, as the trash folder is a mess at the moment.

I just really dislike having my disk array exercised - If I had known I would of GE’d on the saltlake sat…

I have two nodes that connect over vpn. They are not slow. They are pretty fast. My third node is directly connected but is pretty slow, because of old cable technology. So use of VPN does not really correlate with response time. Yes, geo ip would be wrong, VPN server is not in the same city, perhaps 200 miles away.

If use of vpn or bridges will be blocked, then those nodes will be offline. There is no other way to connect them otherwise, there is carrier grade NAT and no port forwarding available.

With more providers implementing NAT, and not allowing port forwarding, more people will be required to use some sort of external bridge if they want to host services. It’s not good or bad. It’s just the fact of life.

I don’t think it is worth anyone’s time to get concerned with cheaters that deploy 70 nodes on the same raspberry pi using 20 VPN services because it makes no sense to do… so much work for $5/month. Seems silly.

On the other hand, in any system there will be cheaters and abusers. It’s from exponentially difficult to impossible, and unnecessary to ensure 100% compliance. As long as number of cheaters is small, it’s best to focus efforts elsewhere.

You would have thrown away disks at the end of their service life in the pristine condition?

They wear out regardless of amount of writes or reads. You paid for them, might as well get you money worth out of them.

You both are saying the same thing. Slow node — bad experience for the customer.

2 Likes

Can you not get a new cable ? this might make it better no ?

ah ok, I didn’t understand correctly, I thought they said they wanted $$ at any cost, my misunderstanding sorry.

My definition of transparency is that I will give the stakeholders are fair chance to come up with an announcement themself. Only if that plan doesn’t work I might step in and fill the gap myself. I want to encourage the stakeholders to reach out to the community. That would have some additional benefits over me posting something halve baked. It is a balance game. How long can I leave you in the dark vs how fast can I get the stakeholders to respond?

Also I am behind on a task that might block the rest of the team. Ok that could be an overstatement. It just sucks going to bed while shifting too much work into the next day. Squeezing in some additional tasks isn’t something I can justify to myself right now. I need to get some work done first.

7 Likes

Lol :slight_smile: No, poor wording on my side. It’s not that I’m using a crappy cable, it’s a cable technology (Data-Over-[coaxial]Cable, DOCSIS that adds a lot of latency. Xfinity/Comcast is slowly deploying “next-gen” network, that promises 10x better upstream (up to 400Mbps) and lower latency, but of course, still not where I live.

2 Likes

Perfectly understandable. I appreciate what you do. It wasn’t intended as a negative comment. It just seemed a little out of character. Hope your work quiets down a bit. Don’t overwork yourself!

3 Likes

I think the test traffic stopped. But still no announcement (won’t be?).

I still get it on one of my nodes

This should mark this topic as resolved.

4 Likes