Should I change max-concurrent-requests? >> Note: The implementation has changed. You shouldn't use this setting any longer. Leave it commented out

SGC · April 7, 2020, 1:12pm

running raid with different drives is not a great idea tho… it will usually strain the worst drive and thus make some drives fail before others.
the only setup i’m familiar with that can run multiple dissimilar drives is windows storagespaces.
in that you can just add the drives and windows will tier it and put the most used data on the fastest drive and even split files across drives to ensure data reliability for critical data.

i was planing on going that route myself, until i learned of zfs… there is something to be said for snapshot backups and checksums on everything and ofc CoW.

but yeah you really want just one node… if possible

BrightSilence · April 7, 2020, 1:15pm

Exactly what you want in a RAID, since simultaneous failure can be fatal. I would say uneven wear is not really a concern. It may impact performance though.

andrew2.hart · April 7, 2020, 1:22pm

It is 30/10 Mbps but I don’t think I can go past 5 nodes realistically

BrightSilence · April 7, 2020, 1:25pm

It seems like a very bad idea to limit your nodes to 1 simultaneous connection with those speeds. Are you nodes full yet? If not, don’t add new nodes until they are. What is your nodes performance with this setting commented out?

SGC · April 7, 2020, 1:27pm

storagespaces can run raid across different sized drives while still utilizing all space from what i remember, basically like btrfs just that it actually works well … xD
never tested it myself tho…

Anyways i tested the max concurrent settings…
for my setup it doesn’t seem to affect anything, except when i go past hundreds, then for some reason its like my egress drops really low… not sure why…

but will give it a couple of weeks of testing to really verify that…

and config.yaml works fine… just some stuff one shoulds remove the # from

the really weird thing is that its not like the downloads fail or are cancelled… i just don’t get the downloads… been pondering that if i have to many connections at one time uploading… then new uploads with high bandwidth might not get full speed and so they end up being cancelled, and maybe they are often those that download again quick.

SGC · April 7, 2020, 1:29pm

5 nodes… lol i run 1 node on 400mbit fiber full duplex
like bright said, you needs much more bandwidth to support your nodes… 25/5 is minimum now… so you are way low.

and one connection at a time… you will still get a lot of network activity that just is rejected… it still takes up hardware capacity

BrightSilence · April 7, 2020, 1:31pm

I feel like I should repeat. The answer to the question posed in this topic has changed. It’s now simply NO! You shouldn’t change it. There may be very rare cases where it’s useful, but those cases are likely far from ideal setups to begin with.

Toyoo · April 7, 2020, 1:41pm

Just to confirm: is my experience the very rare case that supports use of this setting, or maybe I should seek a different solution?

BrightSilence · April 7, 2020, 1:47pm

The numbers you showed do not warrant using this setting. Yes, the success percentages go down a little, but the total number of successful transfers is still higher than when you don’t limit it. You’re not gaining anything by limiting it, you’re actually losing business by setting it. This is the case for almost everyone.

Additionally, if too many nodes use this setting and start rejecting transfers it could lead to failed uploads for customers, because not enough pieces are actually accepted. So, not only is it bad for you, it’s bad for customers as well.

Toyoo · April 7, 2020, 1:55pm

What I experience is I/O getting so slow as to impact the actual services run next to the storage node. Storj is the secondary concern here, as it is not a priority service on this hardware. Besides, wasn’t Storj supposed to take advantage of spare resources?

Also, now, after I collected more data, it seems that I start to get less pieces accepted total, not just per-piece success rate, at around 25 in-flight connections. Extrapolating, if you get 1k in-flight attempts at the same time, there’s a chance none of the uploads will get accepted, reducing the success rate to zero flat.

BrightSilence · April 7, 2020, 2:15pm

What kind of hardware are you running on? How is your HDD connected to the node? What kind of services are you experiencing issues with?
If we know a little more it might be easier to determine what’s going on.

Toyoo · April 7, 2020, 2:23pm

This node is an old laptop with a consumer HDD on SATA serving as a home server, connected with a cable to a router with 1000/1000 service. I host some small personal webapps on this laptop, and I’m sometimes running a freeciv game server. When the freeciv server starts having problems writing its savefile once per turn, it’s really worrisome

BrightSilence · April 7, 2020, 2:29pm

Not the most ideal hardware for a node. It’s most likely the old laptop HDD that’s slowing things down significantly. Yes, I think your experience is a bit of an exception because of this setup. It would help a lot if you could run the node on a different HDD. But I guess that kind of goes against using what you already have.

Toyoo · April 7, 2020, 2:39pm

Some more cute data. The top number of in-flight requests in the last 3 weeks was 776. I was getting success rates below 1% above 546 in-flight requests (got a total of 298 samples above that number, so I find it pretty representative). Show me consumer hardware that can do any good job with a flood like that…

SGC · April 7, 2020, 7:51pm

you can get near 1k concurrent requests!!!
your max concurrent at 25 right… how many uploads do you reject then?

1k sounds absurd when looking at my node… i mean i can run at like maybe 12 concurrent and not get any rejected uploads

BrightSilence · April 7, 2020, 9:51pm

Such high numbers will only happen if your node is severely bottlenecked somewhere. I had my node running at max 40 for a long time even before the far better uptimized dRPC was implemented and it never hit that limit even once. So I once again have to conclude you’re seeing a fringe scenario play out.

I also noticed in your description earlier that you didn’t mention upload failed as one of the things you take into account. This could artificially increase your numbers.

Toyoo · April 7, 2020, 11:12pm

Yeah, I actually consider both … canceled and … failed as the same for the purposes of computing the success rate. Forgot to write about that, as failures are very rare on this node—a total of 26 in the last 3 weeks.

And yes, this top case started with 21 upload started in a window of 5 seconds, and then it kept piling up at a slower pace. For some reason uplinks must have been quite patient, as usually I’d get an upload canceled quickly. But no, after about an hour of traffic like that it reached the top number. This actually makes me wonder how cheap would it be to run a denial-of-service attack against a node with this kind of patient uplinks.

SGC · April 8, 2020, 6:59am

i run on a 5 disk raiz1 array with a ssd l2ARC and usually my peaks in disk latency are 100-150ms
but at times that can be up to nearly 1000ms, very rarely, very briefly and i’m looking into it…

(might be when my computer wake up and connect to the server)

and funny thing is that its actually the ssd thats the worst bottleneck… ofc it might also see the most work…

i cannot imagine what the latency would become on a single ssd hdd, but like you say the system becomes unresponsive… then there you go… either each task gets so little thoughput that its worthless or the latency is so high its basically cancelled before the disk even has time to respond.

i’m putting in 5 more drives to improve my node and then maybe a dedicated zil to take more IO load from the ssd and adding 24gb more ram for the same reason.

so yeah you laptop server is a special case, i would set your concurrent connections to maybe 4 or 5
anything more than that i wouldn’t expect to run well…

SGC · April 14, 2020, 7:43pm

and can i just say, my node seems to run so nice at max-concurrent 14
rejects one once in a while, but quite rarely… for now… but we will see what the graphs say…
but this does seem to have completely removed my issues with my ssd constantly being backlog in the netdata monitoring system.

ofc it might just be insane traffic right now…which could be why its running so well, but if it breaks all my previous records then ill keep it for a week or two… if it doesn’t start to reject a lot.
just doesn’t seem right if my ssd causes me to get cancelled uploads because its overworked… xD

and i know you say it shouldn’t matter… but seriously i set it to 0…
my entire internet basically stalled, so i turned it back down before i got complaints, haven’t gotten QoS setup yet, not that it should matter much when limited connections.
i have yet to break past 40mbit ingress sustained for any meaningful period of time…
maybe today tho… its so close…

Pac · July 12, 2020, 6:28pm

Is the “SMR drive” case an exception?

More globally, something’s not clear to me: does this setting impact egress requests? ingress requests? or both?

I’m wondering if we should tinker with this parameter when facing issues with an SMR drive? But I guess that what would be ideal is a way to limit the number of ingress requests (to go easy on disk writes) without limiting egress requests… ?

That is not possible, is it?