High load after upgrade to v. 1.3.3

demeo · May 10, 2020, 3:09am

Hi there,

After upgrading to v 1.3.3 load is significantly higher. Yes I do have 8TB smr drive but before upgrade to v 1.3.3 load was below 2 and io-wait is much higher now.
Any advice will be appreciated.

BrightSilence · May 10, 2020, 7:32am

You may want to look into the suggestion here.

hoarder · May 10, 2020, 7:56am

I have similar issues, so far I’m dealing with it by giving the drive some time to breathe - every 8 hours a node is recreated with lower amount of storage, which stops data intake. An hour later the change is reverted to allow uploads again. This amount of time seems to be enough for SMR drive to clean up the cache area.

LinuxNet · May 10, 2020, 8:41am

Hi. I have had the same problem since 1.3.3
The blame is always put on the SMR hard drives. Now I would be interested in why the problem has only occurred to me since 1.3.3? That was never before and I already had a lot more traffic. (now the hard drive is full)

SGC · May 10, 2020, 10:32am

SMR drives might not do so great if filled to capacity, however issues like that shouldn’t show up just after an update… but i suppose they could…
i have noticed that if i have down time my ingress increases often greatly until i’m back in sync with the network… so if your updating is a bit slow, then maybe that could affect it…

i mean i had some issues the last couple of days and when i finally got back on the network i peaked at 110mbit ingress with an avg of nearly 5 MB/s
but i doubt that is it.

it seems to be a trend that the 1.3.3 upgrade has caused higher loads, which is to be expected sometimes when developing new programming…

SMR drives read just fine… so if you are at near max node capacity and leave a bit free for the disk not to get to cluttered / fragmented… remember it has to read and move blocks around just like an ssd, so you will give it a shit ton of extra work if filling it to 100%
i got no idea where the sweet spot is tho… with SSD’s they say 80% but with a SMR drive i would think the number would be higher… maybe 90 or even 95% is fine… you would often with in a few weeks feel if its having trouble, because it will get slower and slower at writing … tho reading should be just fine.

I would just adjust the max concurrent in the config yaml to like 10-20
might look a bit ugly when you boot the node, but it will keep the network from flooding your node with requests your system isn’t fast enough to answer anyways…

I run at 20 with 400mbit fiber, 48GB RAM dedicated SSD for my OS, another dedicated SSD handling SLOG and L2ARC, then i got 5 HDD in raidz1
it’s a monster that eats whatever the network throws at it… and still for whatever reason and because my local 1gbit network infrastructure, don’t like the strain, and i have other people using my network, so latency is a thing.
my system can keep up with the network, even if it doesn’t successfully manage to get every upload successfully 15% or so get cancelled, it rejects zero… when running at max concurrent 20
tho booting the node is f’kered because pings and cleaning orders counts as concurrent.

anyways it slows down my number of concurrent actions for the computer and for the network to keep everything running smoothly.

and i know not everybody will agree, looks at Brightsilence but i wouldn’t run without it… and i’ve turned it on and off like 15 times sometimes letting it be off for a few days of stable activity and then turning it on again… to me it makes a noticeable difference in my performance, latency and such.

tho now i got my system optimized to a T then i might be able to run it at 200+ again or infinite at 0
but like i noted earlier, its most likely my network which needs better gear… using old 1gbit ethernet able routers converted into switches… or i assume switches i suppose they could be hubs, maybe thats the issue… well the server connection hops through a couple of those before hitting the fiber.

anyways long story short limiting max concurrent has helped my node and infrastructure run smoothly

Add this to the end of your config.yaml
# Maximum number of simultaneous transfers
storage2.max-concurrent-requests: 7

SGC goes back to check if he can finally run infinite with his current rig

WELCOME to the network get spammed by 20 of so upload requests at once and cleanup and auth procedures my system was at 20% iowait!!! for the first 10 minutes… seems to be back down to near 0% now… but its isn’t a gentle start up… ridden hard and put up wet…

anon27637763 · May 10, 2020, 10:43am

It’s the large number of deletes coming through that is causing high system loading.

$ grep -i deleted /opt/storj/node.log |wc -l
419057
$ grep -i uploaded /opt/storj/node.log |wc -l
189679
$ grep -i downloaded /opt/storj/node.log |wc -l
147082

Since the beginning of May, my node has recieved more deletes than uploads and downloads combined.

SGC · May 10, 2020, 10:58am

i got a few days of those also… but seems like i’ve gotten through them
only got 775 deletes logged for today, tho i don’t think those include cleaning, if i was to hazard a guess then they get registered by the node when they are logged and then finally deleted after some X amount of time during the next cleaning cycle.

i did get like 500000 deletes or more over a few days
successrate.sh result on log from today
just set my max concurrent to 0 when i did the first post
but i can already see a drop in my upload successrates of 0.1%
tho my system seems to be getting close to be able to keep up.

========== AUDIT ==============
Critically failed:     0
Critical Fail Rate:    0.000%
Recoverable failed:    7
Recoverable Fail Rate: 1.039%
Successful:            667
Success Rate:          98.961%
========== DOWNLOAD ===========
Failed:                91
Fail Rate:             1.650%
Canceled:              7
Cancel Rate:           0.127%
Successful:            5417
Success Rate:          98.223%
========== UPLOAD =============
Rejected:              81
Acceptance Rate:       99.859%
---------- accepted -----------
Failed:                0
Fail Rate:             0.000%
Canceled:              8855
Cancel Rate:           15.406%
Successful:            48623
Success Rate:          84.594%
========== REPAIR DOWNLOAD ====
Failed:                3
Fail Rate:             7.692%
Canceled:              0
Cancel Rate:           0.000%
Successful:            36
Success Rate:          92.308%
========== REPAIR UPLOAD ======
Failed:                0
Fail Rate:             0.000%
Canceled:              154
Cancel Rate:           16.348%
Successful:            788
Success Rate:          83.652%
========== DELETE =============
Failed:                0
Fail Rate:             0.000%
Successful:            775
Success Rate:          100.000%

the 3 lines was because i changed something in bios which didn’t agree with my server and it started shutting down randomly… seems to be fixed now whatever it was,
i bet you if i leave it at max con 0 then my network graph will start trailing downwards and never recover.
also note how much one gets punished when returning to the network…i would fear dealing with that on one hdd…

SGC · May 10, 2020, 11:11am

it also becomes a bit of a self reinforcing cycle…
if the server falls behind the node seems to redirect more traffic, leading to higher load on the system and then one is attempting to for the most part getting an upload, gets half way and its cancelled and then on to the next one maybe also failing that because its trying to get 4 other uploads that it also will end up getting cancelled…

i’ve most often seen performance gains from running lower max concurrent (ofc to a point) rather than going to high because that just lets it run into the self reinforcing downward spiral of latency and overload death… xD

wasting tons of bandwidth on maybe 50% cancelled ingress uploads, maybe increasing disk latency but i would hope uploads stay in ram until they are complete and ready to go to the disk.

SGC · May 10, 2020, 1:47pm

about 3hr in with max concurrent = 0 / unlimited

Doesn’t look to promising…

========== AUDIT ==============
Critically failed:     0
Critical Fail Rate:    0.000%
Recoverable failed:    8
Recoverable Fail Rate: 0.995%
Successful:            796
Success Rate:          99.005%
========== DOWNLOAD ===========
Failed:                114
Fail Rate:             1.593%
Canceled:              16
Cancel Rate:           0.224%
Successful:            7027
Success Rate:          98.184%
========== UPLOAD =============
Rejected:              81
Acceptance Rate:       99.881%
---------- accepted -----------
Failed:                0
Fail Rate:             0.000%
Canceled:              10480
Cancel Rate:           15.355%
Successful:            57770
Success Rate:          84.645%
========== REPAIR DOWNLOAD ====
Failed:                3
Fail Rate:             6.818%
Canceled:              0
Cancel Rate:           0.000%
Successful:            41
Success Rate:          93.182%
========== REPAIR UPLOAD ======
Failed:                0
Fail Rate:             0.000%
Canceled:              172
Cancel Rate:           15.665%
Successful:            926
Success Rate:          84.335%
========== DELETE =============
Failed:                0
Fail Rate:             0.000%
Successful:            889
Success Rate:          100.000%

does seem to be holding its ground according to the successrate, but the graph tells me overall throughput is way down…

i’ve also gone from 16-30 ms latency peak on my hdd’s to about 20- 700ms backlog
my ssd has gone from peaks of a few ms and maybe 10ms while cleaning, to an avg of about 40ms

ill keep it running for a bit, but i don’t expect it to get any better, and i know people tell me they can run unlimited, but i sure don’t have the resources for it yet… might hooking up 4 additional disks in a vdev more for the pool, and stripe the SLOG SSD’s via partitions and see if i cannot squeeze out enough performance…

these egress numbers look interesting tho… but they usually does this peak thing and then it just ends up being and over all lower avg anyways… maybe if my machine could keep up…i could get higher…

and like i said, i tried this many times, always the same results long term… slowly trailing downwards over days and when i turn it off it goes up nearly immediately…

so yeah, i cannot stress enough how important the max concurrent setting is for smooth operation.
and on top of that, limiting it will also keep your other internet usages from being disrupted, in theory you should be able to game on the connection when running the right max concurrent setting.

i’m not saying your node should reject a ton of requests, just enough that it can actually keep its latency down and thus keep performance up…

think of it like this… nah can’t come up with an analogy that makes sense…
the fact is the higher the max concurrent is the slower the disks get and it hurts overall performance.
if somebody think they can explain it exactly why, the please enlighten me…

so even if the successrate remains around the same mark, the throughput goes down.
as clearly shown by experimentation, ofc this might not be thus in all cases, but it sure is in mine.

i’m sure Storj will try to optimize it more eventually, but for now i am not aware of any other way rather than limiting the max concurrent connections in the config.yaml as described above and discussed in the post i linked.

BrightSilence · May 10, 2020, 5:15pm

You do realize that this means the setting isn’t making any difference right?

All this setting does is reject uploads when you cross a certain limit. If you see no rejections, it may as well be set to unlimited.

It’s pretty rare that an update actually impacts something like this. Most of the time it’s a change in traffic patterns. Like a large amount of deletes. Additionally the node can do some IO heavy maintenance during updates. I guess that could kickstart CMR cache saturation and get your node into trouble at around the time of the update. It would be nice to test this by getting your node some down time by lowering available space like @hoarder does.

SGC · May 10, 2020, 5:22pm

you keep saying that… however i see a huge change in my storagenode’s performance when i do…
i duno why… but its difficult to argue with the numbers, no matter how much you say, it won’t change that fact.

BrightSilence · May 10, 2020, 6:26pm

I keep saying it because the source code doesn’t lie.

github.com

storj/storj/blob/e6d5ce6b775389db370e2b9ed12ddc876b387156/storagenode/piecestore/endpoint.go#L212


func (endpoint *Endpoint) doUpload(stream uploadStream, requestLimit int) (err error) {
	ctx := stream.Context()
	defer monLiveRequests(&ctx)(&err)
	defer mon.Task()(&ctx)(&err)


	liveRequests := atomic.AddInt32(&endpoint.liveRequests, 1)
	defer atomic.AddInt32(&endpoint.liveRequests, -1)


	endpoint.pingStats.WasPinged(time.Now())


	if requestLimit > 0 && int(liveRequests) > requestLimit {
		endpoint.log.Error("upload rejected, too many requests",
			zap.Int32("live requests", liveRequests),
			zap.Int("requestLimit", requestLimit),
		)
		errMsg := fmt.Sprintf("storage node overloaded, request limit: %d", requestLimit)
		return rpcstatus.Error(rpcstatus.Unavailable, errMsg)
	}


	startTime := time.Now().UTC()

I don’t know what effect you’re seeing, but it’s not that setting.

SGC · May 10, 2020, 6:53pm

but it’s been the only thing i’ve changed and because of our continued discussion about it, i’ve had it running for like a week with and a week without to compare it… i’m not saying it’s the programming per say, but that it clearly has an effect on how smooth my hdd’s are running because it once in a blue moon rejects some requests thus keeping the system from getting flooded with more than it can keep up with and thus entering a decreasing spiral of continuous bad performance.

hdd’s tend to stall out if they are given to many orders at one time, which i assume is why it’s improving performance to limit max concurrent on a system that seems stressed… i’m sure that if the system could keep up, it would only be detrimental… but my current system just isn’t powerful enough…

seem to be getting kinda close now tho… just added a second SLOG SSD for even lower latency write caching, some of my hdd’s are still having trouble tho… maybe it’s the one i have been having issues with, it seems to keep giving me higher latency than i get from the others…
been thinking of just adding an additional vdev of 4 drives in raidz1 to the pool, to take half the load of the 5 in raidz1 already in the pool.

doesn’t quite remove my issue with one drive giving me 100-200ms backlogs
also did find out i had disabled the cache on that particular hdd, so that wasn’t helping either i’m sure…
was trying to find out why it was throwing read errors… something which it seems zfs might have fixed for now… but i might be ordering a couple of extra drives soon, so i have a couple of spares ready to go… pretty bad to have a raidz1 and no ability to replace a drive quickly.
so i might just replace it to be safe, find a less critical use case for it.
pity tho, its a nice enterprise sas drive with fairly low hours on it…

anyways… soon you may be right… but still it seems my system benefit from having max concurrent on other than unlimited… i duno why … it just does thats what the multiple monitoring software i got tells me.

and if i can’t keep up… i ain’t surprised people just moderate intereste with limited hardware can’t keep their setups from blowing up.

LinuxNet · May 10, 2020, 7:46pm

I will continue to watch this, I have another guess that could solve the problem (in my case). Minimizing the storage space would also be worth a test, but it is not a “sweet” solution ^^

SGC · May 10, 2020, 8:20pm

i found a write cache helps a lot, but if it is a SMR drive it most likely already has like 256mb which is most likely more than it can use anyways…
on my zfs i set sync to always so that everything goes to the now dual MLC SSD write cache (which surprisingly enough also helped reduce the ssd’s latency, going from one to two of them ((not mirrored obviously))

anyways, so everything goes to the ssd cache and then every 5 sec its written to hdd’s in one big sequential sweep… i found that minimized my hdd IO and improves read latency.

might need to get a good NVME drive, because my ssd’s just cannot keep up without getting into the 30-40 ms backlog peaks.

ssd write cache that both async and sync is written to and then flushed to disk in one go seems to be the optimal solution for optimizing random reads while sustaining reasonable writes… atleast in my case… not sure if you can implement something like that.

BrightSilence · May 10, 2020, 8:27pm

I was responding to this part.

Now you’re saying it’s not 0, but once in a blue moon. Going by literal definitions of different types of blue moons that’s anywhere from twice a year to once every 3 years. I’m going to guess I shouldn’t be taking you that literally. But if it’s really as infrequent as that phrasing suggests, I’d say the difference between results in 2 different weeks is much more likely to be a simple result of slightly different traffic patterns than the few times this setting actually rejects some traffic.

Either that, or it’s actually rejecting quite a few more uploads, in which case it could actually cause problems for customers.

In the end, I trust code. Code never lies and I can easily see how this setting is applied. If it doesn’t reject, it doesn’t do anything. That much I can be sure about.

SGC · May 10, 2020, 9:10pm

it seems to be when i just booted up the system, else it basically doesn’t … unless if i reboot the node, then it ofc goes all crazy when at 20… i would almost bet that i could go a week without rejecting a single request, and then change it to unlimited and see my performance drop… i might sometimes reject a few… but its so rare that i basically …wait i got logs…
bearing in mind that the machine have seen very heavy use the last couple of weeks moving around 20 TB maybe if not more then while doing that and also scrubbing a lot… i might reject 1 request pr hour.
else on a day where i wasn’t using it and just letting it mind the storagenode… it would maybe reject 5 in 24 hours.
so yeah it does reject a few… especially at boot, that shit looks scary lol
and that few rejects basically gave me about 50% better performance on the storagenode… which imo is a shit ton for very little waste… the successrates go up, the network transfer both up and down improve… sure if it could keep up and have exactly 0 rejections it would without a doubt do better…

but i’m looking from a real world performance perspective… you cannot tell me storj doesn’t benefit from my network giving better numbers, rather than me taking 1 extra request an hour which slowly over days stress my system so performance eventually tanks massively.

i have little doubt that’s what happens to many nodes and it doesn’t help the network one bit… infact it hurts its over all performance greatly.

but hey maybe its just my poorly managed system… xD

BrightSilence · May 10, 2020, 9:31pm

I wonder if that initial bottleneck after starting the node could have lasting effects. With more concurrent transfers, each transfer takes longer to complete, hence the number of concurrent transfers stays higher, hence random writes become more random and concurrent, leading to more bottlenecks.

And yes, your node performing better is obviously good for customers. But the slight performance increase of a single node probably does very little to the experience of the customer. However, rejecting transfers can lead to uploads failing. This is obviously much worse for customers. My main worry around this is not the few transfers you reject now when your node reboots, as those are dependent on events that are unique to your node. My worry is that if the load on the network goes up across the board, suddenly all nodes that have a limit start rejecting transfers at the same time. This would be a coordinated failure that could really impede functioning of the network on uploads.

So I would argue based on the information I have as an outsider that you are putting the performance benefit of your own node above a risk factor for customers. I’m not blaming you for that. If this is indeed a problem, the option simply shouldn’t be there for you to use it. The network can’t rely on people playing nice. But I’m not buying the argument that it’s for the good of storj or the customers.

SGC · May 10, 2020, 9:46pm

i find no fault in that argument, and yes i do believe its due to overloading the system short or long term… if the reject option is there and shouldn’t be used then it’s pointless to allow it… but not allowing nodes to atleast to some degree define what their system can manage, wouldn’t help the network as a whole either.

i didn’t mean my node makes a difference, clearly out of the thousands of nodes it’s basically irrelevant, no matter how well it performs or how powerful it is… which is kinda the whole idea in it being distributed.

but i mean that if this is a problem affecting many nodes, it could greatly affect the network performance as a whole… if i can loose 33-50% of my traffic due to taking to many requests, then then overall network might be able to increase in performance by about that same number, as i doubt most nodes can keep up with my gear, even if mostly antiquated from a technology standpoint.

if rejections cannot work, maybe nodes should send some sort of information to a satellite at boot and then the satellite would limit the number of requests… or something a kin to that… i really have little clue on the exacts of that software / hardware / its features
for now it might not matter, but if the network performs 33-50% better with the same hardware that is essentially 33-50% more customers served, thus more payout to everybody… when we get to that point… well i don’t intend to put this to rest… ill take it up in a performance troubleshooting / ideas suggestions vote or something when i feel well settled with my own system. and got some better ideas about possible solutions maybe.

breaking records now running at unlimited still…ofc its only been on for a few hours…is picking up speed tho… lol

does the webdash board autoscale if i break 300gb a day or do i go off the chart lol xD that would be awesome… at current pace, and if the system will keep up i should break 300gb in the next graph day

BrightSilence · May 10, 2020, 10:03pm

I agree with everything you just said, hence this idea: Limit node transfers through node selection
It’s closed now, so I don’t think you can vote for it anymore.