V137.5 - high load?

Hi,

i’m wondering if anyone has seen high load on the version 137.5?

I’m seeing PIDS around 1,000+ and then load at 500+ … restarting seems to work for about an hour.. then loads hise again. ,

4TB WD Purple HDD - no HDD errors

Seen this happening on 2 nodes , same spec

Thanks in advance

2 Likes

Yes. Absolutely! I don’t know if it’s due to the high number of deletions, the high number of repairs, or the high number of potatoes.

3 Likes

@Alexey - you able to report this back ?

No differece between 136 and 137 software here, everything working smooth. However, there is a never seen amount of egress atm. Maybe piecestore can’t handle the high load? All my nodes are migrated to hashstore.

3 Likes

for me everything looks good too. my CPU usage just raised about 2% for 6 Nodes. And this is only due to “high” egress. All my Linux Docker nodes are in good condition

Loads look normal for me. If the load is high in ‘top’: look at the iowait/“wa” field to see if it’s high too. If so then your applications are likely waiting on your HDD. ‘iotop’ will tell you what’s going on.

2 Likes

there’s been more repair egress traffic on most of my nodes.

I had my hard drives suffering yesterday. Part of it was due to partially migrated nodes trying to run the used space file filewalker at the same time as the increased traffic

Likely yes. This is one of the reasons, why hashstore was implemented.

@ItsHass Did you migrate to hashstore? Perhaps it’s time. I would expect that the repair traffic will only rise.

Why tho? Was there a change on the satellites? I saw high amount of audits two days before the big spike, was there a manuell File Check?

The Production Cloud Global:

I think, that it might be implemented via repairs. For example, the team member uploaded their raw video to the bucket in the USA, but their colleagues editors, colorists, etc. are in Europe. To ensure quick access, file pieces must be distributed in both the USA and Europe.

3 Likes

Best simple guide to migrate ?

Additional parameters:

1 Like

machine requirements different to support hashtables/store ?

This has nothing to do with the storage node software, healthy or hashstore migration.

A few months ago we decreased the RS settings from 29/43/65/110 down to 29/46/49/70. The reduced the storage expansion factor from 65/29 to 49/29. The downside is that each file has to get repaired more often. It doesn’t happen right away because old files take months to enter the repair queue and when they do they get migrated to the new RS settings. Slowly over time the repair traffic goes up.

Last week we noticed that the repair queue of US1 has grown to the point that we do need to scale up repair workers. No customer data is at risk. Even the worst segments on the repair queue still have enough pieces to not worry about it. If we don’t scale up the repair workers this repair queue bell curve might move closer to the minimum threshold until eventually a segment gets lost. Thats why we are scaling up repair workers and keep observing the situation.

10 Likes

If it is just normal repair traffic why don’t we see the same increase on the ingress side?

46/49 sounds… tight. I guess the network has been reliable enough that potentially paying for a bit more repair… is better than for-sure constantly paying SNOs for 65/29 instead of 49/29?

Fair enough.

Repairs are only replacing some pieces… but the system still has to download the minimum number to recreate them. Like if the repair system needs to replace 3 pieces of some data it still has to download 29 existing pieces to regenerate those 3.

So we probably are seeing increased ingress… it’s just small because it only covers a low count of freshly-regenerated pieces (while egress needs to always send those 29 out first)

At least that’s how it works in my head :wink: - a Storjling will correct me…

2 Likes

from 65/29 to 49/29 will be a lot of deletions.. 25% less data for all nodes

At least I know understand why my nodes have seen no growth for month. There should be better communication about such fundamental changes.

2 Likes

So you are telling us, that all that is manual labor and not automated? Isn’t that a bit problematic?