V137.5 - high load?

ItsHass · September 28, 2025, 7:38pm

Hi,

i’m wondering if anyone has seen high load on the version 137.5?

I’m seeing PIDS around 1,000+ and then load at 500+ … restarting seems to work for about an hour.. then loads hise again. ,

4TB WD Purple HDD - no HDD errors

Seen this happening on 2 nodes , same spec

Thanks in advance

agente · September 28, 2025, 9:00pm

Yes. Absolutely! I don’t know if it’s due to the high number of deletions, the high number of repairs, or the high number of potatoes.

ItsHass · September 28, 2025, 9:34pm

@Alexey - you able to report this back ?

alpharabbit · September 28, 2025, 10:07pm

No differece between 136 and 137 software here, everything working smooth. However, there is a never seen amount of egress atm. Maybe piecestore can’t handle the high load? All my nodes are migrated to hashstore.

MarviBiene · September 28, 2025, 10:32pm

for me everything looks good too. my CPU usage just raised about 2% for 6 Nodes. And this is only due to “high” egress. All my Linux Docker nodes are in good condition

Roxor · September 28, 2025, 11:42pm

Loads look normal for me. If the load is high in ‘top’: look at the iowait/“wa” field to see if it’s high too. If so then your applications are likely waiting on your HDD. ‘iotop’ will tell you what’s going on.

EasyRhino · September 29, 2025, 1:42am

there’s been more repair egress traffic on most of my nodes.

I had my hard drives suffering yesterday. Part of it was due to partially migrated nodes trying to run the used space file filewalker at the same time as the increased traffic

Alexey · September 29, 2025, 3:21am

Likely yes. This is one of the reasons, why hashstore was implemented.

@ItsHass Did you migrate to hashstore? Perhaps it’s time. I would expect that the repair traffic will only rise.

MarviBiene · September 29, 2025, 4:38am

Why tho? Was there a change on the satellites? I saw high amount of audits two days before the big spike, was there a manuell File Check?

Alexey · September 29, 2025, 5:26am

The Production Cloud Global:

I think, that it might be implemented via repairs. For example, the team member uploaded their raw video to the bucket in the USA, but their colleagues editors, colorists, etc. are in Europe. To ensure quick access, file pieces must be distributed in both the USA and Europe.

ItsHass · September 29, 2025, 6:16am

Best simple guide to migrate ?

Alexey · September 29, 2025, 6:19am

Additional parameters:

ItsHass · September 29, 2025, 8:21am

machine requirements different to support hashtables/store ?

littleskunk · September 29, 2025, 11:31am

This has nothing to do with the storage node software, healthy or hashstore migration.

A few months ago we decreased the RS settings from 29/43/65/110 down to 29/46/49/70. The reduced the storage expansion factor from 65/29 to 49/29. The downside is that each file has to get repaired more often. It doesn’t happen right away because old files take months to enter the repair queue and when they do they get migrated to the new RS settings. Slowly over time the repair traffic goes up.

Last week we noticed that the repair queue of US1 has grown to the point that we do need to scale up repair workers. No customer data is at risk. Even the worst segments on the repair queue still have enough pieces to not worry about it. If we don’t scale up the repair workers this repair queue bell curve might move closer to the minimum threshold until eventually a segment gets lost. Thats why we are scaling up repair workers and keep observing the situation.

alpharabbit · September 29, 2025, 11:49am

If it is just normal repair traffic why don’t we see the same increase on the ingress side?

Roxor · September 29, 2025, 11:49am

46/49 sounds… tight. I guess the network has been reliable enough that potentially paying for a bit more repair… is better than for-sure constantly paying SNOs for 65/29 instead of 49/29?

Fair enough.

Roxor · September 29, 2025, 12:45pm

Repairs are only replacing some pieces… but the system still has to download the minimum number to recreate them. Like if the repair system needs to replace 3 pieces of some data it still has to download 29 existing pieces to regenerate those 3.

So we probably are seeing increased ingress… it’s just small because it only covers a low count of freshly-regenerated pieces (while egress needs to always send those 29 out first)

At least that’s how it works in my head - a Storjling will correct me…

agente · September 29, 2025, 3:16pm

from 65/29 to 49/29 will be a lot of deletions.. 25% less data for all nodes

alpharabbit · September 29, 2025, 3:55pm

At least I know understand why my nodes have seen no growth for month. There should be better communication about such fundamental changes.

MarviBiene · September 29, 2025, 6:45pm

So you are telling us, that all that is manual labor and not automated? Isn’t that a bit problematic?