No differece between 136 and 137 software here, everything working smooth. However, there is a never seen amount of egress atm. Maybe piecestore can’t handle the high load? All my nodes are migrated to hashstore.
for me everything looks good too. my CPU usage just raised about 2% for 6 Nodes. And this is only due to “high” egress. All my Linux Docker nodes are in good condition
Loads look normal for me. If the load is high in ‘top’: look at the iowait/“wa” field to see if it’s high too. If so then your applications are likely waiting on your HDD. ‘iotop’ will tell you what’s going on.
there’s been more repair egress traffic on most of my nodes.
I had my hard drives suffering yesterday. Part of it was due to partially migrated nodes trying to run the used space file filewalker at the same time as the increased traffic
I think, that it might be implemented via repairs. For example, the team member uploaded their raw video to the bucket in the USA, but their colleagues editors, colorists, etc. are in Europe. To ensure quick access, file pieces must be distributed in both the USA and Europe.
This has nothing to do with the storage node software, healthy or hashstore migration.
A few months ago we decreased the RS settings from 29/43/65/110 down to 29/46/49/70. The reduced the storage expansion factor from 65/29 to 49/29. The downside is that each file has to get repaired more often. It doesn’t happen right away because old files take months to enter the repair queue and when they do they get migrated to the new RS settings. Slowly over time the repair traffic goes up.
Last week we noticed that the repair queue of US1 has grown to the point that we do need to scale up repair workers. No customer data is at risk. Even the worst segments on the repair queue still have enough pieces to not worry about it. If we don’t scale up the repair workers this repair queue bell curve might move closer to the minimum threshold until eventually a segment gets lost. Thats why we are scaling up repair workers and keep observing the situation.
46/49 sounds… tight. I guess the network has been reliable enough that potentially paying for a bit more repair… is better than for-sure constantly paying SNOs for 65/29 instead of 49/29?
Repairs are only replacing some pieces… but the system still has to download the minimum number to recreate them. Like if the repair system needs to replace 3 pieces of some data it still has to download 29 existing pieces to regenerate those 3.
So we probably are seeing increased ingress… it’s just small because it only covers a low count of freshly-regenerated pieces (while egress needs to always send those 29 out first)
At least that’s how it works in my head - a Storjling will correct me…