On-line RAID expansion - node pause option?

AnoStor · May 10, 2023, 5:11pm

Node runs fine but the expansion is just absolutely sluggish after the recent update. It was pulling a couple % per day before which was fine but now it takes a week for 1%.

No risk to data or node performance because of the configuration because disk IO priority is to the workload (and I can’t change that). However that workload seems to have massively increased recently leaving no room for the background expansion task. Expansions in the past with similar amounts of data stored were much faster because there was far less activity from STORJ.

Is there a way to have my node gracefully paused for like a day without penalty to let it finish or am I stuck balancing 1hr of downtime a week to let it make progress without going too low on uptime?

Node has 0 suspension or warning history, has never gone offline for over 4 hours in any sliding month period (only once for 3 during a physical migration), and has solid 100% audit history. Also has an almost 2 year history of the above. Would hate to lose that.

Thanks in advance all!

lyoth · May 10, 2023, 5:19pm

I have been down a whole week, and was not suspended.
You can also reduce your node to 500GB so you are not getting any new data while you expand the raid.

AnoStor · May 10, 2023, 6:02pm

Thanks, I hadn’t considered that.
Reduced node storage setting below stored amount but its still doing a whole bunch with multiple blob files.
It does seem somewhat reduced but not by much.
Time will tell I suppose.

Thanks!

arrogantrabbit · May 10, 2023, 7:52pm

If this is not responding to priority changes — add ram. There is very little IO that nodes generate today, but disk responsiveness will benefit massively from the metadata being cached.

What raid configuration and filesystem is this? Any reason you are expanding the array as opposed to adding a new pool?

Periodic raid scrub is similar io workload as a rebuild/expansion — are you planning to shut down node monthly?

AnoStor · May 10, 2023, 8:26pm

I’ll preface with the fact that it’s an odd configuration but there are reasons for it (mostly because of requirements I can’t go into detail about).

It’s a VM running on an R730XD that runs Windows (GPU rendering workloads requiring it) with Hyper-V. I’ve tried adding RAM for metadata cache as well as adding a RAM cache for the underlying filesystem on the host OS. Little to no benefit from either. (L1 and L2 cache actually with NVME as L2).

RAID6 with 9 drives (long story) adding 3 more to expand total storage space for the primary storage workload - backups. It’s an older box (ish - R730s still rock!) so I didn’t have much say in doing it another way. All drives are 8TB HGST SAS. Perc H730P controller.

I’m also quite aware of just how long an expansion for RAID6 when adding 3 drives will take and that the risk is mitigated in multiple ways (primarily by the controller not making any of the additional space available till it’s finished and disabling write cache which also degrades performance).

All that said… The VM primary boot volume is on a totally separate mirrored volume with no other workloads and it’s swap file is on an intel DC P3600 1.2TB SSD.

VM is quite responsive and the array responds to reads at over 300mb/sec still easily.

Not adding a new pool because I need it all in 1 array (reasons).

RAID scrub is far less intense in my understanding, and experience, since it’s just validating the data, not re-computing it and shuffling it around. They have also completed very quickly in the past with workloads I was running (large drive workloads for some renders for some work) in the past - totally different than STORJ for sure but somewhat relevant.

I’ve not found a way to tell the controller to bump up the priority for the reconstruction/expansion task and I really don’t want to take it offline till it finishes.

Thanks!

arrogantrabbit · May 11, 2023, 12:59am

Oh wow. Yes, this is the worst case scenario in terms of IO latency amplification, unfortunately.

It’s not throughput, it’s IO: every block needs two sets of checksums recomputed and written. Disks can support 200 or so iops. So if you have small blocks — it can take ages. 300MB/s sequential read for 6 drives is a bit low though.

Perfectly fine for backup, (albeit raid6 is overkill for just 9 drives), but now that you add random io into the mix, with modern NTFS likely tuned toward the flash storage — it won’t scale.

Maybe migration to faster file system/raid arrangement sooner rather than later is the solution. It’s not very encouraging when raid repair takes the server effecting offline for weeks. Defeats the purpose of having redundancy in the first place

Adding ram and disabling sync still should have helped by drastically reducing storj generated IO so that controller can focus on rebuilding.

AnoStor · May 11, 2023, 2:34am

Yea - it’s definitely a worst case.

I know its IO and not throughput - just giving a data point that shows the array is still decently responsive all things considered.

It’s actually ReFS but that doesn’t make much of a difference. No need to scale past this once it finishes.

No option to migrate unfortunately for this box - it’s a loner. Luckily it’s not offline - just slooow. All functionality is intact - I just wanted to stay ahead of things and learn more about STORJ for future potential problems. Whole point of running STORJ is that it’s making use of unused space on a box that otherwise would go unutilized.

Yea - I don’t get how adding RAM didn’t help - I gave it 128GB and all 56 virtual cores (dual 2690V4) and it barely uses any of either :\

Maybe it’s ReFS?

What I was hoping for was an option to contact STORJ support or a built in contract for extended maintenance windows or something.

Ultimately nothing bad is happening - all services, backup, renders, and STORJ, are all running fine - the reconstruction will just take ages. There’s risk sure but that is somewhat controlled for and I can wait if I have to. I just like to stay ahead

Alexey · May 11, 2023, 4:52am

Hello @AnoStor,
Welcome to the forum!

please read: Topics tagged refs

So, seems ReFS is not a good choice for storagenode yet, especially in combination with big RAID arrays unfortunately.

AnoStor · May 11, 2023, 5:35am

Strange, it used to be just fine.
Guess I’ll have to figure out a way to move off of it.
I wonder if it was a recent Windows update cycle that actually did something…

Alexey · May 11, 2023, 7:31am

Perhaps described problems were in the past and they become better, but right now the ReFS look more slow than NTFS.
Maybe it requires cache enabled like zfs otherwise it works slower than a regular native FS (NTFS for Windows and ext4 for Linux).