How can a handful of small nodes completely trash a large array?

Ottetal · December 22, 2023, 7:37pm

Hiya forum

I have yet another technical question, regarding disk operations.

I have here a screenshot of one of my Synology boxes. It hosts cold storage as a backup target, and has 4 nodes.

1 node on 4TB, which is full
3 nodes on 500GB, all has ~300GB

The storage array is running 12 disks between 8 and 20TB, for a total of just north of 100TB. Additionally, there is an read/write. NVMe RAID1 array of 400GB disks. The volume is SHR2 btrfs, which is not the best in regards of write amplificaiton, but should be alright with so many disks.

Here is my problem: The array is getting completely trashed, when two nodes are turned on at the same time. This was not always a problem, but it seems like as soon as I turn on two nodes at the same time, the SSD cache will speedrun to 100% filled, and then volume performance will drop to 2000ms+ latencies.

while lazydisk is running, I expect lower performance, but it just never gets good again. No errors in log as I can see

The array has recently been scrubbed, it has recently been defragmented and all the nodes are running locally via docker and are up to date. Any ideas? Thanks!

arrogantrabbit · December 22, 2023, 7:58pm

I don’t know details but wasn’t there a setting to configure to cache metatada only?

If Synology cannot do that and dumbly caches everything ever accessed indiscriminately, that cache does you little good.

Moreover, SHR2 of a bunch of random sized disks is not just raid6. It’s an amalgamation of a number of raid6 arrays and mirrors, to maximize space utilization. The iops are horrific, especially on such a small array, and moreso with different sized disks. Some of those end up being raid6 of 4 disks, with the array performance slower than one disk. This is perfectly exemplified by the performance you are seeing even on these tiny nodes.

My advice, speaking from experience, your setup is hopelessly unfixable. Scrap Synology and their inefficient BTRFs over md implementation, let alone SHR, and use something that supports ZFS and configure special metadata device. All your other services will thank you too.

As a side note — please don’t defragment BTRFs.

Edit: this one

Pin all metadata shall be ON.

You want to see over 99% cache hit rate.

Edit2. Did you disable sync writes and atime updates for node data and databases? I would also disable consistency verification and checksums, there is no need to do that for storagenode. This alll will drastically reduce IOPS. Right now cache does not help at all, looking at IOPs reported.

Ottetal · December 22, 2023, 8:07pm

Hiya @arrogantrabbit, thank you so much for your response.

Synology don’t have the ability to only cache metadata like the special device on ZFS, you have to cache metadata and act as a normal cache. I don’t know why the space required is so large on Synology, but I need ~600GB to enable it. The current cache does not seem to matter: I’ll try disable it and remount as RAID-0 read only to see if I am able to pin BTRFS metadata. Else I am getting larger cache drives.

I agree on all of your points. The array has -obviously- been upgraded over time, and as it’s just cold storage, performance has seldom been a great concern of mine.

Questions:

Why is defragmentation on btrfs ill-advised?
If I swapped all the disks with the same capacity as my largest disks, would the underlying issues with the RAID implementation be nullfied, or would I still have many partitions, where just some of them are raided together?
Let’s say I were to move all the data to another box, and recreate the array as it is now, could I then get better utilization of the data?

Kind regards friend, and happy holidays to you

EDIT: Just saw your edit too, lol.

Did you disable sync writes and atime updates for node data and databases? I would also disable consistency verification and checksums, there is no need to do that for storagenode.

I have not done either of those. Can I do that for just the StorJ folders? I want the btrfs features enabled on the other data

arrogantrabbit · December 22, 2023, 8:28pm

Happy holidays to you too!

That pinning feature shall accomplish that (I’ve made a bunch of edits to the post, they’re crossed in the mail)

That actually shall help too (and be much safer; I have horror stories to tell about sinology’s write cache implementation, and I doubt anything changed since) – you’ll have double space, and storagenode reads more than writes, so that shall still be helpful. with consistent load caching writes does not help much – they still have to end up on the disk, there is no reason to buffer in hopes for lower load in the future if the load is pretty consistent.

Exactly - I was going to write that you can keep this one as cold archival storage, but I’m so disappointed with synology that I don’t want give more reasons to keep using them

On one hand, it may increase in data space consumption, because it does not preserve extent sharing. On the other – storagenode data is mostly small files that are not fragmented in the first place. They are scattered on disk, yes, but they are also randomly accessed, so piling them up in one corner won’t help much.

I’m not sure. I’ve never done it on synology (I pretended SHR does not exist) but I did read that not all configurations can scale. You may end up replacing all disks with the same size and not being able to utilize full size, because of the pre-existing mini-arrays that synology could not detangle.

I think in this case you likely will end up in generally same place. While order in how you added drives matters, and perhaps synology could do better when creating array of all disks at once, as opposed to incrementally, the very nature of SHR is to take advantage of disk size variety, so if disks are of different sizes – it will have to create multiple arrays. For example, if you have 10 6-TB disks, and 4 8-TB disks, creating SHR2 will result in one RAID6 array of 14x 6-TB disks and one RAID6 array of 4x 8-6=2TB “disks”. That latter array will be horrifically slow, so while you do have 14 disks to share IOPs indeed, when writing to that space you will experience RAID6 of 4 disks horribleness.

I can never write a coherent post in one go. Usually I try to fit in under 5 minutes before Discourse shows Edited flag.

Yes, I think this is manageable on a per-share (sub volume) level in synology. You can disable atime updates (definitely) and sync writes (probably) on the fly, but to disable checksumming you need to re-create the shares (perhaps not worth the trouble) because synology decided so. IN reality, btrfs allows to configure on a file granularity, but sinology understandably did not expose that complexity to the end users to minimize confusion.

Ottetal · December 22, 2023, 9:25pm

Tension is rising, as removal of the RAID1 array of caching devices are nearing completion:

And all tension fell to the floor after creation of a RAID0 read array, as “pin all BTRFS metadata” is not possible in read only mode … what a bummer

So far, so bad. I have had this array since it was 4x 4TB disks, so I can guarantee that I have tons of underlying configs behind the volume. Not great. I have identified about ~15TB of data that can be deleted. That pushes usage down to ~60TB. Once again; not great; not terrible.

I have that spare on one of my other boxes. I could move the data to another box, tear down the entire primary array and rebuild as regular RAID6. 2x 18TB, 2x 16TB, 2x 14TB, 4x 10TB and 2x 8TB would yield ~70TB usable under RAID 6, with 50ish TB wasted.

I could also just purge all 10TB and 8TB disks, purchase two new 14TB disks and be around the same place as the above config. I am fairly sure that selling 4x 10TB and 2x 8TB could finance 2x new 14-16TB disks.

… I could of course also just accept less than stellar performance, as this is after all just a cold storage box.

arrogantrabbit · December 22, 2023, 10:00pm

Indeed:

This feature is only available for SSD read-write caches created in DSM 7.0 and mounted on Btrfs volumes.

Classic Synology. Implement half a feature as a proof of concept, ship it anyway, and move on.

Alexey · December 23, 2023, 10:54am

And this is expected, we explicitly ask to run nodes on own disks, in this case the pool = 1 disk. Because RAID is working as slow as a one disk AND the pool is one disk for the OS.
You decided to break this rule, so… It’s expected.

100%

Even if I agree, this is bad advice for Synology and not useful, you cannot use ZFS there. The only way is to move a second node to a different pool, nothing else could help here unfortunately. May be, only MAY BE, the SSD as a cache device could help there.

because BTRFS is still not production-ready.. And as far as I know - the defrag doesn’t help, only migration to ext4 could or adding an SSD cache.

the same for me

Ottetal · December 23, 2023, 5:01pm

Six minutes have passed - I am safe to respond now, lol

And this is expected, we explicitly ask to run nodes on own disks, in this case the pool = 1 disk. Because RAID is working as slow as a one disk AND the pool is one disk for the OS.
You decided to break this rule, so… It’s expected.

Synology installs the OS on all disks in the system, so it can never really be any other way
It is true though, I did break that rule. I think @arrogantrabbit hit the nail on its head in his comment about SHR being possibly very many multiple partitions on disk behind the scenes, which causes bad writes.

Had I not used SHR, and had normal RAID instead, I would have greater IOPS than the single disk could muster.

If we just follow a simple raid calculation, then I have 12 disks capable of ~75 IOPS. Modern disks should do more, but lets keep it at 75.

12 disks is 900 IOPS.

Let’s change numbers for the calculation a bit, and say StorJ does 30% writes and 70% reads

(900 Raw IOPS * .3 / 4) + (900 * .7) = 698 IOPS. How’s that not plenty for a pair of nodes? Especially with one being full, it will above 90% reads. Of course, these numbers don’t hold water because I have the SHR2 implementation, which problably skews the numbers a bit.

Man, underlaying storage architechture is exiting!

Now off to redo my setup at home

Toyoo · December 24, 2023, 12:49am

The way storage nodes write files is just a bad match to btrfs, see some of my old measurements.

Parity schemes necessarily perform small random writes slower than the slowest drive in the parity scheme. If a storage node needs to update a single sector in a stripe, it has to first read the whole stripe from N-3 disks, only then write the new sector and new parity data to 3 drives. Again, not sure how specifically this works on btrfs, but I’ve identified that on ext4 you would do around 10-20 small random writes for a single upload. I suspect btrfs won’t do much better than that. Given that a single HDD can do 250 IOPS best case, even a few concurrent uploads will have you observe trashing.

I do not have any specific advice as for the actions you can do with your Synology unit, I do not have experience with them. But if you can, avoid parity schemes, and avoid btrfs. Or, if you are willing to tinker, patch your storage node code to remove the synchronous writes, and the use of temp directory for uploads. This will make you violate the current node T&C, but at least your hard drives will thank you.

Ottetal · December 27, 2023, 8:11am

Thank you @Toyoo - this is a very (!) interesting read. Good resource, great contribution to the thread.

@Alexey @arrogantrabbit, I’ve purchased some additional disks (And now I’m breaking the rule about not purchasing gear only for StorJ!). These disks should allow me to move all additional nodes to their own disk and get rid of RAID all together. We’ll see what happens.

Cheers

Alexey · December 27, 2023, 8:14am

Please try to use ext4 if that possible instead of BTRFS.
Or make an another pool to run the other node.

So, in general - avoid using the same pool for several nodes.

Ottetal · December 27, 2023, 8:21am

Absolutely. I am running ext4 at my homelocation, and have none of the issues described here. Take care Alexey

EasyRhino · December 30, 2023, 5:40am

So I’m a little late to this party. But if you could take those four 4tbB drives, and just set up four nodes. Each of which has exclusive access to a single drive. It would be a million times simpler and faster than what you currently have set up. And would hold more storj data for higher payments.

Ottetal · January 1, 2024, 12:09pm

But if you could take those four 4tbB drives, and just set up four nodes.

No, this cannot be done. The drives are long gone, and I don’t have the available slots in the unit as the screenshot shows

It would be a million times simpler

I disagree. It would not be simpler. It would be more volumes, more places to have files and an additional layer of management. It would not hold any more data, as datacap is not my current ceiling - performance on the array is.

I am fixing the drive performance - I am not reconfiguring the setup to run StorJ alone. This is against the StorJ way. We should be running on existing hardware.

Alexey · January 1, 2024, 12:53pm

You are right.
But. You shouldn’t run multiple nodes on the same array - this is similar to running several nodes on the same disk (prohibited by ToS) and it was multiple times confirmed (you included) that’s a bad idea. Each node inevitable will affect each other in this case.
Especially for array with parity (operations are working as a slowest disk in the array), and especially for BTRFS (it’s not optimized for millions of small files).

Ottetal · January 1, 2024, 2:58pm

Agree - exactly as we have been discussing earlier in this section

Happy new year to you friend. May 2024 be great!

Ottetal · January 3, 2024, 9:56am

Update:

I’ve got a pair of new 2TB NVMe SSDs on the way to the array. I’ll try to see if pinning the BTRFS metadata has any meaningful impact on the cache hitrate as @arrogantrabbit suggested.

If that fails, I’ll move data off the array, and rebuilt it in “regular” RAID5.

I’ll keep BTRFS, as I need some of the features. I expect at least two additional updates in this thread.

Ottetal · January 24, 2024, 4:00pm

New caching SSDs have been installed. I’ll report back in a few days, and report if the cache hit rate is any higher.

Ottetal · January 26, 2024, 10:50am

After a few days, the cache hit rate is now at 99%, and I’ve got all the BTRFS metadata pinned in the Cache. A bit surprised it takes almost a TB, but it’s not raising so I’m happy. I’m also a bit surprised that it has only cached 21GB worth of data, but what does it realistically cache in normal node operation apart from Databases?

Either way, the location of this NAS has been cleared to be upgraded to 8x 18TB drives. I’ll take down my horrible implementation of SHR2, and will be running RAID6, still with BTRFS on top. This is not perfect for StorJ, but it a requirement for the NAS’ primary purpose.

Alexey · January 27, 2024, 8:19am

You may use Storj for your primary purpose