Rethinking ZFS (draid)

With the current improvements to Hashstore and the resulting reduction in disk stress, I started thinking about doing something with ZFS.

Let’s use “The Van” as an example. To optimize disk space and eliminate all that unused capacity, I thought it could be interesting to evaluate a ZFS configuration based on dRAID3.

In setups like this, the limiting factor is almost always IOPS (that’s the first major one that comes to mind), and until now it was basically unthinkable to run 100 nodes on a system with, say, 60 HDDs (a 60-disk JBOD is fairly common, but the same idea applies to smaller or larger systems).

With Hashstore, things have changed, and I’m wondering: would it make sense, on a system like “The Van”, to move everything to a 60-disk dRAID3 and try to simplify the entire system?

The idea of adding two mirrored special devices would solve many of the IOPS problems, but it would still increase complexity (I know it’s simple to implement, it’s just a stylistic choice to keep the system a sealed ‘black box’ from day one, with close to zero maintenance) and introduce a major single point of failure. If I wanted to avoid using special devices, could a setup like this handle the workload of 100 nodes?

If you just have enough RAM, the advantages of Special Device is much smaller. Special device is a very nice thing though. I’d do a threeway Mirror on my Special Device and as many Raid-Z2 10w pools as you’d need.

I don’t know what the ideal number is: but it seems like you’d still want smaller pools (than 60) so fewer disks need to be accessed-at-once for each batch of IO? There may even be some SAS considerations (like with external enclosures, you only have so many cables, and 4-lanes-per-cable, are there inefficiencies with the HBA always touching 60’ish disks for every task)?

I was thinking about this the other day: but for smaller configs: like the common 24-bays-in-4u. Perhaps carve things up in three 8-disk RAIDZ2 pools? If you had 20TB HDDs, then that’s (6 x 20) = 120TB per pool… and if Th3Van has around 6TB-each nodes now… that’s around 20 nodes per pool. Would a 8-disk RAIDZ2 have the IOPS for 20 nodes at once?

Maybe four 6-disk RAIDZ2 pools would be better? (then that’s 4 x 20 = 80TB, and only 13’ish nodes per pool)?

I have no answers, only questions :wink:

But it does circle back to the idea of SNOs using parity to preserve nodes. At first I was all-on-board with the idea of no redundancy. But… although the network doesn’t need it… it takes so long to fill a node that up until they’re filled… a SNO may be better off using extra space for parity just so they don’t lose a node and take 2+ years to fill it again.

(Edit: Maybe in my example you could do DRAID 3x7+3-spares in a 24-bay system? Interesting…)

I have lost 20+ HDDs without parity so far, but all nodes survived. Therefore I still thinking that close monitoring of SMART data is sufficient.

3 Likes

So basically when the system said a drive was unhealthy… you still managed to copy off enough data that audits didn’t notice enough damage to be disqualified? Lucky! I’ve had drives just simply fail to spin-up anymore.

1 Like

Wait I definitely want to hear more about this losing of 20 drives.

I had a transient corruption on one drive and that was enough to disqualify one node. Let alone a full HDD failure.

There is thread about my Toshiba drives: Toshiba MAMR experience?

All drives failed slowly with increasing numbers for pending and reallocated sectors. I lost some files but never reached the threshold for disqualification.

2 Likes

I wonder. Writes are supposed to be well-coalesced with zfs and hashstore. Reads, with a decently-sized stripe, wouldn’t hit many drives. Might indeed be a good idea if you tuned the stripe size. I admit I do not have much experience with zfs though.

I agree, special metadata device is a great idea generally, but with the new hash store implementation, it’s less useful than it was before, and OP specifically states

I know it’s simple to implement, it’s just a stylistic choice to keep the system a sealed ‘black box’ from day one, with close to zero maintenance

Under that consideration, which I totally understand, I’d still go with more RAM

Just to chime in here, using a mega array to host multiple nodes seems like a bad idea. Regardless of hashstore.

First, the max size of one node may have a maximum theoretical limit. I know with the bloom filters it used to be a little over 20TB but I don’t see it published now. but also a practical limit where data just doesn’t fill in faster than it deletes. I personally don’t have a node with more than 8TB right now.

Second, if the limit is IOPS, then that’s due to the highly random nature of reads and writes coming in. You know the best way to increase IOPS on independent random data? Give each node its own disk.

1 Like

That’s the part I’m unsure about. Nodes write async: so even with say 20 nodes on a giant parity-protected pool those writes go to memory (and the HDDs get it eventually). But random reads for 20 nodes could be an issue… and with that many nodes the chances are that a compaction is often running for one of them. Maybe the theoretical 20-node pool really would be IOPS-starved?

I can’t test… because I don’t a have a large number of large nodes to combine into a large pool with a large number of HDDs. That would be a wonderful problem to have :money_mouth_face:

Stripe size needs to be large, then many reads will be able to be satisfied with just a single HDD, or a small subset of them.

You might build your array using multiple VDEVs. Provides better parallelism. Read speed is theoretically multiplied by number of vdevs, write speed is basically the same.
But less space.

I know I’m getting off-topic: but I still have a hard time wrapping my head around this idea. If, on average, some data that’s being uploaded isn’t getting regularly deleted… then it seems like as nodes age they’d gather more and more layers of that type of data: like sediment. So even with regular bloom filters: they should grow forever?

I guess I don’t understand the big picture: like if I ever get a node to grow to 10TB… should I start a second node on the same IP for the next 10TB? (because it’s easier/faster to get 2x10TB on the same IP than 1x20TB?).

(Edit: I guess… in a large-scale/hand-wavy way there could be the idea of an average data-retention time. Like across the whole Storj network, on average, a piece of data is held 3-years before deletion. Then there would be an average max node size (however much uploads that node receives in 3-years). But that would be per-subnet: for a SNO to get more used-space they’d need another node on another /24?)

(Edit #2: It’s also possible to combine the ideas. Like an average node hits an average max of say 8TB over say 3 years. Then growth slows drastically (but doesn’t stop) as those sediment layers of rare never-deleted data keep stacking up…)