SSD CACHE for RAID

node1 · November 24, 2019, 2:43pm

Hello,

Does anybody is using ssd cache under linux?
I do have spare SSD, and think that it can improve the speed of writing data to the node, maybe also lead to the more concurrent-requests accepted?

Any suggestions. Is it worth?
If yes, what is better for storjnode? bcache, lvmcache or something else?

Thank you.

Pentium100 · November 24, 2019, 5:12pm

I use zfs with hard drives and two 240GB SSDs.
The SSDs are partitioned into a small partition (4GB) and a large partition (200GB), remaining space is for overprovisioning.
The small partitions are joined in RAID1 as ZIL (write cache, it has to be reliable).
The large partitions are joined in RAID0 as L2ARC (read cache, reliability does not matter here).

cdhowie · November 24, 2019, 7:19pm

The most important thing as @Pentium100 hinted at is that a write cache must be reliable. If the cache drive dies after a write has been committed to it but before the data is pushed to the underlying volume, then the underlying volume is likely to be corrupted. So, at a minimum, you need two SSDs in a RAID1 and use the RAID device as the cache.

Depending on how much revenue your node makes, the cost for two SSDs may be overkill – and don’t forget that they will consume drive bays¹ that you could otherwise use to add more storage to the node. Given the choice between a cache and adding storage, I would choose to add storage.

¹Unless you use NVMe SSDs, I suppose. I’m still unsure if the (slight) revenue increase that results from the faster disk performance would pay off the SSDs within a few years. And with NVMe you don’t get hotplug replacement for a failed disk. (PCI hotplug is still just barely available in the consumer market. I’ve only seen it on high-end motherboards and support is usually experimental.)

node1 · November 24, 2019, 7:25pm

Thank you for your aswers.

@Pentium100 can i overlay zfs on existing RAID6 array? Or it should be done all from the scratch?

@cdhowie i don want to spend money on ssd’s. but i do have a few of them. As well i still have a connections for the drives, thats not a problem

The main question the would be is it worth doing it? Will it be visible somehow in the payouts?

cdhowie · November 24, 2019, 7:42pm

You might see a slightly-increased upload success rate meaning that your node can use up its storage at a slightly faster rate. This means your timed storage income would probably be a bit larger. More storage used also means more opportunity to serve piece downloads which could increase your egress storage income.

Depending on the details of your setup, the download success rate may or may not see any impact, and probably only on pieces that are frequently fetched. (Consider that a read cache depends on a high rate of cache hits; if there’s a high rate of cache misses, which I would expect on a Storj node, your download failure rate would not be improved.)

The tl;dr is that it’s not something we can easily answer because it’s highly dependent on both your storage configuration and the access patterns that your node sees.

Edit: It’s also worth noting that I’ve heard horror stories of lvmcache corrupting the underlying volume even though there was no storage failure. I don’t know how true this is or how recently it happened, but I plan to pretty thoroughly test lvmcache in recent kernels before I actually consider using it in production anywhere. At least in my mind, the risk outweighs the potential reward.

ndragun · November 25, 2019, 8:50pm

Honestly unless you’re just playing with ZFS as an enthusiast looking for something to do, I wouldn’t recommend using ZFS. Someone recently asked about recordsize optimization, which the response gets into the overhead and cost benefits of raid6 vs raidz2. In a nutshell you’ll lose about 20% of your capacity, after removing parity disks, of actual usable space.

Is it REALLY worth it to you for 20-30ms~ response time difference?

Pentium100 · November 25, 2019, 10:29pm

ZFS works great, in production environments too. The 20% overhead does not seem right to me though, but I am not storing storj data on ZFS directly. Storj node runs inside a VM and uses ext4 for the data, but the virtual disk is a zvol (volblocksize=64K, small volblocksizes and raidz do result in large overhead).

ZFS works better directly with the disks, not on an array. It will work on an array, however, in case of problems with the disks etc, it would be better if the disks could be accessed directly.

ZFS is a bit different. If you use one drive for the ZIL, then if it fails, you may not lose data, since ZIL is essentially a “backup” or the write cache that is in RAM and is read only after an unclean shutdown. But if you have an unclean shutdown and then the SSD fails, you have problems.

cdhowie · November 25, 2019, 10:30pm

Right, I should have clarified that this is the situation I was describing – or attempting to describe.

ndragun · November 25, 2019, 10:50pm

Yes, unfortunately it’s a painfully known reality about a zvol going over 80% that most people find out the hard way. I can’t even imagine trying to run VM’s on a volume over 80%!!! (Fragmentation over time becomes a huge performance issue as well … but let’s save that for another time.)

Oracle, who acquired ZFS through acquisition of Sun Microsystems, seems to claim 90% is the new level if you’re running Solaris … but net seems to argue that 80% is the sweet spot still.

https://blogs.oracle.com/zfs/current-zfs-pool-capacity-recommendations

Pentium100 · November 25, 2019, 10:59pm

That’s just performance. If the pool is on SSDs or you do not need it to be very fast, then it is possible to fill it up more.
The way it applies to my Storj node, I think that I could most likely get away with filling the pool up - I have SSD cache and each file on the node is not equally likely to be accessed, so, the frequently accessed files would get cached and the remaining reads would probable be few enough for the pool to cope.
Then again, if the pool was filling up I’d start thinking about adding more dries.

node1 · December 5, 2019, 8:41pm

@Pentium100 as i’m using miror from ssd’s ±120Gb for os and etc. i still have plenty of space fast and reliable. It’s not a direct accessed disk, but still unused space. I could leave ±20Gb for os and etc., rest ±100Gb could be zfs cashe?

Pentium100 · December 5, 2019, 8:43pm

I would create a partition for the cache. Though I think you can set up a loop device and add that as cache. I have not tried that though, so I do not know what the performance is going to be compared to a separate partition.

node1 · December 5, 2019, 9:14pm

Yes i do have additional space on both ssd drives, as a mirrored spare partition. There is swap±4gb, root/os±20gb, empty±80Gb.
So i believe this what are you talking about?

Pentium100 · December 5, 2019, 9:38pm

Is it hardware RAID or software RAID?
You could create two partitions, one small (2-4GB, say sda3) for write cache and one big (the rest, sda4) for read cache.
If it’s software RAID, then you can do that for both drives and add sda3, sdb3 as mirror log and sda4, sdb4 as cache.

node1 · December 6, 2019, 4:22pm

it’s mdadm running inside ubuntu.
raid1 from ssd for os (here i can create and delete partitions on unused space).
raid6 from hdd for storage node data.

Everything is set and running. So there is no way i can make array from direct drives, but if in this situation i can apply a fast cache on running system without destroying it, would be great

svet0slav · January 5, 2022, 12:03pm

Would it not be better, if you just have a good RAID card with flash backed write cache?