Slow SLOG device now looking for a fast one

SGC · August 19, 2020, 8:43am

been browsing my options for a new / used slog devices for a few months… i knew that i would eventually have to abandon my old ssd’s, was looking at a Samsung 970 pro, but it doesn’t really give me the performance i want… which means that ssd tech is out the window

during my long time browsing my options i found a great deal of the so called IO accelerators, tho it seems to be very much a hp / ibm product line, but most of these also use ssd tech.

what i’m looking for is a pcie module with 16gb ram or such i think the one i found a while back had 16gb, of maybe replaceable ram, thus greatly exceeding ssd random iops and then the data is secured by having capacitors on the card…

think it cost like 300$ and would do a few million iops, much better than the 20k random read iops a samsung 970 pro will do…

but at present i cannot seem to find another card like it and was wondering if anyone in the storage crowd mind can remember what this type of cards are named…

BrightSilence · August 19, 2020, 9:02am

Using volatile memory for a SLOG drive seems like a very bad idea, even if it is backed by some capacitors. Especially if you’re looking at used ones. How long will they hold their data when power goes out? What if you need to do some maintenance and need to have the system off for a bit. This is a recipe for disaster. I did some Googling and the only ones I can find have either PCIe 1.0 or 2.0 interfaces, which suggests to me this is not a solution that is commonly used anymore. We’re seeing the first SSDs pop up now with PCIe 4.0 interfaces that would actually be able to take some advantage of the added speeds.

I’m assuming you’re not buying this for storj either? Since that would be complete overkill. I personally use sata SSDs as read/write cache on my NAS that serves my main node and db’s of 2 other nodes as well as a ton of other personal uses including media serving and running games over SMB (with local SSD cache in the game system. I know, kinda crazy.) and they do just great. In fact I have a node running on a USB 2 connected HDD that is performing quite fine (though as mentioned its db’s are moved to my SSD accelerated array). Only difference in performance I see for the USB 2 node is a download success rate of 98.4% compared to 99.8% on my other nodes. It also collects more trash than the other nodes. But that should not be worth it to you to spend massive amounts on SSD acceleration.

I’d like to hear from some people who have experience with these IO accelerators. But it would be my guess that these are almost exclusively used for read acceleration and definitely not for write caching. It honestly makes me shiver to even think of using them for write caching. Please really think about what you’re doing here. Are you certain this is the right approach? Because intuitively all I hear is alarm bells.

SGC · August 19, 2020, 9:32am

on power outage there is a couple of mirrored flash modules that the ram will be offloaded on to…
much akin to the PLP on enterprise ssd’s.

this was standard practice for many years and still today ssd’s are only starting to take over the market and in many cases the ssd’s will have a ddr front end for the accelerator cards.
the issue isn’t really present unless when one started to try to do all writes as sync writes, this is a measure to ensure the highest levels of data integrity, because data written will be able to "ack"nowledge within the nano second range.

other advantages with this model is that there is essentially little wear on the device because it’s RAM based and doesn’t use its flash storage for anything else side from power loss prevention, and then ram can be replaced with bigger and better later on… in some cases …

one of my main issues is i can only fit low profile cards…

yes i think this is the right solution, i could just turn sync=always off… but that’s really my PLP then because i got highly stable power, i shouldn’t have to worry about power outages and thus can save on not getting a UPS even if a UPS might make better sense… but that won’t improve my system…

and the two ssd’s i use as slog now, barely cuts it… i’m still seeing near 100ms latency on the older one of them…
this method of operation also makes all my random writes into sequential writes when hitting the hdd’s and thus reduces fragmentation of the data on drive, and because it’s sequential the hdd’s will be able to manage much higher throughput…

started researching my storage solution when i first installed storj, and this is not just a storagenode, the general idea is to have all storage in a massive pool and then run most of the local network through it, maybe do some nvme over lan or whatever its called… and ofc run all kinds of vm’s and stuff on the pool, basically cutting down the price of storage, because it doesn’t end up sitting mostly idle in devices all over.

oh yeah and because the hdd’s write sequential they will have less latency for reads, which is another nice advantage.

so many reasons for this… it will also increase cpu performance for non trivial tasks, when working on datasets larger than i’m guessing … hmmm or will it if its smaller than my ram… maybe i do need one with some decent sdd component to it…

like if one looks at this… then i would think that the speed gained is due to the ability to work faster with a larger dataset… might have to dig alittle deeper on this…
but no matter what i’m pretty confident that the best route is to have a RAM frontend for writes with PLP and flash backing … then how big that flash has to be becomes the real question i guess.

BrightSilence · August 19, 2020, 10:07am

Isn’t it a bit contradictory to insist on sync writes for everything, but then ack when the data is in volatile RAM? You’re now trusting transistors not to fail on you. You’d need at least two of these in RAID1.

I tried my best to find some more recent information, but there doesn’t seem to be anything from 2015 or later. Most recent I founds was documentation from 2014. Which is a different age in computer hardware. Could you link to one of them you might be considering?

SGC · August 19, 2020, 10:53am

these looks pretty nice, the problem with more recent models of io accelerators are the prices…
the modern models start at 1000$ and upwards and it seems the companies selling them will only take orders for thousands or hundreds when they start to age…

thus the stuff on the consumer market is a few years behind at the very least…
i really like the oracle F640 but thats like a 6.4TB card but will do in the million range of random writes, because it will be using a RAM frontend and then ssd backing… or that was how they use to do it…
but even today with a 2TB samsung 970 evo plus having 31k random writes… and samsung are the leaders in memory tech i believe… so i cannot imagine how they would get into the million random writes without a ram front end… but ofc their reads are the same…

so maybe it’s just a huge nexus of interconnected MLC / SLC cells, doesn’t really matter to much to me, so long as the card can do what i want.

these FusionIo series seems reasonably priced and will still do 10-15x random io writes of what a 970 can…

www.ebay.com/sch/i.html?_from=R40&_nkw=fusion+io

ECC memory has spare chips so that if there is a problem it has internal redundancy, on top of that they may use a spare setup which is essentially the raid5 of ram… but i guess only a mirror setup would be fully reliable…

well these accelerator cards are very high end gear… i’m sure they will work just fine… also the write cache is a redundant system… it only comes into play if the system fails… else the data will also be in ram… thus the SLOG can be removed at any time without issue…

ZFS mainly uses memory for this stuff, like the l2arc it will also only contain data that has been written to disk… thus it can be purged without any issue and if there is a error somewhere it can be used for data recovery…

but yeah some people do run mirror setups of such cards or more i guess… but i do feel that is a bit overkill i like the speed and the sync always gives me very high integrity, if i ever loose a file ill consider it… but even thus far using regular ssd’s and pulling devices at random for testing purposes it doesn’t seem to matter…

i would rather do a spare accelerator setup, to double the io writes lol… but hopefully i won’t need to do that for a long time…

i see your point, and it might be valid… but i’m not trying to make a perfect system… just like to remove the worst of the integrity issues.

and with this setup i can basically pull the power cord and the system doesn’t loose any meaningful data… sure some data will be lost… but maybe a couple of miliseconds and ZFS is CoW so there should be no corruption because of it…

sure if the Slog device then also is damaged, then there will be a few seconds of data loss, but still shouldn’t be any corruption… which is what i really want… a system running well without data errors so i can know it’s software issues and not hardware issues… 99.9% of the time.

but yeah long story short…
i’m looking at the fusion io drives… they seem reasonable in price… even if they are kinda old tech.
i would like something in the TB range because then i might place the l2arc on it also…

and so i should get something like 1.6TB because thats like 1.2TB without the 25% for limiting device wear and then overhead, so maybe 1tb left which means i can do l2arc for like 100-200TB which is the recommended range for the L2ARC when dealing with large database type workloads.

SGC · August 19, 2020, 11:18am

my temp fix off adding another ssd as a span… did make my upload improve… ofc right now only the storagenode is running… and the bad hdd seems to be acting totally fine now… seems it might just have been an artifact of the slog device not being able to keep up…
right now i should have about 13k random write iops… and it kinda makes me wonder if one can remove all errors and all cancelled uploads by disk alone
which would be interesting lol

ofc it’s kinda relative… only works until tech catches up…

========== AUDIT ==============
Critically failed:     0
Critical Fail Rate:    0.000%
Recoverable failed:    0
Recoverable Fail Rate: 0.000%
Successful:            996
Success Rate:          100.000%
========== DOWNLOAD ===========
Failed:                1
Fail Rate:             0.001%
Canceled:              51
Cancel Rate:           0.076%
Successful:            67044
Success Rate:          99.923%
========== UPLOAD =============
Rejected:              0
Acceptance Rate:       100.000%
---------- accepted -----------
Failed:                0
Fail Rate:             0.000%
Canceled:              8
Cancel Rate:           0.033%
Successful:            24265
Success Rate:          99.967%
========== REPAIR DOWNLOAD ====
Failed:                0
Fail Rate:             0.000%
Canceled:              0
Cancel Rate:           0.000%
Successful:            14934
Success Rate:          100.000%
========== REPAIR UPLOAD ======
Failed:                0
Fail Rate:             0.000%
Canceled:              0
Cancel Rate:           0.000%
Successful:            7704
Success Rate:          100.000%
========== DELETE =============
Failed:                0
Fail Rate:             0.000%
Successful:            4542
Success Rate:          100.000%

twl · August 19, 2020, 11:54am

What an absolute madman!

BrightSilence · August 19, 2020, 12:01pm

I’ll be the first to admit I’m not an expert on these devices or ZFS. But I’m still certain that if you put forcing sync writes for everything at a higher priority than having your SLOG device being redundant, you don’t have your priorities straight. Especially since you are looking at second hand hardware as well. Everything can fail. You can’t rely on redundancy on the drive itself because the drive might fail in its entirety. In case of using RAM, capacitors wear and you’re planning on using old devices.

I think the problem with your approach is that you’re throwing everything at the SLOG device by forcing sync on everything. If we take Storj for example (after all, that’s why we’re here). The database writes need to be sync writes, because they need to ensure consistency. The software will make sure they are. But writing pieces can easily be done asynchronously. Worst case you could corrupt pieces for ongoing transfers and in almost all cases those transfers would then not complete and the your node will never be held accountable for that loss. The vast majority of writes for the node software are piece data though. So instead of bothering the SLOG device with everything, it could be working on just the database writes and any SSD at all would be plenty. But now you’re forcing everything to be written to slog and that slows down the database operations to the point of lock congestion. Just because you don’t trust the software to determine which writes need to be sync writes and which don’t.

Lol, yeah I kind of am. But no worries, I have about 800GB of local cache on SSD. Which basically means that I’m running all recently played games at local SSD speeds. It just lets me have an endless size game drive on my game PC (while only actually using a 1TB Samsung 970 Pro), so I never have to uninstall anything. Loading times on games I haven’t played in a long time will be a bit longer if I pick them up again, because it may need to retrieve some stuff from the NAS, but in practice this slow down is barely noticeable. It works pretty damn good actually. Although unsafe shut downs are a bitch, since that forces it to copy the entire local cache to the NAS again to ensure consistency. This can be done in low priority in the background though.

twl · August 19, 2020, 12:23pm

Care to tell me which software you use to do that? And if that solution is available for Windows

SGC · August 19, 2020, 12:25pm

the slog is redundant stuff is written to it, to be retrived incase of a power failure, system crash or such…

thus 99.9% of the time it will not be used… because it will grab the data directly from ram when putting it down to the hdd’s…

thus the slog is essentially a redundant system… sure if it breaks, and the system looses power / crashes before the slog issue is fixed, then i could loose 5 or 10 sec worth of data, but i shouldn’t get any data corruption because ZFS is Copy on Write.

a second slog disk would make even that impossible… but the risk is very minimal in my view… also i may get a solar setup which would double as a UPS to cut down on the electricity costs…

that would also reduce the risk of a power failure, it’s risk management… and i will ofc have to cut some corners… just like your system is most likely prone to bit rot…

i’ve survived for decades on using regular hdd’s… i’m sure this insane setup will help keep my data more stable long term… and if i loose a storj file… well i’m also a redundant part of the network… so no great risk there… but like this i would see very limited database errors, audit failures and just have much less trouble keeping the systems and data stable long term i believe.

and ofc i get an insane random write iops performance which makes sync always still viable.

BrightSilence · August 19, 2020, 12:34pm

Sure, I didn’t want to derail the topic too much though. I use Stablebit Clouddrive, which is indeed windows software, but kind of still looking for another solution. It works pretty well, but you have to tinker with the settings to get everything performing right. Like setting it to background uploading and using only 1 parallel process, otherwise it’s going to slow down reads from the NAS drive too much. It doesn’t seem to be made entirely for this kind of implementation though. You can actually use cloud providers as the remote drive, hence the name. But that definitely wouldn’t be a good idea for running your games from.
Probably best to PM me if you want more info. I don’t want to take over @SGC’s topic.

Anyway, back to the topic at hand. I’m aware that SLOG is only used in case of failure. It just seems counterproductive to use a drive that may be the cause of that failure.

This was kind of the point I was making too. Forcing sync writes on all piece data is really not helping you. In fact if something happens that causes some corruption on ongoing writes, the damage would be so small it would almost certainly never even be caught in an audit. You’d definitely not lose your node for it. So much of data written to disk has similar properties. So I think you’re creating an unnecessary bottleneck by forcing sync on everything.

In the end the question is, is the cost to make this happen really worth it? For storj alone it definitely isn’t. But you may have other needs.

twl · August 19, 2020, 12:47pm

That was already everything I needed to know, my NAS is located several hundred kilometres away from my gaming rig, so this isn’t a viable scenario anyway

SGC · August 19, 2020, 1:13pm

yes i am creating a bottleneck of that there is no doubt and yes the system will run a lot better without sync always… atleast short term…

the immediate effects of turning sync always to standard, is things like higher rates of file handling in case of moving around mostly empty files, which goes up by a magnitude of like x50 if not more…
it would give me higher throughput because i wouldn’t have to go through the ssd, i also wouldn’t create extra internal bandwidth utilization from moving data around inside the system.

pushing all data to the SLOG ofc makes all writes essentially become double, because one path goes to the SLOG and the rest goes to memory and then towards disks, going to count the memory and then disk as one … even tho it will essentially use twice the bandwidth of the SLOG depending on the configuration and host hardware configuration… in case of a mirror SLOG that bandwidth usage would of cause also be doubled… so yeah running sync standard would be nice…

but i also gain improved read speeds now and in the future i gain even more from less fragmented data because there are “no” random writes to my hdd’s, then i ensure the data is written on to the hdd’s instead of sloshing around in memory for like up towards of 167sec for non sync writes on rare cases… again risking data integrity because there is limited redundancy on the data… even with ECC RAM.
Data base writes are known to cause great fragmentation on storage and thus for special large scale long term database loads a sync = always approach is recommended and taken by most.

it’s not for fun companies like oracle use and produce these cards, and the gains from having such cards can be immense… sync always just forces me to take that route…

and no i will not be turning off sync always even tho i can… i tested it on off until now… and tho there are some issues with it, which will be about fully mitigated in any practical sense by an io accelerator card, i’m by far no expert on all of this stuff i just researched it a lot in recent time and follow what i consider to make sense by recommendation of oracle zfs manuals, ixsystems blog posts and other such sources i have been able to find that actually deal with this kind of stuff…

also i plan to make this pool massive… so upgrading something two sizes to small is no advantage long term… i don’t expect to be replacing this card for a long time and might move it into the next server, because i doubt ssd tech will become 10-15x times better before i upgrade that…

yeah like i said… i want to just throw everything into one pool… make storage on my entire local network obsolete… and an expensive luxury… why have 500GB nvme SSD in a gaming pc when its used maybe 5-8 hours a week… i think that tomorrows storage and most likely computers will be much more local datacenter like…

and sure this is a crazy kinda setup… which might not make sense, but i think it does… i think there are plenty of good reasons to run the setup like this… especially long term… after all a storagenode takes 9 months just to start giving 100% return, so data reliability should be counted in the years…

i don’t expect any errors to be caused from this… i haven’t seen any yet and i run scrubs on a frequent basis… a zfs scrub reads and verifies checksums on everything…

so if even a bit was wrong, then i would know about it… and thus far i haven’t had one recorded on drive.

even with me being kinda mean to it for the first long while… ofc other factors come into play… like neglect and disk redundancy… as you wisely stated early on when talking about storage redundancy then raid with 1 redundant drive isn’t an optimal solution… and me running raidz1 x3 i imo the weakest point in my setup… and then maybe my ram since i only run ECC without any mirroring or spare setup
but again ram data is redundant just like slog data is redundant… so long as one of them are working and or powered…

so really if i should improve my data intrigrity even more i should find a way to do a raidz2 (two redundant drives) which gives me the option of missing that a disk was bad or having a bad drive i was unaware of during a rebuild… ofc i mitigated some of that risk by having multiple raidz1’s and then with few drives thus i can rebuild each drive in little time and with greatly reduced data overhead, which limits my system’s time exposed to data corruption… while bigger arrays / raids will need to work much longer to fix the issue and thus increases the likely chance of disk failure…

and i want to do that… it’s just not really that practical if one wants nice raw iops on the hdd’s also while not losing to much data capacity, and keeping everything affordable and easy to manage…

SGC · August 19, 2020, 1:15pm

60-80ms to reach the other side of the globe or something like that… so shouldn’t be to bad… ofc bandwidth limitations might become an issue the further away one gets…

littleskunk · August 19, 2020, 2:06pm

Just buy 2 Intel Optane 16GB and run them in raid 1. They are cheap and they are designed for this use case. If one gets destroyed you can simply replace it without having to worry about data loss.

EVO 970 wouldn’t be good for this. They are designed for a different use case and most important they are more expensive.

I am running the SLOG for 6 hard drives on this cheap setup. Each drive has a 2GB partition for SLOG. This is all I need to buffer 5 seconds * maximum internet connection. I am using one drive for my personal data which means 5 seconds * LAN connection. This drive has the last 6GB partition.

If I ever add additional drives I would split this 6GB partition. That might mean less performance for my personal data but my goal is maximum performance for the storage nodes. I wouldn’t mind. I could add a bigger Intel Optane but just for my personal data that is not worth it.

BrightSilence · August 19, 2020, 2:48pm

I guess my side step wasn’t as irrelevant as I thought. But the answer is latency. There is a recent I have a massive local cache. Latency matters in games. You don’t want assets popping in or worse causing frame drops because the storage is at large distance. I recommend you also look into what they have recently been showing about the super fast SSDs that will be in PS5 as that opens paths for games that didn’t previously exist. It’s a scenario where local storage can still have a big impact.

In theory this may be possible if you have 500mbit synchronous or more as well as a large local cache. But any asset that is loaded during game play could lead to massive hitches. See my previous comment.

Of course, this is probably why in recent years you don’t see those IO accelerators anymore. Optane has probably replaced them in similar scenarios. Somehow my brain had archived the existence of optane away somewhere.

SGC · August 19, 2020, 2:58pm

yeah i looked at the optane drives for this use… but since i plan to have many hdd’s already got 9 makes my sequential writes do 1gbyte pr sec peaks, the cheaper optane drives… tho having a mindblowing good latency have very limited write performance… and i plan to hook up a disk shelf eventually…
but already using a cheap end optane would greatly limit my writes…

the higher end optane i also considered, like the P800 but in higher load situations the latency increase a bit to quickly for my taste, which is why i initially thought i could simply use a samsung 970, but that also runs into limitations that one has to go back something like 10 years to find in the enterprise accelerator cards.

but yeah for a long time i thought a small optane drive was my solution, not sure what its random write iops is… most likely pretty great… would also be nice to combine the SLOG and L2ARC into one drive for better utilization of system resources such as pcie slots, else i will eventually end up wanting to get yet another pcie ssd like device to contain the L2ARC

think i’m going to go with one of these
1.6TB version

lol was looking for the optane specs
found this, haven’t considered these, doubt my mobo supports that also… might be cool tho

this is the smallest performance m.2 i could find on intels optane site, its numbers aren’t bad… but if memory serves will become slow when overloaded… and still it’s not going to be immensely cheaper…

and i would need an adapter card… and it’s pcie 3.0 and my mobo is pcie 2.0 and tho it can run most likely without issue… then some say there can be added latency from using the newer pcie cards in an older bus… and i think their fullsize pcie accelerator card is like 1000$+

and on top of that the sequential write performance isn’t to great… would make a great l2arc tech tho…
one major thing that may make your suggestion valid is the power consumption…

a bit below 4watts and the card i was looking at is like 20watts… ofc much large ssd and thus will consume more power because of it…

might dig a bit more through optane… but might be to much of an expense

the H10 was what i was looking at before… it has even more limited sequential writes, but is very cheap, hybrid optane / QLC

and mirroring isn’t worth it imo…

btw i changed back from 16k record sizes… as it made my scrubs and migration of storagenode data nearly impossible… increased transfer times by 4-8 times… so now i’m running 512k should have been on 256k or 128k… but its okay… i think… atleast it works

last scrub of my pool with 15tb data on it took 8 hours i think it was… new record

jerrydp · August 20, 2020, 8:38am

I’m using a pair (stripe mode) of ZeusRAM Z4RZF3D-8UC 3.5" 8GB STEC SSD 95100-02049-011U Solid State Drive. An old(SAS2) and pretty neat piece of equipment. It’s using 8GB RAM during normal operation and capacitor+ssd in case of power failure(dumping RAM contents onto SSD). It’s decently fast and reliable as it can be. Bought my pair of of eBay for ~$300. Happy to answer any questions.

SGC · August 20, 2020, 9:21am

that is a pretty impressive piece of hardware, basically a perfect purpose build slog device, and looks to even be able to keep up with most modern ssd specs… just low size and throughput…

i saw somebody mention there was two sas ports on it… for HA but to my understanding SAS already has redundant connections, so doesn’t really make sense unless if its for like throughput or iops because of a sas controller / cable limitation.

does it really have two sas ports?

not exactly what i’m looking for but very interesting variation, i run sata and thus it’s not possible to mix in sas else i run into trouble… had two 6tb sas drives i had in the pool which only caused me many hours of grief so now they sit in their own mirror, would be a perfect slog for those tho…

i suppose thats another good reason for me to get a pcie device… then it doesn’t matter what kind of pool i need to put it into in the future.

hoarder · August 20, 2020, 9:26am

These small drives are barely better than SATA SSDs. Proper optane, on the other hand…