Write cache strategic deactivation - SMR disks

Propagandalf · December 22, 2020, 8:25pm

I have SMR drives, and they are known to have lousy random read/write efficiency unless Windows write cache is enabled. But, enabling it makes the data on the drive more vulnerable to corruption if the system shuts off unexpectedly and such.

Is the write cache mostly beneficial in this case for ingress, compared to egress? If so, then could it be a good idea to turn off the write cache once the drive has filled up, or will this degrade performance too much, and thereby reduce potential income?

deathlessdd · December 22, 2020, 8:32pm

If your are prone to having power outages you should turn it off no questions, If you have a UPS you could keep it enabled. But you always run into a chance of corruptingly data if you keep it enabled and don’t have a UPS. If you have an unstable system I would disable it aswell.

Propagandalf · December 22, 2020, 8:34pm

Thanks, that’s good advice, but I still would like some more information about how write cache relates to performance of ingress vs. egress, so that it’s possible for someone to willingly “take more risk” when filling the drive up, and then reduce that risk when full.

deathlessdd · December 22, 2020, 8:36pm

If the hard drive can’t keep up with the ingress it will be stored in ram first then it will goto the hard drive. But if the power or system crashed then that data will be lost. Egress for an SMR drive isn’t to much of an issue since SMR drive do well for reading but not writing.

If your trying to advoid getting failed transfers or canceled transfers enabling or disabling write cache isn’t really going to help.

Just using a CMR hard drive would help with this, It sucks I know but CMR is much better then SMR if you have a choice which to get.

Propagandalf · December 22, 2020, 8:45pm

I’m not quite sure what you mean here, are you maybe saying that having write cache on will give me no benefits whatsoever regarding filling up my node over time? If that’s true, then I see no reason leaving it on, and will turn it off, even if I have a UPS?

deathlessdd · December 22, 2020, 8:52pm

Well let me try to explain better, It will help getting ingress and speed it up if you have write cache enabled since the drive can’t keep up if.

You have plenty of ram - This also is can be bad if alot of transfers are in ram and the system fills up with ram before it can write to hard drive aka System crashes. Then all data is lost and you could cause your node to either get DQed or suspended. Also corrupts database.
You have backup power - if your power goes out and it is able to write all the data from ram to drive before it shutdown.
But yes it does add to performance to the drive.

But if you do not have backup power not recommended to use.
You could corrupts your database if lets say your hard drive has filled up to max and is unable to write to database if your not leaving 10% overhead.

SGC · December 23, 2020, 12:55pm

if you are forced to disable your caching on the hdd’s, then my might want to consider finding a SSD with PLP (power loss protection) and figuring out some way to use that as a cache for the system.

this would help optimize the performance of your SMR HDD’s without having the chance of data loss because the ssd cache would not loss any data due to it having internal power backup and thus being able to clearing the SSD’s cache/dram, writing the data to the SSD’s persistent memory cells.

i haven’t tried to run such a setup myself, but it should be possible… if the software / windows doesn’t overwrite old cache data stored on the PLP SSD when booting up…

but that should be more of a configuration / software issue rather than an actual hardware limitation.
so in short, do some research into doing a setup using a PLP SSD to ward against power outages…

a bit more expensive, but easier to implement solution, is a UPS Uninterruptible Power Supply… basically a box outside your computer with a battery that can supply the computer with power, then one sets it up so the UPS sends a shutdown signal to the computer upon a power outage and is then able to shutdown correctly…

ofc these systems can have extensive batteries, but that also makes them much more expensive, the most affordable solution, is without a doubt to just have emergency power for non data corrupting shutdown of the system.

maybe 30 to 60 minutes of backup power should be more than enough… windows shouldn’t take more than a few minutes to shutdown after getting the signal… ofc with battery degradation and such one might want a bit more to keep the UPS useful after a period of years…

duno if that one is any good, do your own research, it’s certain an affordable and easy to implement solution to power stability issues… and ofc also helps with surge protection and general power filtering, ofc for 60$ it will never be perfect, but should solve the problem posed.

running without caching is terrible for performance…

Propagandalf · December 23, 2020, 11:29pm

Thanks for the alternative input! I like the idea of a PLP SSD, I only vaguely knew what it was from before, but now it sounds like it might be a good alternative for people with SMR disks. However, I already have UPS at my server room and I plan on setting up a few HDDs for Storj there, so I will look into a good configuration for protecting the data against power loss and system crash.

Regarding non-cache setup, I was mainly considering to turn that off once the drives had filled, and then try to see if performance for egress would suffer and lead to an impact on income (for the sake of reputation/quality of my node?). If my nodes suffered from that, I would consider turning cache back on.

SGC · December 24, 2020, 6:52am

yeah for SMR write is the only big issue… else they are like CMR in performance
i got a PLP SSD myself, because i didn’t want to spend on a big battery since power loss here is generally my fault, if it happens
many SSD’s today has PLP, but it’s not the most widely promoted feature for consumer grade stuff…
i’m running zfs, but i really beat the hell out of my system, had a period with like 70 crashes / random reboots in over a week, because i thought it was funny to test out the system and i didn’t know how to fix it…

took it like a champ an not a byte lost or misplaced, i know many will say… oh my system can do that… even without such safe guards… but there will most often be degradation, most things just have methods to mitigate the damage sustained in such instant power loss cases,
so it might loose data, but it basically won’t care.

i scrubbed the entire drive after the problem was fixed zfs checksum verification of the entire drive and was running storj at the time…with a 7-8tb if not more node and i didn’t get 1 bad audit or behavior for like 70 random restarts after i added my PLP disk.

but my power is generally rock stable… this was i just don’t have to care about it, because i know the system will not place any data wrong or loss it.

i’m sure an error can creep in… it’s rare that a solution is perfect…
i think my netdata broke… but netdata was on a disk without the PLP caching
and netdata always seems to break for me…
right now it makes half graphs because i updated it…
haven’t tried rebooting the server yet… but i really should… i’m just so mad that to just update a freaking piece of monitor software i should need to do that…
that would never have happened in windows but most likely just me that’s bad at linux.

but yeah PLP is amazing i can warmly recommend it, you can get 10 year old enterprise ssd’s that has it… so it’s very easy to find something 2nd hand, if you aren’t buying new… and today i think many tho certainly not all consumer ssd’s has PLP