PSA: Beware of HDD manufacturers submarining SMR technology in HDD's without any public mention

twl · June 11, 2020, 6:08am

This is really bad rep for WD indeed, I hope it won’t have too much effect on the HDD market (other than the effect of manufacturers learning how to not thread customers, of course).

And yes, it’s Synologys best bet if they don’t want to be flooded with support request once users fill up their SMR drives or have to rebuild an array.

pietro · June 15, 2020, 7:40pm

cdhowie · June 20, 2020, 1:47am

I’ve also noticed that most HDDs (at least NAS ones) listed on Amazon now explicitly say CMR or SMR directly in the product title. You can search for “CMR NAS HDD” and get really useful results.

Toyoo · June 20, 2020, 4:30pm

They probably started getting a lot of refund requests.

BrightSilence · June 24, 2020, 3:27pm

Some good news from WD. CMR disks will now be branded as Red Plus. So we can now just buy Red Plus or Red Pro and be sure we get CMR drives.

ACarneiro · June 24, 2020, 3:46pm

Of course they’ll charge a premium for the privilege…

KernelPanick · June 24, 2020, 4:11pm

Exactly what i was about to say. The days are numbered where we can buy an external USB enclosure and shuck it for a nice WD Red/Helium CMR drive…

Pac · July 12, 2020, 3:39pm

Update: after my node switched to version 1.6.4: The SMR disk does stall eventually, after 3-4 hours of ingress which is an improvement.

But I still have to stop it if I don’t want it to eat all my RAM up, crash completely or whatnot…

So… putting used serials in RAM was definitely a great idea, but unfortunately this enhancement is not enough for some SMR drives

This said, I’m sure it is still going to put less pressure on all disks, which is a good thing

Pac · July 15, 2020, 3:34pm

In the end, I played with the --storage2.max-concurrent-requests, and I had to go down to 4 for the disk to stay responsive with no sign of stalling. It’s been receiving ingress for almost 20 hours now (at a pace of approximately 1.85GB/hour) , and it still has decent response times and the load average of the system is stable (between 1 and 2).

So… for me and after several weeks of tests, tinkering and article reading about SMR disks, that’s the only option I found to make sure my SMR node does not crash: limiting its number of concurrent requests to 4. Which is a sadly low number, especially as the disk can perform way way better than that when it is not reorganizing its own data. But once it has no more CMR space available and starts writing SMR sectors, then it can only handle so many (or should I say so few…) Storj requests per seconds.

And although I really get it’s not ideal especially from a customer perspective, SotrjLabs will have to decide whether this is acceptable to have SMR nodes rejecting massive number of ingress so they stay alive.

As already mentioned somewhere in the forum (even maybe in this thread, don’t remember), I really think an improvement would be for the node to automatically adjust this setting depending on the current responsiveness of the disk, because it really feels like the node should start rejecting requests when the storage device can’t keep up, and not just based on a fixed setting like --storage2.max-concurrent-requests.

Easy said, hard to do probably

Anyways, I can finally sleep without the fear of my node crashing at any moment now! Pheww

Pac · July 16, 2020, 7:28am

The idea behind this being that this could cause some uploads to the network to fail, from a customer point of view.

However, surely the Storj network has been designed to reroute/retry/other-magic things so customer requests fail as rarely as possible… ? At worst they should be delayed I guess?

BrightSilence · July 16, 2020, 7:39am

I haven’t seen those errors posted in a while, likely because most people stopped using this setting. But the upload actually fails with an error. At least at the time there was no automatic retry, zombie segments could also be a problem for that. It might not be possible to immediately retry. For what it’s worth, if your node is the only one or one of the very few for an upload to reject the piece, the upload will be fine. If too many reject, it’ll fail though.

twl · July 16, 2020, 5:00pm

This might be of interest in this thread as well:

node1 · July 27, 2020, 12:03pm

Got a bit confused.
As i understood SMR is NOT recommended for RAID arrays.

But what about using single external USB drive, that is SMR? Will it also couse the problems?

BrightSilence · July 27, 2020, 12:11pm

SMR causes problems with any sustained write load. For normal external USB drive uses SMR is great. Even for a game disk for example it’s perfectly fine. Reads aren’t any slower and writes are either small or incidental on such HDD’s. SMR is a great tech for many use cases, but sustained load will never give the HDD time to do its internal reshuffling required to settle the CMR cache onto the shingled areas.

Unfortunately, Storj can be considered as a sustained load for these purposes. Additionally the db files get many small writes. As a result the HDD doesn’t get the time to write the CMR cache to the SMR areas and will at some point pretty much stall. Therefor it’s not a good fit for Storj as a single disk solution either.

There are some things you can do to mitigate some of the issue though. Adding more nodes on other disks can spread the load and some SNOs have seen that fix the problem. You can also move the db files to another HDD, though that comes with its own risks and should only be done as a last resort.

node1 · July 27, 2020, 12:50pm

This means, there are no CMR drives that are 2.5" and larger then 2TB?

BrightSilence · July 27, 2020, 12:54pm

That I don’t know, but it wouldn’t surprise me. Obviously density is going to be needed on those smaller disks. I wouldn’t recommend using them for Storj anyway.

I just came across this list. Looks like Seagate eventually started listing their CMR/SMR models. I’ll add the link to the top post as well.

twl · July 27, 2020, 1:26pm

This currently seems to be the case, yes.
/EDIT: the Toshiba MQ03ABB300 (3TB/2.5in/5400rpm) is listed as CMR!

Little update (I’m the guy with the Exos 5E8 SMR drive): there is 1TB left on my node and I still can’t see any strange behaviour, ingress is coming as normal and the drive doesn’t seem to be busier than usual.

I did however order a 12TB replacement today, gonna sell the SMR one just to be safe (i bought it used anyways).

BrightSilence · July 27, 2020, 1:58pm

Is this the only node you’re running? If so, then I guess the Exos models are a bit more resilient to higher loads. Which would make sense.

I don’t know if you should do that. Once it’s full you’re never looking at large sustained write loads again. Might as well keep it running. though I would make sure you leave a fairly healthy margin of free space. Instead, I’d suggest just running an additional node on the new 12TB.

twl · July 27, 2020, 2:12pm

Nope, but it’s the only node on that machine.

I actually wanted to refrain from doing so until it is officially supported via a convenient interface. Currently the way to go would be using Vadims toolbox, right?

Also, said node is in a pretty small computer case, mounting a 2nd HDD might require a bit of fiddling around. Aaaaand the motherboard actually has to supply power for the SATA devices, I am not yet sure if this solution can handle 2 rather hungry drives.

BrightSilence · July 27, 2020, 2:35pm

If you run multiple nodes in the same IP subnet, then you spread the load across them. This is likely why your SMR drive had less of a problem keeping up with the load.

On windows that’s the easiest way to do it right now. I don’t expect that an official interface for this will be available soon. You can also opt for docker, but on windows that is probably not the best way to go.

You can also use the HDD in an external enclosure (with its own power supply) if internal is not an option.