Can a SMR drive handle these requirements?

With the scale of the graph showing these drops I wouldn’t be surprised it weren’t just “drops”, but stalls for tens of seconds that stop all I/O activity, seriously impacting latency. That’s what I observed on my Seagate Backup 8TB.

1 Like

I don’t know how Storj would handle such a delay.
I would guess that the 128mb cache can handle all burst from Storj.
I sure will let you guys know :slight_smile:

Allow me to disagree here: Some SMR drives are better than others, but generally they’re not a good fit for Storj.
I can tell you that my poor 2.5" 2TB SMR drive could not keep up with Storj when they were carrying out some kind of load tests in the past, because even though the average throughput wasn’t crazy, the disk had to go all over the place to write AND read thousands of small files. It could do that for a couple of hours, and then could barely respond, almost stalling to a stop… it could take tens of seconds to respond and was unusable. At this point, Storj ingress queries stacked up in RAM up until it got full and…

The OOM Killer started to strike.
I had to turn off the node completely to avoid docker from restarting it, and leave the disk alone for ages to let it reorganize itself and get back to a working state.

The node software got better with time and I learned to optimize a few settings, so things are not as bad as they used to be, but I think @BrightSilence is totally right and that SMR drives should be avoided when possible.

3 Likes

Yeah, it’s definitely not a myth. I wanted to respond to that before, but declined because I kind of know myself and didn’t feel like my response would be very kind if I jumped on it in the moment.

Calling it a myth on a forum that is littered with people complaining about nodes grinding to a halt on SMR HDDs is quite a bold statement after all. SMR drives aren’t all made equal. Just because one didn’t cause an issue (yet) doesn’t mean there aren’t many problematic ones out there. However, I think almost all of them will eventually hit a wall. There are two important factors that may help or hurt. The size of the CMR cache and the available space on the HDD. CMR cache gives the HDD a place to write new data to temporarily, without having to rewrite entire sections of tracks, but under constant load, it only provides a buffer that will fill up eventually. Now if there is tons of free space on the HDD, chances are this isn’t really an issue, since the HDD will likely be able to find adjacent tracks that are both free and just write data there.

The big problem though, is that these HDDs are designed for intermittent use. And they perform really well in those cases. The reason is that when there is no use from the OS, the HDD does internal maintenance to optimize the way data is stored. This includes writing CMR cache to SMR areas and rewriting shingled tracks to optimize free space. However, Store never lets the HDD fully relax. As long as either CMR cache or free adjacent tracks are still available, you would see performance curves like the tests posted earlier. They actually still perform quite ok. But when both of those run out, it’s like hitting a wall and both write and read performance plummet to the KBps level, with multi second stalls. At that point, the writes just stack up in memory and transfers start failing on reads as well if it gets bad enough.

It’s annoying… Because you won’t immediately know the HDD is going to be a problem… And when it does become a problem, it’s going to be really hard to migrate all that data.

What might help though is making sure DB’s are moved to a different HDD (or SSD). And if you do hit that wall, lower the nodes capacity to below what is already stored and give the SMR HDD room to breath and do all that stacked up internal maintenance. You could then still keep your node online as writes are very limited at that point.
Since this requires a restart to do, it would be best to disable the file walker at that point to prevent a large read spike on restarts while the HDD is already stalled.

@IsThisOn I hope your SMR drive won’t run into these issues, but with only 3TB filled right now, it’s simply to soon to tell. Should you run into these issues later, I hope this posts provides some useful info to work around it.

6 Likes

Yeah, it’s definitely not a myth.

Maybe not a myth but totally overblown or lets say misunderstood.

Calling it a myth on a forum that is littered with people complaining about nodes grinding to a halt on SMR HDDs is quite a bold statement after all.

People on forums complain about a lot.
They have added layers of complexity (VMs, ZFS over NFS or iSCSI, BTRFS), have exotic setups, or don’t use optimized settings. Big part of that is not great default settings in my opinion, like storing data and DB in the same directory.

So people complaining on the forum is not a very strong argument in my opinion :slight_smile:

SMR drives aren’t all made equal.

Some is true for every drive, that is why we have to look at two things, how the drive behaves and if that behaviour suits the use case…

Just because one didn’t cause an issue (yet) doesn’t mean there aren’t many problematic ones out there. However, I think almost all of them will eventually hit a wall. There are two important factors that may help or hurt. The size of the CMR cache and the available space on the HDD. CMR cache gives the HDD a place to write new data to temporarily, without having to rewrite entire sections of tracks, but under constant load, it only provides a buffer that will fill up eventually. Now if there is tons of free space on the HDD, chances are this isn’t really an issue, since the HDD will likely be able to find adjacent tracks that are both free and just write data there.

That is a good technical explanation on SMR.

The big problem though, is that these HDDs are designed for intermittent use.

Which is excatly the usecase of storj, so there are a match made in heaven? :slight_smile:

The big problem though, is that these HDDs are designed for intermittent use.

You could also phrase it the other way around. These HDDs are not made for sustained writes.
Thank god Storj has no high sustained writes, otherwise we would all have filled up nodes :slight_smile:

Here is how I would rephrase your sentence:
Because of the SMR nature, these HDDs are not made for sustained writes that exceed the 128mb drive cache of the drive itself and the additional 10mb per connection RAM cache from Storj. This is especially true, if the CMR cache is full.

In the end it comes down to

  1. what write spikes does a node experience?
  2. can we cache these spikes?
  3. what average write speeds does a node experience?
  4. can the HHD(s) sustain that write speed?

And when it does become a problem, it’s going to be really hard to migrate all that data.

Why should that be hard? Migrate data to a SMR is hard (like a RAID rebuild). Rsync data from a SMR to CMR is easy, because SMR read speeds are fine.

file walker at that point to prevent a large read spike on restarts while the HDD is already stalled.

They way file walker is configured right now is another “not so great default” in my opinion.

I hope your SMR drive won’t run into these issues, but with only 3TB filled right now, it’s simply to soon to tell. Should you run into these issues later, I hope this posts provides some useful info to work around it.

Again, all these technical details are just theories. But lets look at the reality and real numbers. What use case do we actually have at hand? A sustained write of 4mb files at a speed up to 50mbit spikes at an average of 10mbits? Sprinkled with random reads? Thanks to OS cache we don’t even have these 50mbit spikes and just a constant 10mbits write operation?

You could argue that this is an even better use case than the original use case Seagate promoted by selling them as Archive Disk :slight_smile:

Unfortunately I could not get good data.
Anandtech says

at the outer edge of the platters. We see a gradual steady drop down to 50-70 MBps

If that is true for even a full SMR drive and random write instead of sequential, that would be more than sufficient.

I will fill up my disk with trash data to give you some benchmarks. Would you mind sharing what real world writes you experience on your nodes?

You don’t have to be so dismissive. They complained for good reasons. If you read the actual topics you’ll find the IO stalled, the node got killed with OOM or the system has constant high IO wait, causing performance degradation.

Those aren’t small issues, so no, it’s not overblown. It’s a problem.

That’s a matter of scale… Storj doesn’t ever let the drive rest, even if it’s not at 100% utitlisation, that is still highly problematic for SMR drives.

They won’t be fine if the drive has already hit that wall and is using all its resources to rewrite the shingled tracks. You seem to still be missing the point of how hard that wall is you hit when an SMR drive is saturated.

Nope, they are not. They are based on actual experiences with this specific use case.

You’re using a sequential write test for these stats. A famously easy thing to do for SMR drives, because they don’t have to rewrite and reorganize a single track. And in that best case scenario, you already have significant drops in performance. Now imagine Storj, where lots of data is randomly deleted and rewritten. Leaving lots of gaps in the data and entire sections of shingled tracks needing to be rewritten in order to store a tiny file. And lots of those at the same time.
The test results you quit are meaningless for this use case.

I’m sorry… but this just shows you have no idea why an SMR disk filling up causes slow down. It has nothing to do with the edge of the platters. You should be so lucky that that’s all that would impact it.

Here’s a hint… This is an image a user posted of fragmentation of free space on a 10 month old node.


Original post here: Disk fragmentation is inevitable... Do we need to prepare? - #47 by JDA

6 Likes

I’ve got a Python script that replays actual I/O performed by a node based on real node logs, that would probably be the most realistic approach. I used it to prepare some posts here in the past. But I’d have to prepare it for publication first, it’s tied too strongly to my infrastructure right now.

It would be interesting to see comparison of results with an SMR drive.

3 Likes

You don’t have to be so dismissive. They complained for good reasons. If you read the actual topics you’ll find the IO stalled, the node got killed with OOM or the system has constant high IO wait, causing performance degradation.

Those aren’t small issues, so no, it’s not overblown. It’s a problem.

To be fair, I haven’t used days to look trough the forum. But I haven’t found a forum post that described this problem WHILE following all best practises for SMR drives (like not putting the DB in the data folder).

I don’t say SMR drives are the same as CMR drives! I recognize their shortcomings.

That’s a matter of scale… Storj doesn’t ever let the drive rest, even if it’s not at 100% utitlisation, that is still highly problematic for SMR drives.

we are going in circels. You list downsides of SMR drives, while I accept these downsides but argue and make guesses it is not relevant for the use case. And there is a good probability that I am wrong on these guesses. I don’t deny that. But I need facts to believe it.

You’re using a sequential write test for these stats.

You misunderstood, I am not using any tests or stats here! I am observing a real life usage that Storj has on my node. I get around 8mbit download, with spikes to 30mbit while the HDD idles at 10% usage. These are just my numbers rounded up to leave some headroom. And again, I would love the see numbers from other nodes.

Im sorry… but this just shows you have no idea why an SMR disk filling up causes slow down.

Again we are going circles, I don’t deny these things and you are putting words in my mouth I did not say and estimate my knowledge based on it. I just wanna see some numbers.

I will do some testing and report back. Because I am very happy to admit, that I was wrong or that after 7TB the drives becomes unusable. I like finding out these things and I also like to be proven wrong and change my opinion :smile:

It is a complex discussion. Maybe it is easier to split it into two questions

  • What real life stress does a Storj Node have?
  • Can a SMR drive handle these requirements?

But to answer the second question, we first need to answer the first.
And the only real data I could found to answer the first question is in this thread:

which shows around 19gb a day ingress best case.

Some quick and probably wrong math:
That is average of 1800kbit/s.
19456 / 2mb (file size) * 4 (operations per file) = 38912 / 24 / 60 / 60 = 0.45 operations per second

I guess a SMR can handle this but I am happy to be proven wrong on both, my maths and my guessing :laughing:

where you see 2mb files? 5% from all storj files?

this discussion as about SSD disk.
set 1mb in CrystalDiskMark, and see good speed. yahooo. it’s best. this is man, who say about smr and his life.
set 64gb and see pain. this is other mans in this forum, who see SMR in hard production (1 storj with 70% free space is not producation).

See you after 1-1.5 years, then you will copy this disk into another larger disk and… suffer

Half of my files are <4kB. It would be so much easier if all of them were large…

1 Like

I don’t expect you to. But then if people who have read most forum posts because they’ve been here since the forum was launched tell you this has been a real problem, I don’t see why you would call it a myth or overblown.

That’s my bad. It looked like your numbers matched the test numbers you linked earlier. However, that is not surprising, because I’m pretty sure your HDD isn’t rewriting any tracks yet during active writes. So it is expected that your performance is still mostly the same. Regardless, the point is that none of the numbers you have quoted represent the SMR HDD in distress because both CMR cache and empty adjacent tracks are saturated.

I have to get back to this one. 128MB is the flash cache on the HDD. CMR cache or persistent cache is different. It consists of zones on the platters that don’t have overlapping tracks and thus perform at roughly the same speed of CMR HDD’s. This is probably in the tens to hundreds of GB. But I haven’t found a reliable source for the actual size. Either way, this additional level of cache is what makes SMR HDD’s perform quite ok for most use cases.

So how can I be so sure that your performance numbers don’t represent an HDD having to de rewrites of tracks during write operations (synchronously)? Simple… there is just no way you would see anything close to the performance numbers you quote. SMR zones are general in the order of 256MB. If no adjacent empty tracks are available, ANY write operation, no matter the size would incur a read and write of all or most of that zone. With average file size being around 270KB (yeah, not the 2MB you quoted, I just grabbed a random folder on us1 and got the actual average), that’s an up 1000x amplification of write size. And hence it becomes clear why this wall is such a big performance hit when you do hit it.

Make sure you replicate the random writes and deletes. It’s not just that the drive fills up, but most importantly that the empty space is scattered all around the drive. @Toyoo’s script would be really helpful with that.

This would surely help, but I’m not sure it’s a complete solution. It does save a lot of small writes. Unfortunately for many SNO’s this isn’t even an option.

2 Likes

Thanks for the input about the file size. Never took a look at the files and did not realize that there not all 2mb chunks.

See you after 1-1.5 years, then you will copy this disk into another larger disk and… suffer

I still fail to see the suffer part. Ohh noo, my smr drive performance is not good, if have to stop the node and sync it another drive. Or I just stop accepting new files by limiting the size and add another disk. I don’t get where there is any suffering.

I don’t expect you to. But then if people who have read most forum posts because they’ve been here since the forum was launched tell you this has been a real problem, I don’t see why you would call it a myth or overblown.

Because I still could not find a single user that vocalizes a real problem and followed best practices. Can you provide me a link?

Make sure you replicate the random writes and deletes.

True, if I just fill it up, the fragmentation would be better than in real life. But we could at least test the none CMR performance.

Would be interesting to try to defrag a SMR disk :slight_smile:

This would surely help, but I’m not sure it’s a complete solution. It does save a lot of small writes. Unfortunately for many SNO’s this isn’t even an option.

I don’t know if it is a “complete solution”. I just know it would be pretty stupid to put any db on a SMR. Why should that be a problem for a SNO? Do most operators don’t have a SSD for the OS anyway?

Don’t know if I trust windows defrag, shows me 1% fragmentation at the moment.

Here is an idea, you don’t need to do fancy tests, just fill up your free space in your SMR disk with some big files, (you can use whatever is at hand, linux isos, movies, game files, etc), leaving let’s say less than 1Tb free, keep your node running and come back and tell us how it behaves.

The point is that your disk will not have time to copy and distribute at the same time. You will receive fines for not downloaded files. And disqualification in the end. That’s all. 7tb via rsync is… few weeks :wink:

2 Likes

Here is an idea, you don’t need to do fancy tests, just fill up your free space in your SMR disk with some big files, (you can use whatever is at hand, linux isos, movies, game files, etc), leaving let’s say less than 1Tb free, keep your node running and come back and tell us how it behaves.

That is exactly what I did. Left 60GB free.
Disk usage still low. CrystalDisk mark RND512KQ1T4 32gb test size got me 80mb/s.

Either fine or bad testing on my part :face_with_monocle:

1 Like

The point is that your disk will not have time to copy and distribute at the same time. You will receive fines for not downloaded files. And disqualification in the end. That’s all. 7tb via rsync is… few weeks :wink:

In my mind, I will stop the node, copy 7TB to new node in under 16 hours, start the node and that is it. So not weeks but hours. No “fines”, still 100% audit, just a bit downtime.

1 Like

SMR drives are not good for storj once it starts getting alot of data, I can tell you 100% its not a myth and from experiance SMR drives generally dont do good less you have more then one SMR drive and more then one node on the same network splitting the traffic. Once a SMR drive is full you will probably have some issues if you dont allow for 10% or more free space or else you will not be having a good time because of the way SMR drives work I have lost my nodes because of missing data or missing idenity files because of how SMR drives shuffle data around and if you loose power or something while this happens you will lose everything.

I am running a 5.6TB node on a 6TB Seagate SMR drive and never had any issues so far (databases are on NVMe SSD‘s). But yes, filewalker takes some time….

I bet you won’t be able to transfer 7TB of node data in 16 hours. I did some rsync node transfers from different kinds of drives already and it was much slower unfortunately, even with fast CMR drives 16 hours for 7TB is more than unrealistic.

1 Like