Storage drive health status as part of dashboard

SGC · July 19, 2020, 8:24pm

HDD SMART data is commonly used to help predict impending drive failure…

i would suggest that it would make sense to utilize such features on a storagenode, to ensure drive reliability,
either by special/custom configurations, or by default the storagenode could simply identify which drive or drive’s it’s set to utilize and then track their SMART data

then display their SMART health status via an indicator directly on the dashboard to help notify SNO’s of potential storage issues.
.

More on special/custom configuration details
in case of more advanced raid setups, so that the storagenode can be made aware of multiple drives to be monitored,

such as in my case i got 11 drives related to my storagenode, i ofc have my own monitoring setup for keeping track of my drives, but i can also understand that some casual SNO might not want an elaborate setup, and just have the storagenode keep track of what can be done without over complicating the issue.

thus simply a custom configuration option in say the config.yaml (or somewhere else if thats a docker thing) where one simply lists the drives to be monitored and which in case of bad smart data will affect the storage health indicator on the web dashboard.
.

More thought’s on the drive/storage health indicator

The health indicator could be Green/Yellow/Red and located somewhere clearly visible on the dashboard, maybe next to the refresh button, or as part of the status.

maybe below the online / offline indicator
simply have a storage media health : good : unhealthy : bad
in green, yellow or red…

one could also maybe change it so that graph colors or the total disk space pie chart would turn yellow or red incase of and issue to draw attention to it…

TheMightyGreek · July 19, 2020, 8:37pm

That sounds like an excellent idea !
I’m not an expert but it doesn’t sound too hard to implement. I especially like the part about being able to monitor multiple hard drives, I personally only have one HDD but I know that many people have RAID arrays.

litori · July 19, 2020, 9:07pm

I believe in order for a docker container to obtain access to hardware, it needs to be run in privileged mode. It is not a recommended practice to let docker containers have access hardware since exploits can happen.

SGC · July 19, 2020, 9:19pm

@litori
maybe i’m not to familiar with the details of that… does make logical sense…

but on the other hand, it’s not like one needs much access to read smart data… might need some external installed software in a worst case…

would ofc also be difficult on a vm using a virtual hdd… but lets assume this is mostly for the home users and that the health software could be 3rd party software that simply helps access the health status…

i duno how it’s best done… i’m sure there is a way if people want it…

kevink · July 20, 2020, 5:02am

I think this is difficult because even if you use smartctl on linux to read smart, which values do you consider important? Will you only warn when the SMART says that your drive is unhealthy? For a single drive node that might already be too late because until you buy a new drive and then copy everything to the new drive, the node might fail because the harddrive is already too unhealthy.
But I have no experience with that, my HDDs didn’t die on me yet.

SGC · July 20, 2020, 7:32am

anyways long story short…

hdd’s often give bad smart data before failure, tho not always…
i’m sure somebody that has worked with thousands+ of disks can confirm this…

it’s been a very active market on the higher end of datacenters… most likely because if one can predict failure, one can reduce the levels of redundancy…
however after graining more xperience in recent time with actual raid and small datacenter type setups… then i suspect it has much more to do with predicting disk failure before it becomes a performance issues and to ofc avoid rebuilding data…

it’s the two main factors i keep running into that gives my pools / arrays the most workload and causes most increases in pool latency and iowait…

alas it can predict many failures well ahead of time… and in this use case it doesn’t have to be more than 5% effective and it would still save lots of nodes from dying… tho i think it’s more into the 75% range… maybe higher for some high end solutions.

that point where the short version grows to about the same as the version you tried to shorten xD

Rants and reasons…
i find it’s very rare that a drive doesn’t give one a fair warning, some time before failure…
but there are many forms of failures and afaik the big datacenters use smart data to identify and predict failures with some degree of accuracy.

and i more often had a drive that could survive and work unhealthy, even if i often in the long term have gotten slightly corrupted data out of the drives… generally one can more often than not coax the data out of a failing drive, if one just doesn’t overload it…

i trust that if smart data says my disk is bad, then it’s bad… might not be 100% accurate, but maybe 90% or better tho i have had my fair share of hdd’s then i’m not at the level where i can discern 10% with ease i might have had to manage a couple of hundred, maybe 3.

and with my personal disks i almost often have gone to far… and mostly ignored bad smart data… and it’s usually right… but drives often run fine for a good while … but it’s also difficult to say because i haven’t in the past monitored the data this accurately… now zfs will yell if one byte is out of place… in the past i would have had to directly see it in the data or notice that the machine was weird or the os failed.

but from that i’ve learned that usually a disk with bad smart data will run for a good while, i’ve always been able to retrieve my personal data… tho sometimes with a fair bit of a corruption, so not sure that counts lol as retrieving my data, it does help reconstruct what is gone tho…

Alexey · July 20, 2020, 7:55am

This is sounds like another one nice idea to include part of the OS to the storagenode.

All SMART/monitoring must be a separate services with a privileged access to hardware, they must not be part of the storagenode.

Of course, the storagenode dashboard could be modified to show metrics from external source, but it will be a Grafana, which already exist and is supported by the Community. Why we should integrate another dependency to the storagenode or build self made light version of Grafana?

Let’s do not overweight the simple storage service client?

SGC · July 20, 2020, 8:12am

good point, i’m not saying how it should be done, or which method that would be best suited to implement this, i’m just saying… it’s a feature that will improve quality of life for just about everybody and sure i also suggested using 3rd party software in the initial suggestion.

if it’s to keep the node clean and efficient, which is something i can only completely agree on… then put it in the dashboard or where ever … not really important… whats important is that as many casual future SNO’s get better warning before the impending doom for their storagenode… without having to do more than a 3 click install and a glancing monitoring of their node.

makes good sense to keep the storagenode simple and stable, and then let the more advanced features be able to crash without taking down the node…

SGC · July 19, 2020, 8:25pm

many people my past self included also considered hdd’s as reliable storage, when they infact under continuous workloads and especially under heavy load can be extremely unreliable…

especially when it’s old drives people think are good… hmmmmmm
maybe some sort of hdd reliability test as a recommendation or as a default yet possible to disable feature of allocating space… seems like essential harddrive monitoring data would make sense to have a part of the storagenode / storagenode dashboard…
doesn’t have to be anything elaborate… well just got a good idea there… i sketched it out in voting

Alexey · July 20, 2020, 7:33am

Why you stop there? Let’s include ZFS-like RAID subsystem, disk checker and fixer, SMART control utility (should alarm to the depending monitoring system in case of problems), would be nice to have a guess-recovering subsystem to recover lost data right from the failed disk, also sqlite repair, alarming (include SMS and email) and monitoring system, email server (or at least client), firewall, ssh server, OS and bootloader from the stick.

I don’t think those are suitable tasks for the storage service. This is why all of them are separate utilities and services.
But it’s possible to build for sure - take a light Linux distro, add all of those components and inject storagenode, then keep it updated.

SGC · July 20, 2020, 7:54am

well to be fair… since the storagenode is in docker i suppose it already has most of the things you are talking about as it’s hosted on some sort of OS, which usually has most of those features by default.

but isn’t that like saying people that print books, shouldn’t care about the paper its printed on…

might not seem that important at first, but it does a lot for durability… and that a SNO has a green indicator telling them their storage medium is in working order, seems pretty useful…

it doesn’t cost anything meaningful, it’s not really in the way, it saves data and there by reducing repair costs on the network, and makes life better for just about everyone…

atleast so far as i can tell…
but it’s a valid point, i just think there will be many much more casual SNO’s in the future… then storj will want 3 click installs and 3 click management of the nodes, without anything having to be learned…

one just sees the green light and one is good…

much like your dash lights in your car tell you everything is okay… in the past just turning the key or pushing the button would have represented maybe an hour of prep and no less than 15 minutes…

the fact of the matter is that the evolution of technology is combining multiple technologies into one and making them smaller… just like we can have libraries, super computers, tv, radiostations, flashlight oh yeah and lets not forget the actual purpose of the device that was made to do this… a phone xD, camera, all previous physical things that are now simply 1 little thing… if simply can be used about something so complex as a smart phone.

i suppose we can add financial institution to that also today…

but yeah valid argument where i will agree with you on some points… but hdd health i have to strongly disagree that it doesn’t belong on a storagenode dashboard, especially since it’s basically a free already existing feature.

Alexey · July 20, 2020, 8:18am

The end goal is to have a standalone binary, installed in a meaningful way on each OS. The binary can’t be sure, that it will have all needed tools. So, you should bundle them into storagenode otherwise.

I don’t think it is a good idea.

But build some kind of distro for storagenode - this is will be nice and good task for us, as a Community.

SGC · July 20, 2020, 8:28am

that would also solve the whole privileged access issue, i see no issue in that the “monitoring” software is separate from the storagenode, like say if people are running massive systems they may want their own monitoring, and thus having it separate would enable such advanced users to simply not install it, and thus have their individual node require less resources.

going to set that as solution to push that idea to the top along with the initial outline of my thoughts, as i think this is makes perfect sense…

sorry if i didn’t at first understand your line of reasoning @Alexey , my coding skills and experience is quite limited, so had to parse it xD

Alexey · July 20, 2020, 8:36am

Sorry, that I didn’t include all this information in my first response. I thought you know how those tools are working and what they should have and how they could be used by storagenode.

If the Community would make this distro, it could help for a new Storage Node Operators to start operate a node.

However, I don’t think it would be a main way to deliver the storagenode binary.

Robertomcat · July 20, 2020, 8:36am

I have an HP server, and the RAIDs are managed by HP’s proprietary system, on a P440 controller. So there is no software that can monitor the disks that are under the HP system, none at all, the software does not detect the disks.

SGC · July 20, 2020, 8:40am

actually with the right commands one can request smart data through atleast LSI RAID controllers and i believe that the P440 is a LSI controller at the core…

so yeah actually with the correct configuration one could request the drive smart hdd data even on atleast LSI raid.

Robertomcat · July 20, 2020, 8:44am

The problem is not the hardware of my system, but the software that manages the whole system. I have installed a Windows server 2016, with all the HP packages, in which different tools are installed to control that the disks are perfect at any time. But if you want to incorporate some kind of third party software monitoring, it does not detect it. You can use software to see the performance of the disks with their RAIDs, but their status is impossible.

SGC · July 20, 2020, 8:46am

reading smart data doesn’t affect anything… it would just simply give one a health status on the dashboard… if one wanted it…

Alexey · July 20, 2020, 8:50am

If vendor’s drivers would allow you…
We are talking about HP and Windows.

andrew2.hart · July 20, 2020, 8:51am

I believe it is
smartctl -a -d cciss,0 /dev/sda

Then /dev/sda is a place holder. The digit 0 chooses the disk

Alternately there is the hp tool (called hpacucli then hpssacli then ssacli)

ssacli ctrl slot=1 pd all show

Sorry but this is Linux not windows