Can a bad Disk I/O performance impact all other nodes hosted on the same server?

jeremyfritzen · August 27, 2021, 2:26pm

Hi!

I have a VM dedicated Storj that hosts 5 nodes (on 5 different disks).
One of my nodes has pretty bad I/O performance that make the I/O wait to something around 80-95%.
When I stop this node, the overall iowait is good.

My question is: is this I/O performance issue (due to only 1 node) a problem for all other nodes? Could it affect my other nodes performance (and then potential earnings)?

Thanks for your advice!

bre · August 27, 2021, 6:29pm

Hi @jeremyfritzen ,

It’s certainly possible. But if the only thing running on that drive is just one Storj node, it wouldn’t be expected. The team said it sounds like you should replace that one drive because it seems like it’s having issues.

jeremyfritzen · August 27, 2021, 7:32pm

Thanks @bre

But if the only thing running on that drive is just one Storj node, it wouldn’t be expected.
Indeed, this drive is running only for this node.

So you mean that it shouldn’t impact other nodes performance?

Indeed, I think I should replace the drive. It’s an SMR disk. I’m seeing with Amazon to be reimbursed since it was not described as a SMR disk.

Thanks for your help!

SGC · August 28, 2021, 7:16am

storage controllers have long iowait’s so they will stall many operations, in most cases it’s at worst the blink of an eye, but if the disk is in the wrong spectrum of trouble then it can throw these multiple times a minute.

the disk doesn’t have to produce data errors to do so, and it’s most visible on live tasks, like when using the console, live data routing like VoIP or live streaming.
the performance loss is usually minimal, but it is annoying as hell and can be very difficult to get rid of without the right tools.

if you are a linux user i would recommend something like netdata, which will usually be able to show you disk latencies / backlog.
just keep in mind the backlog is a approximate length of the entire backlog and not the actual latency.

i seem to always have a drive that is acting up, if i get rid of one then another less visible one seems to rear its head, it’s a really annoying issue that i haven’t found a good solution for, and which i have seen in almost all storage solutions.

this i think is down to that if a process is running and losses its connection to the storage media, it will break stuff, or make inconsistencies, and thus to mitigate this issue, storage systems basically just halt’s everything on loss of a connection in hope that it can be restored in reasonable time.

hdd’s will usually make a tick sound when having these kinds of issues, and it will create high backlog spikes in netdata.

often these issues also seem to be exaggerated by higher workloads and similarly disk errors are also much more likely to come up when heavily loaded, so it can be a good idea for storage systems to run lightly loaded, as to not over work devices.

ideally i like to split them up in two classes, there are my lighter workload disks which can have issues and i use to store less important stuff, often in mirrors.
and then the near flawless performing drives which runs everything else.

ideally one might want a second host for the poorly performing drives to avoid their latency, ofc if not heavily loaded this is much less of an issue and drive can often run under light loads for years without issues, even if they continually fail under heavy loads.

ofc you can’t trust bad drives, which is why i usually end up throwing them in mirrors.

jeremyfritzen · August 28, 2021, 4:24pm

Thank you for your explanation!
After all, I’m not sure about the answer of my original question.

Could my “bad drive” (SMR disk with poor IO performance) impact other flawless drives?

Thanks and take care!

Alexey · August 28, 2021, 4:53pm

It should not, but can. Depends on many variable factors. For example, when your SMR disk is struggling with load, the storagenode would use more RAM, your system could start to use swap and slow down in average.
Some motherboards could have a poorly designed architecture and all bus could be affected. So YMMV