Manage Learn to apply best practices and optimize your operations.

Storage performance monitoring: How to diagnose problems in customers' storage systems

Learn how to pinpoint a performance problem in your customers’ storage systems via storage performance monitoring, as well as how to fix the problems once you’ve found them.

If you have been an integrator for any period of time, you have heard the complaint from customers, “My storage isn’t fast enough.” Your first instinct in response to this comment may be to sell them a storage system that you think will be faster. This may be a dangerous response since there is no guarantee that a storage system that is technically faster is going to actually perform faster in the customer’s environment. Nor is there any guarantee that the faster system will be fast enough. Before you can make that determination, you need to understand what the problem is first via storage performance monitoring.

One constant in the IT universe is that the storage system gets blamed for all kinds of problems. While storage certainly deserves some of this blame, the reality is that it’s not always the problem.

Before you focus on storage performance as the troublemaker, you should first rule out application performance as the source of the problem. Utilities from companies such as SolarWinds, Aptare, Virtual Instruments and others can help with this task. Even built-in utilities that come with the operating systems will report on some of these key statistics.

If application problems are ruled out, you can move on to examining the storage infrastructure.

The simplest statistic to look for to confirm a storage performance problem is CPU utilization, using either one of the above applications or an operating system utility. If the CPU utilization is high—50% or higher—more than likely there is no storage performance problem; if it is low, chances are good that there is one. Low CPU utilization means the CPU is waiting on something. In an application sense, it is either waiting on user response or storage.

Another area to look at is queue depth, again using the applications or utilities above. Specifically, what you are looking for is whether the application is making enough requests of storage to build a queue depth. A queue depth can be caused by a single application making multiple requests or multiple applications making simultaneous requests. The depth of the queue is probably the second quickest way to determine a storage performance problem. While any queue depth at all means there is a performance problem, typically you should pay attention to anything greater than 3 or 4. In most cases, if there is a storage performance problem, it will be in the triple digits. (Many third-party storage management tools examine that queue depth broken down by storage adapter, storage controller or disk volume for more detailed analysis and diagnoses.)

Another statistic to investigate (via the applications mentioned above) is latency; this is how long the system takes to respond to a request. Latency in mechanical hard drives is a measurement of how long it takes the platters of the hard drives within the disk volume to rotate to where the data is.  If queue depth is less than 10 but latency is measurable, the system is probably having problems with seeks, which can happen, for example, with a database trying to find a large range of records. Latency can also be caused by high data fragmentation since, in that situation, the drive needs to rotate constantly to assemble the requested data. 

A final measurement is the combined IOPS of the disk volume, again using the tools mentioned above. While the IOPS numbers of a disk volume will vary, if you have a near-steady state of IOPS performance, the storage system has hit the performance wall and is constantly pegged.

Once you’ve confirmed a problem via storage performance monitoring, the next step is deciding what to do about it. You can reduce disk queue by adding more mechanical disk drives to the disk volume. Essentially, each drive you add reduces the queue depth by one. The problem with this approach is that the queue depth might be 200 or more. Are you really going to tell the customer to buy 200 drives? Remember, they don’t have a capacity problem; they have a performance problem. This is an ideal scenario for a solid-state drive (SSD) system.

To reduce latency, you need to speed up the hard drive. The problem with this is that the industry has been stuck at 15,000 rpm drives as the fastest speed for the past decade, so much of the time, a faster hard drive isn’t an option. In that case, to speed up the drive, the only thing you can do is shorten the distance that the drive has to travel to complete a seek. To do this, you have to format the drive at one-half or one-third of its actual capacity, forcing data to the fast, outer edge of the drive.

Imagine going to your customer and suggesting that they buy the most expensive mechanical drive on the market and then use only a third of its capacity. The suggestion won’t go over well. This is why solid-state drives are now selling so well. They can be used in a surgical way to address specific performance problems. And in these situations, they are more cost-effective than adding a lot of drives.

Installing the fastest drive technology possible (SSD) is almost always going to help performance. But not all SSDs are created equal, and you don’t always need to sell the fastest one available. Careful study using the tools above will help you recommend the right solution.

The next step is to understand the storage network and how to optimize that, something that we will cover in our next article.

George Crump is president of Storage Switzerland, an IT analyst firm focused on the storage and virtualization segments.

Dig Deeper on Data Management Technology Services

Join the conversation

1 comment

Send me notifications when other members comment.

Please create a username to comment.

Does ASM or Exadata help to que depth, IOPS occurrence?
I imagine these kind problem will not occur in cloud storage.