How to diagnose disk array controller performance problems

Learn about the components in a disk array controller and find out how to determine whether a customer’s performance problem originates with the controller as well as what’s causing the problem.

Most of the emphasis around improving a customer’s storage system performance is placed on the disk drives themselves. The typical solutions to a performance problem are to either add drives to the RAID group or add solid-state disk (SSD) to the system. The challenge with those approaches is that at some point in the evolution of the storage system, the controllers become the bottleneck. In that case, it doesn’t matter how fast the drives are or how fast the storage network is, the disk array controller can’t keep up. So when making improvements to the storage system and storage network, to make sure your customers get the performance results they expect, it’s critical that the storage controller be evaluated to make sure it can handle all of the surrounding upgrades.

The disk array controller typically has three primary components. This first is a CPU that processes all the data that is sent to it and that it stored on the storage system. Twenty years ago, this processing was relatively simple. Today, it’s not so simple, with more sophisticated storage systems and loads of storage-centric services.

The second component of the storage controller is its I/O ports. These are the connections that are receiving and sending data to the connected hosts and the connections that are sending and reading data from the storage subsystem. The available bandwidth that these connections support also plays a factor in how fast the controller can drive the underlying disk media.

The final component is the software executed by thecontroller's processor. Thanks to advancements in technology, the storage controller is responsible for an increasingly complex number of functions. Each service provided by the controller’s software consumes more and more of the processor’s resources.

Disk array controller bottlenecks

At the simplest level, the disk array controller handles all the volume management functions, removing that responsibility from the hosts that connect to it. This includes making multiple independent disk drives look like one, as well as the now-expected RAID-level protection such as RAID 1 (mirroring, RAID 4, RAID 5 and RAID 6). The goal of these services was to make the management of storage easier on the connected host so that it can spend its processing power elsewhere.

In the past two decades, storage systems have become more intelligent, taking on additional storage chores—such as snapshots, thin provisioning and automated tiering—each adding to the burden on the controller.

For customers experiencing storage performance problems, it’s not safe to assume that you can just sell SSD to them and the problem will go away. The storage controller could be the source of the I/O problem. To determine whether the controller is to blame, there are two areas to consider. The first is the CPU utilization of the storage controller. Most storage system software will report this information, but usually not with the detail you need. Rather, you need the average utilization metrics over a given time period and within a specific time window. A report of the average CPU utilization of the controller that factors in 9 p.m. to 6 a.m., when no one is in the office, will skew the results. To get the data you need, look to advanced storage management software available from third parties—providing you another opportunity to add value.

The second area to examine is I/O utilization to the controller from the hosts and from the controller to the storage shelves. This information may be harder to get. Some manufacturers make this info available, but many do not, in which case, you might need a third-party software tool.

In both cases, the software tool should be able to point out when the disk array controller either has excessive CPU utilization or if one of the I/O bandwidths is being exceeding. At this point, problem detection becomes a math issue, subtracting the potential CPU capacity or I/O bandwidth that is available from what is being demanded. If the result is less than 10% headroom at peak periods, then you should look at the potential causes.  

If it turns out that your customer does have a storage controller bottleneck, there are three possible causes. The first cause of a storage controller bottleneck is too much capacity behind the controller. Most storage systems claim a support drive count that is about half of what the controller can support while under full load. They are banking on the storage being idle often enough that the storage system services (snapshots, thin provisioning, etc.) can quickly catch up even if there are too many drives in the system. If the storage system is going to be heavily utilized and all the drives in the system need to perform at the highest-possible rate, the system might not scale to anything close to what the brochure indicates.

Another possible cause of a disk array controller bottleneck is too many services being performed. The list of functions that the storage controller performs in a modern storage system is pretty extensive compared with 20 years ago, with more on the way, such as deduplication and compression. In an active system, this may just be too much responsibility for the controller.

The final cause of a bottleneck is increased random I/O thanks to server virtualization. Before server virtualization, each connecting host had one application, and oftentimes you would run out of storage network ports before you could exceed the performance of the storage controller. Nowadays, each connecting host might support dozens of workloads, each with its own I/O pattern. This means that each host now has a steady stream of random I/O that makes the storage controller work harder to find the data that each virtual machine is requesting.

How vendors are addressing the problem

Storage vendors have a wide variety of solutions to this controller bottleneck problem—from increasing raw processing power, developing specific CPUs to handle specific services, scaling out storage or offloading storage services to the virtual infrastructure or hypervisor. In our next article, we will detail these potential solutions so you can better position them with your customers.

George Crump is president of Storage Switzerland, an IT analyst firm focused on the storage and virtualization segments.

Dig Deeper on Primary and secondary storage