In a recent tech tip, “RAID technology advances with wide striping and erasure coding,” Stephen Foskett correctly described the modern advances in drive-based data protection, also
RAID has served the data center well. It takes a group of individual hard drives and aggregates them to provide better performance and protection from failure. It is that failure or, more importantly, the recovery from failure that has led many, including myself, to predict that RAID, at least in its current form, is on its last legs.
When a drive fails in a RAID group, the RAID algorithm uses data from the other drives in the group, leveraging parity data, to reassemble the data from the failed drive onto a new drive. In principle this method of protection makes sense, but because modern drives hold so much data, it takes far too long to rebuild those drives. With high-capacity drives, we talk about RAID rebuilds in terms of days instead of hours.
But the problem isn’t just the amount of time that the RAID rebuild takes; it’s also the impact of the rebuild process on applications and users: Storage performance is in most cases significantly impacted during the RAID rebuild. That means that applications can grind to a halt or come close to it. With many arrays, you can choose to throttle down on the amount of resources that are allocated to the rebuild process so that regular storage performance is not hindered. But with that strategy, the rebuild takes longer and you are in the exposed state described previously for a longer period of time.
In addition, for the duration of the RAID rebuild -- which can be shorter or longer depending on whether you’ve throttled the rebuild process to avoid hindering system performance -- it can suffer total system failure if one additional drive (under RAID 5) or two additional drives (under RAID 6) fail. If your customer has a complete RAID failure of this nature, typically, the only option is to begin a full recovery from backup, which of course takes time. While this type of failure might seem to be extremely rare, the impact of the failure is enormous. I’m also not sure just how rare a total RAID failure is anymore. While it’s certainly not common, every week I speak to end users who have experienced it.
As Foskett points out in his tip, there are ways around these typical RAID problems, with wide striping, erasure coding and other techniques. And as we mentioned in our recent article on SSD reliability, solid-state storage systems could be another option because they are smaller in capacity per drive and of course very fast, especially on reads. SSD-only storage systems could return RAID rebuild times to minutes.
Now you know the “why” of all the discussion of RAID failures. Start your conversation with the customer with this foundation. You can add value for the customer by educating them on exactly what the RAID problem is and how they can plan around it or -- as Foskett’s article points out -- leverage technology to overcome today’s RAID challenge.
This was first published in March 2011