When it comes to the amount of attention the storage industry pays to various technologies, only cloud storage scores higher than deduplication. But hype and overhype cause a problem for the channel, which has to separate fact from fiction and decide what customers really need and they're actually going to spend money on. The simple truth is that most customers, especially when budgets are tight, spend money only in the areas that are causing the most pain. Is
Before we answer that question, first let's discuss what primary storage deduplication is and what some of the motivators are for a customer selecting a primary storage deduplication product. Primary storage deduplication is largely based on the same technology as backup or archive deduplication. Redundant blocks of data are identified and stored only once. This requires some overhead to build the metadata database that manages the reference points to the data. For the overhead to be worthwhile, there should be a significant return on the investment in the form of increased capacity.
The problem is that primary storage is unlike backup storage in key ways. In a backup scenario, the same data is sent to the backup store over and over again. As a result, backup storage deduplication can deliver data reduction rates as high as 20X. But primary storage does not typically have a high level of redundancy and so primary storage deduplication can't deliver similar reduction rates or a similar ROI.
Beyond that, primary storage has less headroom than backup storage in which to perform the deduplication. Primary storage needs more headroom to sustain performance rates; if the headroom isn't there, application performance on that primary storage will suffer.
Finally, while primary storage is more expensive than backup storage, its capacity and cost has come down, making the cost of buying more primary storage less expensive than in the past. It's relatively easy to keep adding more shelves of storage with more capacity to primary storage. Customers may see this as the path of least resistance.
With all these factors, why would customers consider primary storage deduplication, and why should VARs pay attention to this market? One motivator relates to power. The challenge with just throwing disks at the data growth problem on primary storage, is finding room on the electrical grid to power all those drives. When calculated by itself, even though the space savings from dedupe on primary storage is much lower than in a backup scenario, the ROI of squeezing more data on to the same storage may be compelling; when combined with the potential power savings, it may be irresistible.
The second key motivator in primary storage deduplication is the ever-growing deployment of virtual machines. In most virtual server environments, the entire server image is loaded on to a shared storage platform. This includes the operating system and other key files, all of which tend to be very similar across servers. In an environment with 100 virtual servers, there is a lot of duplicate data that wasn't there before server virtualization. This data tends to be read-heavy, so the performance impact is not as severe as write-heavy data when read from a deduplicated area. In large virtual server environments -- say, more than 50 VMs -- the amount of redundant data makes an investment in primary storage deduplication worthwhile, even without taking into consideration the power savings discussed above.
There are a number of products that handle primary storage deduplication and data reduction. For instance, there are content-aware deduplication/compression tools (from the likes of Ocarina Networks) available that allow for content-specific examination. For example, say you have two photos of the same image stored on a system that are identical except that one has had the "red eye" removed. To most deduplication systems, these are totally different files. A content-aware dedupe product stores almost the entire image just once, retaining separate data for the area in the photo where the images differ.
Beyond content-aware deduplication/compression products, there are solutions available (from the likes of Ocarina and Storwize) that can keep the data in its optimized state across storage tiers. For example, data could be examined for redundancy, compressed and then moved to a disk archive. This not only frees up the primary storage pool but it also more deeply optimizes the secondary storage tier. In some cases this data can even be sent to the backup target in its optimized format.
Another approach, from Storwize, focuses on compression rather than deduplication. Storwize's compression appliances sit inline in front of NAS heads. Although they don't deduplicate at all, they compress data universally (as opposed to deduplication, which obviously acts only on duplicate data). Interestingly, in almost every test case, the Storwize appliance has not impacted storage performance, primarily because with compression, while there's processing required to compress the data, there's less data to transport, cache and compute.
Another inline primary storage data reduction method that uses deduplication, along with compression, is WhipTail Technologies' Racerunner SSD. The product's use of deduplication and compression means it won't be as fast as more traditional SSDs, but it may be a happy medium for many customers. Those that need more performance than what mechanical drives can offer but not the extreme performance of traditional SSDs are good candidates. Racerunner SSD is the only product that does block-based in-line primary storage deduplication.
While those two examples address inline primary storage data reduction, most primary storage data reduction tools are post-process and work on "near-active" data -- data that is idle but is not quite ready to be archived to a secondary or archive tier. Many customers decide that they don't want to or can't migrate to that secondary tier at all and so this form of optimization may be ideal for them.
NAS systems are the most common deployment of this technique. Companies like EMC, NetApp and Nexenta all have a deduplication component in their offering now (NetApp as part of its Data OnTap operating system, EMC within its Celerra and Nexenta with a product based on Sun's ZFS). In most cases, files are examined when NAS utilization is low. The deduplication component will identify files or block of files that have not been accessed in a period of time, compare them at the appropriate level with other data segments on the NAS for similarities and then eliminate redundant segments. This process could produce savings ranging from 10% to 5X depending on the dataset.
The downside to most of these NAS implementations is that efficiencies are gained only on a platform-by-platform basis. If your customer has a mixed NAS environment, you may want to look at an external solution that can work across different systems. Today, Ocarina Networks and Cofio Software have products in this space. Both companies' products can identify redundant segments of data and store only one copy of that segment. Both also have the capability to move data from primary storage to secondary storage and maintain the deduplication efficiency.
So, are we "there" with primary storage deduplication? To a large extent, yes. The solutions are beginning to mature and customer need -- thanks to constricting access to power, rampant growth in virtualization and unstructured data -- is increasing quickly. Now is an ideal time for resellers to develop a strategy around primary storage deduplication and data reduction.
About the author:
George Crump is president and founder of Storage Switzerland, an IT analyst firm focused on the storage and virtualization segments. With 25 years of experience designing storage solutions for data centers across the United States, he has seen the birth of such technologies as RAID, NAS and SAN. Prior to founding Storage Switzerland, George was chief technology officer at one of the nation's largest storage integrators, where he was in charge of technology testing, integration and product selection. Find Storage Switzerland's disclosure statement here.
This was first published in December 2009