Deduplication initially gained prominence as the enabling technology in a disk backup appliance almost 10 years ago. A number of improvements have occurred along the way that upgraded its overall performance and effectiveness in backup systems as well as in primary storage. Global deduplication, developed as a method to increase efficiency and enable greater scalability, has been integrated into products in a number of categories. This article will explore how global deduplication has been implemented in the storage space and offer some insights into the benefits it can provide.
Briefly, the dedupe process parses an input data stream into (typically) sub-file-sized blocks, runs a hashing algorithm on them (somewhat like a checksum) and creates a unique identifier for each. These "hash keys" are then stored in an index or hash table, which is used to compare subsequent data blocks and determine which are duplicates. When a duplicate is encountered, a pointer to the existing block is created, instead of storing the block a second time. In this way, only unique blocks and hash keys are stored and redundant blocks are eliminated, or "deduplicated," from the data set.
A dedupe system's effectiveness is a function of its ability to find duplicate blocks, which, in turn, is directly related to the size of the pool of blocks it can store and represent in the hash table. In general, more blocks and a larger hash table means better deduplication. Also, dedupe systems need to scale as storage growth continues, without impacting performance, further driving the need for a larger pool of data blocks to support the dedupe process. Global deduplication is the way many dedupe vendors are addressing this requirement.
But exactly how to make that pool larger depends on the implementation (where the dedupe engine sits) and the architecture of the storage system it's connected to. This has led to dedupe systems sharing hash keys in an effort to expand the number of blocks compared by different dedupe engines. It's also led to an expansion of block pools to support greatly scaled, clustered storage systems and a method for sharing the correspondingly large hash tables resident on multiple storage modules.
The term "global" doesn't refer to a consistent process or architecture when applied to deduplication. It's most commonly used as a relative term, meant to differentiate a process that makes use of an expanded or shared index (global dedupe) compared with one that has a single index (local dedupe). When implemented in backup software (most enterprise backup software applications do have some kind of global deduplication functionality), this index or hash table is shared among individual dedupe processing engines as a method for improving dedupe efficiency and reducing data handling. In hardware, it's more typically a method for scaling the dedupe system, by sharing a larger pool of common blocks and a larger hash table among multiple controllers or storage modules; Data Domain takes this approach in global deduplication hardware.
Some manufacturers have called their systems "global" when all they've done is connect independent modules together with replication software, without actually sharing any blocks or an index, so these systems don't actually apply dedupe across the group of independent modules. While they certainly have the right to promote their products as they see fit, care must be taken to understand the functionality that they claim makes them "global."
Global deduplication in software
In backup software, "source side" dedupe runs on the client server, where data blocks are hashed and keys created by the backup client. But each key is compared with a hash table that's stored on the backup server, media agent or on dedicated backup storage hardware -- not on the client server. Unique blocks and keys are sent to these same devices, and duplicates are referenced by the client. All clients share the same "global" index and pool of unique blocks as opposed to earlier client-side "local" dedupe processes that only compared blocks within each client server's backup jobs. In general, source-side dedupe was made significantly more effective when a shared hash table was implemented. As a variant, some applications have the option to run software dedupe on the media agent or backup server instead of the client, a process called "target-side" dedupe, similar to the way hardware deduplication is implemented.
Global deduplication in hardware
One of the earliest implementations of dedupe was within a dedicated backup appliance, which connected to the backup server and presented itself as a NAS device or virtual tape library (VTL). It was essentially a large local dedupe system, since the process was performed in a single box. In response to the need to scale, however, some of these target-side dedupe hardware vendors have also come out with their version of global dedupe. These systems essentially combine two separate dedupe processors with an expanded storage capacity and share the hash table between them, enabling them to scale into the low hundreds of terabytes. There are other hardware-based dedupe systems that have leveraged this same shared-controller/shared-hash-table design but expanded it to a multiple-node architecture to support even larger data sets.
Similar to the target-side dedupe appliance, there are backup systems that scale by adding modules or nodes. The dedupe processing is done by a dedicated node or nodes, which share the index across all the nodes storing data blocks. These systems, which present as a large NAS or VTL to the backup software, can scale into the petabyte range.
Some clustered storage systems, also called object-based storage, now offer deduplication but differ from clustered backup appliances. This node-based topology supports extremely large and distributed infrastructure, and its use of data objects instead of files is well-suited for distributed, global deduplication. These systems typically run the hash calculation on objects within each node, compiling a hash index for the node, but share its access with other nodes. Not specifically designed for backup, they represent one way that dedupe has moved into primary storage.
As a VAR, understanding various vendors' products in a space like deduplication is essential, since VARs are frequently called upon by customers to explain the differences between technologies that use the same identifier, like deduplication. Adding the label "global" to a dedupe system most often indicates that it's more efficient than a similar "local" system or that it can scale larger without degrading performance. To the extent that scalability is needed, a global data deduplication system would be more desirable than a "local" dedupe system, provided the cost is appropriate.
About the author
Eric Slack, a senior analyst for Storage Switzerland, has more than 20 years of experience in high-technology industries holding technical management and marketing/sales positions in the computer storage, instrumentation, digital imaging and test equipment fields. He's spent the past 15 years in the data storage field, with storage hardware manufacturers and as a national storage integrator, designing and implementing open systems storage solutions for companies in the Western United States. Find Storage Switzerland's disclosure statement here.