Explaining deduplication rates and single-instance storage to clients

In order for clients to make an educated purchasing decision on a deduplication system, they need to understand the key factors that influence deduplication rates. Learn about these factors and further your understanding of single-instance storage.

Solution provider takeaway: Learn how deduplication rates could influence your clients' deduplication purchasing choice, as well as how to clarify single-instance storage and perform proper testing of deduplication systems.

Deduplication almost doesn't need to be defined anymore, but just to make sure we're all on the same page, I'll define it here: It's the process of identifying redundant segments of data and storing only the unique instances of that data. The results are most beneficial in repetitive data copy processes like backup and archiving.

But exactly how beneficial will deduplication be? While storage efficiencies of 20X are not uncommon (that is, only one-twentieth of the data will need to be stored), the actual rate you see might be lower. Since you're out there on the front lines, it's critical that you set accurate customer expectations. To do that, you'll need to understand all the factors that play into deduplication rates and educate customers before they commit to a product. Those rates will vary depending on factors such as the deduplication technique being used, data types and data sources. Also, in the testing phase, it's key to do real-world testing.

Realistic deduplication rates

The first question that a customer will ask about deduplication is, "How much space will I save?" The only right answer is that it depends on a number of factors.

In addition to factors such as data type and source, which I'll discuss in more detail below, deduplication rates will vary depending on change rate, length and retention period. From a data perspective, there are two processes common to deduplication: backup and archiving.

The first full backup will generate some level of deduplication as redundancy is identified across files, volumes and servers within the enterprise. Be careful here, though -- some deduplication systems are not global; they only dedupe on a single server or volume. That said, typical rates for the first full backup can be 2X to 4X efficiencies.

Subsequent incremental backup jobs will typically capture efficiencies of 6X or 7X. Most of the data in an incremental backup consists of either new or modified documents or updated database or email stores. Even if the documents are new, a comparison can be made to similar files for redundant patterns. The data segments that together represent the modified files will be compared to data segments of the original copy, and only the changed segments need to be stored. Because they tend to be very large files, databases will be the big gainer. For example, a 200 GB Oracle database that only had a 1% change during the course of the day will only require the storing of 2 GB of new data rather than the entire 200 GB that would be stored without deduplication.

Subsequent full backups will see a 50X to 60X reduction in data stored. This is because, as a percentage, there is not much changed data between two full backups, and in the case of deduplication a high percentage of those changes were captured during the incremental jobs; essentially, from a storage perspective, subsequent fulls require no more space than the prior incremental.

Additional resources
Affordable tiered storage via data deduplication services

Five questions to ask in a data deduplication project

Host-based vs. VTL vs. NAS data deduplication

FAQ: Data deduplication's impact for customers

Single-instance storage vs. deduplication

In the channel, you'll often hear the term "single-instance storage" used synonymously with deduplication. They're different, but it's easy for customers to get confused.

Single-instance storage (SIS) is a form of data reduction, but it's not data deduplication. The difference between SIS and deduplication lies in the level of granularity that can be applied. As explained above, data deduplication works at a segment or sub-block level; SIS works at the file level and eliminates redundant copies of files.

Here's an illustration of the difference between SIS and deduplication: Say there's a PowerPoint file that has been stored in each home directory of each member of the marketing department. Single-instance storage would store at least one copy of each of these files, while data deduplication would store only one. If the company changed its logo and each marketing person updated the presentation with the new logo, SIS would save all new versions of this file; data deduplication would store only the bytes that changed in each file.

An even better example of the difference between SIS and deduplication is seen with databases. Since changes are made every day to databases, at every backup, the database appears to the backup application as a new version of a file and is sent to the backup target as such. A SIS-based system would also see this as a new version of the file and store a new copy of the file each day. A data deduplication system would store only the blocks of the database that had changed from the previous night's backup.

Single-instance storage is typically implemented by the backup or archive software, whereas deduplication is typically performed at a standalone storage appliance. Software-based SIS operates only on the duplicate data that actually is processed by the backup or archive application. It is more common for redundant copies to come from a variety of sources. In databases, that can be the backup application, the built-in database backup utility or an external third-party application; in many data centers, all three are used. Deduplication systems, as a target for all of these sources, work across all of the data, yielding a much higher level of efficiency than a SIS implementation.

Block-level deduplication vs. variable segment deduplication

Any conversation with a customer about block-level deduplication vs. variable segment deduplication means they're interested in pretty technical aspects of the technology. The main challenge to a fixed block-level data deduplication comes from block shifting. Block shifting occurs when all the data in a file is rewritten on Save or Save As. The challenge is that some fixed block-level systems may identify this data as unique.

Systems that deduplicate on variable-length segments, on the other hand, anchor segments based on smaller data patterns and as a result are less sensitive to block shifting because they can pick up commonality within the file even after the file has been rewritten.

Be realistic in your testing

A common mistake resellers and their customers make when performing tests or evaluations is to evaluate for a short period of time and not run real-world simulations. When testing a deduplication system, you should test both multiple-stream backup -- a lot of data from a lot of computers -- and single-stream backup -- a large database or file from a single server. Make sure that performance is acceptable under both conditions. Test all types of data in the environment: large files, images, data from the backup applications, and direct copies from operating system or database utilities.

Most importantly, you should test recovery performance from older generations of backup data. Without proper built-in intelligence, it is quite easy for data deduplication systems to become fragmented, significantly affecting restore performance of older files and backup sets. Restore times can drop well below what they should, guaranteeing issues down the road.

Be realistic about OEM relationships

You know how this works: OEM relationships are almost always a matter of convenience. They come about when a market is legitimatized before a major vendor has had a chance to respond; they'll OEM the technology to get a toe in the market. In some cases, the vendor is not adding any value to the relationship and is merely trying to cash in on the revenue stream.

Typically, when a vendor OEMs a product, they have to handle the bulk of the support calls; because they didn't produce the product, they're not as well-equipped to handle those calls. Even organizations that are known for excellent service on their own products can be slow to understand products they OEM. Many times these OEM relationships are not designed to last, leaving customers in limbo.

In the deduplication market, a storage vendor might OEM a data deduplication application to accompany its storage product. If you recommend an OEMed deduplication system to a customer, they might end up unhappy with the recommendation due to poor support. For more on this topic, see my Channel Marker blog entry on manufacturer innovation.

References are key

As with any other data center technology gaining in popularity, suppliers are rushing to catch up with the deduplication market, and sometimes hastily so. Customers are yearning for the technology. You as the trusted advisor need to be the voice of reason. Develop a set of your own references that can speak to potential customers about their actual experience with the technology. Be prepared with these. It amazes me how often a channel representative gets that "deer in the headlights" look when asked for references. Being able to answer without hesitation will give the customer greater confidence that you're the go-to guy.

About the author

George Crump is president and founder of Storage Switzerland, an IT analyst firm focused on the storage and virtualization segments. With 25 years of experience designing storage solutions for data centers across the United States, he has seen the birth of such technologies as RAID, NAS and SAN. Prior to founding Storage Switzerland, George was chief technology officer at one of the nation's largest storage integrators, where he was in charge of technology testing, integration and product selection. Find Storage Switzerland's disclosure policy here.


Dig Deeper on Storage Backup and Disaster Recovery Services