Since the first bytes were written to magnetic media, the desire to store data has outpaced our ability to store it affordably. Magnetic tape, the first true mass-storage media, sought to solve the problem with storage compression
Fortunately more processing power has become available to compress storage. Compressing more data into smaller spaces is no fad. Storage is often the largest expense in the data center; it just makes sense to use modern processing power to reduce the amount of storage required to meet the needs of the business. In this tech tip, I'll discuss storage compression options to present to your customers, including traditional compression and data deduplication tools.
There are two basic methods of compression: lossy and lossless. Lossy compression reduces the size of a file by literally deleting bits in the file that will not drastically affect the quality of the information as perceived by a human. Examples include mp3 audio files and jpeg images. Lossy compression is commonly used at the application layer, and works well. Data owners may elect to exchange the integrity of the original information for reduced storage space, but infrastructure people rarely have the liberty to trade quality of the data entrusted to them for disk space. Therefore, your customer may only have lossless compression techniques at their disposal.
Traditional storage compression
The oldest and most prolific storage compression technique is traditional compression. This method works best on plain text, raw images and database files. The compression engine examines a relatively small segment of data looking for patterns in the data that can be reduced. For example, the ASCII string "aaaaabbb" could be reduced to "a5b3" saving a few bytes. The compression engine can be implemented in hardware or software.
Traditional hardware compression uses a dedicated microprocessor designed specifically to handle the compression workload without restricting throughput. In almost all circumstances, hardware compression will provide better performance, both in speed and compacting ability, as compared to software compression. Hardware compression has been a staple of tape drive technology for quite some time, and all modern physical tape drives have it built in. Network Appliance recently announced that its NearStore Virtual Tape Library now supports hardware compression. The high overhead of software compression means that most VTLs take a 50% hit in throughput performance when compression is enabled. The Network Appliance device, however, is actually reported to run faster when hardware compression is enabled!
Traditional software compression uses a server or storage controller's main processor to compact data. This is generally slower and less efficient than hardware compression but offers a significant advantage: It's cheaper to implement and update. Software compression is everywhere: You can find it in backup software clients, which compress backup data at its source, saving network bandwidth. Some server file systems like NTFS, JFS and ZFS also support compression, increasing the usable capacity of the filesystem at the expense of IO performance. In some cases, applications also use compression mechanisms. For example, IBM's DB2 database engine claims a 50% savings in disk space when compression is enabled. Most VTLs also support software compression exchanging density for throughput performance.
When you have a choice, in almost all cases it will be favorable to use hardware based compression over software compression. It is important to note that one should avoid using both hardware and software compression on the same data stream. It will yield little or no capacity improvement, but will certainly slow throughput.
Data deduplication is really just like traditional compression, except it operates on much larger datasets, eliminating all duplicate chunks of data under management. The deduped data is then often compressed using more traditional pattern-based compression techniques. The amount of space required to store deduped data is highly dependant upon the amount of redundancy in the data. Some dedupe vendors like Data Domain claim that their Global Compression dedupe technology, can obtain an average of 20:1. Until recently, data deduplication was only available at the file level. As more processing power becomes available, deduplication mechanisms work on smaller chunks of data, down to the byte level.
Many deduplication tools are already on the market. They are implemented as standalone software products and embedded directly into storage hardware. Symantec's PureDisk is a software product that bolts on to Netbackup deduplicating backup streams. It is often leveraged to reduce to total bandwidth required to back up remote offices. EMC recently purchased Avamar Technologies, which produces a deduplication software product called Axion. It runs on the host and allows storage shops to write deduplicated archives to any existing disk technology, enabling long-term cost-effective archive storage.
EMC's Centera CAS array was one of the first storage devices to implement file-level deduplication. While the Centera's deduplication services are not as powerful as more cutting edge, byte-level dedupe products, they are a stable and innovative way to efficiently store archive data. Newer to the market is Data Domain, which implements dedupe engines into their storage arrays or alternatively, front-end any existing storage behind dedupe gateways.
Data deduplication is building momentum. The swell of information pooling up in every data center makes deduplication one of the most important developments to hit storage technology users in at least a decade. As more manufacturers scramble to capitalize on the frenzy, you'll see more VTLs, intelligent fabrics, host software and disk arrays implement sophisticated data deduplication compression mechanisms.
About the author: Brian Peterson is an independent IT infrastructure Analyst. He has a deep background in enterprise storage and open systems computing platforms. A recognized expert in his field, he held positions of great responsibility on both the supplier and customer sides of IT.
This was first published in April 2007