While there are multiple factors that cause data to grow and have a life of its own, it's mostly driven by email and digital media files. The files in the latter group typically aren't the kinds of files that are used for legitimate business at most companies, but they certainly have their way of making it into the corporate network. And during one data migration process for a financial services customer, it became apparent to me that much of that data wasn't important to the business; the users had just accumulated it over time and didn't want part with it even though most of the files haven't been accessed in years. They hang onto those files for a couple reasons: They think that someday they may need them, and/or it takes too much time to sort through and figure out what's really important.
When viewed from the employee's perspective, those extraneous files seem innocuous. But, of course, from the macro perspective, the data growth problem can add costs and headaches in many areas. For example, in the data migration project I mentioned above, while moving data from Windows servers to a network-attached storage (NAS) environment, the cutover nights took more time than was budgeted because users' Exchange email archives were so big (3 GB to 6 GB). When you're trying to migrate 100-plus users a week, that's a lot of data.
To try to control the problem of overburdened and slow email databases, many IT departments have instituted space quotas on email accounts. Many users respond to the quota by archiving their files to their home directory on the network -- which essentially shifts the burden from the email server to network storage devices and leads to the problem I ran into in the data migration project I just mentioned. Saving that email data onto storage systems has a large rippling effect. Data saved on a file system gets saved within snapshots, gets replicated to secondary arrays and gets backed up on to a VTL or traditional tape library. One way to combat that data growth problem and cut down on the costs associated with such bad user habits is by tiering data off of expensive primary storage.
We know that a large percentage of data has not been accessed for over a year. For instance, one analysis I ran within a global HSM environment showed that we'd save roughly 50% of storage space by archiving files that had not been accessed in a year. (While some experts suggest that you can move 70% of data off of primary storage, I feel comfortable using 50% as the estimate. And while 50% is substantially lower than 70%, it's still a huge number.)
There are some things to be wary about when applying storage tiering on customers' unstructured data. For example, if the data is moved from a Windows /Unix environment to a NAS environment, make sure that the metadata (referred to as inode) of the files is retained. The most important metadata, at least when it comes to tiering, are the creation, modified and access dates. If that metadata isn't retained, after the migration, all files will have the wrong access date since it will change it to the current date when the file is moved. (To learn how to retain the inode information during a data migration, refer to the documentation on whatever migration tool you use, such robocopy or rsync.) If the original metadata is lost, it is impossible to tier data correctly.
Assuming the metadata has been properly retained, you should archive data based on the access date of the files; the access date changes every time a file is opened. (Using the modified date is not advisable since a file doesn't need to be modified to be of recent value to an organization. Think of how many times you open a Word or Excel document but never change it.)
One big mistake I have seen some customers make is to migrate their data but not to cheaper drives. In other words, they go through the pain and suffering of having an inline device move their users' data onto another drive, but the other drive is, for example, Fibre Channel. If the main reason you want to use storage tiering on your customers' data is to reduce their capital expenditure, then you need to migrate data to less expensive storage, such as SATA drives. If you're going to go through the process of moving data to control data growth, it should be done to save capital expenditure and not just disk space on the first drives.
About the author
Seiji Shintaku is a principal consultant for RTP Technology. Before joining RTP Technology, he was global NetApp engineer for Lehman Brothers, Celerra and DMX engineer for Credit Suisse First Boston, principal consultant for IBM, and global Windows engineer for Morgan Stanley. RTP Technology is a VAR for storage-related products and professional services for NetApp, EMC, F5, Quantum, VMware and Brocade. He can be reached at email@example.com.
This was first published in November 2009