The use of compression and dedupe, in particular, really complicates the sizing calculation. The data reduction rates from those techniques will be different at different stages of the backup process and on different types of data. Beyond that, decreasing disk costs, plus these data reduction techniques, mean that customers expect to keep data on disk for a significantly longer time than in the past; those expectations also need to be considered when determining the optimum VTL size.
Here's a process to follow to determine how big your customer's virtual tape library needs to be:
- Determine how big the existing data set intended for backup is. For the sake of this exercise, we'll assume a 10 TB data set to be backed up.
- Determine what percentage of that data is made up of databases and messaging environments and what percentage is made up of other types of files. Databases have to be treated specially. Even though most backup applications can back databases up "hot," with most, the entire database is still backed up every night, so there is a lot of redundancy within backup sets; beyond that, databases, as well as messaging systems, compress really well.
- Determine the weekly change rate. You should be able to determine the size of the weekly change rate from the customer's backup application. The simplest way to determine this would be to have the customer execute a differential backup job the day before the next full job starts. In most backup applications, a differential is a backup of all the data that has changed since the last full; executing such a job the day before a full provides a fairly accurate estimation of what has changed during the week. Our example uses a 10% weekly data change rate.
- Based on the types of data to be backed up and deduplication/compression, calculate the amount of data resulting from the first full backup. The first full backup won't see significant gains from deduplication, but there will be some. For example, if your customer is backing up 20 Windows servers, the core OS files are likely to be similar across those servers. Using a relatively small deduplication factor, 2X, is generally a safe starting point. All the data is eligible for compression, but all data does not compress to the same degree; for example, databases and text files can be compressed by up to 90% or more, whereas JPEG and Office 2007 documents may not compress at all. A good rule of thumb: Figure on a compression rate of 50% of the data set. The goal is to be relatively conservative with your calculations. While you don't want to oversize the solution, undersizing it is worse. With most systems, compression happens before deduplication, so using the above numbers on a 10TB full backup would result in compression down to 5 TB and deduplication down to 2.5 TB.
- Calculate the amount of data resulting from the daily incremental backups. With the big exception of databases and messaging system data, the data in this job will mostly be net-new data and, like the initial full, will not deduplicate at a high rate, so plan on about 2X reduction with deduplication. Compression will follow the same guidelines as above, but be aware that in many cases, this data is likely to be Office 2007 documents and as a result may not compress well. Databases and messaging systems are a different animal and need to be treated separately. Most backup applications, while they can do hot backups, still create a full copy of the database/messaging system to the enterprise VTL. Not only does all this data compress very well, beyond our 50% number above, it also will deduplicate very well. The majority of a database is identical to the previous backed-up copy so the level of redundancy is very high. Take guidance from the customer, but in general, database growth is relatively small on a daily basis. The data resulting from daily incremental backups will be covered in the step below since the weekly backup is a roll-up of the daily.
- Calculate the amount of data resulting from the weekly backups. Using the example of a 10% weekly change rate with a 10 TB data set, we'd have 1TB from the weekly backup. Of this 1TB, it's not uncommon for at least 250 GB to be from databases, and those can be essentially factored out of the calculation by about 90%, or 25 GB of real net-new database growth. The remaining 750 GB will likely compress by 50%, down to 375 GB, and with a similar deduplication rate as the initial full of 2X, down to 187.5 GB of net-new data per week after compression and deduplication, for a total of 212.5 GB of data from the weekly backup in our example.
- Calculate the amount of data resulting from subsequent fulls. Subsequent full backups will have a very high level of redundancy to the backups already run; 90% of the full backup didn't change by definition and the 10% weekly change was picked up by the daily backup jobs. As a result, the full backup will have very minor changes to it and should be calculated like an additional weekly job. In our example, weekly, that's another 212.5GB in a worst-case scenario.
- Determine how long the customer intends to keep the data on disk. The longer the customer keeps data on disk the more the deduplication ratio should improve: There will be more full backups and therefore a greater chance of duplicate data.
- Add everything up. Using our example factors (2.5 TB from the first full backup, 212.5 GB from the weekly backups and 212.5 GB per week from the subsequent fulls) a 20 TB VTL should be able to store about 8 months' worth of backups on disk, assuming no abnormal data growth.
Once you've determined the sizing requirements for your customer, look for a VTL that can scale easily without backup process interruption. Essentially, this means that the virtual tape library you should pick can be configured at the initial size based on your initial forecast, scale in granular increments and have the ability to quickly take advantage of new drive technologies as they become available.
About the author
George Crump is president and founder of Storage Switzerland, an IT analyst firm focused on the storage and virtualization segments. With 25 years of experience designing storage solutions for data centers across the United States, he has seen the birth of such technologies as RAID, NAS and SAN. Prior to founding Storage Switzerland, George was chief technology officer at one of the nation's largest storage integrators, where he was in charge of technology testing, integration and product selection. Find Storage Switzerland's disclosure statement here.
This was first published in May 2009