In this podcast, W. Curtis Preston, backup expert and executive editor for TechTarget's Storage Media Group, explains how to approach backup testing and verification for your customers. Read the transcript below or download the podcast.
You must have Adobe Flash Player 7 or above to view this content.See https://www.adobe.com/products/flashplayer to download now.
Download for later:
Backup testing and verification
• Internet Explorer: Right Click > Save Target As
• Firefox: Right Click > Save Link As
When customers don't do backup testing, what kinds of things happen?
It's more like what kinds of things don't happen -- and that's backups. I got bit with this early in my career, where I assumed that because backups were working, that means that restores would work. And as we say in this world, backups are one thing, restores are everything. So what happens is that if they don't do backup testing, at a minimum, there are lessons that they're not going to learn until the wrong time -- things that they needed to include, instructions that were necessary. If you're not doing regular backup testing, you simply aren't familiar with how the process is going to work when you're restoring. The second [thing] is that you're going to be surprised, statistically speaking, with failed restores because there are things that you're going to learn about things that are being missed that you're simply not going to know unless you're doing restore testing.
Is there any data on what percent of restores totally fail?
The numbers that I've seen out there are really, really bad. It's beyond 50% of production restores that fail, which is just not a good number at all.
How often do customers that do it right do backup testing?
At a minimum, at least once a year you need to do a major test, where you're testing restores of major applications, and hopefully, you're doing a disaster recovery test, where you're doing restores at an alternate location with alternate gear and alternate people. That's the bare minimum, and oddly enough, many people don't. I would suggest that people actually automate restore testing. Get somebody who's good at scripting, and do restore testing. Do it once a day … just to make sure that your system functions. Restore a file once a day. Restore a large file system once a month. Your people should be familiar with doing restores and familiar with how long restores take and the process, and the only way you're going to do that is with restores done on a regular basis. I am lucky that at the company that I worked at when I first started my career, we did about 10 restores a day. [I worked for] a very large bank, with something like 10,000 employees, and any user could call up and ask for a restore of any file at any time. And so we did restores all the time. I know for a fact that we made changes along the way, of our backups, because of things that we learned when we did those restores. So at a minimum once a year, at least once a month is a good thing. Once a week would be better. And I would like to automate daily backup testing. It doesn't hurt anybody, it just runs. … You'd do an alternate restore, where you restore the file to another location so you're not overwriting the production data.
So the businesses where it's part of the business function to have those restores required have it easier, ironically, because they can test it out naturally?
What types of things are you testing for when you test restores?
The biggest thing is you're just testing the process. If someone says, "I need a file," well, there's a person, generally speaking, sitting at a keyboard, who gets that request. Well, what do they do? What's the process that they go through? Do they have to notify somebody that they're going to do the restore? Can they do the restore themselves? In the case of the bank [that I mentioned], we had a help desk; they didn't do the restore, all they would do is issue the restore request, and then that restore request was the help desk ticket that would go to the restore guy, and then he would do something, and that would initiate the actual restore process. [And then the next part of the process was that it asked for a tape.] (Back in those days, it was absolutely a tape; now it could be disk or something else.) And then, what happens? … Some people do the, in my opinion, incredibly ill-advised thing of sending the only copy of their backups offsite immediately. And so if you do restores on a regular basis, what you'll find on a regular basis is what has to happen to make that restore happen. So, do we have to call Iron Mountain and have them bring a tape back on a very regular basis? If that's the case, then I would suggest that you're doing something wrong with that process.
And then the next thing that we're looking for is just things like performance and verification that the restore actually worked. How long does it take to restore a file system of a certain size or a database of a certain size? I would suggest that you have a test system of each major database type -- Oracle, Exchange, SQL Server, etc. -- and that you do regular test restores of a production system to that test system. And so you'll have an idea that when somebody calls and it's the monster database -- the one that you're scared to death to restore, and you never do test restores [of it] because you don't have 5TB of disk sitting around -- but at least you'll know that [when you did a backup test last month of the 500GB database, it took a certain amount of time, and so by extrapolation, you can say,] "OK, it took us three hours to do that one, so it's going to take us a day and a half to yours." And you can tell them that upfront and set an expectation because you've got that knowledge from testing. That's a lot better than sitting there for 18 hours wondering how long it's going to take.
In doing backup testing for a really big database, is there any way to do it other than extrapolation based on the time it takes to restore a smaller database?
I think it's an environment thing, it's a culture thing. If you can get across to the powers that be that [backup testing and verification] is important, then what you can do is make it part of your process of installing new systems that you use a new system that's been installed, and before it goes [to] production, you use that new system to test the restore of an older system that's of a smaller size. We do tend to install newer and bigger and faster systems every time. And so when you bring in that new 1TB database or 1.5TB database, or whatever it is … before that system goes production, use it to test the smaller system. The really good thing there is [that] you're testing ¬an alternate-location restore, which is a bit more complicated than the regular restore, and if you can do the alternate-location restore, you can do the regular restore. And so it's about making [backup] testing a priority and making it more important than getting that new fancy production system into production. Because that's what happens: It's the new shiny box, and we want to turn it on and get the new umpty-squat application up and running as soon as possible. I get that that's important. But if we get that new umpty-squat application up and running and we can't restore it, people aren't going to be happy, so you've got to get that into the culture, so that you're just doing that [backup testing] as a matter of course.
And so if somebody has a huge database that they need to test, that's an excellent way to do it?
Absolutely. In my opinion, the only real way to know that your restore of the large database is going to work is to do a restore of the large database. … I'm at companies all the time, and -- I can think of a client that I was at recently, and they've got a 90TB Oracle database -- and you ask them, Have you ever done a test restore? And the answer is no. [I said,] "OK. I don't know how you know [it'll restore]. The database is really important. I'm not sure why you haven't tested the restore."
What about documentation?
[I can't overemphasize the importance of documentation.] This isn't little notes to yourself on the things you need to remember in the case of a restore. This is, You are gone. You are on vacation, you are in Tahiti, you got hit by a bus. You are no longer accessible -- you or any other backup people that know the ins and outs. And so you've got to document this to the point that a person who is technically minded but not familiar with the backup system can do a restore. Then and only then, frankly, can you take a vacation. I can think back to 14 years ago, when my first daughter was born, and I was in the hospital with my wife looking at our newborn. And I got a call from the bank on a restore that was important. And I was basically able to slam the phone down because I said, "Did you read the documentation?" And they said, "No." And I just said, "Call me back if you've read it and it doesn't work," because I knew that wasn't going to be the case because I knew I had fully documented it. But if you don't, you're never going to be able to take a vacation away from the backup system, a real vacation where you unplug yourself, and your company won't be at risk for when you ever take a permanent vacation.
But it requires that your customers actually read the documentation. It doesn't mean you won't get those phone calls, it just means that you won't have to leave the hospital. Agreed. … Nobody reads anything. But it's more along the lines of, in the case where the worst thing happens and they can't make that phone call. Whoever it was who made the documentation, made the system, manages the system, that person is completely gone for whatever reason -- whether something horrible happened to them or … maybe you had to lay them off, and they're not going to answer the phone when you call. The documentation has got to be good enough so that [someone else] can pick it up and figure something out.
What about server virtualization? Does it have a role in backup testing and verification?
Absolutely. Using either the free products, like VMware Server or the production product, you can do a lot of restore testing that simply wasn't possible before or wasn't feasible. So you can create a new system. You can even test bare-metal restore, so [test] the restore of a completely new system that doesn't even have an operating system. You can mimic that with a virtual machine. If you've got a bare-metal recovery system, then use it and test it on your virtual machine. You can also … take one virtual machine that happens to be running the application that you're restoring and you can copy that and create and alternate system that looks exactly like the production system you want to restore, and then you can do a direct restore with impunity. You can do all sorts of interesting things with restore testing if you happen to have virtual servers. I've even seen some CDP products that will actually continually update a restored system. Even though the system that you're backing up may be a physical system, I've seen a CDP system that will continually update a virtual image, so in the case of an outage, you'll actually be able to start using a virtual server. There are all sorts of possibilities. … When the system dies … go over to the virtual machine that's been acting as the backup resource of that system, and power that new virtual machine up and -- boom -- it's already been updated with the most recent backup information from that client. That's just awesome.
It doesn't mean that you won't have lost data since the time of the last backup.
Right, but it should be minutes or seconds, not hours. And the other phenomenal thing about that it is that you don't have to actually do a restore. The restore's already done; you just have to start [the machine] up.
And so that obviously saves time.
Right. And that's not so much a [backup] testing question. I just got talking about virtual servers and the things they can provide, and that's one of them.
Are you seeing companies that are actually using server virtualization in their backup testing and verification process?
Absolutely. Especially when you consider that many of these [vendors] offer free versions and so you don't need the most high-end ESX Server to be a target for restores. You can use an older server, like VMware Server, which is free, and I know the other products have similar versions that you can use. And all you need is a bunch of storage. You're not necessarily testing performance here as much as you're just testing the process. And it's really cheap to make a multiterabyte VMware server using a 1TB or 1.5TB SATA drive and a motherboard that supports SATA RAID. A few hundred dollars later, and you've got yourself a big old VMware server with several terabytes of disk. And you can do all sorts of [backup] testing [and verification] with that. I wouldn't recommend that necessarily for a production system, but for restore testing, knock yourself out. It's not going to be as fast, but you can extrapolate that. [If it takes you two hours to do the restore with cheap SATA drives and you have Fibre Channel on the live system, the actual restore should be faster.]
Go to Part 1 of the Channel Expert Podcast, on the goals and technologies of a modern backup strategy.