Data Recovery Best Practices
Before I dig into troubleshooting scenarios for various types of database recovery problems, let me share with you perspective on best practices for you to use to avoid problems. This section is not a complete study of disaster recovery strategies, but I give you some information you might find helpful based on my experience with data recovery customer cases.
BACKUP/RESTORE Best Practices
Your database is only as good as your last backup.
If you have a great backup strategy, more power to you. If you have any doubts, read on. You would be surprised how many customers call Microsoft PSS and have not backed up their database (or don't have a recent backup). If I had responsibility for any database being used in a business, my number one priority would be the safety and security of that data, which includes disaster recovery situations. I have many non-computer-savvy friends who I talk to about the files on their computer. They ask me how often they should back them up. I always tell them, "Your data is only as good as your most recent backup." As good as computer hardware is today, I've seen too many situations where someone didn't have a recent backup and lost data. It just takes one time of losing key data to learn to have timely and good backups. Don't let yourself run into that situation. You could run for years without ever needing a backup, but when you really need it, you will be so thankful that you spent careful time and consideration ensuring your backups are recent and up-to-date based on your needs and requirements.
Use the BACKUP CHECKSUM feature.
I personally would take any hit in backup performance that you may encounter when using the WITH CHECKSUM option of BACKUP. There is simply no better method in the database engine to detect whether the pages written to the backup are valid and to verify that the backup medium has not been damaged or altered after it is written. This feature actually makes RESTORE VERIFYONLY a viable option to determine whether you have a reasonable chance of restoring the database.
RESTORE is the only guaranteed method to verify backups.
Even with the checksum feature, if you want a guarantee that a backup can be verified, the only way is to restore it. This is because even if you use the WITH CHECKSUM feature, other problems could occur. For example, what if there is some logical problem with transaction log records so that when recovery runs during restore, a failure occurs? If the log record is physically correct but some logical problem exists that prevents redoing that log record within the engine, restoring the backup is the only way to find out.
As part of my advice on restore, I think it is important to test your disaster recovery strategy. For example, how long does it take to restore your backups at any given day or time of day? If your business requires the database to be back and available within 30 minutes during the day Monday through Friday, but with your current disaster recovery strategy, it would take 4 hours to restore your backups on Thursday at 2 p.m., clearly you have a problem. You may need to seek high-availability solutions such as database mirroring or log shipping.
Avoid network drive backups.
Customers seem to encounter more problems than not when backing up their databases or logs to a network drive. I've seen everything from damaged backups to network errors during the backup to Windows errors due to insufficient kernel resources when customers back up large databases to network drives. Using a storage area network (SAN) system seems to be fairly reliable, but I personally don't recommend you back up your database to a network mapped drive on another server on your network. If you have to store your backups on another server and you don't have a SAN system, I recommend you try to back up the database to a local drive, use a program or utility to compress the file (they usually put a checksum on the file as part of this), and then copy this file over the network to the remote server. If the compression utility supports checksum features, such as a cyclic redundancy check (CRC), when you uncompress you can at least safely know the bytes were copied to the remote server. Another alternative is to use a vendor backup solution that uses SQL Server VDI to stream the backup remotely to another computer.
Don't forget about page-level restore.
Remember when making key decision about restoring a database that you have a new feature to reduce the time to restore a backup sequence called page-level restore. Of course, to use this feature you need to know what page to restore, but in some cases, you may have damage to a single page for a multiterabyte database. If you have Enterprise Edition, why not then restore just the damaged page using database and log backups while users are online in the database? It could save you a lot of time and grief. I highly recommend you test this feature and understand exactly how it works before you rely on this possible strategy. An exercise at the end of this chapter helps you understand what is required to use this feature.
Back up system databases.
Don't forget the importance of your system databases. The failure to open these databases at startup can result in a failure to start SQL Server. If you can't restore from a backup, you could be looking at a longer recovery process to rebuild the system databases, attach user databases, re-create user logins, and so on. Because tempdb is re-created at each server startup, you (of course) don't back this up, but model is important because you may add objects in model for new databases, and tempdb requires model to be available to be created at server startup. I have an entire section of this chapter on recovery of system databases, and I think you will see that having a valid backup to restore makes your life easier.
Database and Transaction Log Best Practices
Don't delete your transaction log.
I hope one thing you get out of this chapter is how important your transaction log is to the logical consistency of your database. So, just because database recovery is taking a long time or the log is getting really big, don't think that simply shutting down SQL Server and deleting the log is a good idea. As with other scenarios, this is not uncommon for customers calling Microsoft PSS. They delete the log file and then call us wondering what to do because the database cannot be started.
If recovery is taking a long time, and you think it is stuck or will never recover, it is possible Microsoft PSS may have to help you rebuild the log, but consider contacting PSS first. You might just choose to restore from a backup, but if the backup is good, that is a much better alternative than just deleting your log.
The one scenario in which you can safely delete the log is if the database is detached and shut down cleanly and the recovery mode is SIMPLE. In this case, the log has nothing in it for recovery purposes, so you could delete it. Personally, I would use this method. I would simply shrink the log file using DBCC SHRINKFILE. Remember, if you choose to do this, you have to detach the database first.
Always detach before attaching a database.
The proper method to attach a database requires that you detach it first. Don't just shut down SQL Server, copy the database and log files to another computer, and then attach them. If you need to copy the database, use BACKUP/RESTORE or the Copy Database Wizard in Management Studio.
The process of detaching cleanly shuts down the database. Furthermore, if the database is suspect, detach fails in SQL Server 2005 (SQL Server 2000 didn't do this), preventing you from getting into a situation where you can't attach.
Moving the resource database.
If you need to move the resource database, there is a documented method to do this in SQL Server 2005 Books Online. However, you must move the database to the exact same drive and directory as the master database. Failure to do this will result in problems when trying to install a service pack or hotfix package.
Don't ignore runtime errors.
If you encounter any error related to a database that indicates possible consistency problems, don't ignore these or take care of them "when you have time." Take these seriously and put them on your high-priority list of things to do. The read retry feature is nice in SQL Server 2005, but it means some underlying problem might exist with your disk system. A successful read retry one day could be a checksum failure the next. If you have any doubts or questions about these errors related to a possible database consistency problem and I don't have the information in this chapter, consult with experts such as MVPs in the SQL Server newsgroups, or contact Microsoft PSS.
DBCC CHECKDB Best Practices
Just use DBCC CHECKDB.
In SQL Server 2005, DBCC CHECKDB does it all. No need to run CHECKALLOC or CHECKCATALOG separately in SQL Server 2005. CHECKDB does it all. You may choose to use CHECKTABLE on individual tables, but if CHECKDB runs in a reasonable time frame, I recommend you check the entire database. You can also consider the WITH PHYSICAL_ONLY option to perform a reasonable minimum check for the database and reduce the time of overall execution.
Don't just run REPAIR.
I talk more about restoring a backup versus repair in the section on CHECKDB errors, but I'll say briefly here that I recommend that you not just blindly run CHECKDB with the REPAIR_ALLOW_DATA_LOSS option. Just because DBCC CHECKDB doesn't report errors after repair doesn't mean you don't have issues to deal with. Repair may de-allocate a page containing data rows, which now means you have logical inconsistency of your data. If you have to use repair because you can't restore a database, try to keep information about errors in CHECKDB and either a backup of the damaged database or a copy of the damaged page(s) before using repair. This is the only way to learn about the possible cause of the damage to the database.
Use RESTORE rather than emergency mode REPAIR.
Emergency mode repair is a great feature (especially for PSS). It makes the process to repair and rebuild a log simple and easy to use. But the key to its usage is the term emergency. The SQL Server development team put in this feature for situations that you simply didn't account for in your disaster recovery strategy. But don't rely on this feature. I hope you never have to use this in your usage and administration of SQL Server, but it is nice to know the feature exists should it be needed.
You can't overuse CHECKDB.
So how often should you use CHECKDB? Well, I would say first that you can't run this too often. It might affect performance because of its resource usage, but you won't burn up your hard drive by using this command every day. Having said that, you probably should use this command only on some type of regular, but minimal, basis. There is no single formula for how often to run this command, but here is my opinion of some checkpoints for when CHECKDB should be run:
- Whenever you see a critical error in the ERRORLOG, especially ones I've documented as runtime consistency problems.
- Before and after any SQL Server service pack or hotfix installations.
- Before and after any major database application upgrades.
- Before and after any Windows operating system service pack or hotfix installations or upgrade.
- Before and after any hardware or system maintenance such as drivers, firmware updates, new hardware installation, or replacement of hardware components.
- On a regular basis that makes you comfortable with the consistency of your database. (This may be daily, weekly, or monthly. I wouldn't go any longer than one month to use CHECKDB unless you have something like a read-only database you can easily re-create at any time.)
CHECKDB on backup servers is not a guarantee.
If you choose to restore backups to another server to keep an updated copy of your database available, be careful if regular DBCC CHECKDB on the backup server reports no errors. The primary server could have problems unique to that machine that are not carried into a restore on the backup server. Now if you restore a full database backup and immediately run CHECKDB on that restored backup, it is reasonable to assume the primary databases based on that backup are clean.
Use the following table of contents to navigate to chapter excerpts, or click here to view Data corruption and recovery issues in its entirety.
|About the book:|
|In this book, bestselling author and SQL Server guru Ken Henderson has worked with the SQL Server Development team to write "the" troubleshooting SQL Server 2005 book. All of the content comes directly from the creators and developers of SQL Server. Ken has edited each chapter to ensure that the writing is clear and sound. Written by the people who know the product best - the people who built it - SQL Server 2005 Practical Troubleshooting teaches developers and dbas how to troubleshoot and diagnose problems they may encounter with SQL Server. Purchase the book from Addison-Wesley Publishing|
|ABOUT THE AUTHOR:|
|Ken Henderson has been a developer for more than 25 years. He has worked with SQL Server since 1990 and has built software for a number of firms throughout his career, including H&R Block, the Central Intelligence Agency, the U.S. Navy, the U.S. Air Force, Borland International, JP Morgan, and various others. He joined Microsoft in 2001 and is currently a developer in the Manageability Platform group within the SQL Server development team. He is the creator of SQL Server 2005's SQLDiag facility and spends his days working on SQL Server management tools and related technologies. He is the author of eight books on a variety of computing topics, including the popular Guru's Guide series of SQL Server books available from Addison-Wesley. He lives with his family in the Dallas area and may be reached via email at email@example.com.|