SQL Server 2005 Practical Troubleshooting

When it comes to data corruption and recovery, BACKUP and RESTORE are your most important tools; find out what to do when these commands fail in this book excerpt.


Because BACKUP and RESTORE are your most important tools to recover your data, I think it's important to briefly discuss some scenarios where these commands can fail and how to handle them.

BACKUP Failures

The T-SQL BACKUP command can fail for various reasons, some of them simpler to solve than others. For example, if you try to use BACKUP LOG and the database is using SIMPLE recovery, you will encounter the following error:

Msg 4208, Level 16, State 1, Line 1 The statement BACKUP LOG is not allowed while the recovery model is SIMPLE. Use BACKUP DATABASE or change the recovery model using ALTER DATABASE. Msg 3013, Level 16, State 1, Line 1 BACKUP LOG is terminating abnormally.

This error is descriptive and tells you exactly the problem. Other errors or problems are not as intuitive and simple to resolve, so I focus on those in this section.

Media Failures

One of the most common issues I've seen in technical support regarding backup is a failure writing to the target media. Target media is the destination for the backup stream, such as a disk file(s) or tape.

Tape failures were more common with previous releases of SQL Server because tape was the most common format for backing up large databases. I don't see as many of these cases anymore, but when they do occur, the cause of the failure is almost always a problem with the tape drive, the tape, or device drivers associated with the tape drive. I've seen this time and time again. Here is a sample ERRORLOG showing a failure writing to the tape that doesn't appear very intuitive.

spid324 BackupMedium::ReportIoError: write failure on backup device '.tape0'. Operating system error 1117(The request could not be performed because of an I/O device error.).

The steps to resolve a problem such as this or other tape errors is to focus on the tape drive, tape, or device drivers. Try a different tape or a different tape drive. Make sure the device drivers for the tape drive are up-to-date. One common misconception for troubleshooting a problem with writing to tape is that if the Windows NTBACKUP program works, the problem must be SQL Server. Although it is a good idea to see whether you can back up a file using NTBACKUP to the tape drive, if it works but you still encounter an error with SQL Server, the problem could still be the tape system. This is because SQL Server uses Windows API calls that NTBACKUP doesn't when writing to the tape.

Before I talk about some common issues when writing to a disk drive, one simple troubleshooting step to help in many backup cases is to simplify what you are backing up. For example, if you are struggling to back up a 30GB database, try backing up just a small database, such as the sample AdventureWorks database. If you cannot back up even a small sample database, there is a good chance you should focus on the target media, whether it is a tape system, local disk, SAN, or network drive. If a small database sample works, but your database does not, it could still be the target media system, but there is a little more evidence that it could be a SQL Server problem. This is good information to give to a technical support engineer should you decide to contact Microsoft for assistance.

Let's talk about a few more common scenarios that you could encounter today—failures writing to a local disk, SAN, or network drive. As with tape, the cause of these problems is almost always something to do with the target media, current operating system environment, or target operating system environment. Let's take a look at two possible scenarios you may encounter where the solution to the problem is not obvious.

2005-05-02 09:15:28.05 spid69 BackupMedium::ReportIoError: write failure on backup device 'G:MSSQLBackupscaamdb.BAK'. Operating system error 64(The specified network name is no longer available.).

In this situation, the error is indicative of a network problem. The error has nothing to do with SQL Server. You should troubleshoot this problem as though there was a network I/O issue communicating with whatever disk is mapped to the G: drive. This may even be a SAN drive.

BackupMedium::ReportIoError: read failure on backup device 'FFGPSD$MSSQLBACKUPFSPRD_DB.BAK'. Operating system error 1450(Insufficient system resources exist to complete the requested service.).

In this situation, the error is coming from Windows when writing the backup file. I've especially seen this situation when backing up to a network drive. I've also seen this problem with the use of the /3GB switch in boot.ini because there is not enough nonpaged pool memory for Windows to handle the I/O associated with the backup. Your choices are 1) Don't back up to a networked drive, 2) Remove the use of /3GB, 3) Tune nonpaged pool according to the guidelines in Knowledge Base article 304101.


I discussed earlier in the chapter a new feature for SQL Server 2005 called backup checksum. When the WITH CHECKSUM option is specified for the BACKUP command, the engine verifies the checksum for any pages it reads from the database file that previously had a checksum calculated for it. If any page fails this verification check, BACKUP fails with the following error:

Msg 3043, Level 16, State 1, Line 1 BACKUP 'test_checksum' detected an error on page (1:177) in file 'C:Program FilesMicrosoft SQL ServerMSSQL.1MSSQLDATAtest_checksum.mdf'. Msg 3013, Level 16, State 1, Line 1 BACKUP DATABASE is terminating abnormally.

The ERRORLOG shows the following information:

2006-08-25 14:18:21.49 Backup Error: 3043, Severity: 16, State: 1. 2006-08-25 14:18:21.49 Backup BACKUP 'test_checksum' detected an error on page (1:177) in file 'C:Program FilesMicrosoft SQL ServerMSSQL.1MSSQLDATAtest_checksum.mdf'. 2006-08-25 14:18:21.52 Backup Error: 3041, Severity: 16, State: 1. 2006-08-25 14:18:21.52 Backup BACKUP failed to complete the command BACKUP DATABASE test_checksum. Check the backup application log for detailed messages.

Your choices here are simple:

  • Correct the checksum problem.
  • Remember the cause of the problem is something within the disk system, so be sure to check all is well here first. I say this because if you choose to repair the database to correct the problem, but the disk still has problems, you will continue to run into other inconsistency issues. To repair the current problem, you can choose to restore from a backup (remember page-level backup restore is a possible option) or you can use DBCC CHECKDB to repair the page (which will result in data loss as the page will be de-allocated to "fix" it).
  • Ignore the error and continue the BACKUP.

    Let's say you just want to go ahead and back up the database ignoring this error, knowing it will be a bad page. You have an option with BACKUP called CONTINUE_AFTER_ERROR that will let you do just that. If you use this option, you will get the following warning after the BACKUP completes:

    Processed 208 pages for database 'test_checksum', file 'test_checksum' on file 1. Processed 1 pages for database 'test_checksum', file 'test_checksum_log' on file 1. BACKUP WITH CONTINUE_AFTER_ERROR successfully generated a backup of the damaged database. Refer to the SQL Server error log for information about the errors that were encountered. BACKUP DATABASE successfully processed 209 pages in 1.879 seconds (0.910 MB/sec). The ERRORLOG will contain the following information: 2006-08-25 14:20:30.83 Backup Error: 3043, Severity: 16, State: 1. 2006-08-25 14:20:30.83 Backup BACKUP 'test_checksum' detected an error on page (1:177) in file 'C:Program FilesMicrosoft SQL ServerMSSQL.1MSSQLDATAtest_checksum.mdf'. 2006-08-25 14:20:31.03 Backup Database backed up. Database: test_checksum, creation date(time): 2006/02/22(11:35:35), pages dumped: 219, first LSN: 22:16:26, last LSN: 22:28:1, number of dump devices: 1, device information: (FILE=1, TYPE=DISK: {'c:test_checksum.bak'}). This is an informational message only. No user action is required. Why would you consider this option? One reason may be that you want to back up the database in its current state before trying to repair it and correct the damaged page. I'm not saying repair will cause problems for you so you must back up your current database, but I've met careful DBAs who want to back up the current state of the database before trying any fairly major modification operation. The only way for you to do this is to use the CONTINUE_AFTER_ERROR option.

    One point of clarification about the WITH CHECKSUM option for BACKUP. The BACKUP command itself does not verify any data after writing it to disk. In other words, BACKUP doesn't verify checksums, write the backup stream checksum, and then go back and verify everything. What it does is verify database page checksums as it reads pages from the database file, write these pages to the backup media, and then write a backup checksum value based on all data written to the media. You can do this yourself by immediately executing RESTORE WITH VERIFYONLY right after the BACKUP.

    If you use Management Studio to back up a database, you are presented with an Options dialog box where you can select to use checksum and verify the integrity of the database (see Figure 2-10).

    Figure 2-10 Option to use checksum and verify the integrity of the database

    Look at the options in the Reliability section of the dialog box. If you select these options and use the Script menu at the top of the dialog box (this is a nice feature in Management Studio where you can generate scripts based on options picked in a dialog box for different types of tasks), the resulting T-SQL script will look like this:

    BACKUP DATABASE [pubs] TO DISK = N'c:pubs.bak' WITH NOFORMAT, NOINIT, NAME = N'pubs-Full Database Backup', SKIP, NOREWIND, NOUNLOAD, STATS = 10, CHECKSUM GO declare @backupSetId as int select @backupSetId = position from msdb..backupset where database_name=N'pubs' and backup_set_id=(select max(backup_set_id) from msdb..backupset where database_name=N'pubs' ) if @backupSetId is null begin raiserror(N'Verify failed. Backup information for database ''pubs'' not found.', 16, 1) end RESTORE VERIFYONLY FROM DISK = N'c:pubs.bak' WITH FILE = @backupSetId, NOUNLOAD, NOREWIND GO

    This is effectively what Management Studio (through SMO) will use to back up the database with checksum and then verify the backup when it is complete.

    Performance Problems: Memory

    One possible scenario you may encounter will not show up as an error message. The issue is performance problems with your BACKUP. Perhaps you have seen this before and were not sure as to the cause. We have had customers contact technical support and say that their backups are slowing down and in some cases taking twice as long. In some situations, this degradation has occurred over time. Let's explore how to recognize this behavior and what you can do to resolve it.

    When customers have contacted us and complained about backup performance, we quickly noticed in most cases Performance Monitor would show a significant drop in backup throughput via the following counter: Physical Disk/Avg Disk Bytes Per Transfer. We observed this for the disk drive where the customer was writing backup. We noticed that this value on that drive was something much lower than we expected. For a normal SQL Server disk backup, this number should be close to 1MB. This is because we knew internally in SQL Server that the backup code will write out buffers of ~1MB in size to the target disk. The numbers we observed were more like 64KB. What this means is that the SQL engine is writing out a large number of small disk writes versus a smaller number of larger writes. This is not efficient for writing out to a disk drive (assuming no bottlenecks exist).

    After some investigation, we discovered the problem. The SQL Server backup code calculates what will be its buffer size (called MAXTRANSERSIZE) based on the number of files for the database/log and the number of target disk drives. (Remember you can back up to multiple disk files at the same time, called a stripe dump.) The backup code then takes the total number of buffers and tries to allocate memory for them at a default size of 960KB. If any of these memory allocations fail, the code downgrades its MAXTRANSFERSIZE to 64KB. In SQL Server 2005 (and in SQL Server 2000 SP4), if you observe the following message in the ERRORLOG:

    2005-12-02 02:00:21.93 spid100 Downgrading backup buffers from 960K to 64K

    you know that the default MAXTRANSERSIZE could not work. As with any memory problem, the issue is not having enough free memory for these allocations. But here is the twist to this problem. The issue is not having enough contiguous memory to allocate a 960KB block. Furthermore, the memory for the backup buffers doesn't come from the SQL Server buffer pool. This comes from the remaining virtual address space after the buffer pool reserves its memory. This is a precious resource for systems with 2GB+ of physical memory, because the buffer pool consumes most of the virtual address space. This means there are two resolutions to this problem:

    • Free up more virtual address space or find out why it is fragmented.
    • Use the new MAXTRANSFERSIZE for a lower default setting than 960KB.

    RESTORE Failures

    Like BACKUP, there are various reasons for restore failing, but two that are the most complex are as follows:

    • A consistency problem with the backup media
    • A failure during recovery

    The second scenario should be investigated just like the scenarios I've already described for deferred transactions and a SUSPECT database. This is because the problems are the same. One is a failure during crash recovery (when the database comes online), and the other is a failure of recovery during media recovery (when the database is being restored).

    The first case is the one for us to focus on. This is where database and backup checksums can make the process of determining the cause of a restore failure so valuable. Remember we talked about how restore verifies database page checksums and the backup stream checksum if they exist in the backup.

    As we talked about for verifying the integrity of a backup, you can use RESTORE WITH VERIFYONLY. If this fails with the following error, you know the backup media was altered after it was written. You can't tell from this message whether the problem was with a damaged page, but you know the backup media itself was altered and the checksum is now different than when it was written.

    Msg 3189, Level 16, State 1, Line 1 Damage to the backup set was detected. Msg 3013, Level 16, State 1, Line 1 VERIFY DATABASE is terminating abnormally.

    If you run the actual RESTORE DATABASE command and the damage was to a database page that has checksum values on it, you can see this type of error:

    Msg 3183, Level 16, State 1, Line 1 RESTORE detected an error on page (1:177) in database "test_checksum" as read from the backup set. Msg 3013, Level 16, State 1, Line 1 RESTORE DATABASE is terminating abnormally.

    In either case, a problem was detected with the integrity of the backup. In the case of an error with a database page checksum in the backup, the error could be due to the use of CONTINUE_AFTER_ERROR with BACKUP, or it could be that the page was damaged after the backup was written. Remember this key point about checksum. As with a standard database page checksum, the intention is for SQL Server to detect that data was modified after SQL Server wrote it to disk. So if you get one of the preceding two errors (assuming you didn't use BACKUP WITH CONTINUE_AFTER_ERROR) you must focus your root cause analysis of the disk system used to store the backup media.

    You can always restore a different backup should you encounter these errors. But if that is not an option, you do have another choice. One of the difficult problems we had in technical support when we determined that a customer's backup was damaged was to determine the severity of the problem. In other words, if the RESTORE fails when restoring the fifth page in the backup, it stops. How can we determine whether the rest of the backup is good? In some extremely urgent cases, I even had to resort to a debugger to "skip" errors during the restore to see how badly it was damaged. SQL Server 2005 introduces a new option for RESTORE called CONTINUE_AFTER_ERROR. If you use this option, RESTORE ignores any error and proceeds to copy all data from the backup media. This way you can try to recover as much data as you possibly can. At this point, you can use DBCC CHECKDB WITH REPAIR_ALLOW_DATA_LOSS to de-allocate the damaged page(s). The damaged page from the backup was lost anyway, but now you can use the rest of the database that is still valid. One key point when using CHECKDB here: You must use the WITH TABLOCK option, too, because online repair will not provide consistent results after you have used CONTINUE_AFTER_ERROR.

    Use the following table of contents to navigate to chapter excerpts, or click here to view Data corruption and recovery issues in its entirety.

    SQL Server 2005 practical troubleshooting
      Home: Introduction
      1: Fundamentals & SQL Server 2005 storage internals
      2:SQL Server 2005 enhancements
      3:Data Recovery Best Practices
      4:Data recovery trouble shooting scenarios
      5: User Database Inaccessible
      6: BACKUP/RESTORE Failures
      7: Database consistency errors
    About the book:   
    In this book, bestselling author and SQL Server guru Ken Henderson has worked with the SQL Server Development team to write "the" troubleshooting SQL Server 2005 book. All of the content comes directly from the creators and developers of SQL Server. Ken has edited each chapter to ensure that the writing is clear and sound. Written by the people who know the product best - the people who built it - SQL Server 2005 Practical Troubleshooting teaches developers and dbas how to troubleshoot and diagnose problems they may encounter with SQL Server. Purchase the book from Addison-Wesley Publishing
    Ken Henderson has been a developer for more than 25 years. He has worked with SQL Server since 1990 and has built software for a number of firms throughout his career, including H&R Block, the Central Intelligence Agency, the U.S. Navy, the U.S. Air Force, Borland International, JP Morgan, and various others. He joined Microsoft in 2001 and is currently a developer in the Manageability Platform group within the SQL Server development team. He is the creator of SQL Server 2005's SQLDiag facility and spends his days working on SQL Server management tools and related technologies. He is the author of eight books on a variety of computing topics, including the popular Guru's Guide series of SQL Server books available from Addison-Wesley. He lives with his family in the Dallas area and may be reached via email at [email protected]

  • Dig Deeper on Database software management