Solution provider takeaway: VMware ESX Server is today's leading virtual infrastructure platform in mission-critical environments. This section of the chapter excerpt from the book VMware ESX Server in the Enterprise: Planning and Securing Virtualization Servers will focus on using the platform for disaster recovery and backup.
Download the .pdf of the chapter here.
Disaster recovery (DR) takes many forms, and the preceding chapter on dynamic resource load balancing (DRLB) covers a small part of DR. Actually, DRLB is more a preventative measure than a prelude to DR. However, although being able to prevent the need for DR is a great goal, too many disasters happen to rely on any one mechanism. In this chapter, we categorize disasters and provide solutions for each one. You will see that the backup tool to use will not dictate how to perform DR, but it's the other way around. In addition to DR, there is the concept of business continuity (BC) or the need to keep things running even if a disaster happens. Some of what we discuss in this chapter is BC and not truly DR. However, the two go hand in hand because BC plans are often tied to DR plans and handle a subset of the disasters.
There are various forms of well-defined disasters and ways to prevent or workaround these to meet the defined goal. There is no one way to get around disasters, but knowing they exist is the first step in planning for them. Having a DR or BC plan is the first step toward prevention, implementation, and reduction in downtime. At a conference presentation, I asked a room of 200 customers if any of them had a DR or BC plan. Only two people stated they had a DR or BC plan, which was disconcerting but by no means unexpected.
Stating in writing the DR and BC plan will, in the case that it is needed, help immensely because there will be absolutely no confusion about it in an emergency situation. For one customer, the author was requested to make a DR plan to cover all possible disasters. Never in the customer's wildest dreams did they think it would need to be used. Unfortunately, the "wildest dream" scenario occurred, and the written DR plan enabled the customer to restore the environment in an orderly fashion extremely quickly. It is in your best interest to have a written DR plan that covers all possible disasters to minimize confusion and reduce downtime when, and not if, a disaster occurs.
Yes, this last best practice sounds like so many other truisms in life, but it is definitely worth considering around DR and BC, because failures will occur with surprising frequency, and it is better to have a plan than everyone running around trying to do everything at once. So what should be in a DR and BC plan? First, we should understand the types of disasters possible and use these as a basis for a DR and BC plan template. Granted, some of the following examples are scary and unthinkable, but they are not improbable. It is suggested that you use the following list and add to it items that are common to your region of the world as a first step to understanding what you may face when you start a DR or BC plan. A customer I consulted for asked for a DR plan, and we did one considering all of these possibilities. When finished, we were told that a regional disaster was not possible and that it did not need to be considered. Unfortunately, Katrina happened, which goes to show that if we can think it up, it is possible. Perhaps a disaster is improbable, but nature is surprising.
Disasters take many forms. The following list is undoubtedly not exhaustive, but it includes many different types of potential disasters.
- Application failure An application failure is the sudden death of a necessary application, which can be caused by poorly coded applications and are exploited by denial-of service (DoS) attacks that force an application to crash.
- VM failure A VM failure could be man-made, by nature, or both. Consider the manmade possibilities such as where a security patch needs to be applied or software is to be added to the VM. By nature could be the failure of the VM due to an OS bug, an unimplemented procedure within the virtualization layer, or an application issue that used up enough resources to cause the VM to crash. In general, VM failures are unrelated to hardware because the virtualization layer removes the hardware from the equation. But it does not remove OS bugs from the equation.
- ESX Server failure A machine failure can be man-made, by nature, or even both. For example, a man made failure could be the planned outage to upgrade firmware, hardware the ESX OS, or the possible occurrence of a hardware failure of some sort that causes a crash. Another example is if power is inadvertently shut off to the server.
- Communication failure A communication failure is unrelated to ESX, but will affect ESX nonetheless. Communication can be either via Fibre Channel or Ethernet. The errors could be related to a communication card, cable, switch, or a device at the non-ESX side of the communication. An example of this type of failure is a Fibre or network cable being pulled from the box or a switch is powered off or rebooted.
- Rack disaster Rack failures are extremely bad and are often caused by the rack being moved around or even toppling over. Not only will such an incident cause failures to the systems or communications, but it could cause physical injury to someone caught by the rack when it topples. Another rack failure could be the removal of power to fans of and around the whole rack, causing a massive overheat situation where all the servers in the rack fail simultaneously.
- Datacenter disaster Datacenter disasters include air conditioning failures that cause overheating, power spikes, lack of power, earthquakes, floods, fire, and anything else imaginable that could render the datacenter unavailable. An example of this type of disaster is the inadvertent triggering of a sprinkler system or a sprinkler tank bursting and flooding the datacenter below. It may seem odd, but some datacenters still use water and no other flame prevention system. Use of halon and other gasses can be dangerous to human life and therefore may not be used.
- Building disaster Like datacenter disasters, these disasters cause the building to become untenable. These include loss of power or some form of massive physical destruction. An example of this type of disaster is what happened to the World Trade Center.
- Campus disaster Campus disasters include a host of natural and man-made disasters where destruction is total. An example of this type of disaster are tornados, which will strike one place and skip others but will render anything in its path rubble.
- Citywide disaster Citywide disasters are campus disasters on a much larger scale. In some cases, the town is the campus (as is the case for larger universities). Examples range from earthquakes, to hurricanes, to atomic bombs.
- Regional disaster Regional disasters include massive power outages similar to the blackout in the New England area in the 2003 and hurricanes such as Katrina that cover well over 200 miles of coastline.
- National disasters For small countries such as Singapore or Luxembourg, a national disaster is equivalent to a citywide disaster and could equate to a regional disaster. National disasters in larger countries may be unthinkable, but it is not impossible.
- Multinational disaster Again because most countries touch other countries and there are myriad small countries all connected, this must be a consideration for planning. Tsunamis, earthquakes, and other massive natural disasters are occurring around us. Another option is a massive planned terrorist attack on a single multinational company.
- World disaster This sort of disaster is unthinkable and way out of scope!
Now that the different levels of disasters are defined, a set of tools and skills necessary to recover from each one can be determined. The tools and skills will be specific to ESX and will outline physical, operational, and backup methodologies that will reduce downtime or prevent a disaster:
- Application failure The recovery mechanism for a failed application is to have some form of watchdog that will launch the application anew if it was detected to be down. Multiple VMs running the same application connected to a network load balancer will also help in this situation by reducing the traffic to any one VM, and hence the application, and will remove application from the list of possible targets if it is down. Many of these types of clusters also come with ways of restarting applications if they are down. Use of shared data disk clustering à la Microsoft clusters is also a possible solution.
- VM failure Recovery from a VM failure can be as simple as rebooting the VM in question via some form of watchdog. However, if the VM dies, it is necessary to determine why the problem occurred, and therefore this type of failure often needs debugging. In this case, the setup of some form of shared data disk cluster à la Microsoft clusters will allow a secondary VM to take over the duties of the failed VM. Any VM failure should be investigated to determine the cause. Another mechanism is to have a secondary VM ready and waiting to take over duties if necessary. If the data of the primary VM is necessary to continue, consider placing the data on a second VMDK and have both VMs pointing to the second disk. Just make sure that only one is booted at the same time. Use DRLB tools to automatically launch this secondary VM if necessary.
- Machine failure Hardware often has issues. To alleviate machine failures have a second machine running and ready to take on the load of the first machine. Use VMware High Availability (HA) or other high-availability tools to automatically specify a host on which to launch the VMs if a host fails. In addition, if you know the host will fail due to a software or hardware upgrade, first VMotion all the VMs to the secondary host. VMware HA can be set up when you create a VMware cluster or even after the fact. We discussed the creation of VMware clusters in Chapter 11, "Dynamic Resource Load Balancing." VMware HA makes use of the Legato Automated Availability Management (Legato AAM) suite to manage the ESX Server cluster failover.
- Communication failure Everyone knows that Fibre and network connections fail, so ensure that multiple switches and paths are available for the communications to and from the ESX Server. In addition, make local copies of the most important VMs so that they can be launched using a local disk in the case of a SAN failure. This often requires more local disk for the host and the avoidance of booting from SAN.
- Rack disaster To avoid a rack disaster, make sure racks are on earthquake-proof stands, are locked in place, and perhaps have stabilizers deployed. But also be sure that your ESX host and switches are located in separate racks in different locations on the datacenter floor, so that there is no catastrophic failure and that if a rack does fail, everything can be brought back up on the secondary server.
- Datacenter disaster To avoid datacenter disasters, add more hosts to a secondary datacenter either in the same building or elsewhere on the campus. Often this is referred to as a hot site and requires an investment in new SAN and ESX hosts. Also ensure there are adequate backups to tape secured in a vault. In addition, it is possible with ESX version 3 to VMotion VMs across subnets via routers. In this way, if a datacenter was planned to go down, it would be possible to move running VMs to another datacenter where other hosts reside.
- Building disaster The use of a hot site and offsite tape backups will get around building disasters. Just be sure the hot site is not located in the same building.
- Campus disaster Just like a building disaster, just be sure the other location is off the campus.
- Citywide disaster Similar to campus disasters, just be sure the hot site or backup location is outside the city.
- Regional disaster Similar to campus disasters, just be sure the hot site or backup location is outside the region.
- National disasters Similar to campus disasters, just be sure the hot site or backup location is outside the country, or if the countries are small, in another country far away.
- Multinational disaster Because this could be considered a regional disaster in many cases, see the national DR strategy.
- World disaster We can dream some here and place a datacenter on another astronomical body or space station.
Now that the actions to take for each disaster are outlines, a list of best practices can be developed to define a DR or BC plan to use. The following list considers an ESX Server from a single host to enterprisewide with the thought of DR and BC in mind. The list covers mainly ESX, not all the other parts to creating a successful and highly redundant network. The list is divided between local practices and remote practices. This way the growth of an implementation can be seen. The idea behind these best practices is to look at our list of possible failures and to have a response to each one and the knowledge that many eggs are being placed into one basket. On average for larger machines, ESX Servers can house 20+ VMs. That is a lot of service that could go down if a disaster happens. First we need to consider the local practices around DR:
- Implement ESX using N+1 hosts where N is the necessary number of hosts to run the VMs required. The extra host is used for DR.
- When racking the hosts, ensure that hosts are stored in different racks in different parts of the datacenter.
- Be sure there are at least two Fibre Channel cards using different PCI buses if possible.
- Be sure there are at least two NIC ports for each network to be attached to the host using different PCI buses if possible.
- When cabling the hosts, ensure that redundant cables go to different switches and that no redundant path uses the same PCI card.
- Be sure that all racks are stabilized.
- Be sure that there is enough cable available so that machines can be fully extended from the rack as necessary.
- Ensure there is enough local disk space to store exported versions of the VMs and to run the most important VMs if necessary.
- Ensure HA is configured so that VMs running on a failed host are automatically started on another host.
- Create DRLB scripts to start VMs locally if SAN connectivity is lost.
- Create DRLB scripts or enable VMware DRS to move VMs when CPU load is too high on a single host.
Second, we need to consider the remote practices around DR:
- When creating DR backups, ensure there is safe storage for tapes onsite and offsite.
- Follow all the local items listed above at all remote sites.
- Create a list of tasks necessary to be completed if there is a massive site failure. This list should include who does what and the necessary dependencies for each task.
The suggestions translate into more physical hardware to create a redundant and safe installation of ESX. It also translates into more software and licenses, too. Before going down the path of hot sites and offsite tape storage, the local DR plan needs to be fully understood from a software perspective, specifically the methods for producing backups, and there are plenty of methods. Some methods adversely impact performance; others that do not. Some methods lend themselves to expansion to hot sites, and others that will take sneaker nets and other mechanisms to get the data from one site to the other.
Backup and Business Continuity
The simplest approach to DR is to make a good backup of everything so that restoration is simplified when the time comes, but backups can happen in two distinctly different ways with ESX. In some cases, some of these suggestions do not make sense because the application in use can govern how things go. As an example, we were asked to look at DR backup for an application with its own built-in DR capabilities with a DR plan that the machine be reinstalled on new hardware if an issue occurred. The time to redeploy in their current environment was approximately an hour, and it took the same amount of time for a full DR backup through ESX. Because of this, the customer decided not to go with full DR backups.
About the book
VMware ESX Server in the Enterprise: Planning and Securing Virtualization Servers is the definitive, real-world guide to planning, deploying, and managing today's leading virtual infrastructure platform in mission-critical environments.. Purchase the book from Prentice Hall.
Reproduced from the book VMware ESX Server in the Enterprise. Copyright 2008, Prentice Hall. Reproduced by permission of Pearson Education, Inc., 800 East 96th Street, Indianapolis, IN 46240. Written permission from Pearson Education, Inc. is required for all other uses.