Shaping a solution provider's ESX DR plan: Disaster types

Solution providers can avoid surprises late in the ESX disaster recovery (DR) game by forming a solid DR plan and knowing how to deal with different disaster types such as chassis failure or rack disaster.

Solution provider’s takeaway: To create a strong disaster recovery (DR) plan, VARs need to read up on all the potential disasters that can occur in their customer’s ESX environment. Read through the types of disasters to prepare for, from application failure to chassis or rack disaster.

Disaster recovery, business continuity and backup
In this chapter, we categorize disasters and provide solutions for each one. You will see that the backup tool to use will not dictate how to perform disaster recovery (DR), but it’s the other way around. In addition to DR, there is the concept of Business Continuity (BC) or the need to keep things running even if a disaster happens. Some of what we discuss in this chapter is BC and not truly DR. However, the two go hand in hand because BC plans are often tied to DR plans and handle a subset of the disasters.

What is the goal of disaster recovery?
The goal of DR is to either prevent or recovery quickly from downtime caused by either man or nature.

What is the goal of business continuity or disaster avoidance?
The goal of BC is to maintain the business functions in the face of possible downtime caused by man or nature.

As you can see, DR and BC are interrelated: DR is intended to prevent or recover from downtime while BC attempts to maintain business function. Virtualization provides us the tools to do each of these. In the following discussion, unless otherwise stated, ESX and ESXi can be used interchangeably.

Disaster types
There are various forms of well-defined disasters and ways to prevent or work around these to meet the defined goal. There is no one way to get around disasters, but knowing they exist is the first step in planning for them. Having a DR or BC plan is the first step toward prevention, implementation, and reduction in downtime. At a conference presentation, I asked a room of 200 customers if any of them had a DR or BC plan. Only two people stated they had a DR or BC plan, which was disconcerting but by no means unexpected.

Best practice for DR and BC: Create a written DR and BC plan.
Writing the DR and BC plan will help immensely, in the case that it is needed, because there will be absolutely no confusion about it in an emergency situation. For one customer, the author was requested to make a DR plan to cover all possible disasters. Never in the customer’s wildest dreams did they think it would need to be used. Unfortunately, the “wildest dream” scenario occurred, and the written DR plan enabled the customer to restore the environment in an orderly fashion extremely quickly. It is in your best interest to have a written DR plan that covers all possible disasters to minimize confusion and reduce downtime when, and not if, a disaster occurs.

Best practice for DR and BC: Plan for failure; do not fail to plan.
Yes, this last best practice sounds like so many other truisms in life, but it is definitely worth considering around DR and BC, because failures will occur with surprising frequency, and it is better to have a plan than everyone running around trying to do everything at once. So what should be in a DR and BC plan? First, we should understand the types of disasters possible and use these as a basis for a DR and BC plan template. Granted, some of the following examples are scary and unthinkable, but they are not improbable. It is suggested that you use the following list and add to it items that are common to your region of the world as a first step to understanding what you may face when you start a DR or BC plan. A customer I consulted for asked for a DR plan, and we did one considering all the following possibilities. When finished, we were told that a regional disaster was not possible and that it did not need to be considered. Unfortunately, Katrina happened, which goes to show that if we can think it up, it is possible. Perhaps a disaster is improbable, but nature is surprising. Disasters take many forms. The following list is undoubtedly not exhaustive, but it includes many different types of potential disasters.

Application failure: An application failure is the sudden death of a necessary application, which can be caused by poorly coded applications and are exploited by denial-of-service (DoS) attacks that force an application to crash.

VM failure: A VM failure could be man-made, by nature, or both. Consider the manmade possibilities such as where a security patch needs to be applied or software is to be added to the VM. By nature could be the failure of the VM due to an OS bug, an unimplemented procedure within the virtualization layer, or an application issue that used up enough resources to cause the VM to crash. In general, VM failures are unrelated to hardware because the virtualization layer removes the hardware from the equation. But it does not remove OS bugs from the equation.

ESX host failure: A machine failure can be man-made, by nature, or even both. For example, a man-made failure could be the planned outage to upgrade firmware, hardware, the ESX OS, or the possible occurrence of a hardware failure of some sort that causes a crash. Another example is if power is inadvertently shut off to the server.

Communication failure: A communication failure is unrelated to ESX, but will affect ESX nonetheless. Communication can be via Fibre Channel, Ethernet, or for such items as VMCI, VMsafe, VIX, and so on via some out-of-band mechanism. The errors could be related to a communication card, cable, switch, or a device at the non-ESX side of the communication. An example of this type of failure is a Fibre or network cable being pulled from the box or a switch is powered off or rebooted.

Chassis failure: Chassis failures can cause either a single host to fail or multiple hosts to fail. As datacenters become denser, more and more blade and other shared hardware chassis come into play. This could simply be a failure of a single fan to the back or mid plane components failing outright. This type of failure could cause many ESX hosts to fail or have a communication failure that could affect more than one host.

Rack disaster: Rack failures are extremely bad and are often caused by the rack being moved around or even toppling over. Not only will such an incident cause failures to the systems or communications, but it could cause physical injury to someone caught by the rack when it topples. Another rack failure could be the removal of power to fans of and around the whole rack, causing a massive overheat situation where all the servers in the rack fail simultaneously.

Datacenter disaster: Datacenter disasters include air conditioning failures that cause overheating, power spikes, lack of power, earthquakes, floods, fire, and anything else imaginable that could render the datacenter unavailable. An example of this type of disaster is the inadvertent triggering of a sprinkler system or a sprinkler tank bursting and flooding the datacenter below. It may seem odd, but some datacenters still use water and no other flame prevention system. Use of halon and other gasses can be dangerous to human life and, therefore, these gasses may not be used.

Building disaster: Like datacenter disasters, these disasters cause the building to become untenable. These include loss of power or some form of massive physical destruction. An example of this type of disaster is what happened to the World Trade Center.

Campus disaster: Campus disasters include a host of natural and man-made disasters where destruction is total. An example of this type of disaster is tornadoes, which may strike one place and skip another but can render to rubble anything in its path.

Citywide disaster: Citywide disasters are campus disasters on a much larger scale. In some cases, the town is the campus (as is the case for larger universities). Examples range from earthquakes, to hurricanes, to atomic bombs.

Regional disaster: Regional disasters include massive power outages similar to the blackout in the New England area in 2003 and hurricanes such as Katrina that cover well over 200 miles of coastline.

National disasters: For small countries, such as Singapore or Luxembourg, a national disaster is equivalent to a citywide disaster and could equate to a regional disaster. National disasters in larger countries may be unthinkable, but it is not impossible.

Multinational disaster: Again, because most countries touch other countries and there are myriad small countries all connected, this must be a consideration for planning. Tsunamis, earthquakes, and other massive natural disasters are occurring around us. Another option is a massive planned terrorist attack on a single multinational company.

World disaster: This sort of disaster is unthinkable and way out of scope!

Disaster recovery methods: Now that the different levels of disasters are defined, a set of tools and skills necessary to recover from each one can be determined. The tools and skills will be specific to ESX and will outline physical, operational, and backup methodologies that will reduce downtime or prevent a disaster:

Application failure: The recovery mechanism for a failed application is to have some form of watchdog that will launch the application anew if it was detected to be down. Multiple VMs running the same application connected to a network load balancer will also help in this situation by reducing the traffic to any one VM, and hence the application, and will remove application from the list of possible targets if it is down. Many of these types of clusters also come with ways of restarting applications if they are down. Use of shared data disk clustering à la Microsoft clusters is also a possible solution.

VM failure: Recovery from a VM failure can be as simple as rebooting the VM in question via some form of watchdog such as VMware HA VM Monitoring or VMware FT. However, if the VM dies, it is often necessary to determine why the problem occurred, and therefore this type of failure often needs debugging. In this case, the setup of VMware FT or some form of shared data disk cluster à la Microsoft clusters will allow a secondary VM to take over the duties of the failed VM. Any VM failure should be investigated to determine the cause. Another mechanism is to have a secondary VM ready and waiting to take over duties if necessary. If the data of the primary VM is necessary to continue, consider placing the data on a second VMDK and have both VMs pointing to the second disk. Just make sure that only one is booted at the same time. Use DRLB tools to automatically launch this secondary VM if necessary.

With VMware FT, this last suggestion may seem unnecessary, but if there is a Guest OS or application failure the shadow VM created by FT may also fail as the primary VM and shadow VM are in vLockStep.

Machine failure: Hardware often has issues. To alleviate machine failures have a second machine running and ready to take on the load of the first machine. Use VMware HA or other high-availability tools to automatically specify a host on which to launch the VMs if a host fails. In addition, if you know the host will fail due to a software or hardware upgrade, first vMotion all the VMs to the secondary host. VMware HA can be set up when you create a VMware cluster or even after the fact. We discussed the creation of VMware clusters in Chapter 11 , “Dynamic Resource Load Balancing.” VMware HA makes use of the Legato Automated Availability Management (Legato AAM) suite to manage the ESX host cluster failover. There is more on HA later in this chapter in the section “Business Continuity.” VMware DPM, used in conjunction with VMware HA and VMware DRS, would enable another machine to act as a hot spare. This would of course require one node (usually the +1 node) to be in a rack, installed, kept updated, and otherwise ready to be used as dictated by VMware DRS.

Communication failure: Everyone knows that Fibre and network connections fail, so ensure that multiple switches and paths are available for the communications to and from the ESX host. In addition, make local copies of the most important VMs so that they can be launched using a local disk in the case of a SAN failure. This often requires more local disk for the host and the avoidance of booting from SAN.

Chassis disaster: To avoid devastating chassis disasters, it is best to divide your most important VMs between multiple chassis but also maintaining enough headroom on all blades within a chassis so that if VMware HA needs to be used, the VMs have a home on a new chassis. In large datacenters, it may be useful to have a hot spare chassis with blades in it waiting to be used via VMware DRS and DPM or one that is ready to accept blades.

Rack disaster: To avoid a rack disaster, make sure racks are on earthquake-proof stands, are locked in place, and perhaps have stabilizers deployed. But also be sure that your ESX hosts and switches are divided and placed into separate racks in different locations on the datacenter floor, so that there is no catastrophic failure and that if a rack does fail, everything can be brought back up on the other rack.

Datacenter disaster: To avoid datacenter disasters, add more hosts to a secondary datacenter either in the same building or elsewhere on the campus. Often this is referred to as a hot site and requires an investment in new SAN and ESX hosts. Also ensure there are adequate backups to tape secured in a vault. In addition, it is possible with ESX version 3 to vMotion VMs across subnets via routers. In this way, if a datacenter was planned to go down, it would be possible to move running VMs to another datacenter where other hosts reside.

VMware Site Recovery Manager (SRM) is one tool that can be used to maintain a hot site as could Veeam Backup or Vizioncore vReplicator. 

EMC’s VPLEX technology could also be used to maintain consistent writes between two different datacenters no more than 60km away. VPLEX offers the capability to maintain a complete synchronous backup of data on two different and distinct storage subsystems. EMC VPLEX with vTeleport could even move VMs from datacenter to datacenter as needed. Granted, as we discussed in Chapter 5 , “Storage with ESX,” use of vTeleport (long distance vMotion) requires a stretched Layer-2 network between the datacenters.

Building disaster: The use of a hot site and offsite tape backups will get around building disasters. Just be sure the hot site is not located in the same building. EMC VPLEX and vTeleport would also allow for a solid BC by maintaining both datacenters in a synchronous model.

Campus disaster: Just like a building disaster, just be sure the other location is off the campus.

Citywide disaster: Similar to campus disasters, just be sure the hot site or backup location is outside the city.

Regional disaster: Similar to campus disasters, just be sure the hot site or backup location is outside the region.

National disasters: Similar to campus disasters, just be sure the hot site or backup location is outside the country, or if the countries are small, in another country far away.

Multinational disasters: Because this could be considered a regional disaster in many cases, see the national DR strategy.

World disasters: We can dream some here and place a datacenter on another astronomical body or space station.

The major tools to use for DR and BC follow:

• Application Monitoring

• VMware Fault Tolerance or VMware HA VM Monitoring

• VMware HA, DRS, and DPM all working together with the use of hot

spare systems or chassis

• VMware SRM, Veeam Backup, PhD Virtual Backup, or Vizioncore vReplicator

• EMC VPLEX with vTeleport

Disaster recovery best practices
Now that the actions to take for each disaster are outlined, a list of best practices can be developed to define a DR or BC plan to use. The following list considers an ESX host, from a single host to enterprisewide, with DR and BC in mind. The list covers mainly ESX, not all the other parts to creating a successful and highly redundant network. The list is divided between local practices and remote practices. This way the growth of an implementation can be seen. The idea behind these best practices is to look at our list of possible failures and to have a response to each one and to know that many eggs are being placed into one basket. On average for larger machines, ESX hosts can house 20+ VMs. That is a lot of service that could go down if a disaster happens.

First, we need to consider the local practices around DR:

• Implement ESX using N+1 hosts where N is the necessary number of hosts to run the VMs required. The extra host is used for DR.

• When racking the hosts, ensure that hosts are stored in different racks in different parts of the datacenter.

• Be sure there are at least two Fibre Channel (FC) cards, if employing FC SAN, using different PCI buses if possible.

• Be sure there are at least two NIC ports for each network to be attached to the host using different PCI buses if possible.

• When cabling the hosts, ensure that redundant cables go to different switches and that no redundant path uses the same PCI card.

• Be sure that all racks are stabilized.

• Be sure that there is enough cable available so that machines can be fully extended from the rack as necessary.

• Ensure there is enough local disk space to store exported versions of the VMs and to run the most important VMs if necessary.

• Ensure HA is configured so that VMs running on a failed host are automatically started on another host.

• Use storage replication (VPLEX, SRM, and the like) to ensure SANs are redundant, either within the same datacenter or across datacenters.

• Create DRLB scripts to start VMs locally if SAN connectivity is lost.

• Create DRLB scripts or enable VMware DRS to move VMs when all resources loads are too high on a single host. Second, we need to consider the remote practices around DR:

• When creating DR backups, ensure there is safe storage for tapes onsite and offsite.

• Follow all the local items, listed previously, at any remote sites.

• Create a list of tasks necessary to be completed if there is a massive site failure. This list should include who does what and the necessary dependencies for each task.

The suggestions translate into more physical hardware to create a redundant and safe installation of ESX. It also translates into more software and licenses, too. Before going down the path of hot sites and offsite tape storage, the local DR plan needs to be fully understood from a software perspective, specifically the methods for producing backups, and there are plenty of methods. Some methods adversely impact performance; others do not. Some methods and security controls lend themselves to expansion to hot sites, and others will take sneaker nets and other mechanisms to get the data from one site to the other.

About the author
Edward L. Haletky is also the author of VMware vSphere and Virtual Infrastructure Security: Securing the Virtual Environment. Edward owns AstroArch Consulting, Inc., where he provides virtualization, security, and network consulting.

Printed with permission from Pearson Publishing. Copyright 2011. VMware ESX and ESXi in the Enterprise: Planning Deployment of Virtualization Servers (2nd Edition) by Edward L. Haletky. For more information about this title and other similar books, please visit

Dig Deeper on Server management, sales and installation

Start the conversation

Send me notifications when other members comment.

Please create a username to comment.