Implementing VMware High Availability
High availability has been an industry buzzword that has stood the test of time. The need and/or desire for high availability is often a significant component to any infrastructure design. Within the scope of an ESX/ESXi host, VMware High Availability (HA) is a component of the vSphere 4 product that provides for the automatic failover of virtual machines. But—and it's a big but at this point in time—HA does not provide high availability in the traditional sense of the term. Commonly, HA means the automatic failover of a service or application to another server.
The VMware HA feature provides an automatic restart of the virtual machines that were running on an ESX/ESXi host at the time it became unavailable, shown in Figure 11.15.
VMware HA provides an automatic restart of virtual machines that were running on an ESX/ESXi host when it failed.
In the case of VMware HA, there is still a period of downtime when a server fails. Unfortunately, the duration of the downtime is not a value that can be calculated because it is unknown ahead of time how long it will take to boot a series of virtual machines. From this you can gather that, at this point in time, VMware HA does not provide the same level of high availability as found in a Microsoft server cluster solution. When a failover occurs between ESX/ESXi hosts as a result of the HA feature, there is potential for data loss as a result of the virtual machine that was immediately powered off when the server failed and then brought back up minutes later on another server.
HA Experience in the Field
With that said, I want to mention my own personal experience with HA and the results I encountered. Your mileage might vary but should give you a reasonable expectation of what to expect. I had a VMware ESX/ESXi host that was a member of a five-node cluster. This node crashed sometime during the night, and when the host went down, it took anywhere from 15 to 20 virtual machines with it. HA kicked in and restarted all the virtual machines as expected.
What made this an interesting experience is that the crash must have happened right after the polling of the monitoring and alerting server. All the virtual machines that were on the general alerting schedule were restarted without triggering any alerts. We did have some of those virtual machines with a more aggressive monitoring that did trip off alerts that were recovered before anyone was able to log on to the system and investigate. I tried to argue the point that if an alert never fired, did the downtime really happen? I did not get too far with that argument but was pleased with the results.
In another case, during testing I had a virtual machine running on a two-node cluster. I pulled the power cords on the host that the virtual machine was running to create the failure. My time to recovery from pull to ping was between five and six minutes. That's not too bad for general use but not good enough for everything. VMware Fault Tolerance can now fill that gap for even the most important and critical servers in your environment. I'll talk more about FT in a bit.
In the VMware HA scenario, two or more ESX/ESXi hosts are configured in a cluster. Remember, a VMware cluster represents a logical aggregation of CPU and memory resources, as shown in Figure 11.16. By editing the cluster settings, you can enable the VMware HA feature for a cluster. The HA cluster then determines the number of hosts failures it must support.
A VMware ESX/ESXi host cluster logically aggregates the CPU and memory resources from all nodes in the cluster.
HA: Within, but Not Between, Sites
A requisite of HA is that each node in the HA cluster must have access to the same SAN LUNs. This requirement prevents HA from being able to failover between ESX/ESXi hosts in different locations unless both locations have been configured to have access to the same storage devices. It is not acceptable just to have the data in LUNs the same because of SAN replication software. Mirroring data from a LUN on a SAN in one location to a LUN on a SAN in a hot site is not conducive to allowing HA (VMotion or DRS).
When ESX/ESXi hosts are configured into a VMware HA cluster, they receive all the cluster information. vCenter Server informs each node in the HA cluster about the cluster configuration.
HA and vCenter Server
Although vCenter Server is most certainly required to enable and manage VMware HA, it is not required to execute HA. vCenter Server is a tool that notifies each VMware HA cluster node about the HA configuration. After the nodes have been updated with the information about the cluster, vCenter Server no longer maintains a persistent connection with each node. Each node continues to function as a member of the HA cluster independent of its communication status with vCenter Server.
When an ESX/ESXi host is added to a VMware HA cluster, a set of HA-specific components are installed on the ESX/ESXi host. These components, shown in Figure 11.17, include the following:
- Automatic Availability Manager (AAM)
Adding an ESX/ESXi host to an HA cluster automatically installs the AAM, Vmap, and possibly the vpxa components on the host.
The AAM, effectively the engine or service for HA, is a Legato-based component that keeps an internal database of the other nodes in the cluster. The AAM is responsible for the intracluster heartbeat used to identify available and unavailable nodes. Each node in the cluster establishes a heartbeat with each of the other nodes over the Service Console network, or you can use or define another VMkernel port group for the HA heartbeat. As a best practice, you should provide redundancy to the AAM heartbeat by establishing the Service Console port group on a virtual switch with an underlying NIC team. Though the Service Console could be multihomed and have an AAM heartbeat over two different networks, this configuration is not as reliable as the NIC team. The AAM is extremely sensitive to hostname resolution; the inability to resolve names will most certainly result in an inability to execute HA. When problems arise with HA functionality, look first at hostname resolution. Having said that, during HA troubleshooting, you should identify the answers to questions such as these:
- Is the DNS server configuration correct?
- Is the DNS server available?
- If DNS is on a remote subnet, is the default gateway correct and functional?
- Does the /etc/hosts file have bad entries in it?
- Does the /etc/resolv.conf have the right search suffix?
- Does the /etc/resolv.conf have the right DNS server?
Adding a Host to vCenter Server
When a new host is added into the vCenter Server inventory, the host must be added by its hostname, or the HA will not function properly. As just noted, HA is heavily reliant on successful name resolution. ESX/ESXi hosts should not be added to the vCenter Server inventory using IP addresses.
The AAM on each ESX/ESXi host keeps an internal database of the other hosts belonging to the cluster. All hosts in a cluster are considered either a primary host or a secondary host. However, only one ESX/ESXi host in the cluster is considered the primary host at a given time, with all others considered secondary hosts. The primary host functions as the source of information for all new hosts and defaults to the first host added to the cluster. If the primary host experiences failure, the HA cluster will continue to function. In fact, in the event of primary host failure, one of the secondary hosts will move up to the status of primary host. The process of promoting secondary hosts to primary host is limited to four other hosts. Only five hosts could assume the role of primary host in an HA cluster.
While AAM is busy managing the intranode communications, the vpxa service (or vCenter Server agent) manages the HA components. The vpxa service communicates to the AAM through a third component called the Vmap.
Name Resolution Tip
If DNS is set up and configured correctly, then you should not need anything else for name resolution. However, as a method of redundancy, consider adding the other VMware ESX and vCenter Server information to the local host file (/etc/hosts). If there is a failure and the ESX/ESXi host is unable to talk to DNS, this setup will ensure that HA would still work as designed.
Ensuring High Availability and Business Continuity
Using Microsoft Cluster Services for virtual machine clustering
VMware HA implementation and ESX/ESXi host addition
HA cluster configuration: Requirements and steps
Printed with permission from Wiley Publishing Inc. Copyright 2009. Mastering VMware vSphere 4 by Scott Lowe. For more information about this title and other similar books, please visit Wiley Publishing.
This was first published in July 2010