Exploring vSphere HA features for failure protection

vSphere HA provides protection for areas in your customer's virtual environment where failures may occur, such as guest OSes and the network.

Solution provider takeaway: Ensuring that your customer's virtual environment is shielded from an array of potential failures is critical to providing business continuity and reliability. See what these vSphere HA features can do to defend against failures in a customer's guest OS or network.

Many solution providers are leery of having high consolidation ratios and scaling up because a host or infrastructure failure can affect a large number of virtual machines (VMs). Availability and continuity are critical in vSphere to protect against a failure in your customer's virtual environment that can take down a great number of VMs.

There are a variety of potential failure points in any virtual environment. To ensure maximum availability, eliminate single points of failure and leverage some of vSphere's advanced features. The failure points in vSphere include the host server, data center, network connections, storage devices and VM/guest OSes. This tip explains how you can take advantage of these features and configure your customer's environment properly to protect against any type of failure.

Protecting against a VM guest OS failure with vSphere HA

There is a sub-feature of the High Availability (HA) feature called Virtual Machine Monitor (VMM) that may be under your radar. This feature uses VMware tools and functions by transmitting a heartbeat every second from inside the guest OS that the host can monitor to ensure that the VM is functioning properly.

If a major failure occurs within the OS, such as a Windows Blue Screen of Death (BSOD), the VMware tools driver will no longer send heartbeats. If the host doesn't detect a heartbeat after a predetermined amount of time -- such as 30 seconds -- the host assumes the VM had an OS failure and restarts the VM on the same host.

Figure 1: Solutions providers have the option to change the VMM settings.

Users found early on that well functioning VMs occasionally stopped sending heartbeats, which resulted in unnecessary VM resets. To avoid this scenario, VMM was enhanced to also check for network or disk I/O activity on the VM. The VM is restarted if its heartbeats and the I/O statistics indicate that there is no activity over a two-minute period. This interval can be changed using the HA advanced setting called das.iostatsInterval, which can be set by editing the cluster settings, selecting VMware HA and then clicking the Advanced Options button.

Guarding against a network failure

Use multiple network adapters in your vSwitches to protect the network connections between the vSwitch on a host and the physical network switch. Having at least two physical network interface controllers (NICs) in a vSwitch ensures that network connectivity stays up in case of a failure between one NIC and the physical switch.

If you're using multi-port NICs to assign multiple NICs to a vSwitch, be sure to assign NICs on separate physical cards. This ensures that the vSwitch has continuous network connectivity in the event that a whole multi-port NIC fails. Solution providers also need to connect the NICs to separate physical switches so that at least one NIC stays up in case of a switch failure in one of the paths. It is crucial for the vSwitch that the Service Console and VMkernel networks remain on.

In addition to having redundant paths for multiple NICs in a vSwitch, you can also benefit from the load-balancing feature to ensure that network I/O is spread across multiple NICs. Set this by going to the vSwitch properties and editing the vSwitch object that appears on the Ports tab.

On the NIC Teaming tab, set the type of load-balancing policy that you want to use as well as additional failover options. You can also set this individually by editing any port groups on the vSwitch and on the NIC Teaming tab by checking the box next to the field that you want to change and then selecting a new option. Another configuration option when there are limited NICs is to put the Service Console port group and VMkernel port group on the same vSwitch with two NICs assigned to it.

Figure 2: Sort through the different vSwitch configuration options.

Both NICs will be active by default, which means traffic for the Service Console and VMkernel will be intermixed. Having redundant NICs for both port groups is much more desirable for solutions providers than having the traffic mixed together. Fix this by editing each port group and on the NIC teaming tab, select the option to override the vSwitch failover order and set one NIC to active and the other to standby.

Edit each port group and on the NIC Teaming tab override the vSwitch Failover Order, and set one NIC to Active and the other to Standby like below:

  • VMkernel: vmnic0 – Active, vmnic1 – Standby
  • Service Console: vmnic1 – Active, vmnic0 – Standby

Figure 3: You can change port groups to Active or Standby in the NIC Teaming tab.

The advantage of doing this is that both the VMkernel and Service Console port groups have their own dedicated NIC. If a NIC failure occurs, the remaining NIC is active for both groups

How to avoid a storage failure

Most environments use shared storage that requires either a network or fibre connection to a switch that connects to the storage device. Similar to networking, multiple paths should be available to your storage device. You should have either two NICs (NFS and software iSCSI) or two storage adapters (Fibre Channel and hardware iSCSI) that connect to different switches. The target storage device should also have a connection to each switch.

NFS and the software iSCSI use the VMkernel network. You can create a separate vSwitch for it with two VMkernel port groups and assign one NIC to each port group by setting the vSwitch Failover for each port group to active/unused while ensuring that each has a separate active NIC.

For Fibre Channel and the hardware iSCSI, if you have two storage adapters configured, they will become multi-path by default. Only one NIC or storage adapter will be active at one time. To gain maximum efficiency, you should change the multi-path options so they are both active. You can accomplish this by selecting the properties of a Virtual Machine File System volume, selecting Manage Paths and then changing the path selection policy from fixed to round robin so you have two active (I/O) paths.

Figure 4: Manage Fibre Channel and the hardware iSCSI paths to help customers achieve greater efficiency.

Avoiding a host failure

To protect against host failures, solution providers can use either the HA or Fault Tolerance feature that is available in some editions of vSphere. HA offers high availability for VMs running on a host. in the event of a host failure, all the VMs are shut down in a crash consistent state and are powered up on other hosts in the cluster.

The average downtime for a VM is a few minutes, and HA uses the same heartbeat concept between hosts and a vCenter Server, similar to what the VMM feature uses. VMs will power up on other hosts if the loss of heartbeats on a host reaches 15 seconds.

The Fault Tolerance feature takes HA to the next level and provides continuous availability for a VM in case of a host failure, resulting in zero downtime. Having no downtime is a product of keeping a secondary VM copy running on another host server. In the case of a host failure, that VM then becomes the primary VM and a new secondary is created on another functional host.

The primary VM and secondary VM stay in sync with each other by using a technology called Record/Replay, which works by recording the computer execution on a VM and saving it into a log file. It can then take that recorded information and replay it on another VM to have a replica copy that is a duplicate of the original VM.

Figure 5: Keep your customer's VMs in sync with Record/Replay.

Although Fault Tolerance provides continuous availability for a VM, it does have many requirements and limitations. The biggest obstacle may be that is it can be used only on VMs that have a single vCPU. To see a list of all the requirements and limitations, look at VMware's Availability Guide.

How SRM can help keep the data center safe

These vSphere features protect against certain components failing in your data center, but what if a major event occurs that affects your customer's whole data center?

To avoid that type of disaster, use a replication product to ensure that your data is continually replicated to an off-site data center and replicate at the VM level or at the storage array level. VMware has a product called Site Recovery Manager (SRM that provides an automated solution for failover of virtual environments to a recovery site.)

SRM can help you create recovery plans using vCenter Server, extend recovery plans with custom scripts, perform non-disruptive testing, use a single command to automate recovery plan execution and reconfigure VM networking at the recovery site. By itself, SRM is not a complete solution for disaster recovery. It relies on a supported third-party storage replication application to handle the replication of VM data to a recovery site.

Figure 6: Execute precise recovery plans by using SRM.

VMware works closely with storage vendors to certify that SRM supports and integrates with their storage arrays. SRM provides a nice front-end application that both integrates storage replication with virtualization and automates recovery failover in VMware environments. Although SRM is a great solution for protecting a whole virtual environment, it can be expensive to implement. Products like Veeam Backup & Replication, Vizioncore vReplicator or Double-Take Availability can also implement replication at the VM level while keeping costs down.

VMotion and Storage VMotion are not really considered true HA features, but they can definitely help with HA by allowing live migration of VMs to other hosts or storage devices when host or storage maintenance is performed. VMotion and Storage VMotion are proactive features that can help prevent VM downtime during maintenance or a pending failure, while other HA features are reactive to situations that may occur in your customer's environment.

With any type of failure in your customer's virtual environment, a single failure can have a big effect. Know your options in small environments, where HA may be good enough protection, as well as in a larger environments that require SRM to ensure continuous availability.

About the expert Eric Siebert is a 25-year IT veteran whose primary focus is VMware virtualization and Windows server administration. He is one of the 300 vExperts named by VMware Inc. for 2009. He is the author of the book VI3 Implementation and Administration and a frequent TechTarget contributor. In addition, he maintains vSphere-land.com, a VMware information site.


Dig Deeper on Server virtualization technology and services