By Stephen J. Bigelow, Senior Technology Writer
Every modern enterprise must prepare for disaster -- be it a natural disaster, such as flooding or fire, or a terrorist attack. Disaster recovery (DR) planning is the process of deciding what data the business needs most in order to continue operations in the case of a disaster. This normally involves moving corporate data to remote locations that can restore the data later -- or even continue day-to-day operations independently. This puts tremendous importance on proper network planning and implementation.
Unfortunately, most businesses are unprepared to handle DR, so solution providers can often help their clients make the best choices and implement the most beneficial network infrastructure to accommodate the client's disaster recovery needs. Ultimately, successful DR preparedness relies on proper site planning, software tools and wide area network (WAN) connectivity. This first installment of our Hot Spot Tutorial introduces critical WAN issues and site planning considerations related to network disaster recovery.
Site selection and the wide area network
The underlying technologies for disaster recovery are readily available, and while they present a sales opportunity for you, clients are more likely to need help with their DR strategy. Clients often need help identifying their data recovery needs and planning accordingly.
"We've found many orgs think DR is nothing more than putting one more server in another data center, and when the primary fails, they somehow bring up the remote system," said Rand Morimoto, president of Convergent Computing, a network solution provider in Oakland, Calif. "Sounds great until you actually try it and realize users cannot easily connect to this new server."
The first consideration in any DR plan is a careful assessment of the disasters that your client is planning for. This crucial consideration will influence the performance requirements -- as well as the cost -- of your client's WAN and remote site, yet it's often overlooked.
"Disaster recovery does involve quite a gamut of potential problems," said Dave Sobel, CEO of Evolve Technologies, a solution provider located in Fairfax, Va. "We look to solve the worst case and work back from that."
Some organizations simply need to protect themselves against storage failures or small local disasters like fires, and can economically replicate data to a nearby facility down the street or across the city. Other organizations face the prospects of a broader regional disaster like a flood or an earthquake, and must shepherd data to other geographic locations beyond the danger zone. Still other organizations are global, and may choose to preserve their data at backup data centers on entirely different continents.
Site selection is really a tradeoff. You want to establish a DR site that's far enough away that it won't be affected by the same disaster, but not so far away that WAN bandwidth costs will be prohibitive. "Recent power grid failures by Pacific Gas and Electric that have brought down the entire PGE northern California and southern California grid have taught orgs that a failure in one PGE location will impact another PGE location," Morimoto said. "So orgs are looking to split data centers across grids."
The physical distance involved will often dictate the type of replication used (e.g., asynchronous or synchronous) to move data between sites. Synchronous replication moves data in real time so that the data center and DR site contain the same data moment to moment, but synchronous data transfers often need high-bandwidth, low-latency wide area network connections and impose distance limitations. Asynchronous replication moves data on a bandwidth-available basis. This allows data movement using cheaper, lower-bandwidth connections, but presents a possibility of data loss because the data center and DR site may be out of sync by up to several hours.
Data recovery requirements and WAN bandwidth
Second, a solution provider must understand what client data needs to be protected, how much data that involves, and how much of that data changes on a regular basis. A popular misconception of DR planning is that every byte of data needs to be available. This is not the case. While everything should ultimately be backed up and restorable, only a limited number of critical business applications may need DR protection for rapid recovery, leaving other noncritical data to be restored from backups later.
These issues not only dictate the storage requirements at DR site(s), but they also impact the WAN bandwidth needed to carry new and changing data volumes within an acceptable timeframe. For example, it will take considerably more bandwidth to synchronize 500 GB of data than 50 GB of data each day. Consequently, more data means more time or costlier bandwidth for the client. Further, the bandwidth needed to move data should not impact the other business uses of that bandwidth. Some organizations may adapt to these demands by synchronizing data during off hours, or throttling up available bandwidth during the day for synchronization tasks.
A solution provider can often minimize WAN bandwidth requirements by deploying a WAN acceleration product from Riverbed Technology, Cisco Systems, Silver Peak Systems, Blue Coat Systems or another vendor. WAN acceleration speeds data transfers by compressing data on the fly, altering packet sizes for greater efficiency and reducing the latency normally associated with TCP/IP handshakes.
Acceptable downtime and data loss
An important third consideration is the client's allowable downtime and data loss. Solution providers should help clients determine how long they can afford to be offline during a disaster (the recovery time objective or RTO) and how much data they can afford to lose (the recovery point objective or RPO). These factors profoundly influence the way that data is copied and made available at the remote site, which also greatly affects DR costs.
For example, suppose a client can be offline (without replication) for up to four hours. That DR implementation would likely use some form of asynchronous replication across an inexpensive WAN link to a simple (idle) site that would take time to bring online. By comparison, a customer that can't afford to lose any data and needs to be available again in less than 30 seconds might employ synchronous replication across a high-bandwidth WAN link to a hot site that is always online -- or even a redundant data center that actually shares the processing load with the main site.
Network disaster recovery architecture and installation considerations
While essential planning concepts have changed very little over the last decade, advances in WAN and virtualization technologies have changed the way that DR sites are implemented. Traditionally, DR sites fall into categories that define their availability. Cold sites act as data repositories with little (if any) processing capability. In a disaster, the cold site needs to be brought online and data restored to another work site from the DR site. Warm sites include more processing capability and connectivity, able to restore data or assume some processing duties with minimal preparation. Hot sites duplicate much of the data center, possessing the processing and connectivity to operate the business from that remote location with almost no preparation.
Today, cold/warm DR sites are fading into obscurity. Traditionally cold DR sites can be made hot in just a matter of moments, and high availability needs are converging with DR. "Rather than having a high availability plan and a separate DR plan, [it is now possible] to cluster across a WAN so that you have both high availability if a server fails and disaster recovery if a site fails," Morimoto said. Sobel agreed, noting frequent instances in the SMB space where clients replicate data to a hot on-premise appliance and also replicate the data to a cold off-site appliance -- accommodating both availability and data protection in the same strategy.
This dual availability/recovery theme is also repeated at the enterprise level. For example, WAN bandwidth and replication technologies allow a synchronous DR site to be as far as 50 to 63 miles from the data center within the same region for high availability, and a third asynchronous site can be deployed far outside of the region for recovery. "I need to get as far away as possible and maintain my synchronous distance," said Bob Laliberte, analyst with the Enterprise Strategy Group in Milford, Mass. "If something happened regionally that incapacitated a region, you would still be able to recover and operate."
Server virtualization is also revolutionizing DR sites. Hot DR sites once had to duplicate the original data center's equipment, adding expense and complexity. However, virtualization now allows DR data to reside and be available from almost any server hardware -- keeping data available while consolidating and improving the versatility of DR site equipment.
Any DR plan should consider availability on the WAN itself. Wide area networks like the Internet involve numerous local and regional providers that can also be affected by disasters. Try to avoid single points of failure that will affect multiple sites. For example, suppose your client's main data center and DR site are both serviced by the same WAN provider running through the same central office (CO). Anything that affects that CO might then disable both sites.
Similarly, multiple WAN links between the data center and DR site may be essential for high availability, but choose links from different providers that route through different COs or geographical areas. The issue is even more complicated for high-bandwidth links like dense wavelength division multiplexing fiber where service may only be available in a small number of areas, limiting the choice of data center and DR site locations.
DR personnel considerations
In the past, DR sites received data across the WAN and normally only needed IT personnel to bring the site online. Today's shift toward high-availability DR usually requires some IT personnel permanently stationed at the DR site to perform regular maintenance, updates and load balancing between sites. Solution providers can sometimes be engaged to maintain a client's DR site(s) and bring them live if the need arises. But solution providers need to get their clients thinking about more than data preservation and IT staffing when planning a DR site.
Even if a company implements the best DR technologies and design strategies, and the DR site is available to operate at a moment's notice without losing a single byte of data, the post-disaster business will still suffer if the people that run the business are ignored in the planning. A skeleton crew of IT professionals won't be handling customer service, accounting, legal, product development, shipping and receiving, manufacturing or any of the other myriad tasks involved in normal daily operation.
"If something causes the primary data center to go offline like an earthquake or fire, then no one will be in the building to care that their data is still there," Morimoto said. "People will go home to take care of family or not be able to make it into work the next day."
Encourage your client to consider where displaced personnel will work in the hours and days following a disaster. A DR site may include workspace, telephones and other resources for key personnel, though the majority of displaced personnel may be able to access the DR site and work remotely once local telecommunication service is available. In larger enterprises, everyday personnel may split between multiple locations -- working collaboratively between sites while taking over operations when the other site is disrupted.