By Mehran Hadipour
In most enterprises, mission-critical applications are fundamental to the core business. Failure of those applications can be potentially disastrous to the business and, in some cases, can be terminal. This is why protecting high-value data and delivering 24x7x365 business continuity is by all counts the top objective of any IT organization. Meeting this goal, however, represents a number of challenges as the traditional data protection strategies are complex, expensive and hard to manage, and require extensive additional infrastructure. Distances between primary and secondary sites are also limited with many data protection technologies, since the primary application performance can get impacted. Protection against regional disasters and power outrages and numerous regulations are driving the need for extended distance disaster protection.
In September 2002, the Federal Reserve, the Securities and Exchange Commission (SEC), and the Office of the Comptroller of the Currency (OCC) jointly published the “Draft Interagency White Paper on Sound Practices to Strengthen the Resilience of the U.S. Financial System,” in direct response to the terrorist attacks of September 11, 2001 (www.sec.gov/rules/concept/34-46432.htm). This outlined “preliminary conclusions with respect to the factors affecting the resilience of critical markets and activities in the U.S. financial system; sound practices to strengthen financial system resilience; and an appropriate timetable for implementing these sound practices.” The agencies solicited comments on the draft white paper, and received many letters from leaders of financial firms, industry associations, technology companies, and others (www.sec.gov/rules/concept/s73202.shtml).
Probably the most controversial aspect of this interagency white paper was the suggestion that those financial institutions that “play significant roles in critical financial markets” must have fully operational recovery sites located at least 200-300 miles away from the primary data center site. In addition to protection against regional disasters, there could be other reasons for deploying systems across multiple data centers in different geographic locations such as providing “local access” to users spread across a wide geographic area or to take advantage of existing IT resources skills and infrastructure in the companies geographically dispersed data centers.
Is Only the Mainframe Data Critical? Windows server operating systems have become accepted in high-end, mission-critical applications and as a result requirements for disaster tolerance and business continuance for these systems is becoming more and more important. Microsoft offers a robust clustering technology as part of the windows operating system (MSCS); however this is generally deployed for failure protection within a campus or data center environments.
The goal is to ensure that there is no single point of failure. In other words, the loss of a single component or complete site failure cannot cause applications to become unavailable. In extreme cases, a complete site can fail, either due to a total loss of power or through a natural or artificial disaster. More and more businesses are recognizing the value of deploying mission critical solutions across multiple geographically dispersed sites.
A new data protection approach utilizing an intelligent replication appliance coupled with windows clustering technology can be used to create a highly resilient infrastructure across data centers that are thousands of miles apart, protecting applications automatically against all types of failures as well as local or regional disasters. In addition, this new network-based data protection architecture can provide unique features such as bandwidth optimization and support for heterogeneous, storage and server environments.
Replication Requirements Unique to Clusters Clusters are defined as a minimum of two or more computer systems that together provide a highly available and highly scalable platform for hosting applications. MSCS clusters host applications that use failover to achieve high availability. The failover mechanism is automatic and the configuration ensures that loss of one site does not cause a loss of the application.
The challenge with making a multi-site MSCS configuration to work the replication infrastructure has to solve several specific issues:
- Making sure that multiple sites have independent copies of the same data
- Making sure that each site has its own copy of the data so that if one site is lost, the applications can continue
- Ensuring that changes to the data at one site are replicated in a consistent manner to the other sites so that in the event that the first site fails, the changes are available in the second site so that the applications will run uninterrupted.
- Ensuring that the data between two sites stays consistent at all times
- Replicating data across sites in both directions to ensure failover-failbackM
Geographically dispersed cluster configurations should be implemented, especially around storage and data replication components of the solution. The system should ensure various failures will not result in data corruption and ensure that the cluster integrity is always maintained.
The most difficult challenge, specifically with geographically dispersed clusters, is to be able to distinguish between a communication failure between sites where the other site is still alive and a site failure where it is no longer available to run applications.
The MSCS architecture handles this issue using a single quorum resource in the cluster that is used as the tie-breaker to avoid split-brain scenarios. A split-brain scenario can happen in the above case when all of the network communication links between two or more cluster nodes fail. In these cases, the cluster may be split into two or more partitions that cannot communicate with each other.
Using an Intelligent Replication Appliance to Set up a Geographically Dispersed Cluster A geographically dispersed cluster generally deploys multiple storage arrays, with a minimum of one at each site. The replication system is configured to replicate the application data in both directions so that, in the event of site failure, the application data is preserved so that the failover servers can continue to provide the services and applications. In addition, the consistency of the quorum volume should be maintained in a synchronous manner to guarantee operations of the MSCS cluster independent of any type of failures.
Synchronous replication is used for the quorum volume, which means that any data written by MSCS on one node at one site will not complete until the change has been made on the other site.
Asynchronous replication is used for the data volumes, which means that if a change is made to the data on one site, that change will is replicated to the second site. It is important, however, that the consistency of the data volumes is maintained. This means that the write order fidelity is guaranteed by the replication system, and that the remote data is always consistent. This is important since most applications can recover from crash consistent states but very few (if any) can recover from out of order I/O sequences, whereby the application may be totally unusable.
A Cost-Effective, Less Complex Alternative Intelligent replication appliances, coupled with clustering configurations, effectively address the cost, complexity and management issues that have limited traditional data protection solutions, simplifying infrastructures while extending application protection efficiently over long distances. The ability to use a network-based appliance to enable low-cost connectivity, bandwidth optimization and long distance replication services allows IT organizations to deliver 24x7x365 availability of business information (with dramatic savings in operational costs) and ensures that information will be immediately available in the event of a complete or partial site failure.
Bandwidth Optimization: Using intelligent bandwidth reduction technologies, appliances can deliver unprecedented reduction in bandwidth requirements. This enables the system to dramatically reduce WAN costs, particularly over long distances.
Bi-Directional Replication Over Existing Infrastructures: The appliance can enable bi-directional use across heterogeneous server and storage platforms, with guaranteed data consistency across multiple servers and storage platforms in the event of any possible failure or disaster.
Protection for the Entire Data Center: Using an intelligent replication appliance, users can protect the transactional data, multi-tiered applications and other business-important information in a data center (including operating systems, working files and e-mail) to bring point-in-time protection of all applications for end-to-end immediate recovery in case of a failure event.
We have entered a time where cost-effective, intelligent replication solutions can help protect data centers located thousands of miles apart automatically and cost effectively.
Mehran Hadipour is vice president of marketing for Kashya, Inc. (San Jose, CA)