By Jim McKinstry
In the past, only the largest companies could afford data replication technology to implement a disaster recovery solution for their e-mail systems. Most other companies relied on tape solutions that could lead to extended outages, or even data loss, in the case of a disaster. Over the years, mainframe-class replication solutions have appeared in the Open Systems environment and costs have dropped. Today’s replication solutions make it affordable for most companies to at least consider replicating their e-mail data as part of their disaster recovery plans.
The ultimate goal of data replication is to create a complete copy of the source data. There are a variety of ways to do this:
Snapshots
The simplest form of replication is the snapshot. Snapshot technology is considered replication because replication is, essentially, just a copy of data—which is exactly what a snapshot is. With snapshot technology, data can be replicated with minimal impact on the host application. Usually, it takes longer for the e-mail application to shut down or the e-mail database to enter some sort of “hot-backup mode” than it takes to perform the snapshot. There are a couple of ways to perform snapshots:
- Snap-copy (or Volume-copy) snapshots create a complete second copy of data that can be used at a later time to recover the data. The advantage to this method is that there is a full copy of the original data that may be stored on separate physicals disk drives. The disadvantages are that creating the snapshot each time can be time-consuming and affect performance while being created. In addition, with a snap-copy snapshot there must be enough storage to accommodate not only the original copy but the snapshot as well (100% overhead per snapshot). When a snap-copy is created, the data is copied to another area of storage; availability of the snapshot depends on how long the copy takes to complete.
- Pointer-based snapshots are not exact copies of the data but a set of pointers that point to the original data. When a block of data is written to the snapshot source, the changed block is written to the snapshot reserve area, the pointer for that block is changed to point to the copied block and the new block is written to the snapshot source. This process is called “copy on write.” Subsequent writes to the original data are not copied to the snapshot reserve area because the original data has already been moved. Pointer-based snapshots are very attractive because the snapshot is available instantaneously and the snapshot reserve area needs just a fraction of the original disk space, since only the changed blocks are copied. Because pointer-based snapshots require such a small amount of additional space, they can be implemented cost effectively. The disadvantage to pointer-based snapshots is that if the source is write-intensive, maintaining the “copy on write” can affect the performance of the source.
Remote Replication
Snapshots are generally controlled by a single device (host, disk subsystem or SAN appliance). If there is a need to have a copy of the data on another disk subsystem or in a remote location, then a remote replication solution should be considered. There are, essentially, two options when replicating data remotely.
- Synchronous replication has been around for a long time in the high-end, mainframe-class storage devices and more recently in the higher-end open-systems storage devices. Due to cost and complexity, synchronous replication has traditionally been implemented by larger enterprises.
Synchronous replication is where every write from the application is sent to the local disk system, which sends it to the remote storage system. The write is not acknowledged back to the application until the remote system has received the write and acknowledged that the write is complete. In other words, the application has to wait for the written data to be written to the local and remote storage before it can continue processing. This “double-write penalty” can be very significant to an application that does a large amount of writes. As the distance between the local and remote systems grows, the delay in each write of data will also increase. The networking between the two systems needs to be as fast as possible to keep the latency to a minimum.
The benefit of synchronous mirroring is that in the case of a disaster at the primary site, the remote site can be brought online and processing can continue from the exact point in time that the primary site died. Unfortunately, the networking between the sites that is needed to implement a synchronous solution has historically been very expensive. As a result, only the most critical applications can afford to take advantage of it (banking systems, brokerage houses, government, etc). Today, synchronous replication has been ported from very high-end solutions to much more affordable solutions and has allowed many more companies to implement solutions they would not have been able to deploy a few years ago.
- Asynchronous replication, like synchronous replication, creates an exact copy of the local data on a remote system. Unlike synchronous replication, there is no practical distance limitation. Where synchronous replication waits for a response from the remote system before acknowledging the write back to the application, asynchronous replication acknowledges write back to the application immediately and then transmits the data to the remote site. Since the application is not waiting for the remote system to respond, the application performance is not impacted, no matter how long the response takes. With no requirement for a timely response, the remote system can now be hundreds or thousands of miles away.
The obvious benefit of asynchronous replication is that there is no performance impact to the application. One drawback is that the remote data is not necessarily useable because writes may occur out of order (many databases need to have the writes occur in order). Some asynchronous implementations provide a facility called “write order consistency groups”. Write order consistency groups guarantee that all writes are delivered in the proper order. For example, all the volumes used by an e-mail system can be grouped in a write order consistency group, which will guarantee that the remote copy of the data is useable.
A third type of remote replication is to initially establish the remote mirror and then suspend replication. Periodically, replication can be resumed and when the remote site is back in sync replication can be suspended again. This methodology allows users to have stable, point-in-time copies of their data at a remote location.{mosgoogle right}
{mospagebreak}Types (or Options)
As discussed, replication used to be very expensive with limited options available to those trying to implement a solution. In the past, replication solutions resided in large, monolithic storage systems. Today, there are a variety of options available to users and replication solutions are available for various points in the I/O path.
Host-Based: Host based replication resides on the application server that needs to have its data replicated. Software that supports all forms of snapshots and remote replication is available for most of the popular operating systems. The major benefits for a host-based solution are that the cost can be very low and heterogeneous storage can be used. As more servers need to use replication, the cost goes up (initial cost of software, implementation, service and ongoing maintenance). Software may need to be purchased from different vendors to support the mix of servers which means that there will be different management interfaces which causes management of the environment to become very complicated. Users of less mainstream operating systems may have a challenge locating products to implement host-based replication. Operating system upgrades or patches may cause the replication software to stop functioning. Another issue with host-based replication is that it takes processing cycles away from the applications running on the host (i.e. the operating system has to use resources (CPU, memory, network) to manage the replicated data).
Appliance-Based: Appliance-based replication (also called SAN-based) technology, like host-based, supports all the types of replication. Unlike the host-based solutions, all intelligence needed to perform the replication is housed in an appliance that resides in the I/O path between the host and the storage, typically in a SAN. Appliance-based replication has many advantages over host-based. To start, there is no overhead on the application server and the application has little or no knowledge that the appliance exists or that the replication is taking place. Management is centralized on the appliance and any operating system that the appliance supports can utilize the replication features. Like host-based solutions, a heterogeneous storage pool can be utilized. There are some major issues with an appliance-based solution. For a high-available solution, there should be at least two appliances in the local site, configured as failover for each other, and at least one appliance remotely. Since the appliance is involved with every I/O (not just the replicated data) each appliance should use at least four switch ports and typically more, which can add significant cost and complexity to the SAN infrastructure. Modern disk subsystems can deliver huge I/Os per second (IOPS) and megabyte per second (MB/s); application servers can now drive these appliances to their max. A SAN appliance in the stream of the I/O can become a major bottleneck. An environment with large I/O needs can easily overpower a pair of appliances. Some appliance-based solutions are limited to a pair of appliances while others can scale beyond two. As additional appliances are added, the cost of the solution rises (cost of appliance, SAN switch ports, support, etc.).
Storage-Based: Storage-based replication combines the best aspects of host-based and appliance-based solutions. The application servers have little or no knowledge of the replication; there is no overhead on the application servers. Management is centralized and any host supported by the storage system can use the replication functions of the storage device. Unlike appliance-based solutions, there are no extra SAN switch ports needed to implement storage-based replication. Since the replication is native to the storage controllers, the impact is minimal to the application servers utilizing the storage. The only drawback to storage-based replication is that replication can only take place between homogeneous storage systems. In the past, this could prove to be costly; but today, many storage devices allow replication from more expensive Fibre-Channel drives to less expensive SATA drives and also support remote replication from a higher-end model to a lower end model. For example, a StorageTek FLX280 with a mix of FC drive and SATA can perform snapshots of the data on the FC drives and store them on the SATA drives and also remotely replicate the data on the FC drives to a StorageTek FLX240 that may have only SATA drives.
{mospagebreak}Uses
Implementing replication can be a costly proposal but can usually address an important business issue or multiple issues that make it is easy to justify the purchase.
Disaster Recovery: Replication technology has typically been used to address disaster recovery issues. Disaster recovery is still the driving business case behind replication. Remote replication can be implemented from the production site to one or more remote sites across a campus, across town, across a state or across the country. When a disaster strikes the primary location, the applications can be brought up at the remote site and continue processing against the replicated copies. When the primary site is back online, the replication can be reversed and when the data is resynchronized, processing can be switched back to the primary site and business can continue. In the past, if an e-mail system experienced a disaster it was an “oh well” moment. The loss of a day or more of e-mail was not considered important. Today, e-mail is a critical component of many companies’ business plans and recovering e-mail after a disaster quickly and completely is required.
Maintenance: Once a disaster recovery solution is in place and fully tested with documented processes and procedures, the infrastructure can be used to solve other business needs. E-mail servers may need periodic maintenance that can take hours to complete. With remote replication in place, the downtime can be minimal (as long as it takes to bring the remote peer of the primary e-mail server online). The primary server can be worked on (patches, hardware upgrades, etc.) and then brought back online and into production. A whole datacenter can be failed over to a remote site on purpose to perform maintenance on generators, air conditioning, etc. Replication can also be used to perform a datacenter move with minimal downtime (fail everything to the DR site, move the production datacenter to its new location then fail the DR site back to the new datacenter).
Backup: Backing up data is frequently the biggest daily challenge for an IT manager. Backup windows have been shrinking while data has been growing. In the past, the only way to address the issue was to add larger and larger tape libraries. Today, by using snapshots and SAN technology, backup windows can shrink to virtually zero. For example, an e-mail server can be placed in “hot-backup mode” and its data can be snapped. The database can then be placed back into normal operation. The snapshot can be mounted onto a dedicated backup server, backed up directly to tape or SATA disk, and then archived to tape.
Summary
The roll of e-mail systems in the enterprise has gained in stature since their humble beginnings. Thankfully, technologies used to support an e-mail systems infrastructure have also matured at the same time. Replication is one of those technologies that has greatly matured since its inception. Today’s entry-level replication solutions are as robust and reliable as enterprise-class solutions from 20 years ago and far more affordable. The attractive pricing allows all sizes of companies to protect from disasters and implement business efficiencies that were previously available to only the largest organizations.
Jim McKinstry is senior systems engineer at Engenio Information Technologies, Inc. (Milpitas, CA)