An organization’s servers, systems and data are the lifeblood of its business, and maintaining the continuity of these business systems is a complex and full time endeavor. There are often hundreds of servers or sub-components that can fail, and many more ways that environmental and human factors can trigger such failures. Any one of these mistakes has the potential to introduce fatal failures into a company’s business system continuity plans.
Backing Up Only the Data
Organizations got in the habit of thinking user data was the only volatile part of the environment, the only part requiring protection. There’s significant risk at the OS level. Today’s data protection processes should include not only backups of user data, but also backups of the OS layer and all applications, along with the ability to quickly restore and recover each.
Bare-metal technology allows a full snapshot of the operating system to be taken as well as all applications, allowing it to be fully restored to a new “bare-metal” server in a matter of one or two hours, instead of several days. More sophisticated systems provide this bare-metal capability across multiple operating systems, so organizations that are running Windows, Linux, and other operating systems can continue to use a single integrated backup solution.
Regardless of the operating environment, it’s critical to remember that everything needs protection, not just user data.
Allowing Backups to Go Untested
Organizations often spend an enormous amount of time making backups (weekly masters, nightly incremental backups and so on). If the backup volumes being created cannot be restored on a reliable basis, the process has effectively failed. A solid disaster recovery plan must include redundant backups intended to adequately compensate for normal error rates, and must incorporate time factors that reflect real world data from actual test restorations.
“Bare-metal” technology allows a full snapshot of a system’s operating environment—including the operating system, all applications, as well as complete user data—and allows the environment to be restored in a fraction of the time necessary to rebuild it from scratch.
Best practices include the concept of testing something each month. Operating systems and environments change rapidly enough that it’s possible that the normal flow of changes in a 30-day period may have compromised something in the current backup resources. This suggests test restoration of at least some user data—and preferably an entire server— once each month.
Finally, an important best practice is to capture and analyze the data that the organization gathers about error rates of the test restorations, and the time required to conduct them, and feed this knowledge into its disaster recovery plan.
Lack of Adequate Recovery Planning
Most IT organizations have some level of a “plan” that they’ll reach for in the event of a serious IT failure or natural disaster. These plans often merely describe the systems that exist today, as opposed to laying out a roadmap for a successful recovery from a range of different possible emergencies.
For this reason, smart organizations try to standardize around one or two backup recovery systems/technologies. For example, Unitrends’ systems support more than 30 operating systems, simplifying both planning and recovery in the event of a disaster.
Best practices associated with the planning part of the process include the following:
One person should be ultimately responsible for the execution of the disaster plan, and that person should have one backup person in the event that he or she is unavailable.
Just like basic backup and restoration testing, disaster plans themselves must be tested frequently and completely.
Testing should focus not just on whether the plan can be executed successfully, but also on finding ways to simplify the plan for the future.
Finally, remember to keep multiple copies of the plan itself in multiple locations, just as you would data archives or backup tapes.
Not Planning for a Dissimilar Recovery Environment
We think of laptops, desktops and servers as nearly universal devices, and often don’t consider the risks and complexities of moving software and data between different brands or models. System driver, application and patch complexities and incompatibilities can undermine the best laid plans. It is critical for IT professionals to think about these issues, and plan for realistic restoration choices in the event of a disaster.
Today’s careful planners will catalog all their hardware and operating system environments in advance, including:
What current machines are available from their preferred manufacturers and, how today’s generations of machines are different from the generation currently in use.
Advanced identification of replacement sources for the servers that are used, and the documentation on what will be ordered in the event of a disaster.
Confirmation that their “dislike to known” hardware models are supported by their bare-metal software technology.
Duplicate CDs of all necessary drivers, both stored onsite and at an offsite disaster recovery location.
Testing, to be sure that all of the company’s software technology—and all of the hours that were spent making backups—lead to a successful restoration in the event of a disaster.
Not Having Offsite Copies
Software security threats—viruses, spyware, malware, zero day exploits—often grab headlines and the attention of IT execs. But what is often forgotten is physical security, which can also disrupt a company’s operations and undermine its backup and recovery
plans.
Organizations think of offsite storage as protection against true natural disasters—hurricanes or other events that can physically destroy a building. While there’s ample recent evidence of this risk, the benefits of offsite storage cover a multitude of less dramatic but equally damaging potential problems.
Finally, we hear all too frequently of disgruntled employees who gain enormous power over an employer by destroying or holding hostage a crucial set of backups. Again, this risk can be mitigated easily by maintaining current copies offsite.
Best practices involve keeping five successive weekly master backups offsite at all times, as well as end of month archives of the full system.
Confusing Replication and Vaulting
The idea of replicated disk storage originated in the fail-safe systems of 20 years ago. In this technology, two disks or disk arrays simultaneously stored exactly the same data, such that if one had a failure, no data was lost.
Today, replication is typically used by SANs and “data mirror” products. Replication uses near real-time, block-level communications to constantly synchronize data between two different locations. The advantages to this method are near real-time data protection, and ease of configuration and maintenance. The disadvantages are that the synchronization occurs at a very low level, below the ability to see data and file system errors or corruption.
A preferable approach to offsite data storage is called “vaulting,” which uses a less real-time technology, but allows a level of file system integrity not found in a replication environment. Vaulting operates at the file level, where the built-in operating system checks and balances on file system integrity are allowed to operate. This ensures that data stored in the offsite environment is accurate, complete, and able to be used for recovery in the event of a disaster.
Today’s best practices suggest that IT professionals should look for a vaulting system that moves only changed data, not the entire backups or original data source; this minimizes bandwidth requirements and maximizes backup windows.
Pick a business system continuity solution that accommodates the majority of the needs you’ve already identified, plus others that may not be in today’s plan but could still be useful. The more options you have in a real disaster, the better the chances that you’ll be back in business quickly.
Maria Ellison is SVP of products and services at Unitrends.
