Bringing World-Class Data Reduction to Primary Data Storage
Server virtualization and data storage capacity optimization are high on the wish list of most CIOs. Each technology enables IT operations to do more with their existing IT infrastructures. But can these technologies be combined to bring even more value to an organization? In the case of VMware and data de-duplication there is an ideal fit.
To date, data de-duplication (data de-dupe) has largely been relegated to backup and other secondary storage applications for three reasons. First, primary data storage pools do not have near the data redundancy as secondary storage, limiting the usefulness of applying data de-dupe to this class of storage. Second, is the performance issue. The processing overhead of performing de-duplication in real time creates a performance issue for primary applications. Third, secondary storage presents a much larger data management challenge for storage administrators given the massive amounts of data involved compared to primary storage.
The economic case for data reduction/capacity optimization on primary storage becomes clearer when considering the cost of higher-performance, yet lower capacity Fibre Channel and Serial Attached SCSI drives deployed for primary storage to provide the fastest access to active production data and applications. Bringing data de-duplication to primary storage and hosting applications such as VMware presents a new set of opportunities and challenges. The nature of a VMware deployment is a departure for primary storage because of the high level of redundancy with the VMware architecture. VMware is highly efficient at consolidating multiple physical servers onto a single hardware platform and creating multiple virtual machines. However, VMware is not quite as adept at consolidating storage assets and relies on third party storage for redundancy, this is one of the reasons EMC bought VMware.
When VMware virtual machines are added to a physical server a VM image is cloning a master template. Every cloned image is the same size and identical to the original, creating a highly redundant storage environment. As the VMware virtual environment grows with the addition of more virtual machines, so does the level of data redundancy. It is estimated that VMware users typically run six to ten virtual machines per physical server and a high-end server can host 50 or more virtual machines. The storage capacity consumed by these cloned templates quickly adds up, especially in a Storage Area Network (SAN) where storage is shared by multiple physical servers and dozens or hundreds of VMware virtual machines.
Data de-duplication is well-suited for eliminating this excessive data redundancy that VMware imposes, but employing data de-dupe in a VMware environment creates another issue: performance degradation when applying de-duplication in real time to a primary storage application. Most data de-duplication solutions are software based that are loaded onto an “appliance” that is typically powered by an x86-class Intel server. The general purpose processor is expected to handle the compute-intensive data de-duplication algorithm processing with no problems. This can be an effective solution for secondary storage data de-duplication using a post-processing technique where data is written to the disk and then scanned for redundancy and the de-duplication algorithms are applied. Performance degradation is not an issue because the de-duplication process is applied to non-production data. But post-processing is less useful for primary storage applications and real-time disk I/O operations are expected to be maintained with no performance degradation. So for primary storage environments such as VMware, data de-duplication needs to be performed in real time, requiring inline de-dupe where the data is analyzed for redundancy and the de-duplication algorithms are applied before the data is written to the disk.
Data de-duplication systems impose an extreme processing load to handle the data reduction algorithms. Data de-dupe works by detecting blocks of duplicated data within large storage arrays and replacing all repeated instances of the data with a pointer to a single copy. The core operation in most de-duplication system is a hash function that can characterize a block of data with a single signature or key so that subsequent searches can simply compare hash keys rather than comparing the entire data block byte by byte. A suitable hash function must be nearly collision free to avoid false positive matches and must generate signatures long enough and well distributed enough to build nearly collision free hash tables.
The cryptographic hash functions SHA-1 and MD-5 have exactly these properties. They generate reasonably long hash strings, they can detect single bit differences between data blocks and differences as small as one bit can generate large differences in the resulting hashes. It is these computationally intensive hash functions that overwhelm the processing power of x86-class appliances. A better de-dupe solution is required to handle the real-time demands of VMware data de-duplication.
The obvious solution is to add a dedicated offload processor that is optimized for compute-intensive algorithmic acceleration. The easiest solution is the addition of a plug-in PCI card that offloads the de-dupe hashing load from the main CPU and accelerates these time-consuming de-duplication computations. With this type of hardware acceleration solution it scans data blocks and generate hashes at up to 250 megabytes per second. For very high performance environments the workload can be load balanced across up to four PCI cards, providing aggregate performance of 1 Gigabyte per second.
The plug-in data de-duplication accelerator card is a simple solution that can be applied to any network storage environment, working equally well in both SAN and NAS storage architectures. Whether an organization using NAS filers or iSCSI storage appliances as a primary storage repository, the drop-in de-duplication accelerator is a significant value-add for the storage vendor, creating true product differentiation and a new value proposition for the user. In addition, it delivers this new functionality to NAS and SAN storage appliances without any long or complex integration process, creating a time-to-market advantage for storage OEMs and integrators.
In addition to high performance hashing, the PCI cards can compute LZS data compression and cryptography functions also at up to 250MB/S making it ideal for a wide variety of data reduction and data security applications. Once the data has been deduplicated, it can be further reduced by applying LZS compression, producing a cumulative data reduction impact. Data de-dupe algorithms can produce data reduction on the order of 2-3:1. The LZS compression works on that reduced data load, shrinking it with a typical compression ratio of 2-3:1, so the total data reduction and storage capacity savings are on the order of 4-6:1. The deduplicated and compressed data can be securely stored by adding data encryption, including AES 256. This solution works for most data de-duplication software platforms as today’s acceleration processors support a variety of de-dupe and encryption algorithms, including AES 256, SHA-1 and MD-5.
The VMware example demonstrates the viability of data de-duplication for primary storage applications. As more applications create more redundancy within a primary storage tier, data de-duplication will become an essential technology beyond its traditional home in secondary storage. But data de-dupe will only succeed in primary storage applications if the de-dupe platform is robust enough to handle the demands of de-dupe algorithm processing in real time to preclude a performance hit for production applications and primary storage access. Delivering a powerful algorithm processor on a drop-in PCI card to offload data reduction and security processing offers OEMs, ODMs and VARs a fast and simple solution to expand data de-duplication beyond the backup and secondary storage market. This level of algorithmic processing horsepower promises to fulfill the potential of data de-duplication technology by applying its unprecedented data reduction capabilities across all storage tiers.
John Matze is vice president of business development at Hifn.
