Data Integrity In The Cloud

AddThis Social Bookmark Button


by Chris Marsh

This article marks the second in a three-part series looking at the role of tape in the cloud, data integrity verification in the cloud, and archiving in the cloud. Click here to read the first article in this series.

Cloud storage can be an attractive means of outsourcing the day-to-day management of data, but ultimately the responsibility and liability for that data falls on the company that owns the data, not the hosting provider. With this in mind, it is important to understand some of the causes of data corruption, how much responsibility a cloud service provider holds, some basic best practices for utilizing cloud storage safely, and some methods and standards for monitoring the integrity of data regardless of whether that data resides locally or in the cloud.

Integrity monitoring is essential in cloud storage for the same reasons that data integrity is critical for any data center. Data corruption can happen at any level of storage and with any type of media. Bit rot (the weakening or loss of bits of data on storage media), controller failures, deduplication metadata corruption, and tape failures are all examples of different media types causing corruption. Metadata corruption can be the result of any of the vulnerabilities listed above, such as bit rot, but are also susceptible to software glitches outside of hardware error rates. Unfortunately, a side effect of deduplication is that a corrupted file, block, or byte affects every associated piece of data tied to that metadata. The truth is that data corruption can happen anywhere within a storage environment. Data can become corrupted simply by migrating it to a different platform, i.e., sending your data to the cloud. Cloud storage systems are still data centers, with hardware and software, and are still vulnerable to data corruption.One needs to look no further than the recent highly publicized Amazon failure. Not only did many companies suffer from prolonged downtime, but 0.07 percent of their customers actually lost data. It was reported that this data loss was caused by ’recovering an inconsistent data snapshot of … Amazon ESB volumes. What this translates to is that data in Amazon’s system became corrupted, and as a result, customers lost data.

Whenever data is lost, especially valuable data, there is a propensity to scramble to assign blame. Often in the IT world, this can result in lost jobs, lost company revenue, and, in severe cases, business demise. As such, it is critical to understand how much legal responsibility the cloud service provider, per the service level agreement (SLA), has and to ensure that every possible step has been taken to prevent data loss. As with many legal documents, SLAs are often written to the benefit of the provider, not to the customer. Many cloud service providers offer varying tiers of protection, but as with any storage provider they do not assume liability for the integrity of your data.

Cloud SLA language that contains explicit statements protecting the cloud provider if data is lost or corrupted is common practice. An example of this language is found in the Amazon Customer Web Services agreement, which states, “WE… MAKE NO REPRESENTATIONS OR WARRANTIES OF ANY KIND … THAT THE SERVICE OFFERINGS OR THIRD PARTY CONTENT WILL BE UNINTERRUPTED, ERROR FREE OR FREE OF HARMFUL COMPONENTS, OR THAT ANY CONTENT … WILL BE SECURE OR NOT OTHERWISE LOST OR DAMAGED.” In fact this agreement even goes as far as to suggest that a customer make “frequent archives” of their data. As mentioned before, the responsibility for managing the integrity of data, whether in a data center, private cloud, hybrid cloud or public cloud always falls on the company that owns the data.

There are some common sense best practices that will allow a company to take advantage of the flexibility and accessibility of the cloud, without putting its data at risk. The premise of data protection is to distribute the risk so that the probability of data loss is minimized. Even when storing data in the cloud, it makes sense to keep a primary copy and a backup copy of the data onsite so that access to the data is not dependent upon network performance or connectivity. By adhering to these basic best practices and knowing the details of the cloud provider’s SLA, the building blocks are in place to implement a method for proactively monitoring the integrity of data regardless of the storage platform or location.

One method for verifying the integrity of a set of data is based on hash values. A hash value is derived by condensing a set of data into a single unique value by way of a pre-defined algorithm. Since the hash value is derived from the original data itself, if the two hash values are not identical, it is an indicator that at least one of the two copies has been either altered or corrupted.

Make sure that the cloud provider provides the ability to check the hash value of the data and compare it to the hash value of a second copy of data, regardless of where that copy is stored. Undertaking this level of data monitoring manually would be beyond cumbersome. Fortunately, other methods are available, including programmatic checks. Spectra Logic and the other members of the Active Archive Alliance offer tools that will automatically monitor the integrity of the data within their systems.

While an active archive is one method of monitoring data integrity, there remains a critical need for a widely adopted cloud standard protocol that supports integrity monitoring and interoperability. Because not all data centers have homogeneous equipment, nor are they necessarily homogeneous to the cloud hosting infrastructure, interoperability between different storage devices is crucial. The Cloud Data Management Interface (CDMI) standard was put forth in 2010 by the Storage Networking Industry Association (SNIA). A CDMI-compliant system can query another CDMI compliant system for the hash value of an object, thus verifying that the two copies of data are still identical. By monitoring the integrity of the primary copy of data with a backup copy, a company can now verify that the copy of data stored in the cloud has not been corrupted. How frequently these data sets need to be monitored can be determined by the value of the data. Industry standards, such as CDMI, not only ensure interoperability between compliant heterogeneous systems, but also provide a convenient mechanism for data integrity monitoring.

It’s hard to dispute that the cloud industry has taken a few punches in the media recently, especially with large vendors like Iron Mountain discontinuing their basic cloud storage services and the previously discussed data loss at Amazon S3. However, the moral of this story isn’t that the cloud is an unwise storage platform, but rather that when investigating and implementing cloud strategies, there are more factors to consider than simply cost per gigabyte stored. Cloud storage offers many advantages to companies of any size when properly implemented. What cloud doesn’t do is eliminate the need for intelligent data management strategies. Regardless of how or where data is stored, it is absolutely crucial to make certain it will be accessible and restorable when needed. This assurance is at the very heart of data integrity monitoring and verification.

Chris Marsh is the IT market and development manager at Spectra Logic (Boulder, CO).