Attacking the Growth Barrier of Data Storage

CleversafeIDC estimated the amount of data stored worldwide in 2007 was nearly 300 exabytes, and that this number will grow ten-fold by 2011. Meanwhile, this research shows ninety percent of all data created in the next five years will be digital content, namely video, audio and image objects.

The data storage, archive, and backup of large volumes of digital content is quickly creating demands for multi-petabyte storage systems (equal to thousands of terabytes or millions of gigabytes), but today’s storage industry -- and its technological approach -- is not set up to effectively meet this demand.

Enterprises have become comfortable with RAID (redundant array of inexpensive disks) and replication (storing copies of data to multiple storage devices) as the default means of storage, and to date, they have worked well for storing structured data. Little consideration, however, has been given to the idea that one day (a day that is fast approaching for many organizations) there will be too much data to efficiently, securely and cost-effectively store using replication. When that day comes, what happens?

Enterprises are approaching the growth barrier point of data storage -- the point at which the volume of data to be stored outgrows the functional capacity and economic requirements of the current storage model.

The Scalability Problems of RAID and Replication

Current data storage systems based on RAID arrays were designed for storing structured data, which was significantly smaller than digital content. RAID schemes are based on parity, meaning that, if more than two drives fail simultaneously in RAID, data is not recoverable. The statistical likelihood of multiple drive failures has not been an issue in the past, but as systems grow to hundreds of terabytes and petabytes (brought on by digital content), the likelihood of multiple drive failures is now a very credible threat.

When enterprises were first confronted with the shortcomings of RAID, they turned to replication, a technique of making additional copies of their data to avoid unrecoverable read errors, and lost data. Most organizations today rely on replication to deal with failure scenarios such as a location failure, power outages and bandwidth unavailability, but replication presents a huge problem in the future -- as storage grows from the terabyte to petabyte range, the number of copies required to keep the data protection constant increases to the point of unmanageability. Not only does the size of the data undermine the functionality of the storage, but it also means storage systems will get exponentially more expensive while simultaneously decreasing in reliability.

Theoretically, RAID and replication are viable approaches for protecting data. But in the real world, enterprises face resource and financial limitations. Under those constraints, and with massive storage requirements, storage based on replication simply doesn’t “pencil out.”

Typically RAID increases storage by at least 125 percent (with RAID 4-1). If the storage is then replicated once or twice to additional locations to handle offsite protection and avoiding a single point of failure, the raw storage requirements swell to 250 percent a second copy, and 375 percent for three full copies. This dramatic increase in raw storage requirements increases both capital expenditures as well as the total cost of ownership. This illustrates the growth barrier of data storage -- enterprises either need to spend exorbitant amounts for data protection, or spend less but inevitable experience data loss. Bottom line -- because RAID and replication require more raw storage per terabyte to maintain data protection as the amount of data increases, they are cost prohibitive in scale.

Handling the Tide of Unstructured Data

A new approach to storage, via dispersal rather than replication, is better suited to handle the unbridled tide of unstructured data. Instead of copying data, dispersal divides data into "slices" and stores slices in a secure network. This network can either exist within a single site, or be geographically dispersed. When geographically dispersed, it provides additional site failure protection. Each slice contains too little information to be useful, but any threshold of the slices can be used to perfectly re-create the original data. The sum of all the slices is still less than maintaining multiple copies of the original data as with RAID and replication. Real-time data retrieval is always bit-perfect as long as a threshold number of slices is available.

The beauty of dispersal is that as storage increases, the cost per unit of storage does not. Dispersal also meets the same reliability target. The table below best illustrates a generic example of cost savings between dispersal and replication when fixing the cost at a representative, but not necessarily actual, “raw” storage cost of $2.75 per GB.

 

Data storage cost comparison 

Source: http://www.cleversafe.com/vision/dispersal-smart-economics.

Dispersed storage will enable organizations to cost-effectively scale without the bandwidth overhead and additional capacity required with a copy-based storage system. The million dollar question is: where will your organization be when it reaches the growth barrier of data storage?

Chris Gladwin is the CEO at Cleversafe.