by Chris Marsh
When the cloud concept first entered public consciousness and took its initial fluffy form, the idea was that the cloud was sort of a heavenly version of the Internet, with lots of distributed storage so that data never goes away. It sounded terrific. Then lightning struck some data, the data was destroyed and the fluff fell off the cloud.
According to some in the industry, initial cloud infrastructures depended solely or heavily on disk. This reduces latency but doesn’t adequately protect data. The smaller data sets initially stored in the cloud could tolerate the disk-only model. However, as data sets grew, the disk-only infrastructure became brittle. Self-propagating errors and cascading failures took down some user services and some users lost data (it rained) -- and some businesses aren’t in business any longer (it poured).
Since then, cloud providers have taken steps towards protecting data through the integration of tape. This prevents catastrophic data loss. With tape protecting data, the data could be brought back into the cloud and made accessible to users, even in the event of significant disk failure.
The inclusion of tape is in many ways the single biggest change since the early days of cloud architectures. It is estimated that at least 90 percent of cloud platforms rely on tape to protect data. In fact, cloud consumers may want to make sure that tape is part of the storage infrastructure of any cloud they are in which they are storing data.
Cloud is now more than a concept. It has matured into a true business model, with serious storage infrastructure and strategy. In this model, some data remains exclusively on tape to provide inexpensive storage of latency-tolerant data and applications. Some examples of data, where latency is acceptable, include personal multimedia files and financial information: A user can typically wait a minute or two to get large images and music files from the cloud, or wait a brief period to see pay stubs online or images of checks written two years ago.
Large Data Sets Change the Cloud Model
Cloud storage originated in well-networked storage behind secure firewalls. This developed into a more widely available system of storage that relied on public networks to provide services to a broad set of users. The initial adopters tended to be smaller organizations, partly because their data sets were smaller and therefore more easily sent, stored and retrieved from the cloud. With the cloud, economies of scale became available to small businesses that could not have accessed these infrastructures any other way.
As larger and larger institutions started adopting public and private clouds, data quantities became a significant issue. For example, how do you move very large files, such as video and very large data sets – say, a petabyte – to the cloud?
Both disk and tape address these emerging issues. Disk supports multiple concurrent user access to very large files, while tape makes it easy to move large quantities of data to and from the cloud. In extreme situations, moving big data sets over the wire becomes virtually impossible due to time limitations and the high likelihood of interruption during the lengthy transmission period.
Simple math demonstrates the problems associated with transferring large data sets between the customer and the cloud. Moving a petabyte of data from disk to the cloud, assuming the continuous, uninterrupted use of a 1 Gb Ethernet network, requires 104 days. Using tape to move the same data, assuming a 1.24 GB/s ingest rate, requires a total of 9.5 days plus one day to overnight the tape to the cloud provider.
As some like to point out, a box of tapes in a shipping company’s airplane (or a station wagon) moves a petabyte of data a lot faster than an Ethernet line can.
Switching Cloud Providers or Exiting Altogether
Some organizations start with a small data set, others with a large data set. Either way, it’s very likely that with what appears to be unstoppable data growth, the original data set will end up being a much larger data set. This means that exiting the cloud will pose a significant challenge for nearly everyone.
Tape fits in well here. To move large quantities of data, using tape as NAS is a straightforward solution. This method allows organizations to manage data in the cloud, while ensuring ease of data transfer between the user and the cloud provider. Linear Tape File System (LTFS) is an open format that is easily accessible for data portability between heterogeneous environments. For example, you can send a tape, regardless of the destination platform, and similarly, any tape you receive from the provider is in an open format supported by free drive-level software.
Assuming the cloud provider uses best practices, transmitted data stores information that confirms that data received is the same as the data sent. This is sometimes referred to as self-healing capabilities; that is, the system can use the data (checksums) from the before-and-after-comparisons to repair the data as necessary.
Smart Development: Turning Files into Data Objects
Increasingly, cloud structures are adopting standards to simplify data movement between cloud platforms and customers. More and more, files are being turned into objects, where an object is simply file data wrapped up so that each includes its own metadata. This lets data move around without losing its self-descriptive information.
Objects enable the enforcement of data integrity checks throughout the data transmission process. The Storage Networking Industry Association’s (SNIA’s) cloud data management interface (CDMI) is an example of an industry standard protocol that supports object-based storage and movement. Increasing standardization will make it easier to move data objects across cloud providers and heterogeneous platforms, and between public and private clouds.
At first, the cloud seemed more conceptual than real. But today, it is very real: a key component of data protection for many organizations. The technological advantages spawned by cloud development will be useful for large data sets independent of any specific platform or provider.
Data dissemination is a nearly inevitable result of the data creation, so that the importance of data transmission will increase in step with data growth. The new technologies simplify data transmission and data preservation. The technology supporting portable self-describing, self-contained data is a significant advantage to the storage industry and just about everyone who stores data.
Chris Marsh is the market development manager at Spectra Logic (Boulder, CO).www.spectralogic.com