by Robin Schumacher
It seems like a marriage made in heaven: big data and cloud computing.
Big data has certainly grabbed its share of the spotlight these days. Earlier this year, Forbes writer John Furrier wrote: “We all know that Big Data is the hottest sector in IT at the moment.”
Confirming Furrier’s thought was Jeff Kelly of Wikibon, who wrote a recent report on big data that seemed to scold any IT executive who dares not have a big data strategy in the works: “Big Data is the new definitive source of competitive advantage across all industries. For those organizations that understand and embrace the new reality of Big Data, the possibilities for new innovation, improved agility, and increased profitability are nearly endless.”
Hey, let’s be honest: Who doesn’t want near-endless agility and profitability?
At the same time, cloud computing is muscling its way into the hearts of many more devotees these days, and it’s beginning to show. Today, the amount of information that currently resides only in the cloud may be small, but a recent study by IDC estimates that by 2015, nearly 20 percent of all information will be “touched” (stored or processed) in a cloud. No matter how you look at it, that isn’t small potatoes.
So big data and cloud computing are certainly colliding, but the question is: How well will that “collision” take place? To make it more personal – if you’re going to be tasked with developing a big data strategy, and if part of that strategy involves deciding whether to deploy your big data infrastructure on premise or in the cloud (or a combination of both), what criteria should you look at to help make that decision?
Let’s look at some defining characteristics of both a cloud and big data database that can be used to help you narrow your focus and then see how those can be combined into a solution that actually delivers what you hope it will.
Must Haves for a Cloud Database
It’s probably fair to say that every database software vendor claims his or her database can be a cloud database. It’s also probably fair to say that many of those databases are in no way a cloud database.
Although industry experts and analysts disagree over the exact characteristics that constitute a cloud database, they pretty much all agree that taking a single relational database management system (RDBMS) installation and deploying it on one of the cloud provider’s platforms is not a cloud database.
This, of course, begs the question as to what does comprise a cloud database?
Simply put, a cloud database is architected to maximize the benefits of a global cloud platform. What this practically boils down to is being able to support the following key attributes:
- Transparent elasticity and scalability – this
equates to being able to add/subtract database capacity to meet the current
workload, and do so in a way that is transparent to the underlying application.
- Continuous availability – this isn’t “high
availability”, which can still involve unplanned downtime, but continuous
availability, which means you don’t go down unless you want to.
- Location independence – this means being able to
read and write to any node that is part of a database anywhere in the world,
and have that data distributed to wherever it needs to go.
- Management simplicity – a cloud database should
minimize administration tasks and simplify (or automate) the provisioning and
altering of a cloud database cluster.
- Efficient cost – naturally, one of the goals in moving to the cloud is less cost than managing things on premise.
Of course, there are other minor cloud database features that some will deem important (e.g. multi-tenancy, etc.), but the five characteristics above are the must-haves you’ll want supported in your cloud database.
What Qualifies as a Big Data Database?
In the same way that nearly all DBMS vendors say they have a cloud database, most every database supplier also says they are in the big data business. Most are not.
Analysts such as Gartner define big data as not something that merely involves tera/petabytes (i.e. data volume), but also high velocity data, a variety of data (structured, semi-structured and unstructured data), and data complexity, which basically means taking the first three characteristics of big data and managing them across many different geographies, data centers, etc.
Surprisingly, nearly all industry analysts and experts sing the same tune where the definition of big data management is concerned: Something new – technology-wise – is needed to tackle it.
For example, IDC says: “Big data technologies describe a new
generation of technologies and architectures, designed to economically
extract value from very large volumes of a wide variety of data, by enabling
high-velocity capture, discovery and/or analysis.” O’Reilly agrees and says: "Big
data is data that exceeds the processing capacity of conventional database systems.
The data is too big, moves too fast, or doesn't fit the strictures of your
database architectures. To gain value from this data, you must
choose an alternative way to process it."
Implied in a big data database like that described by Gartner, IDC and O’Reilly is the capability to handle real-time data coming in very fast from transactional systems or devices, while also offering mechanisms for easily analyzing and searching that data once it has settled into the database.
Combining a Cloud and Big Data Database
If we join together the key needed traits of a cloud database with the requirements for a big data database, a semi-clear picture emerges as to what you should be looking for.
First, you’ll want something other than a traditional RDBMS. Legacy RDBMSs are not on the short list because of their inability to handle the variety of data in big data, their use of master-slave architectures that don’t meet the location independent cloud requirement, their inability to scale write operations, and the management burden they typically invoke when trying to scale out on a massive scale.
Instead, your “big data cloud database” will have the following attributes:
- A masterless or peer-to-peer architecture that
is designed to elastically distribute data wherever needed and meets the
requirement for location independence (i.e. read/write anywhere);
- Built in replication for easy data redundancy
that is geared for multiple data centers and/or different zones on a cloud
- A flexible or dynamic data model that can
accommodate all types of data with ease versus the rigidity found in RDBMSs;
- Transparent scalability that can reach into the
petabyte and high concurrent user range if necessary, with linear performance
gains being realized through node additions;
- Architecture to handle high incoming data
velocity, which keeps up with data that is sensor or machine generated on a
- A bundled web-based management and monitoring
- A licensing/cost structure that is much less than traditional RDBMS vendors.
So what types of technology are we talking about that can do these things and run well in the cloud?
With RDBMSs not being a first choice, the next natural place to turn is to a NoSQL database. While it’s true that NoSQL DBs can measure up quite well to the big data/cloud challenge, it’s important to note that they are not all created equal.
For example, some still use master/slave architectures, require manual data distribution or sharding (partitioning), can’t handle multiple data centers or geo zones, and aren’t geared for large data volumes (e.g. they are main memory only). The underlying data model is also important to understand as it may or may not be equipped to handle high velocity, time-series based data that could have lots of updates.
In addition, you should look for a NoSQL database that either builds a bridge to Hadoop for analytic processing or integrates directly with Hadoop for a more seamless experience. Lastly, if search is important to you, then find out what (if any) built-in search abilities the database has.
A quick survey of NoSQL options shows that Apache Cassandra, Amazon’s DynamoDB, and Riak appear to have the most checkmarks where the above criteria is concerned.
There’s no doubt that big data and the cloud are here to stay, and if you’re not already tasked with coming up with a strategy that marries the two now, you probably will be soon.
The criteria above will help you narrow the candidate field for your big data cloud database, but of course, nothing takes the place of an in-house proof of concept that puts the software vendor’s claims to the test where your particular application’s needs are concerned.
Robin Schumacher is VP of Products at DataStax (San Mateo, CA). www.datastax.com