by Wayne Salpietro
The time has arrived that we can now deploy deduplication technology everywhere up and down the information stack — from operating systems to applications and across the storage tiers including the cloud. Surprised? You shouldn’t be. Think of deduplication as an evolving technology. It is no longer the old standalone backup appliance that only a few select people with the secret password can touch. Rather, today’s deduplication technology is an entirely new and flexible tool that can save storage costs across the board. In a way, dedupe has “come out of the box.” Today there is a high performance deduplication solution that can be deployed as software (usually embedded into the storage) and has a few new features that make it flexible enough to be deployed everywhere, enabling it to optimize IT. Why everywhere? That’s easy: because data is everywhere. Dedupe saves costs and improves efficiency, so it makes sense to deploy it everywhere it can make a positive impact your business. Before we delve into the myriad benefits of dedupe, though, let’s examine why in the heck we need it in the first place.The explanation is simple, really: We have too much data and we are saving more and more every day. In fact, by 2020, IDC predicts 35 zettabytes of data will have been created (1 zettabyte is equal to one billion terabytes). What’s more, it will grow annually as we create and save all of our information, just in case we need it. This may sound like a good idea, but with IT budgets under extreme pressure and expanding by a paltry 3% +/- per year, there exists an ever widening gap between the need to store information and the ability to afford the ever increasing amounts of information.
Today’s businesses have tools at their disposal, such as storage tiering and data migration to lower performing/lower cost media, which will help. There’s also the continuing decline in the cost of storage on a per GB basis. However, the data growth we see -- 50% or more per year -- is outstripping business’ ability and the industry’s ability to cope with the data deluge. Something must be done or something is going to break!
Dedupe Isn’t Just Dedupe
Dedupe technology was first developed to solve a specific backup problem: how to store more backup data sets on disk to make disk technology an affordable alternative to tape. Developers of dedupe technology for backup knew they would be seeing small amounts of large files. Products were built to deal with these backup files, not to handle the billions of objects that are seen in a primary storage environment. In fact, backup dedupe most often is employed as a post process approach to minimize the performance impact on the functioning applications. As a result of the design choices backup dedupe vendors made, backup deduplication approaches are neither scalable nor fast enough to apply to broader use cases in primary or other dedupe everywhere applications.
Today, dedupe technology has the ability to scale to multiple petabytes as a result of tireless work in indexing technology and memory utilization. Today’s industry leading dedupe technology is extremely high performance and very resource efficient and, as such, is overcoming the limitations in first generation deduplication when applied in backup use cases. In fact, today’s dedupe technology can ingest at rates that do not degrade overall storage performance and can be deployed inline. In addition, today’s high performance dedupe has the ability to analyze billions of data chunks in microseconds, enabling deduplication to scale to multiple petabytes of data.
New Approaches To Dedupe
New functionality begets new deployment approaches that maximize the value and resiliency of your data assets. With an ingest rate that is IO optimized and can run circles around its (backup-only) predecessor, dedupe is now approaching nearly limitless scalability. Today’s high performance dedupe engines can be employed anywhere, whether it’s primary storage (including SSD), tier 2, archive, replication, backup or even the cloud. This means that a version of the same dedupe engine can be in each storage tier and as a result be universally deployed across all storage. As unified storage evolves, a dedupe engine can sit inside the unified storage array and apply to as many tiers as the vendor needs. In any deployment approach, the benefit is that once the initial data chunk is seen -- usually in primary storage -- and analyzed for its duplication, it never needs to be rehydrated again as it moves across the storage tiers during its lifecycle because the dedupe engines are common across the storage tiers. This yields two dramatic results: first, it saves storage space at each layer of storage (primary, tier 2, etc.) and second, it reduces processing loads because fewer and fewer cycles are needed to analyze the data at each tier and less and less storage is required because the deduplication occurred upstream, making the process increasingly more efficient.
But let’s not stop at storage. Data is created higher up in the data stack, in applications, so why not apply deduplication there? Up until now, this was unheard of. With the ability to apply the deduplication engine in the application, data storage is optimized because duplicate checks are made at the application layer and as data moves, before a chunk of data is sent to the next tier, that tier is queried to determine if the duplicate exists at that location. This is done at very high speed and the process is efficient enough not to impede application processing. This is a completely new approach to data deduplication. However, if you can optimize the data at the point of creation, the efficiency benefits can be realized across the entire data stack. This saves costs in CAPEX (capital expenditures) in storage purchases and OPEX (operating expenses) in management and throughout the data lifecycle because there is no need for rehydration and the performance impact it brings, and no need for duplicate storage to be bought, managed and warehoused. In addition, this approach could be deployed in an operating system, adding to the overall data efficiency and financial impact.
On the other end of the data storage spectrum, cloud-based storage deployments also benefit from dedupe everywhere. With a deduplication engine onboard, the uploading client queries the cloud storage to determine if data already exists. If data is found in the cloud, it is not resent, which ultimately saves both communications bandwidth and storage space. This approach enables data to be stored without rehydration and allows the data to be as efficient as it was prior to the upload to the cloud. In addition, cloud implementations save storage costs and enable the enterprise to take advantage of the CAPEX and OPEX savings that are typically seen in external cloud storage.
The Impact Of Dedupe Everywhere
Taking a step back, the impact of dedupe everywhere is that it catches the duplicates as early in the data lifecycle as possible -- in the operating system or the originating application. The result is that storage space is saved immediately in the first level of storage (SSD or primary). Since dedupe everywhere can be deployed across all storage, data migration from both higher performing expensive storage and lower performing, less costly storage will not require rehydration of the data and will maintain the storage efficiency. This approach is complete and integrated and as a result the effective cost of storage is reduced and the actual physical amount of storage is also reduced at every layer of storage. Simply put, money is saved on storage acquisition, space is saved on storage devices and operating costs are dramatically reduced across the board. With this type of compound savings, the impact on CAPEX and OPEX make dedupe everywhere universal.
Deduplication has traditionally been a checkbox feature found in backup and a few primary storage implementations. With the financial and efficiency impact dedupe everywhere brings, however, it is about to become a core competency and requisite capability for storage and full service vendors. Dedupe everywhere will change the information storage landscape dramatically because it scales, performs, integrates and maintains data integrity for the duration of the data lifecycle (from application through multiple tiers of storage, including backup and archive, through to the cloud).
If you are a client of a full service vendor that has the ability to provide operating systems, applications, databases and multiple tiers of storage, the benefits of dedupe everywhere deployed by that vendor will be the most significant and your savings and data efficiency will be the greatest. As full service vendors have tried to optimize the benefits of vendor continuity in IT deployment with their core competencies, the implementation of dedupe everywhere as a core competency provided by that vendor will yield huge cost savings for their customers.
In any business, data growth is occurring at breakneck speed and taming the growth beast is really the objective of every IT professional. If, for example, we could take that 35ZB number and reduce it to 8ZB, data storage would be more affordable and within reach of many IT budgets. Dedupe everywhere enables such a significant change.
The economics of IT are about to change and dedupe everywhere is a technology that will make it feasible. Clients that deploy this technology will reap financial benefits in OPEX and CAPEX savings. For the vendors that deploy dedupe everywhere first, they will offer their customers competitive advantages in cost/benefits and that will allow them to leapfrog market positions and increase market share and revenue. As the old saying goes, “to the victor go the spoils!”
Wayne Salpietro is the director of product/social media marketing at Permabit (Cambridge, MA). www.permabit.com