Policy as an Ally in Navigating the Data Lake

AddThis Social Bookmark Button

by Jim McGann

Data lakes. Big data. Digital exhaust. The buzz words are flying fast and furious, with all signs pointing to a continuing data explosion occurring within corporate America’s firewalls. User data is growing faster than ever before. Some estimates state that 90 percent of all data was created in the last two years. How do organizations keep up with this rapid growth? Buying storage is easy, but doesn’t exactly solve the problem.

The Big Data trend is a hot topic these days on one side of the data center. Across the building, the legal and compliance department is dealing with issues such as SOX compliance and legal holds and preservation in support of litigation. Information Governance is the emerging trend that encompasses regulatory compliance and litigation support. What happens when these buzz words collide? How can corporate policies and data liability survive in the world of big data? Will the massive data lakes overflowing at your organization drown out policies?

Organizations started getting serious about policies once regulatory compliance issues such as SOX appeared 10 years ago. Additional compliance requirements are still emerging, but SOX has been the primary driver over the past decade in the development of robust corporate policies. Many firms still struggle when it comes to policy and tend to over-save user data, however new focus on the preservation and hold of user data in support of litigation has forced the policy development around managing legacy data.

Companies are quickly learning that policy is critical to safeguarding their organization from harm. Organizations that have always saved everything, such as email from every user, are finding that this glut of legacy content is coming back to haunt them. User data, even if it exists as part of legacy disaster recovery backups, has become fair game for the courts.

A Google search for “backup tapes” and “litigation” turns up case after case where judges have requested the production of email from these legacy archives. Tapes are one source of data lakes; they happen to be a significant source that is hidden. Consider tapes as the iceberg in the lake -- lots of hidden danger underneath the calm waters. Other sources of data are obvious: email, desktops, psts, and servers. Gaining insight into this content has been challenging. When you start talking about big data and data lakes, how do you protect your organization from the iceberg, and apply policy and manage content proactively so your firm is protected from the liability hidden below?

Sound policy can be applied across massive volumes of user content. This is where legal and records management comes into play. Regulatory requirements, current litigation and other compliance issues dictate policies. It is common for companies to have policies where about 10 percent of the user email is proactively archived for a set period of time and then released. Most policies will only require archiving or preserving far less than 50 percent of unique user data. Simply put, when you are saving everything, you are saving far too much and creating a long term legal liability.

SOX compliance has been in effect for 10 years now. It is still confusing for many organizations which end up saving everything rather than understanding and developing a sound policy. If after 10 years, IT compliance requirements have still not absorbed issues such as SOX and worked with legal organizations to incorporate into their information management agenda, then it is no surprise we are talking big data.

Policies not only are designed to safeguard your organization; they have significant IT benefits as well. A smart policy can help purge large quantities of unnecessary user data. One of the best approaches to taming big data is to develop and apply policy. As data ages, the bulk of it becomes not only irrelevant but also a liability for the organization. If over time you could purge over 90 percent of user data versus archiving it on legacy backup tapes, wouldn’t that be beneficial? Saving the offsite storage costs and management of the tapes could save companies millions annually. What if you ceased making backup tapes for long term archiving? Simply apply policy to the tapes as they rotate off disaster recovery cycles, and only save what you need. You can then recycle the tapes and reuse them for future backups, and also stop filling the lake with more data.

Policy is one of the best tools for managing data, but it can be difficult to define and execute. One of the challenges is the relationship between legal and IT. These two sectors are typically worlds apart. Fortunately, it is becoming commonplace to have an IT/Legal liaison who understands enough about both disciplines to effectively manage information in a legal and defensible manner. Many consulting organizations have helped to bridge this gap. Consultants who monitor regulatory requirements and legal issues in order to help you define a sound, defensible information management policy are great facilitators in policy development. Either way, a solid relationship between legal and IT is crucial to surviving and managing risk in the big data world.

Developing a policy has become easier due to advanced technology. Direct indexing technology can scan all data sources, online, nearline and offline tapes in order to inventory and profile the content. Take, for example, a recent full backup of your user networks and email. This is a significant volume of content, and also a wealth of knowledge. Using indexing technology that understands backup formats from popular vendors such as Symantec, EMC and IBM, data, tape can be directly indexed without the original backup software, generating a high level profile of the content. Of course, indexing hundreds of terabytes may take a few days or weeks, but the index that is generated is beneficial downstream when you need to apply the policy.

Once data is indexed, the data mapping can begin. Using backup tapes is convenient as they provide a comprehensive snapshot of the network, and processing them is an offline activity that won’t require a new crawl of your systems. By using the metadata in the index, many of the canned reports can then be generated. Dates are key to policy, and the index can tell you the age of data, when it was last accessed or modified and more. Generating a rich index will allow profiling of the content and provide insight into decisions that need to be made. This is the beginning of the policy definition. Using this information, IT can now work with legal to ask questions and map out a plan.

Once the hard work is done and the policy is developed, having an actionable index of the data simplifies the execution and management of data. Most policies revolve around user email. Having a set of user mailboxes and a retention period allows you to simply input these parameters into the indexed data and extract out and preserve the responsive results. The rest can be purged. In fact, you will find yourself purging far more data than you preserve, addressing the big data challenge and streamlining your storage environment.

Additionally, you can start to apply these policies to legacy tapes and remediate what is no longer required, thus saving significant offsite storage expenses. Getting your data house in order, which involves applying policy and information governance principles, has many benefits. Using technology makes this goal achievable and cost effective and in the world of data lakes and big data, the time has come to act. Information is valuable when managed effectively and according to policy. Without a sound policy to keep you afloat, you may find yourself drowning in the deep dark lake of data.

Jim McGann is the VP of Information Discovery for Index Engines (Holmdel, NJ). www.indexengines.com

 
Sign Up for Breaking News and Top Stories in the CTR+ Newsletter (enter email below)

IT Security Journal