![]() |
We often see corporate customers who make fundamentally incorrect, and sometimes costly, assumptions about foreign language data that they may hold. They assume either that all of their data is in English, or at least all data that could even remotely be considered relevant. Rarely is either assumption correct. At the very least, it is probable that two work colleagues who share a mother tongue other than English will often turn to their native language in two-person communication simply for convenience. And, of course, communication with customers in a particular locale will usually be in the appropriate language. Further, sometimes communication in another language becomes a sheer matter of necessity, because the other party simply doesn’t speak English.
Documents and certainly email threads will frequently contain mixed language content. This occurs for many reasons including forwarding a quote from a customer in a language other than English, or because, for very technical terms, even bilingual people may not be conversant with the requisite English vocabulary.
When it is known that data will be coming in from a particular foreign country, the assumption is often made that this will result in the addition of exactly one more language to handle. However, such a situation will generally result in the presence of several more languages being included in the collection. There are many reasons for this, including that many business operations span more than one country, especially in the case of smaller countries, and in places like Europe there are likely to be more multi-lingual professionals generating content in several languages at a time.
Many software programs can identify the language used in a document. When a document is written in just one language, this is actually quite easy to accomplish. For example, if the language in use is English, you won’t go two sentences without seeing “a†and “the.†Where it gets trickier is correctly identifying multiple languages in use within the same document, and how much content is enough to flag a document as containing content in that particular language. For example, most people would agree that just because an English language document contains the phrase “c’est la vie,†it should not be flagged as containing French language content. On the other hand, even a small amount of content that is written in a different language than the rest of the document can contain critical information on which an investigation might turn. In addition, it is important to be aware that some documents contain the same content, translated into multiple languages. But if the search or categorization software only uses the first few words to identify the language type of the document—as quite a few systems do—such documents will not be appropriately recognized.
In the context of a document review, the first tasks to undertake are to determine how much foreign language content there is, which languages are present, whether or not they are likely to be relevant to the matter at hand, and whether the use of a particular language in a particular context is unusual or unexpected. Certain advanced analytical software programs can assist with all four of these areas. It is important to do this analysis as early in the process as possible to avoid the risk of a significant amount of unexpected foreign language content generating a bottleneck at the end of the process when the necessary foreign language reviewers may not be available.
In the related context of investigation, it can be critically important to understand how any foreign language data may impact the facts of the case. People have a natural tendency to revert to a more familiar mother tongue when discussing uncomfortable topics, which could be either business or personal in nature. And of course, the use of a language other than English can serve as a partial shield against the eyes of prying co-workers. While most such communication is innocent and understandable, some small portion of it can be quite sinister. Using an unexpected language makes a search for specific keyword terms likely to fail and increases the chances that someone looking at the data shrugs his or her shoulders and determines it is not worth exploring. In most cases, unanticipated foreign language content will at least insert a significant delay in understanding the data while a translator is found. (Translation software does exist, but is not yet of sufficiently reliable quality to depend on in critical situations. Further, it is even more likely to produce poor results when confronted with slang, and will fail completely in the face of spelling errors.)
From a compliance perspective, it is fairly well known that most compliance filters in U.S. companies are English language only. So whether someone wants to use one of the seven dirty words without getting caught, or wants to sneak under the radar exchanging sensitive information, use of another language provides an easy means of accomplishing the objective undetected. However, such behavior can often be trapped by looking for content that is written in a language that is unexpected given the identities of the actors involved. For example, if you always send emails to your grandmother in Greek because she speaks only Greek, that is a pattern of behavior that is consistent and predictable. But if you start selectively using Greek in communications with someone with whom you have routinely used English in the past, the probability of it being associated with a compliance violation is greatly increased.
One obvious, but often overlooked, peril in all of these scenarios is that keyword lists will fail—unless the list is translated into every language that occurs in the data, assuming this is known ahead of time, and that it is practical to do so. Some vendors of topic clustering or automated topical categorization software assert that their technology is language independent. The basic idea underlying these technologies is that they find words that numerous documents have in common, and form clusters based on these commonalities. This will almost always fail to pull together conceptually related content in different languages, because too few words are shared among the documents in the different languages. Thus a potentially dangerous blind spot is created. It is therefore advisable for companies to monitor the different spoken languages being used by their employees in a significant way, and on that basis determine which languages in addition to English are worth focusing on continuously.
To summarize, it is important for all organizations to be aware of the potential pitfalls of making unwarranted assumptions about the language content of their data. Rather, independent real world risk assessment must be performed on a case by case basis for specific litigations and investigations, and on a general basis for compliance. A necessary part of this risk assessment is an examination of the multi-language capabilities—and shortcomings—of the different available technologies.
Elizabeth Charnock is the CEO of Cataphora.


