Posted by Drew Farris on March 27, 2014
The concept of ‘big data’ brings to mind complex numerical sequences, data points or data sets and an infinite amount of information that needs to be structured and refined. Today, I’d like to think about data in another way. Let’s talk about the text, the words, the phrases that businesses have at their fingertips.
Businesses today have access to internal and external sources of text. This can be the content shared on social media, product reviews or via experience review sites (e.g. Yelp, Trip Advisor, etc.). It can also be the ‘open data’ published from civil and commercial sources as well as public research, thanks to a rise in e-books, novels, memoirs, etc.
So, how can we take advantage of this text?
First, it’s important to start with an understanding of what kind of text we have at our disposal. Text analysis can operate at many levels from the individual word, phrase or clause level to individual documents, large collections of documents (known as corpora) or cross-collections. The crux of all of these analyses lies in a combination of heuristic and statistical approaches.
Basic rules govern our use of language, as I’m sure you remember from grade school, and the encoding of these rules enable computers to understand the basic structure of language and how the words we use are transformed into meaning. Statistical approaches, such as the analysis of word co-occurrence, can then be layered on top of these language rules to identify latent topic structure or interesting word combinations that will further extract meaning from text.
Let me share an example of what this looks like in practice. In 2013, our Booz Allen team was asked pilot a process that allowed us to link multiple sources of chemical compound information to pharmaceutical research that explored off-label usage of prescription and over-the-counter medications. Sounds like a tough task, right? Well, by using information gleaned from public sources, e.g. drug labels, intellectual property information, medical subject headlines, research funding reports and adverse reaction information; we were able to present a prototype information system that allowed our clients to identify new areas of compound research and associated risks.
Our success was rooted in our three-pronged approach to text analysis that combines three essential techniques:
- Search, also known as Information Retrieval, requires us to think through how we need to store and code text so that it can be found by a regular user working under normal search parameters, i.e. Google searches.
- When we classify our text, we’re able to look through the text or document to identify entities that will be important, based on what we know about our business. This is the ‘spam’ filter and it weeds out text that is irrelevant to your business needs.
- Lastly, clustering makes sense out of the unknown by identifying items that are similar. This ‘duplicate detection’ system speeds up the text analysis process by reducing time and resources needed to sift through the data.
The sheer volume of text will only grow in the coming years. And while there is no silver bullet for text analytics, adopting our three pronged approach as your base strategy and coupling it with metrics and measurement tools will ensure that you’re delivering value to your business.
Stay tuned for more data driven posts in the coming days as Booz Allen continues to celebrate Analytics Week! And, follow the social conversation with @BoozAllen and via on-site reporting from; @joshdsullivan, @angelazutavern, and @akherlopian.