Posted by Drew Farris on February 20, 2015
Recently, Mike Kim wrote an excellent post on overfitting, a common problem data scientists face when applying machine learning. Earlier, Paul Yacci and Aaron Sander [SM1] both tackled the topics of feature selection and feature creation. These are all key problems that data scientists encounter when building models that seek to describe and predict real-world outcomes based on observed data or hidden patterns in datasets.
This week I’ve identified a few other common problems data scientists face when working with data. These problems go beyond technology and machine learning and are broadly encountered regardless of the task at hand: interpreting the problem, sourcing the data, and describing the outcomes.
Interpreting the Problem
One of the most significant challenges a data scientist will encounter in examining a real-world problem is identifying the aspects of that problem can be addressed using data science. A recent article about a University of Chicago Data Science for Social Good Fellow project describes describes how data science was used for health care reform in Illinois. Facing low enrollment in Affordable Care coverage among Illinois' uninsured residents, data scientists cast the problem in terms of data, developing mechanisms to predict which individuals are least likely to be insured. Using this model the group developed targeted efforts to get people signed up. Translating a problem in this way requires both an understanding of the capabilities, tools, and techniques behind data science and the ability to get out from behind the keyboard and ask questions to inform the data process. Interpreting the problem is as much an art as it is a science.
Sourcing the Data
A number of the struggling data science projects suffer from a lack of data. The scientists working the problem have excellent ideas of what can be done, what tools and algorithms can be used, what features will be the most important, and even how to validate their assumptions and outcomes. What’s missing is the data required to support all this. Availability issues can range between not having a sufficient volume or variety of data, to having extremely inconsistent or “dirty” data, where the effort to clean, filter or repair is so monumental that in increases the risk related to the effort beyond what is tolerable to the organization. As with the Illinois insurance case, policy hurdles prevent the analysis of raw, individual level data. Related, problems arise from using representative data, a dataset that is used a stand-in for the real data while the team is waiting for the real data to be obtained. More often than not representative data, especially synthetic or generated data, does not accurately capture the nuances found in real data. Finally, scientists often underestimate the amount of work that is required to acquire, clean and understand data. The time or money budget is exhausted and a great data set is in hand, but little or no real analytic activity has occurred.
Exploring the Outcomes
Making predictions isn't always easy. Even given a dataset with expected outcomes, a supervised learning project becomes an exercise in feature exploration, appropriate algorithm selection, and rigorous model selection. Given sufficient time and processing power, you can crank through hyper-parameters and develop models of reasonable predictive strength requiring little interpretation. Lacking a pre-existing labeled set to use as a gold standard, there are options such as recruiting subject matter experts (friends and relatives or complete strangers, see mechanical turk) to do manual tagging or coding of datasets. When going beyond the realm of supervised learning problems, data scientists must fully leverage their ability to interpret what the data and algorithms tell them, in order to shape, understand, represent, and convey the underlying story locked within their data. Statistical analysis and unsupervised machine learning approaches, such as clustering and topic modeling, power these outcomes, but a practicing data scientist finds that most cases require interpretation and explanation to convey subtle meaning. As with the interpretation of the problem, this kind of narration of results is also an art.
These are just some of the problems all data scientists have faced at one time or another. While machines can help us with a major portion of the work associated with data science, a significant portion of the work depends on the human ability to theorize, interpret, analyze, and associate in the problem exploration or solution space. However, the best approaches can suffer if the data just isn't available to expensive or risky to obtain. Keep these in mind as you embark upon your next data adventure.