By this point, data science seems ubiquitous and wildly successful. Recommendation engines govern the content we see on Netflix and YouTube; ranking algorithms govern news streams and our Facebook feed; survival analyses estimate queue times; and neural networks play a huge role in everything from self-driving cars, to reverse image searches, to estimating our credit scores. The vast field of data science seems magical – capable of anything. Indeed, many of my friends working as software developers will pencil in “data science magic goes here” when designing solutions. Those of us who are data scientists have our limitations. In fact, we’re actually more like chefs than magicians: we can only make something as good as the ingredients we have to work with. Delicious data science cuisine requires quality data, and lots of it.
Machine learning methods have recently come under fire for providing profoundly socially unacceptable results. Some endeavors have resulted in the likes of jail-breaking robots (Promobot IR77), and some image recognition algorithms can fail to appropriately detect African Americans. The algorithms’ engineers will frequently come under fire, but the culprit behind such catastrophes is often the data. Machine and deep learning algorithms have progressed to beat humans at chess, Go (the board game), DotA 2 (the video game), image recognition, and many other ostensibly human tasks. Algorithms are phenomenal at learning to do exactly what they’re taught to do; the problem arises when the data they’re given is poorly curated. For example, Microsoft’s Tay, the chatbot – it learned from what people on the internet tweeted to it, and we all know the internet can be ugly. Google’s offensive image classification – most of the images available for that application were of white men. My recent challenges with deep learning solutions are much more innocent, but still suffered from the same basic problem. Specifically, the data I used had many conflicting labels, and so my algorithms had no idea how to appropriately classify certain data. We had to gather more data to resolve the conflicting examples problem.
Machine learning algorithms are like impressionable toddlers: they can learn a lot, and rapidly, but they’ll only be able to repeat what they’ve been told. So, there is a tremendous burden to curate quality data. Which is easier said than done, especially when a lot of data is required.
Without first principles – basic facts that must be true about the problem being solved and can be formulated mathematically – to govern a model, we must compensate for our ignorance with a more powerful, general model. More general models require more model parameters. More model parameters require more data. So, the more complicated the problem, the more data we need. Unfortunately, it can frequently be challenging to find sufficient quality, labeled data with which to train a model. Need to categorize text or images in a unique or proprietary domain space? Be prepared to generate many thousands of examples. Unsupervised techniques – or, algorithms that don’t need examples of correct output in order to learn – certainly exist for certain problem types, but these typically demand even more data to form meaningful conclusions.
There is no silver bullet, there are options to increase the quality and quantity of your data:
- Forge a data set using a mechanical turk if the problem is highly domain-specific.
- Cluster data in a natural way and then collectively label the clusters.
- Use archives of data sets that have been appropriately curated or collected from neutral sources on neutral topics, such as the UCI machine learning library (conveniently, there is a robust list of data archives on Wikipedia).
Going forward, however, I believe this data problem merits considerable attention, and begs for the creation of meta-algorithms that can help either generate data from other similar, but different, data sets (which is the goal of domain adaptation). Another alternative is to cluster, adapt, and map between multiple disparate data types and data sets in an unsupervised fashion.
This second option touches a bit on the philosophy behind AI and what would qualify as AGI (artificial general intelligence). The basic idea is that an unsupervised algorithm cannot divine any inherent meaning in how certain data clusters together. However, if there are many different data sets that naturally cluster in some way, then there could be a way to map some clusters in one set to clusters in others, resulting in a kind of emergent learning. It might be possible to supervise the algorithm by simply labeling the mappings; or it could potentially be unsupervised.
I believe exciting times for data science lay ahead. Resolving this data quality and quantity conundrum poses some interesting challenges, and, as those challenges are resolved, I believe we’ll see even more ‘magical’ advances in the field.