Full Stack Data Science in Two Weeks: A Painful Retrospective
In the wake of the announcement that the city of Washington, DC would receive its first Michelin Guide, the local branch of General Assembly issued a challenge to the Data Science community. Who among them could predict, using Data Science methodologies, which restaurants in the District would receive Michelin stars? Not only that but how […]
In the wake of the announcement that the city of Washington, DC would receive its first Michelin Guide, the local branch of General Assembly issued a challenge to the Data Science community. Who among them could predict, using Data Science methodologies, which restaurants in the District would receive Michelin stars? Not only that but how many stars (on a scale from 1 to 3) each predicted restaurant would be awarded? No sample data was provided, no guidance as to what methodologies would be most useful, just a simple challenge and quick outline of the submission guidelines.
The deadline was set at midnight the day before the actual winners would be announced, two weeks from the date the challenge was released on their website. Being somewhat naïve about the level of effort this might entail, I jumped at the chance to work on this, looping in as many colleagues as were interested.
Due to the ambiguity of the challenge and the dearth of time that came with it, several quick decisions were made regarding what tools to use for the project. Sidestepping the existential question of whether to use Python or R (if you google “Python vs. R” you get +6,000 articles with this question in the title), I chose Python right out of the gate. This was mostly for reasons of personal familiarity, as well as the flexibility that a language like Python affords.
Although Python is a general-use programming language, it has been so widely adopted in the last two decades that several mature, data-specific packages are readily available and easy to use for anyone with a dataset and an interesting question.
In particular, and a key reason that many Data Scientists might be inclined to choose Python over R, is the existence of Sci-Kit Learn, an actively maintained and developed Python Package that has a standardized library of Machine Learning models. It’s a veritable smorgasbord of highly-advanced statistical modeling tools.
Flexibility is, ultimately, what sealed the deal. Python would allow me to pull data from multiple sources, store them in a NoSQL database (MongoDB), query that same database using a python wrapper, and run my data through a Machine Learning module in order to a generate predictions. And all of this without having to switch languages or generate excessive amounts of new code. Or so I thought initially, but I’m getting ahead of myself.
Brainstorms Are an Absolute Necessity
The day the competition was announced, several colleagues expressed interest in the project but did not have any extra time to contribute. However, they all volunteered to make themselves available as sounding boards via a dedicated Slack channel throughout the course of the project. This proved to be an invaluable aspect putting a cohesive project together.
Notably, the group was valuable during the initial planning phases of the project. The biggest challenge to doing Data Science effectively tends to be whether or not you are asking the right question at an early enough stage in the project.
So What Did We Come Up With?
We determined that the best way to generate a robust enough data set in such a small amount of time was to pull our primary restaurant data from the Yelp Search API, which was easily available once you requested a developer key. However, quickly recognizing the likelihood that crowd-sourced restaurant ratings on Yelp would not be significantly correlated to the characteristically sophisticated tastes of the French restaurant guide, we decided to supplement this data with a classification of whether or not restaurants had received a positive review from a local food critic. This we then added to the database as a binary classifier (i.e. positive review = “yes” or “no”).
To help inform our predictions on DC restaurants’ chances, we queried the API for data on restaurants from other cities that had their own Michelin guides and Michelin starred restaurants. We then supplemented this “training dataset” with the actual Michelin-star values for the restaurants that had received the award. This separate data would ultimate be used to train our Machine Learning model, then use that trained model to predict the values for DC’s restaurants.
What Could Be Improved? The Naïve Folly of Time Management on a One-Man Project
Given that the window for assembling a data set was set relatively small, a concerted effort to avoid a full-stack solution may have been preferable. In fact, the person who won had a relatively simple approach to the problem.
In the pressure-cooker of a short turn around, the early and little-considered decisions that were made at the beginning of the project began to compound the amount of time it took to correctly implement each step of the solution. Additionally, the decision to keep the data in its JSON format proved to cause several data-formatting issues when it came time to load the data into our model.
With two or three additional team members, all of these issues likely would not have been a problem and the work could have been better distributed. However, a one-man team attempting to do several aspects of work was a recipe for sleepless nights and buggy code.
Ultimately, the project only produced a single prediction for one restaurant in the DC area. It was, to put it lightly, a colossal flop. A phrase oft-repeated in the Data Science community is that you’ll spend upwards of 85% of your time on the process of actually making your data usable, a process typically referred to as “data wrangling”.
As a young data scientist, I hubristic ally ignored the wisdom of those who knew better. This, coupled with the struggles of coding on a deadline and failing to anticipate the amount of time each step of our solution would take, led to an incomplete project.
If you would like to see the raw code for the project, as well as the brief overview that I threw together as the project was getting started, feel free to check out my GitHub repository. Pull requests welcome! (Within reason.)
You Might Also Like
What is data literacy? Data literacy isn’t all that different from literacy in any other...
The Basics If you are starting an Artificial Intelligence (AI) initiative, one of the first things you need...