Toggle Menu

Data Engineering at the Speed of Software Development

Rapid changes in data management technology and network infrastructure enabled the field of data science as we know it. Traditional methodologies for handling data projects are too slow, driving the need for a new way to manage data teams. The DataOps Manifesto was created to respond to these challenges, borrowing from the Agile Manifesto, the DevOps movement, and Lean Manufacturing. Using the wins of other domains that have faced similar situations and tailoring it to data and analytics, the Manifesto is a baseline for discussing how to implement a DataOps culture. This talk covers the principles of the DataOps Manifesto, the challenges that led to it, and how and where it is already being applied in the industry.

The same improvements in technology that helped bring about a revolution in software development, especially cloud computing, also brought about massive changes in the tools available to work with analyzing data. These changes have made modern machine learning and other data science projects possible without an enterprise platform, have led to new tooling for data visualization and traditional analysis, and ultimately led to a push for a data-driven culture at many organizations. These changes collectively have been referred to as “Big Data” and birthed the new discipline of data engineering to supply these various uses with data in a reasonable timeframe.

Unfortunately, while the available technology has changed, the methodologies around the people and processes involved in managing this development and scaling it have not changed significantly. Customers see traditional software teams being able to deliver valuable features within two weeks of new development and updates constantly deployed without disruption. They are understandably frustrated with data teams needing downtime, a lot of upfront work before any product is delivered, and a host of other issues that seem trivial on the surface. The factors given for the challenges are reasonable to anyone who’s ever worked with data. Data projects face some different constraints than standard development, namely that traditional software takes a set of questions with controlled inputs to give answers to one set of customers. In contrast, data projects analyze answers coming from numerous sources outside the project’s control and try to determine what questions can be answered with that data (which usually brings up more questions).

With that, the authors of the DataOps Manifesto, as part of a larger group of innovators kicking off the DataOps movement, realized that something needed to be done for the customers. The idea behind DataOps was that this space is in a similar situation as traditional software was in the mid to late 2000s, so they looked at that field for help. They looked back to the Agile Software movement and added in lessons learned from DevOps culture, then modified it for data. As Lean Manufacturing inspired both areas, the authors also adapted something very familiar to any data practitioner from that movement – continuously sampling and analyzing data. The one source of data that any project can control is the data the project creates (e.g., how fast jobs run, how many failures, etc.)—making analyzing and using that data easier and very valuable in the hands of people who work with data all day.

Watch this talk as Eric Schiller takes a deep dive into this history and then presents the DataOps Manifesto itself as a baseline to discuss DataOps practices. The Manifesto itself should look familiar to anyone who has read the Agile Manifesto or has knowledge of modern engineering and DevOps culture. It hinges on the idea that all artifacts of a data project are increasingly based in code, and even the ones that are not should still be treated as if they were code intended for production. With this, modern engineering and management practices can be applied, metrics captured, and a culture of continuous improvement added to the project. Then, Eric explains how to implement these practices, mentioning some that are likely already in place at many organizations.

You Might Also Like

Artificial Intelligence (AI)

Out-of-the-Box Data Science Platform: Buy vs. Build

Data is one of the most valuable assets to an organization. A strong data science...

Tech Tips

Why No One is Using Your Dashboard

Adapted from a lightning talk presented at MERL Tech 2018 in Washington DC. As a...