Toggle Menu

Insights > Modern Analytics > Open Source ETL Tools

Open Source ETL Tools

Imagine that you have been charged with getting data from multiple sources – a flat file, a query from your data warehouse – and you need to bring it together so that it can be used to feed a report or a dashboard. What are your options? You’re not a developer who writes scripts and […]

By

June 10, 2016

Imagine that you have been charged with getting data from multiple sources – a flat file, a query from your data warehouse – and you need to bring it together so that it can be used to feed a report or a dashboard. What are your options? You’re not a developer who writes scripts and you’re tired of copying-pasting into Excel spreadsheets. You also don’t have any budget for a tool. As the Data Integration (DI) space moves to a model that encourages self-service integration and analysis services, this scenario is becoming more common, so we decided to take a look at some of the free open source tools available for this task.

As anyone who has combined data between sources to enable analysis will tell you, DI (or ETL) is more than just combining data. DI involves designing the data structures in a manner that supports the analysis being performed, moving the data between the systems (the actual Extract, Transform, Load processes), monitoring and improving the data quality, and ensuring the security of the data.  Here are just a few of the items that you’ll need to consider:

Data structure design
  • How does the data need to be stored and presented based upon it’s expected use (e.g. reports, dashboards, extracts, ad hoc queries)?
Data movement
  • Designing the optimal path from raw source data to cleansed and enhanced target state
  • Leveraging staging data structures where appropriate to enable processing checkpoints and mid-process restartability
  • Adhering to tool best practices to create consistent, intuitive solutions that can be transitioned across team members easily
Data quality
  • Identifying optimal data quality checkpoints throughout the data processing cycle
  • Enabling relevant data quality alerts and reports
Data security & privacy
  • Identifying data security requirements to meet federal, state, local and organizational standards
  • Leveraging data encryption and obfuscation best practices tailored to meet the need, not one size fits all

We took two of the most popular Open Source DI tools, Pentaho Kettle and Talend, to determine how they could handle some of these tasks.  Both tools are free with an active user community to assist with any questions, an adequate pool of existing users for long term support, out of the box connections to most file formats and databases, and intuitive User Interfaces in guiding you through your transformations.  These transformations not only format your data as needed but also allow you to validate the data quality and handle errors gracefully.

Let's Talk

You Might Also Like

Modern Analytics

5 Ways Data Analysts Can Transform Your Organization

Leading organizations realize how critical it is to prioritize data-informed decision-making. When quality data work...

Modern Analytics

The Importance of Effective Data Literacy

What is data literacy? Data literacy isn’t all that different from literacy in any other...

Modern Analytics

30-Day Data Analysis

What is Data Analysis? Data analysis is the process of turning raw data into actionable...