Toggle Menu

Insights / Modern Analytics / Open Source ETL Tools

June 10, 2016

Open Source ETL Tools

2 mins read

Imagine that you have been charged with getting data from multiple sources – a flat file, a query from your data warehouse – and you need to bring it together so that it can be used to feed a report or a dashboard. What are your options? You’re not a developer who writes scripts and you’re tired of copying-pasting into Excel spreadsheets. You also don’t have any budget for a tool. As the Data Integration (DI) space moves to a model that encourages self-service integration and analysis services, this scenario is becoming more common, so we decided to take a look at some of the free open source tools available for this task.

As anyone who has combined data between sources to enable analysis will tell you, DI (or ETL) is more than just combining data. DI involves designing the data structures in a manner that supports the analysis being performed, moving the data between the systems (the actual Extract, Transform, Load processes), monitoring and improving the data quality, and ensuring the security of the data.  Here are just a few of the items that you’ll need to consider:

Data structure design
  • How does the data need to be stored and presented based upon it’s expected use (e.g. reports, dashboards, extracts, ad hoc queries)?
Data movement
  • Designing the optimal path from raw source data to cleansed and enhanced target state
  • Leveraging staging data structures where appropriate to enable processing checkpoints and mid-process restartability
  • Adhering to tool best practices to create consistent, intuitive solutions that can be transitioned across team members easily
Data quality
  • Identifying optimal data quality checkpoints throughout the data processing cycle
  • Enabling relevant data quality alerts and reports
Data security & privacy
  • Identifying data security requirements to meet federal, state, local and organizational standards
  • Leveraging data encryption and obfuscation best practices tailored to meet the need, not one size fits all

We took two of the most popular Open Source DI tools, Pentaho Kettle and Talend, to determine how they could handle some of these tasks.  Both tools are free with an active user community to assist with any questions, an adequate pool of existing users for long term support, out of the box connections to most file formats and databases, and intuitive User Interfaces in guiding you through your transformations.  These transformations not only format your data as needed but also allow you to validate the data quality and handle errors gracefully.

You Might Also Like

Resources

Simplifying Tech Complexities and Cultivating Tech Talent with Dustin Gaspard

Technical Program Manager, Dustin Gaspard, join host Javier Guerra, of The TechHuman Experience to discuss the transformative...

Resources

How Federal Agencies Can Deliver Better Digital Experiences Using UX and Human-Centered Design

Excella UX/UI Xpert, Thelma Van, join host John Gilroy of Federal Tech Podcast to discuss...