Imagine that you have been charged with getting data from multiple sources – a flat file, a query from your data warehouse – and you need to bring it together so that it can be used to feed a report or a dashboard. What are your options? You’re not a developer who writes scripts and you’re tired of copying-pasting into Excel spreadsheets. You also don’t have any budget for a tool. As the Data Integration (DI) space moves to a model that encourages self-service integration and analysis services, this scenario is becoming more common, so we decided to take a look at some of the free open source tools available for this task.
As anyone who has combined data between sources to enable analysis will tell you, DI (or ETL) is more than just combining data. DI involves designing the data structures in a manner that supports the analysis being performed, moving the data between the systems (the actual Extract, Transform, Load processes), monitoring and improving the data quality, and ensuring the security of the data. Here are just a few of the items that you’ll need to consider:
Data structure design
- How does the data need to be stored and presented based upon it’s expected use (e.g. reports, dashboards, extracts, ad hoc queries)?
- Designing the optimal path from raw source data to cleansed and enhanced target state
- Leveraging staging data structures where appropriate to enable processing checkpoints and mid-process restartability
- Adhering to tool best practices to create consistent, intuitive solutions that can be transitioned across team members easily
- Identifying optimal data quality checkpoints throughout the data processing cycle
- Enabling relevant data quality alerts and reports
Data security & privacy
- Identifying data security requirements to meet federal, state, local and organizational standards
- Leveraging data encryption and obfuscation best practices tailored to meet the need, not one size fits all
We took two of the most popular Open Source DI tools, Pentaho Kettle and Talend, to determine how they could handle some of these tasks. Both tools are free with an active user community to assist with any questions, an adequate pool of existing users for long term support, out of the box connections to most file formats and databases, and intuitive User Interfaces in guiding you through your transformations. These transformations not only format your data as needed but also allow you to validate the data quality and handle errors gracefully.