February 06, 2018

Engineering Legacy Data is Hard. Here’s What to Do About It.

4 mins read

Written by

Eric Schiller

Lead Consultant, Data Engineering

Have you ever tried to get two software systems to share information with each other? How about combining data from multiple systems into a shared repository like a data warehouse? What about converting application data to another database system as part of an upgrade? It’s a challenge even with modern systems. If any of these sources are legacy systems, the challenges can increase dramatically. Here are a few common pitfalls to data integration with legacy environments and how to mitigate them.

Drastically Different Data Stores and More of Them

Challenge
The first challenge in data integration is extraction from a variety of disparate data sources. This is also true with legacy systems, with complications increasing over time the system has been around.

The original source may be a mainframe or other unsupported format, limiting accessibility to the data.

Opportunity
Consolidate the source data into a single landing area, and start analysis as soon as possible. This standard procedure is even more crucial with legacy systems.

The landing area should be flexible in terms of storage and querying since there will be missing/malformed data. With relational databases, this means denormalization and most data types stored as strings. The flexibility of some NoSQL databases may be helpful here as the schema will likely change with analysis.
Beginning analysis now is crucial because it allows maximum time for unforeseen complications. While most of the data likely lives in a monolith, there are often many different levels of data integrity and documentation.
Data inputs outside the core system are more difficult to integrate, because they are often stored as files yet represent the complex edges of the system. Examples include one-to-one relationships that should be one-to-many, or valid entities that don’t fit the allowed model. Make sure that the data is as standardized as possible before loading.

Legacy Software Contains Years of Technical Debt and Payback

Challenge
Technical debt is a normal part of software engineering projects, which can pay off the backlog over time. Data from those applications, however, must consider the whole backlog at once.

There is often a lack of documentation for the database structure and expected values.
Database structures change with bug fixes and features, and new data is often not backfilled consistently.
If the system is old enough and has meaningful historical data (e.g. financial, insurance systems), previous upgrades often cause data loss. Data unnecessary at the time may have been scrapped due to time constraints.

Opportunity
Take a holistic approach to understanding the application.

Get embedded with the legacy team to understand the unwritten history of the application.
Group the data structures by processes that create them or other similar logical separation.
Analyze and design for the most complex cases first.
Identify appropriate defaults for backfilling missing data that is necessary in the target.
Determine what data would cause problems if entered today, along with potential errors (e.g. zero or negative numbers, blank strings in common fields). Automatically log these errors and warnings for further analysis.

Computers Do Exactly What They Are Told

Challenge
As the saying goes “Garbage in, garbage out”. Whatever was allowed into the system at the time is the only record to work with. This parallels technical debt, in that it is often not worth spending resources to correct every erroneous entry in the system.

The older and more complex a system gets, the more entropy affects the data. This includes issues like improper typing (i.e. strings instead of numbers), creative workarounds (e.g. an entity entered as a person), or blank entries in a now required field. Systems which did not enforce strict normalization may have orphaned and non-unique records.
A special case of bad data is that of potential security violations. These can take the form of SQL Injection, unencoded special characters, and reserved keywords in the target system.
Legacy systems often did not have strict data governance. This led to not only inconsistent fixes for the above, but also multiple locations for the same data and inconsistent spellings of names. This is especially true for data stored outside the core systems (i.e. spreadsheets).

Opportunity
Much like working with technical debt, this needs to be done holistically with help from the customer.

Identify the subject matter experts (SMEs) early, and involve them in the process. This ensures customer confidence in data quality, and leads to quick decisions on data fixes.
Profile data early in the process to identify anomalies, and pass on to the SMEs for guidance.
When possible, fix bad data in the source systems. Otherwise, automate fixes as part of the loading process.
Use the logging technique described above to show progress in quality over time.