I recently attended the 2017 Data Architecture Summit hosted by DATAVERSITY in Chicago. The audience, as you would imagine, was mostly data architects from a variety of industries coming together to review (and debate) architecture trends and tools. Architects are rarely afraid to speak their minds, especially if you ask for an opinion. After attending the sessions and speaking with many attendees and sponsors, here’s some key themes identified:
Takeaway 1: Data Lake Zones start to make sense of the Swamp
Long gone are the days of a ‘single source of truth’ for analytics via the enterprise data warehouse (EDW). The addition of a data lake is now regarded as a common strategy to meet more advanced analytics needs and can also serve as a repository to feed other downstream analytics solutions. For users, it can get confusing to know where to go to and what tools to use. Enter the idea of data lake zones.
The concept of presenting a single Lake with data zones targeted for different user audiences is compelling as users seek to understand the new enterprise data landscape. A presentation at the Summit by Robert Nocera from NEOS LLC illustrated this best in my opinion.
Robert proposed five zones in the Lake – Raw, Structured, Curated, Consumer, and Analytics. As data moves through the first four zones it becomes more structured and more transformations are applied.
- Raw is the initial ingestion layer where data enters the lake. It is also where the metadata collection (tagging) occurs and it acts as the permanent data storage layer.
- Structured facilitates querying raw data by savvy analysts and data SMEs by putting it into table queryable format.
- Curated is used for processing and is positioned to be akin to the base data warehouse; no users have access to this layer.
- Consumer contains aggregated tables and views that support regular analysis needs. Dashboards, reports and self-service queries to common business questions would happen here.
- The final Analytics Zone is a workplace for advanced users (like data scientists) to manually pull data from any zone for further analysis.
While a single lake with zones simplifies user messaging, under the covers each zone can use the same or different data platforms and tools, giving development and maintenance flexibility. The zones idea and terminology (who liked “data lake island”???) is gaining traction and is something we are introducing at Excella also.
Takeaway 2: The technology marketplace for analytics remains fragmented
Something almost everyone that I spoke to agreed upon was the difficulty of navigating the vast market of tools and platforms available. There are simply too many choices in the current marketplace and little agreement on where to invest money, time, and training efforts.
Best advice? The analytics tech market is expected to consolidate over the next few years, so either hold tight before making a purchase decision or choose tools that you can deploy and switch more easily (in case you want/need to keep pace with change).
If you are not doing so already, plan for a Cloud-first strategy for analytics workloads to provide the elastic scalability and cost efficiency that modern processing and users demand. Over time migrate to tools that meet your functional needs and are easily deployed in popular Cloud platforms like AWS and Azure.
Takeaway 3: Data engineers takes center stage
In a keynote led by Forrester analyst, Michele Goetz, the continuing challenge of hiring “data science unicorns” caught my attention. We all know that finding candidates with sophisticated statistics skills AND advanced computer science skills is difficult and expensive (some commanding salaries of up to $500K).
Forrester recommended creating a collaborative work environment of data engineers with computer science and automation skills paired with data scientists with strong math and statistics, to tackle 21st century data projects. This reinforces the (successful) approach we’ve taken at Excella – see our recent blog post “No More Unicorns” for more details on how we do this.
The demand for data engineers goes further. They also have a vital role in data governance efforts – building data interfaces that connect data consumers in an organization to data governance policies and master (reference) data. They are the glue behind enterprise data and analytics enablement. Forester quoted a stat from Indeed that showed 13% of all data-related job postings were for data engineers, versus 1% for data scientists; in demand skills are Python, Java, C++ and working with APIs to build data pipelines.
Looking for more insights? Contact us at Excella.com.