Skip to main content

End-to-end big data in a massively scalable supercomputing platform.

Open-source. Easy to use. Proven.

An Industry Insight from Arjuna Chala, Sr Director of Special Projects for HPCC Systems®

Data Lakes: A Serious Consideration for Serious Data Science

On a recent phone call with a company, their chief data scientist told me his organization had setup a data lake. After further discussion, I realized they had actually setup a data warehouse using big data technologies. A data warehouse implementation using big data technologies is not a data lake.

That said, this is a very common misunderstanding. While big data technologies process and analyze data to identify specific actions to improve products or services, data lakes and data warehouses involve how the data used to perform that analysis is stored.

Before we dive into a discussion about data lakes, let’s first answer the question, “What’s wrong with using a data warehouse?” The answer is “Nothing, depending on the use case.” A data warehouse solution is sufficient if your data:
  • Is already well structured
  • Won’t be used in advanced analytics
  • Can migrate to a single system relatively easily
  • Supports system-to-system integration
  • Is used for generating static reports
To put it succinctly, a data warehouse is a static environment. Making changes to data stored in a data warehouse is a cumbersome process, and access to raw data (data that hasn’t yet been formatted into a specific schema) is non-existent. A data warehouse does provide some advantages for Online Analytical Processing (OLAP) applications, but these advantages are far outweighed by the benefits of a data lake.

When presenting the concept of data lakes to confused students, I recommend they think of a data warehouse as a book, and a data lake as a library. With a book (data warehouse), someone has already determined what content is contained in that book, while a library (data lake) allows you to choose whatever content you want. More specifically, there are the key differences that distinguish a data lake from a data warehouse, including:

  • Schema on read
  • Unlimited storage
  • The ability to access both raw and processed data
  • The ability to link data from many individual clusters
The last bullet is particularly important as it allows for the integration of different data sets. For a real-world example of a data lake in action, consider Google. The Google search engine backbone is comprised of many individual clusters, distributed across the world. But when you execute a search, Google transparently executes the search across all clusters and collates the results before providing them to the user.

Accordingly, when referring to a data warehouse, we should envision a centralized platform for basic importing, exporting, and preprocessing of data gathered from a collection of linked systems and using one data schema. When referring to a data lake, picture a distributed but integrated data platform that supports schema-less (including unstructured and structured) data and performs queries on data in real-time by leveraging metadata to quickly find, transform, and load data between systems.

However, there is an issue companies interested in adopting a data lake model need to be aware of: the use of more than one big data platform with a data lake. A company may use different big data platforms to perform the analysis on data used in different applications. For example, a company may use both Spark to conduct data analysis for machine learning applications, but use HPCC Systems to ingest, profile, clean, enhance, and build attributes for data. Both platforms would be sourcing data from the same data lake, but they use different programming languages (Spark code is typically written in R or Python, while HPCC Systems uses ECL). This causes problems when the data created by one platform’s analysis needs to inform the analysis of the other platform. While it’s certainly possible to insert a step in the data flow where data generated by Spark is translated into ECL for consumption by HPCC Systems (and vice versa), the additional time required to do so eliminates most of the performance gains that come from storing data in a data lake. Fortunately, big data analytics providers are developing solutions that address the problem. To address the use case described above, HPCC Systems developed a connecting API that allows both Spark and HPCC Systems to leverage the results of each other’s analytics directly, eliminating the performance bottleneck caused by translation.

In light of these differences, it becomes clear that a well thought out and implemented data lake provides faster access to data than a data warehouse, even if the data warehouse is using the same big data analytics engine as the data lake. This faster access to data is critical for more advanced big data applications requiring near real-time performance, such as machine learning. As data sets grow and more companies look to implement more and more advanced big data applications, expect to see adoptions of the data lake model to increase.

Additional Resources