HPCC/Hadoop data integration, or how the mighty Thor rides the yellow elephant

You probably thought that the HPCC Systems platform and Hadoop were two technologies that represented the opposite ends of a spectrum, and that choosing one would make attempting to use the other, unrealistic. If this is what you believed: think again (and keep reading).

The HPCC Systems platform has just released its Hadoop data integration connector. The HPCC/HDFS integration connector provides a way to seamlessly access data stored in your HDFS distributed filesystem from within the Thor component of HPCC. And, as an added bonus, it also allows you to write to HDFS from within Thor.

As you can see, this new feature enables several opportunities to leverage HPCC components from within your existing Hadoop cluster. One such application would be to plug the Roxie real-time distributed data analytics and delivery system, providing real time access to complex data queries and analytics, to data processed in your Hadoop cluster. It would also allow you to leverage the distributed machine learning and linear algebra libraries that the HPCC platform offers through its ECL-ML (ECL Machine Learning) module. And if you needed a highly efficient and highly reliable data workflow processing system, you could take advantage of the HPCC Systems platform and ECL, or even combine it with Pentaho Kettle/Spoon, to add a graphical interface to ETL and data integration.

So what does it take to use the HPCC/HDFS connector (or H2H, as we like to call it)? Not much! The H2H connector has been packaged to include all the necessary components, which are to be deployed to every HPCC node. HPCC can coexist with Hadoop, or run on a different set of nodes (which is normally recommended for performance reasons).

How did we do it? We leveraged the capabilities of ECL to pipe data in and out of a running workunit, through the ECLPipe command, and we created some clever ECL Macros (did I mention before that ECL Macros are awesome?) to provide for adequate data and function mappings from within an ECL program. Thanks to this, using H2H is transparent to the ECL software developer, and HDFS becomes just an option of a particular type of data repository.
What are the gotchas? Well, HDFS is not as efficient as the distributed filesystem used by HPCC, so this data read and write will not be any faster than HDFS allows (but it won’t be sensibly slower either). Another caveat is that transparent access to compressed data (as it’s normally provided by HPCC) is not available to data accessed from within HDFS (although decompression can be achieved easily in a following step, after the data is read).

I hope you are as excited as we are, about this HPCC/Hadoop data integration initiative. Please take a look at the H2H section of our HPCC Systems portal for more information: https://hpccsystems.com/download/free-modules/hadoop-data-integration, and don’t hesitate to send us your feedback. This HPCC/HDFS connector is still in beta stage, but we expect to have a 1.0 release very soon.

Flavio Villanustre