Spark-HPCC Systems Integration

HPCC Systems-Spark Integration consists of a plug-in to the HPCC Systems platform and a Java library that facilitates access from a Spark cluster to/and from data stored on an HPCC Systems cluster.

The HPCC Systems Spark plug-in integrates Spark into your HPCC System platform. Once installed and configured, the Sparkthor component manages the Integrated Spark cluster. It dynamically configures, starts, and stops your Integrated Spark cluster when you start or stop your HPCC Systems platform.

HPCC Systems-Spark Integration

⚡ Note: This project references log4j which has been reported to include security vulnerabilitie(s) in versions prior to v2.15.0

The Spark-HPCC Systems Distributed Spark Connector employs the standard remote file read facility to read and write data to/from either sequential or indexed HPCC datasets.

The data on an HPCC cluster is partitioned horizontally, with data on each cluster node. Once configured, the HPCC data is available for reading and writing in parallel by the Spark cluster.

The HPCC Systems Spark Connector requires Spark 2.10 or 2.11 and the org.hpccsystems.wsclient library available from the Maven Repository.

  • Find the source code and examples in the spark-hpccsystems repository
  • Get the latest JAR or javadocs files from the Maven Repository
  • Example Maven dependency information, be sure to update the <version> with the appropriate version you are using:
 <dependency>     <groupId>org.hpccsystems</groupid>     <artifactid>spark-hpcc</artifactid>     <version>7.12.0</version> </dependency>  

Known Limitations:

HPCC-21511 A Spark-HPCC write operation (HpccFileWriter.saveToHPCC ) on a cluster with nodes configured with multiple NICs can fail due to wrong IPs being reported as the dataset source location. This has been resolved in versions 7.2.0 or later versions of HPCC + Spark.