Two integrated clusters, a declarative programming language,and a standards-based web services platform form the basis of this comprehensive, massively scalable big data solution.
Powerful Open Source Big Data Analytics Platform
Born from LexisNexis® Risk Solutions deep data analysis history, HPCC Systems® helps businesses of all sizes to find the answers they need by making data easier to process, analyze and understand.
HPCC Systems incorporates a software architecture implemented on commodity shared-nothing computing clusters to provide high-performance, data-parallel processing and delivery for applications utilizing Big Data. The HPCC Systems platform includes system configurations to support both parallel batch data processing (Thor) and high-performance
data delivery applications using indexed data files (ROXIE). It also includes a high level and implicitly parallel data-centric declarative programming language for parallel data processing, called Enterprise Control Language (ECL).
Click the arrows below to learn more about each component
SALT: Scalable Automated Linking Technology
SALT addresses most common data integration tasks and can be used to generate complete applications ready-to-execute for record linking and clustering, data profiling, data hygiene, data source consistency monitoring, data file delta changes and data ingest. It is also easy to configure and execute with as little as 40–50 lines of code.
Linking and clustering (MDM)
Data profiling, cleansing, standardization and normalization
Sophisticated specificity and relatives based linking and clustering
Thor: The Data Refinery Cluster for Big Data Ingest and Transformation
The HPCC Systems Data Refinery Cluster — known as “Thor”, after the hammer-wielding god of thunder — is responsible for ingesting, cleaning, transforming, linking, and indexing vast amounts of data. It functions as a distributed file system with parallel processing power spread across the nodes. A Thor cluster can scale from a single node to thousands of nodes.
A Thor cluster:
Provides a massively parallel job execution environment for programs coded in ECL.
Utilizes a master-slave topology in which slaves provide localized data storage and processing power, while the master monitors and coordinates the activities of the slave nodes and communicates job status information.
Provides a record-oriented distributed file system (DFS). A big data input file containing fixed or variable length records in standard or custom formats is partitioned across the cluster’s DFS, with each node getting approximately the same amount of record data and with no splitting of individual records.
Is fault resilient, based on configurable replication of file parts within the cluster. Utilizes middleware components that provide name services and other services in support of the distributed job execution environment.
ROXIE: The Data Delivery Engine Supporting Up to Thousands of Requests Per Second
ROXIE — for Rapid Online XML Inquiry Engine — is the front-end cluster providing high-performance online query processing and data warehouse capabilities.
How the engine runs:
Data and indexes to support queries are pre-built on Thor and then deployed to ROXIE.
ROXIE uses an index-based distributed file system, based on a custom B+ tree structure, to enable fast, efficient data retrieval.
Queries may include joins and other complex transformations, and payloads can contain structured or unstructured data.
Each ROXIE node runs a Server process and an Agent process. The Server process handles incoming query requests from users, allocates the processing of the queries to the appropriate Agents across the Roxy cluster, collates the results, and returns the payload to the client.
A ROXIE cluster is fault resilient, based on data replication within the cluster.
Interlok: Seamless Data Integration
Provides for seamless data integration with hundreds of data stores, real-time data ingestion and flexible stream processing.
KEL: Knowledge Engineering Language
KEL — a graph-centric descriptive programming language — represents knowledge as an entity. An entity is an intellectual representation of raw data. It is a computational equivalent of a real life person, business, object, events, or activities.
Features of KEL:
Declarative — Describe what things are, rather than how to execute
High level — Vertices and edges are first class citizens
Simple — A single model to describe graphs and queries
With the help of KEL, most challenges can be addressed in three steps, where each step is represented in a few lines of code
Entity ideology — Define input data as knowledge
Define USE strategy — Map files to Entity and Associations
Define query strategy — Output entity or graph via queries, shells or packages
The attributes describe the behavior of an entity. The association between entities describe various perspectives of the data. The relationship between entities or associations outlining the data hierarchy facilitates the perspectives in a graph form.
A social graph example
In most cases, KEL is capable of producing networks-of-interest in under 75 lines of code with optimal performance.
ECL: The Powerful, Efficient Programming Language Built for Big Data
Enterprise Control Language (ECL) is a key factor in the flexibility and capabilities of the HPCC Systems platform. This declarative programming language was designed specifically to enable the processing of massive data sets as efficiently as possible.
Advantages to ECL:
Accomplishes big data processing and analysis objectives with a minimum of coding.
The sophisticated ECL compiler is cluster-aware and automatically optimizes code for parallel processing. Programmers needn’t be concerned about whether their code will be deployed on one node or hundreds of nodes.
An included graphical IDE for ECL simplifies development, testing, and debugging.
ECL code compiles into optimized C++ and can be easily extended using C++ libraries.
ECL can be used both for complex data processing on a Thor cluster and for query and report processing on a ROXIE cluster.
Newsletter Sign Up
Sign up today for the newsletter to receive the latest news and product updates.
Sign up to take advantage of our free training resources to get the most out of HPCC Systems.
Contact us to discuss how your organization can benefit from HPCC Systems.
Introduction to HPCC Systems
Learn more about the types of big data problems the HPCC Systems platform can solve, and how it solves them.