- Processing clusters using commodity hardware and high-speed networking
- Linux operating system
- Supports SOAP, XML, HTTP, REST and JSON
- Enterprise Services Platform (ESP) enables end-user access to ROXIE queries via common web services protocols
- Thor and ROXIE are both fault-resilient, based on replication within the cluster.
- The systems store file part replicas on multiple nodes to protect against disk or node failures.
- Both are designed for resiliency and continued availability in event of hardware failures.
- Administrative tools for environment configuration, job monitoring, system performance management, distributed file system management, and more.
- Extension modules for web log analytics, natural language parsing, machine learning, data encryption, and more.
Declarative, modular, extensible Enterprise Control Language (ECL) is designed specifically for processing big data.
- Highly efficient — accomplish big data tasks with far less code.
- Flexible — can be used both for complex data processing on a Thor cluster and for query and report processing on a ROXIE cluster.
- Graphical IDE for ECL simplifies development, testing and debugging.
- ECL compiler is cluster-aware and automatically optimizes code for parallel processing.
- ECL code compiles into optimized C++ and can be easily extended using C++ libraries.
The two main systems — Thor and ROXIE — work together to provide an end-to-end solution for big data processing and analytics. Data and indexes to support queries are pre-built on Thor and then deployed to ROXIE.
Thor, the Data Refinery, is the extraction, transformation and loading engine.
- Thor uses a master-slave topology in which slaves provide localized data storage and processing power, while the master monitors and coordinates the activities of the slave nodes and communicates job status information.
- Middleware components provide name services and other services in support of the distributed job execution environment.
ROXIE, the Data Delivery Engine, provides high-performance online processing and data warehouse capabilities.
- Each ROXIE node runs a Server process and an Agent process. The Server process handles incoming query requests from users, allocates the processing of the queries to the appropriate Agents across the Roxy cluster, collates the results, and returns the payload to the client.
- Queries may include joins and other complex transformations, and payloads can contain structured or unstructured data.
- Thor DFS is record-oriented and optimized for big data ETL (extract-transform-load). A big data input file containing fixed or variable length records in standard or custom formats is partitioned across the cluster’s DFS, with each node getting approximately the same amount of record data and with no splitting of individual records.
- ROXIE DFS is index-based and optimized for concurrent query processing. Based on a custom B+ tree structure, the system enables fast, efficient data retrieval.
- Horizontal scalability from one node to thousands of nodes.
- Thor Data Refinery can process up to billions of records per second.
- ROXIE Data Delivery Engine can support thousands of users with sub-second response time, depending on the application.
- Technical Whitepaper: Comparison of HPCC Systems® Thor vs Apache Spark Performance on AWS (PDF)
- Webinar Overview:Baselines & Benchmarks– Making Open Source Big Data Analytics Easy
- Terasort Benchmark Results (PDF)
- PigMix Comparison
- Comparison to Hadoop
Free videos are available to help you get started quickly and get the most from HPCC Systems. The videos cover a variety of topics, from beginners trying to solve their first problem to advanced users looking to tune their programs to meet performance requirements.