FAQs
About HPCC Systems
Who is HPCC Systems?
HPCC Systems from LexisNexis Risk Solutions offers a proven, open-source, data-intensive supercomputing platform designed for the enterprise to solve big data problems. As an alternative to Hadoop, the HPCC Platform offers a consistent data-centric programming language, two data platforms and a single architecture for efficient processing. Customers who already leverage the HPCC Platform through LexisNexis products and services include banks, insurance companies, law enforcing agencies, federal government and other enterprise-class organizations.
HPCC Systems is a cluster of computers anywhere from two to a thousand plus built from commodity off the shelf (COTS) hardware, which work together as one computer to process and deliver Big Data. It has three main components: a programming language, a data refinery engine and data delivery engine.
Why is LexisNexis Risk Solutions offering this technology under HPCC Systems?
HPCC Systems is the open source Big Data initiative of LexisNexis Risk Solutions.
What does HPCC Systems offer?
HPCC Systems offers the HPCC platform which includes one programming language called Enterprise Control Language (ECL), the data refinery engine (called Thor) and the data delivery engine called ROXIE.
How does HPCC Systems differ from other open source platforms?
HPCC Systems from LexisNexis Risk Solutions has a more than 10-year track record of proven, enterprise-ready data-intensive supercomputing. LexisNexis is an organization with more than 35 years of experience in managing big data and for developing the technology platforms to handle large amounts of data. LexisNexis is best known for its legal, media and online databases. The LexisNexis Risk Solutions division is best known for its data, technology and analytics to locate people, businesses, assets; verify identity and prevent and predict fraud. LexisNexis Risk Solutions is a $1.5 billion business unit of LexisNexis, which is a $6 billion information solutions company. LexisNexis is owned by Reed Elsevier, which had revenues of $9 billion in 2013.
How does HPCC Systems differ from Hadoop?
HPCC Systems from LexisNexis Risk Solutions offers a proven, open-source, data-intensive supercomputing platform designed for the enterprise to solve big data problems. As an alternative to Hadoop, HPCC Systems offers a consistent data-centric programming language, two data platforms and a single architecture for efficient processing. The platform offers speed, scalability and faster analysis of big data using less code and fewer nodes for greater efficiencies. The platform is based on high performance computing cluster technology. Read more about how HPCC Systems differs from Hadoop.
General Information
Which training classes fit my role?
HPCC Systems offers a variety of training classes and tracks geared to high-level managers, developers and administrators who want to learn the HPCC platform components including ECL, Thor, Roxie, ECL IDE and maintaining the environment. View our recommended learning tracks to see which training program is best suitable for your role.
What types of training classes are offered?
HPCC Systems currently offers free online training. For any other training options (instructor-led, in person, etc.), please contact us for more information.
What is HPCC Systems?
HPCC Systems is a cluster of computers anywhere from two to a thousand plus built from commodity off the shelf (COTS) hardware, which work together as one computer to process and deliver Big Data. It has three main components: a programming language, a data refinery engine and data delivery engine.
HPCC Systems stores and processes large quantities of data, processing billions of records per second using massive parallel processing technology. Large amounts of data across disparate data sources can be accessed, analyzed and manipulated in fractions of seconds. The HPCC Systems platform functions as both a processing and a distributed data storage environment, capable of analyzing terabytes of information.
How long has HPCC Systems been in active development?
The HPCC Systems has been in active development and use for over 10 years.
Who is currently using HPCC Systems?
This technology has been proven in the marketplace for the past ten years. Our HPCC Systems technology powers the products and solutions of the LexisNexis Risk Solutions business unit, whose mission is to provide essential insights to advance and protect people, industry and society. LexisNexis Risk Solutions customers include top government agencies, insurance carriers, banks and financial institutions, health care organizations, credit card issuers, top retail card issuers, cell phone providers and a range of other industries. HPCC Systems technology is also used to provide enhanced content to the new Lexis electronic products that serve legal, academic and research industries.
Is HPCC Systems more efficient than other Java variants like Hadoop, and why?
Yes. Starting at the lowest level, the HPCC Systems platforms generates C++ and not Java; that immediately gives it an efficiency advantage. HPCC has also been in critical production environments for over a decade. The time and effort placed in individual components give a tangible performance boost. Our analysis of ECL executing code translated directly from the PigMix shows an average performance improvement of 3.7x.
That said, the real performance of the HPCC Systems platform begins to show when the programming language (ECL) is used to its fullest to express data problems in their most natural form. In the hands of a skilled coder, speed improvements in excess of an order of magnitude are common, and two orders of magnitude are not out of the question.
Is HPCC Systems a transactional database?
No. The HPCC platform is not a traditional transactional database.
Is there a limit to the size of database HPCC Systems can handle?
No. HPCC Systems is completely scalable, capable of meeting any database need regardless of size. It can be used for almost any data-centric task.
Can I access the query interfaces with REST instead of SOAP?
You can call queries deployed on HPCC using SOAP and REST/JSON. You can also use a web form which is provided for testing.
Can I install HPCC Systems on my Apple OSX computer?
Although we do not currently test this configuration, our source code is available for developers to explore these possibilities. Currently, only Client Tools is supported on Apple OSX.
Will HPCC Systems run on Amazon’s EC2?
Yes. The Thor cluster works well on Amazon AWS EC2. More information is available including how to get started and other instructional information.
How does HPCC Systems differ from oracle/mysql/sybase clustering?
HPCC Systems is built from the ground up to work as a single cohesive super computer. Managing and developing solutions for the HPCC platform is far simpler.
How is the HPCC Systems different than a Beowulf computing cluster?
Historically Beowulf clusters have defined their space in the field of computational analysis and mathematics. HPCC Systems Systems is designed for the purpose of data manipulation and is geared for that specific purpose.
For example, in a Beowulf Cluster the programmer explicitly controls the inter-node communication via a facility such as MPI (Message Passing Interface) to perform a global data operation; while in an HPCC system the inter-node communication is performed implicitly.
What is ECL?
ECL (Enterprise Control Language) is a programming language designed and used with HPCC Systems. It is specifically designed for data management and query processing. ECL code is written using the ECL IDE programming development tool.
ECL is a transparent and implicitly parallel programming language which is both powerful and flexible. It is optimized for data-intensive operations, declarative, non-procedural and dataflow oriented. ECL uses intuitive syntax which is modular, reusable, extensible and highly productive. It combines data representation and algorithm implementation.
Learn more about ECL.
What is the ECL IDE?
The ECL IDE is an integrated development environment for the ECL language designed to make ECL coding easy and programmer-friendly. Using the ECL IDE you can build, edit and execute ECL queries, and mix and match your data with any of the ECL built-in functions and/or definitions that you have created.
The ECL IDE offers a built-in Attribute Editor, Syntax Checking, and ECL Repository Access. You can execute queries and review your results interactively, making the ECL IDE a robust and powerful programming tool.
For a more detailed look at the ECL IDE, see the HPCC Data Tutorial that provides a walk-through of the development process from beginning to end using the ECL IDE.
What is Roxie?
Roxie (Rapid Online XML Inquiry Engine) is the data delivery engine used in HPCC Systems to serve data quickly and can support many thousands of requests per node per second.
While the refinery (Thor) is configured to run one or two processes against very large amounts of data, ROXIE, the delivery engine cluster is designed to return thousands of concurrent queries against that data. Think of it this way: The Refinery catalogues all the stars in the sky and the delivery engine returns you a single star out of billions.
What is Thor?
Thor (The Data Refinery Cluster) is responsible for consuming vast amounts of data, transforming, linking and indexing that data. It functions as a distributed file system with parallel processing power spread across several nodes. A cluster can scale from a single node to thousands of nodes.
Thor does the do the heavy lifting of big data: cleaning data, the merging and transforming it, profiling and analyzing it, and preparing it for use by end user queries. This refinery engine handles petabytes of data extremely efficiently. Depending on the data, a cluster of 500 nodes can crunch through a petabyte of data in less than ten seconds
What is Big Data?
Big Data is a term that refers to very large (e.g., tera or petabyte) data sets and secure storage facilities that are created and manipulated by hardware and software tools, and the processes and procedures used behind them to do this.
As a leading information provider, LexisNexis has more than 35 years of experience in managing big data, from publicly available information such as worldwide newspapers, magazines, articles, research, case law, legal regulations, periodicals, and journals – to public records such as bankruptcies, liens, judgments, real estate records – to other types of information.
To manage, sort, link, and analyze billions of records within sub-seconds, LexisNexis Risk Solutions designed a data intensive supercomputer built on our own high performing computing cluster (HPCC Systems platform) that is proven for the past 10 years with customers who need to sort through billons of records. Customers such as leading banks, insurance companies, utilities, law enforcement and Federal government depend on LexisNexis technology and information solutions to help them make better decisions faster.
How did HPCC Systems evolve?
To manage, sort, link, and analyze billions of records within seconds, LexisNexis Risk Solutions designed a data intensive supercomputer that has been proven for the past 10 years with customers who need to process billons of records within seconds. Customers such as leading banks, insurance companies, utilities, law enforcement and Federal government depend on LexisNexis Risk Solutions. LexisNexis has offered this platform as an open source solution under HPCC Systems. LexisNexis Risk Solutions is a $1.5 billion business unit of LexisNexis, which is a $6 billion information solutions company. LexisNexis is owned by Reed Elsevier, which had revenues of $9 billion in 2013.
What is ESP?
ESP (Enterprise Services Platform) provides an easy to use interface to access ECL queries using XML, HTTP, SOAP (Simple Object Access Protocol) and REST (Representational State Transfer).
ECL and Developers
What does an ECL program look like?
In its simplest form, an ECL program is simply one or more expressions, and, when run, the value of the expressions are written to the workunit or console.
To create more complex expressions without becoming unreadable, you can use the definition syntax to give a name to an expression, then use that name in subsequent ECL expressions. For extra flexibility, a definition can be parameterized to make a function definition.
When the ECL program is compiled, all references to defined symbols are substituted for the appropriate expressions to form a single complex expression representing the entire program. This expression is then optimized and compiled into executable code.
Because the entire program is an expression (and not a series of statements) there is no implied order of execution. Logically (and actually in many cases) the entire program can execute at once in parallel across as many cores and/or nodes are available in the cluster.
Here is a simple example:
TheWorld := ‘ and the world’;
Message := ‘Hello HPCC’ + TheWorld;
Output(Message);
Resulting in:
“Hello HPCC and the world”
How do I know if I need ECL?
You should look at ECL if one of the following scenarios listed below is true. If two of the scenarios below are true, then you need ECL:
1. You have large volumes of data. ECL has repeatedly been proven to be the most efficient solution to large scale data processing. In other words, you can do more with fewer nodes (on the terabyte sort we needed 4x fewer nodes than Hadoop to get the same result). Thus whether you are dealing with terabytes on ten nodes or petabytes on thousands, ECL is the cost-effective solution if the data is huge.
2. You need to do more with limited staff. ECL is a highly productive language and usually requires 10x fewer lines of code than other languages.
3. You need to perform complex data processing. ECL is a highly expressive language; it doesn’t just perform simple data operations. It can do everything from natural language processing to complex statistical modeling and from data clustering to non-obvious relationship analysis.
4. Your data is ‘dirty.’ The data comes in a variety of formats, but none of the formats are entirely well specified. The data has errors, has been accumulated over time and is full of both human and system artifacts. ECL has a rich variety of data types, file types and most importantly a strong ‘library building’ mechanism that allows the data to be methodically knocked into shape.
5. Many people need to work with the data. ECL has a very strong module encapsulation mechanism and revision control mechanism built in. It is designed to allow hundreds of people to work on many datasets which need to be integrated and fused. It is designed to maximize code-reuse and provide long-term code maintainability.
6. Analysts rather than strong programmers need to use data. A somewhat unusual feature of ECL is that it is dictionary based. It is actually a domain specific meta-language. Put another way, it is possible for the stronger coders to EXTEND the ECL language with simpler, domain specific terms which then allow analysts to code in ECL.
Why is ECL described as highly optimizing?
In the compiler world, a highly (sometimes called globally) optimizing compiler performs a series of mappings or changes to the program it is given, in order to produce code which is better than the one the programmer specified, while achieving the same effect. Obviously the point at which you declare a compiler ‘highly optimizing’ is somewhat subjective. However, we believe ECL exceeds the mark by a good distance. Some of the global optimizations it performs are:
1. Common sub-expression elimination: across a large complex data process it is very common for the same value or same file to be accessed multiple times. Common sub-expression elimination is the process by which the compiler scans ALL that a program is doing and looking for anything that happens twice. If anything does happen twice the compiler causes the first occurrence to be stored so that second time around it does not have to be computed again. ECL is particularly strong in that it does NOT require both occurrences to be called the same thing or be written the same way. ECL is declarative so if you were effectively doing the same thing twice it will only do it once.
2. Redundant operation elimination: In a large piece of code that reuses other large pieces of code it is not uncommon for operations to be specified that are not really needed. As a trivial example, a given attribute might return a file SORTed by a particular key. Another piece of code uses that file but immediately JOINs it to something else destroying the SORT order. ECL spots that the SORT wasn’t used and eliminates it. Conversely, if a file was sorted and an upcoming JOIN would normally sort the data to perform the join, it will skip the second join and use the existing SORT order.
3. Operation hoisting: It is possible that two operations in a row both have to occur but at the moment they are in a sub-optimal order. For example if the code is filtering after a SORT it is better to move the filter before the SORT. ECL will perform this optimization and others like it.
4. Operation substitution: Sometimes two operations are both required and in the correct sequence, but because BOTH have to happen it is possible to substitute a third operation which is more optimal. Alternately, it may be wise to split a single heavy operation into a number of simpler ones. An example is the global aggregate table operation; rather than redistribute all of the data to aggregate, it will expand the operation into a local aggregate, followed by a slightly different global aggregate – this saves network bandwidth and avoids nasty skew errors when aggregating data in with some high frequency values.
5. Redundant field elimination: If many different processes use a file it is not uncommon for the file to contain fields which are not used in a particular process. Again ECL scans the whole graph and will strip out fields from every dataflow as soon as it can. At the very least this reduces memory consumption and network utilization. If the field was a computed field it may also save processor cycles.
6. Global Resource Coloring: When all of the pure optimizations have taken place there still will (usually) be work that needs to be done. To make sure that work happens as quickly as possible it is necessary to ensure that all the parts of the machine (disk, network, cpu, memory) are working as hard as they can all of the time. It is also important to ensure that all of the machine components are not being over-worked as this can cause the system to ‘thrash’ (degrade in a non-linear fashion due to resource contention). ECL therefore takes the graph of all the operations that need to occur together with the sequencing constraints and ‘colors’ the graph with the target cluster resources. Put simply, it organizes the work to keep everybody busy.
The above lists the major optimizations that justify the title ‘heavily optimizing.’ Combined they result in an execution graph that is usually far superior to anything that a team of programmers would have the time or patience to put together by hand.
There are also other features in the system that help maximize system performance, but they are not strictly language optimizations.
Is ECL a functional language?
ECL is not a functional language. It is a declarative language, though it does share a number of important characteristics with functional languages. Perhaps the most obvious is that ECL does not have variables and thus the majority of ‘functions’ and ‘procedures’ do not have state, though it is possible to declare external ECL functions as having state. It is also true that most ECL constructs do not have side-effects which is again similar to a functional language. This allows the compiler a great deal of flexibility when optimizing.
What type of problems can be solved using ECL?
Using ECL, you can find ‘hidden’ relationships that can be at least N2 complexity where simple partitioning is not effective. Information extraction can be done from human language texts and other data sources as well as the extraction of entities, events and relationships.
How is ECL considered powerful and flexible?
ECL is designed in a way which reduces the complexity of parallel programming. The built-in parallelism makes creating a high performance parallel application easier. ECL applications can be developed and implemented rapidly.
It can manipulate huge volumes of data efficiently and the parallel processing means that ”order of magnitude” performance increases over other approaches are possible. Also, data intensive algorithms that were previously thought to be intractable or infeasible are achievable with ECL.
ECL is ideal for the implementation of ETL, (Extract, Transform and Load), Information Retrieval, Information Extraction and other data intensive work.
Is ECL a general purpose or domain specific language?
ECL is a data centric language. It is not general purpose yet not tied to a domain; it is tied to the data.
ECL has been used in everything from business intelligence to healthcare, from debt collection to counter intelligence and from law enforcement to theology. The one common feature all of these tasks have in common is the need to process data in volumes, forms or ways which are not possible or at least not realistic using conventional technology.
Does ECL support visualization operations?
Out of the box ECL supports tables and graphs as native visualizations. More statistical visualizations such as charts and scatter plots are supported using the ‘export to Excel’ mechanism. ECL also offers XSLT extensibility of its output format; so with a little coding, extra visualizations can be available without using Excel.
How is programming in ECL different from programming in terms of MapReduce?
MapReduce is a patented software framework introduced by Google to support distributed computing on large data sets on clusters of computers.
The MapReduce framework is based on Map and Reduce functions used to explicitly manipulate distributed data. This framework does very little to hide the complexity of parallel programming. Moreover, expressing certain data problems in terms of map-reduce is not trivial, and most data problems require multiple sequential phases of map-reduce to accomplish the task.
ECL, on the other hand, is based on the functional dataflow paradigm and designed in a way which reduces the complexity of parallel programming. ECL supports built-in parallelism, both extrinsic and intrinsic (among multiple nodes and across multiple operations) and hides the complexity of this parallelism from the programmer standpoint. This makes creating a high performance parallel application easier. ECL applications can be developed and implemented rapidly.
ECL can manipulate huge volumes of data efficiently. Also, data intensive algorithms that were previously thought to be intractable or infeasible are achievable with ECL.
How do I transition my current Pig scripts to ECL?
HPCC Systems provides a utility program called Bacon, which can automatically translate Pig programs into the equivalent ECL. The Bacon-translated versions of the PigMix tests are presented at “The PigMix Benchmark on HPCC“.
What development languages are supported?
HPCC Systems is designed, built and optimized around the ECL language. It is a declarative, data centric programming language built specifically for big data. We offer training videos, tutorials, and reference materials to get you started quickly. We also offer training sessions. ECL generates C++ and also allows C++ to be used to extend the language or plug into other libraries. Other languages can be integrated using PIPE, SOAPCALL and External DLLs.
What type of queries can be created to manipulate the data stored on HPCC Systems?
Efficient Cross-Partition Operations can be performed including ‘Global Joins’ and ‘Sorts’ to manipulate partitioned data as a single logical object. Data intensive operations can be performed to minimize overheads when reading/writing large amounts of data. This kind of operation also executes the algorithm closer to the data and encapsulates it internally (rather than moving large amounts of data to the processors).
How do I import custom record formats into Thor?
In order to distribute the data across a Thor cluster, some form of record delimiter or record length indication will be needed so that the data may be partitioned. ECL has a rich syntax to allow data to be read in a wide variety of formats, structured layouts and in completely unstructured form. In the unlikely event that ECL really cannot handle the format natively, it is possible to use the PIPE command to ingest the data via a program written in some alternate language.
Can I write directly to the filesystem?
One of the major features of ECL is it has a fully integrated file management system that allows data to be moved, utilized and maintained using a suite of tools. The tools also have a programming level interface to allow them to be used from code. It is vastly preferable to work within this framework if at all possible. Ultimately however, if you really have to do it, it is quite possible to write directly onto the disks without the framework knowing.
Will a data read look like a continuous read or is it chunked?
The technical answer is that ECL is not built upon a chunked file system. By default the data resident for a single file upon a single node will be in one continuous piece. This greatly speeds up the sequential reading of a large file. It is important to realize that most ECL programmers do not worry about underlying formats at all. The ECL file management system encapsulates all of the low-level details of a distributed file system including failover. To the programmer, a file has data and contains data.
How can I access other libraries and code?
There are a number of options for incorporating other libraries and code, refer to each section in the ECL Language Reference:
- Command-line applications (see PIPE in the ECL Language Reference)
- Soap Calls (see SOAPCALL in the ECL Language Reference)
- External DLLs that implement well-defined standards using the ECL SERVICE and PlugIn interfaces (refer to the External Service Implementation in the ECL Language Reference)
- In-line C++ code (see BEGINC++ in the ECL Language Reference)
- There is also the ability to embed in ECL and run code from other programming languages, such as Java, Javascript, Python and R, in addition to the existing C++ integration.
Can I query with SQL?
HPCC uses a descriptive programming language called ECL to greatly minimize development time while working with extremely large complex data problems. However, we now offer multiple SQL interfaces: the HPCC JDBC Driver and WsSQL service, which both support a subset of SQL commands; and embedded MySQL support within ECL.
Why ECL, not SQL?
The scope of these languages is completely different and disjointed, which makes it an unfair comparison. However, there are a number of characteristics of ECL that make it the preferred choice of programming languages. ECL is more expressive allowing most algorithms to be represented completely. One ECL query can replace the functionality of multiple SQL queries.
ECL allows complex processing algorithms to be executed on the server instead of in the middleware or in client programs. It also moves the algorithm close to the data and encapsulates it internally, rather than moving large amounts of data to the processing algorithm externally. Algorithms can also be implicitly executed on data in parallel rather than moving the data to a single client.
ECL also encourages deep data comprehension and code reuse which enhances development productivity and improves the quality of the results.
What are ECL Definitions?
ECL definitions are the basic building blocks of ECL. An ECL definition asserts that something is true; it defines what is done but not how it is to be done. ECL Definitions can be thought of as a highly developed form of macro-substitution, making each succeeding definition more and more highly leveraged upon the work that has gone before. This results in extremely efficient query construction.
How is data loaded onto a ROXIE cluster?
Data gets deployed to a ROXIE cluster automatically whenever a query is loaded on the cluster. Use the Roxieconfig tool, select a query to load and Roxieconfig will lookup all files / keys used by that query. The information is then sent to the ROXIE cluster to be used. Depending on the cluster configuration, ROXIE will either copy the files or use them directly from the remote location.
How is data loaded onto a Thor cluster?
Data loading is controlled by the Distributed File Utility (DFU) Server. The data files are placed onto the Landing Zone and then copied and distributed (sprayed) onto the THOR cluster. A single file is distributed into multiple physical files across the nodes in the THOR cluster. One logical file is also created which is used in the code written by ECL programmers.
Is there JDBC support?
Yes, an HPCC JDBC Driver is available which supports a subset of SQL commands. In conjunction with your favorite JDBC Client it is possible to query HPCC data with SQL.
Why doesn’t the HPCC platform have the well-known Hadoop “long tail problem”?
There are multiple causes to the “long tails” problem in Hadoop. Some of these causes, related for example to data skews and slow nodes, get amplified by the fact that multiple MapReduce cycles are normally serialized over a single data workflow (when, for example, performing a multi-join, working through a graph traversal problem or executing a clustering algorithm).
HPCC Systems utilizes several mechanisms to minimize the lasting effect of these long tails, including the additional parallelization that was described in the previous post, a record oriented filesystem which ensures that each node receives an approximate similar load (in terms of number of data records processed by each node, even for variable length and/or XML record layouts) and enough instrumentation to make the user aware of the data skew levels at each step in the data workflow execution graph.
In comparing to Hadoop MapReduce, how is the HPCC Platform truly parallel?
The fundamental design concepts in HPCC Systems are not based in the MapReduce paradigm postulated by Google in 2004. As a matter of fact, HPCC predates that paper by several years.
The idea behind the way data workflows are architected in HPCC Systems is based on high level data primitives (SORT, PROJECT, DISTRIBUTE, JOIN, etc.), exposed through the ECL language, and a powerful optimizer which, at ECL compile time, determines how these operations can be parallelized during execution, and what the execution strategy should be to achieve the highest performance in the system.
ECL is a declarative language, so ideally the programmer doesn’t need to define the control flow of the program. A large number of data operations are commutative in nature, and since transferring (big) data is normally very expensive, the optimizer can, for example, move a filter closer to the beginning to reduce the amount of data that is carried over in subsequent operations. Other optimizations such as lazy execution are also utilized to eliminate throwaway code and data structures.
The specific execution plans vary, depending on how the particular data workflow (ECL program) looks like, and the system provides for a graphical display of the exact execution plan that the optimizer determined to be the most appropriate for that workflow. Once you submit a workunit from the ECL IDE, you can visualize the execution plan for that workunit, and even key metrics in each intermediate step which include number of data records processed, data skews and the specific operation represented. As you can see, a complex execution graph is normally subdivided in multiple subgraphs, and many of those operations are parallelized if there is no need for a synchronization barrier (or if the optimizer thinks that excessive parallelization will affect the overall performance negatively).
It is recommended that you download the Virtual Machine and/or binaries of the platform and play with some of the examples that we provide in our portal, to understand how this all works in practice. Although in real life you would never need to tinker with the platform itself, if you feel inclined to seeing how things work under the hood, please feel free to download the C++ source code of the HPCC Systems platform from our GIT repository and take a look at the inner implementation details of the platform and ECL compiler and optimizer.
Another source of reference is the PigMix Benchmark on HPCC.
Operations and Planning
What are the system requirements?
HPCC Systems is designed to run on thousands of nodes. At minimum, you need one current and supported Linux system to run a single node cluster. In addition, the ECL IDE is available for Windows and ECL Eclipse Plugin is available for all platforms. Detailed information is available in the system requirements guide.
Are any special hardware requirements needed to run HPCC?
No. It integrates Commodity-Off-the-Shelf (COTS) hardware.
Minimum memory per node: 4GB
Any limitations on file size and number of files?
Currently no single file can be greater than 4 exabytes (or 4 million terabytes). This is not a current concern as other limitations occur in real-world systems before this limitation is reached; such as total available disk space on your cluster. There are no theoretical limits to the number of files.
What are the differences in configuration between Roxie Remote versus Local Data Access Settings?
The data files and index files referenced by the query’s ECL code are made available in one of these ways, depending on the configuration of the Roxie cluster.
Two configuration settings in Configuration Manager determine how this works:
copyResources
Copies necessary data and key files from the current location when the query is published.
useRemoteResources
Loads necessary data and key files from the current location when the query is published.
These options may appear to be mutually exclusive, but the chart below shows what each possible combination means.
——————————————————————–
copyResources T
useRemoteResources T
Directs the ROXIE cluster to use the remote copy of the data until it can copy the data locally. This allows a query to be available immediately using the remote data until the copy completes.
——————————————————————–
copyResources T
useRemoteResources F
Directs the ROXIE cluster to copy the data locally. The query cannot be executed until the data copy completes. This ensures optimum performance but may delay the query’s availability until the file copy completes.
——————————————————————–
copyResources F
useRemoteResources T
Directs the ROXIE cluster to load the data from a remote location and never copy locally. The query can be executed immediately, but performance is limited by network bandwidth.
This allows queries to run without using any ROXIE node disk space, but reduces its throughput capabilities. This is the default for a single node because Thor and ROXIE are on the same node and share the same disk drives.
——————————————————————–
copyResources F
useRemoteResources F
Will use data and indexes previously loaded but will not copy or read remote data.
If I redistribute or provide a service using the Community Edition of HPCC: must I disclose or publish my ECL source code under the same license?
No, you are free to keep your ECL source code for yourself. Only changes or improvements to the HPCC platform must be disclosed. There are no obligations attached to ECL programs.
How do I get started?
The HPCC Systems website has documentation and training to support you from your first few moments with the HPCC Systems right on through your progression to HPCC Systems power use. Check out the Try Now section to explore the ECL Playground or launch an AWS Cloud.
How can I learn the ECL programming language?
To learn ECL, visit the Documentation area and read the ECL Programmers Guide and/or the ECL Reference Guide. There is also online self-paced training available. The courses are free for anyone wanting to learn ECL and the HPCC Systems platform. Additional resources include videos, tutorials, and our YouTube channel.
What is the relationship between CPU cores, amount of memory, volume of data and number of concurrent queries for Thor and ROXIE?
We typically size ROXIE nodes to ¼ the size of the Thor (worker) nodes. If you have 12 Thor (worker) nodes, roughly you will need 3 nodes for ROXIE. ROXIE nodes are also setup on independent instances (VMs too) to accommodate high availability and redundancy. Typical workloads in ROXIE will see high disk utilization and lower CPU usage. ROXIE automatically caches data based on the available memory.
How do you calculate/measure the throughput of one ROXIE agent?
Since ROXIE is really a data query engine, the throughput is based on the complexity of the query. If the query is very straightforward, a single node (dual CPU) can scale up to 2000-3000 TPS. ROXIE is typically not compute intensive. So CPU is not a big deal. But IO, network speed and RAM make a difference. For example, if we have several 16 GB memory, 2 CPU, 8 core and SSD drive servers running. 1 GbE network is minimum. Our higher workloads might use 40 GbE.
How is data reliability organized in HPCC? Can I control replication factor?
On Thor the replica factor is 2. That is, for one partition of the data, there can be one copy on the cluster. Hence, you can recover a node automatically based on this. With the implementation of container type of environments, we might be able to derive even more flexibility. On ROXIE, you can have more than 2 replicas and the cluster will use the replicas actively to fulfill query requests. For example, you can lose 50% of the ROXIE nodes and still be able to fulfill incoming queries.
How fault tolerant is the system? If one hardware server goes down will the cluster remain available?
HPCC Systems can run in a Data Lake configuration where multiple HPCC Systems clusters (individual clusters) can operate on shared data. A cluster can remotely access data from another cluster and so on. The options provide you with many ways in which you can harden your environment as well as provide agility for manipulating data.
Does HPCC know which ROXIE Agent or Thor Worker nodes are hosted on the same HW server and which are not during copying data for redundancy? Can I add variability to locations to evenly distribute data between hardware servers but not Thor Worker nodes?
If you have two Thor Worker nodes, you will have two logical partitions of a data file on each node. If you enable replication, then Worker A’s data copy will exist on Worker B, and Worker Bs data copy will exist on Worker A. A different design for data redundancy is the way some of our customers leverage EBS on AWS.
How HPCC is reliable if I plan to use it as a primary data store? Does it have “endless” horizontal scalability?
Data loss is very rare with the amount of redundant parts. Horizontal scalability depends on the types of processing that you will perform on the Thor. If jobs execute sequences that require a lot of network interaction, there might be an upper threshold above which horizontally scaling might not achieve better performance. But if the data is partitioned to accommodate a high level of parallelism, then you can expand HPCC to very large cluster sizes.
In terms of reliability, Is HPCC Fault Tolerant (9.99%) or Highly Available (99.999%)?
Our customers run production ROXIE configurations in 99.999% configurations.