HPCC Frequently Asked Questions
Read the FAQ for answers to common questions about HPCC, ECL and more
About HPCC Systems
HPCC Systems from LexisNexis Risk Solutions offers a proven, open-source, data-intensive supercomputing platform designed for the enterprise to solve big data problems. As an alternative to Hadoop, the HPCC Platform offers a consistent data-centric programming language, two data platforms and a single architecture for efficient processing. Customers who already leverage the HPCC Platform through LexisNexis products and services include banks, insurance companies, law enforcing agencies, federal government and other enterprise-class organizations.
LexisNexis has created a small, dedicated group called HPCC Systems to better interact with the open source community and deliver enterprise-ready, open source solutions to organizations with big data challenges.
HPCC Systems offers the HPCC platform which includes the programming language called Enterprise Control Language (ECL), the Roxie data delivery engine and the Thor data refinery cluster. The platform is based on high performance computing cluster (HPCC) technology and offers a consistent data-centric programming language, two data platforms and a single architecture for efficient processing. HPCC Systems offers a (free) community version, access to its technology platform and an enterprise-ready version that includes support, training and consulting.
HPCC Systems from LexisNexis Risk Solutions has a 10-year track record of proven, enterprise-ready data-intensive supercomputing. LexisNexis is an organization with more than 35 years of experience in managing big data and for developing the technology platforms to handle large amounts of data. LexisNexis is best known for its legal, media and online databases. The LexisNexis Risk Solutions division is best known for its data, technology and analytics to locate people, businesses, assets; verify identity and prevent and predict fraud. LexisNexis Risk Solutions is a $1.5 billion business unit of LexisNexis, a $6 billion information solutions company. LexisNexis is owned by Reed Elsevier, which had revenues of $12 billion in 2010.
HPCC Systems offers a community version (free) access to its technology platform and an enterprise-ready version that includes support, training and consulting. Read more about how HPCC differs from Hadoop.
General Information and Capabilities
The HPCC has been in active development and use for over 10 years.
This technology has been proven in the marketplace for the past ten years. Our HPCC technology powers the products and solutions of the LexisNexis Risk Solutions business unit, whose mission is to provide essential insights to advance and protect people, industry and society. LexisNexis Risk Solutions customers include top government agencies, insurance carriers, banks and financial institutions, health care organizations, credit card issuers, top retail card issuers, cell phone providers and a range of other industries. HPCC technology is also used to provide enhanced content to the new Lexis electronic products that serve legal, academic and research industries.
Yes. Starting at the lowest level HPCC generates C++ and not Java; that immediately gives it an efficiency advantage. HPCC has also been in critical production environments for over a decade. The time and effort placed in individual components give a tangible performance boost. Our analysis of ECL executing code translated directly from the PigMix shows an average performance improvement of 3.7x.
That said, the real performance of the HPCC begins to show when the ECL language is used to its fullest to express data problems in their most natural form. In the hands of a skilled coder, speed improvements in excess of an order of magnitude are common, and two orders of magnitude are not out of the question.
Yes. HPCC works over the internet and/or over a private network. It also operates on either distributed or centralized systems.
No. The HPCC is not a traditional transactional database.
HPCC is completely scalable, capable of meeting any database need regardless of size. It can be used for almost any data-centric task.
You can call queries deployed on HPCC using SOAP and REST/JSON. You can also use a web form which is provided for testing.
Although we do not currently test this configuration, our source code is available for developers to explore these possibilities. Currently, only Client Tools is supported on Apple OSX.
Yes. The HPCC Thor works well on Amazon AWS EC2. More information is available in the Install Thor on AWS documentation.
The HPCC is built from the ground up to work as a single cohesive super computer. Managing and developing solutions for the HPCC is far simpler.
Historically Beowulf clusters have defined their space in the field of computational analysis and mathematics. HPCC is designed for the purpose of data manipulation and is geared for that specific purpose.
For example, in a Beowulf Cluster the programmer explicitly controls the inter-node communication via a facility such as MPI (Message Passing Interface) to perform a global data operation; while in an HPCC system the inter-node communication is performed implicitly.
ECL (Enterprise Control Language) is a programming language designed and used with the HPCC system. It is specifically designed for data management and query processing. ECL code is written using the ECL IDE programming development tool.
ECL is a transparent and implicitly parallel programming language which is both powerful and flexible. It is optimized for data-intensive operations, declarative, non-procedural and dataflow oriented. ECL uses intuitive syntax which is modular, reusable, extensible and highly productive. It combines data representation and algorithm implementation.
The ECL IDE is an integrated development environment for the ECL language designed to make ECL coding easy and programmer-friendly. Using the ECL IDE you can build, edit and execute ECL queries, and mix and match your data with any of the ECL built-in functions and/or definitions that you have created.
The ECL IDE offers a built-in Attribute Editor, Syntax Checking, and ECL Repository Access. You can execute queries and review your results interactively, making the ECL IDE a robust and powerful programming tool.
For a more detailed look at the ECL IDE, see the HPCC Data Tutorial that provides a walk-through of the development process from beginning to end using the ECL IDE.
Roxie (Rapid Online XML Inquiry Engine) is the data delivery engine used in HPCC to serve data quickly and can support many thousands of requests per node per second.
Thor (The Data Refinery Cluster) is responsible for consuming vast amounts of data, transforming, linking and indexing that data. It functions as a distributed file system with parallel processing power spread across several nodes. A cluster can scale from a single node to thousands of nodes.
Big Data is a term that refers to very large (e.g., tera or petabyte) data sets and secure storage facilities that are created and manipulated by hardware and software tools, and the processes and procedures used behind them to do this.
As a leading information provider, LexisNexis has more than 35 years experience in managing big data, from publicly available information such as worldwide newspapers, magazines, articles, research, case law, legal regulations, periodicals, and journals – to public records such as bankruptcies, liens, judgments, real estate records – to other types of information.
To manage, sort, link, and analyze billions of records within sub-seconds, LexisNexis Risk Solutions designed a data intensive supercomputer built on our own high performing computing cluster (HPCC) platform that is proven for the past 10 years with customers who need to sort through billons of records. Customers such as leading banks, insurance companies, utilities, law enforcement and Federal government depend on LexisNexis technology and information solutions to help them make better decisions faster.
To manage, sort, link, and analyze billions of records within seconds, LexisNexis Risk Solutions designed a data intensive supercomputer that has been proven for the past 10 years with customers who need to process billons of records within seconds. Customers such as leading banks, insurance companies, utilities, law enforcement and Federal government depend on LexisNexis Risk Solutions. LexisNexis has offered this platform as an open source solution under HPCC Systems. LexisNexis Risk Solutions is a $1.5 billion business unit of LexisNexis, a $6 billion information solutions company. LexisNexis is owned by Reed Elsevier, which had revenues of $12 billion in 2010.
ESP (Enterprise Services Platform) provides an easy to use interface to access ECL queries using XML, HTTP, SOAP (Simple Object Access Protocol) and REST (Representational State Transfer).
The ECL Language and Developers
In its simplest form, an ECL program is simply one or more expressions, and, when run, the value of the expressions are written to the workunit or console.
To create more complex expressions without becoming unreadable, you can use the definition syntax to give a name to an expression, then use that name in subsequent ECL expressions. For extra flexibility, a definition can be parameterized to make a function definition.
When the ECL program is compiled, all references to defined symbols are substituted for the appropriate expressions to form a single complex expression representing the entire program. This expression is then optimized and compiled into executable code.
Because the entire program is an expression (and not a series of statements) there is no implied order of execution. Logically (and actually in many cases) the entire program can execute at once in parallel across as many cores and/or nodes are available in the cluster.
Here is a simple example:
TheWorld := ' and the world';
Message := 'Hello HPCC' + TheWorld;
“Hello HPCC and the world”
You should look at ECL if one of the following scenarios listed below is true. If two of the scenarios below are true, then you need ECL:
1. You have large volumes of data. ECL has repeatedly been proven to be the most efficient solution to large scale data processing. In other words, you can do more with fewer nodes (on the terabyte sort we needed 4x fewer nodes than Hadoop to get the same result). Thus whether you are dealing with terabytes on ten nodes or petabytes on thousands, ECL is the cost-effective solution if the data is huge.
2. You need to do more with limited staff. ECL is a highly productive language and usually requires 10x fewer lines of code than other languages.
3. You need to perform complex data processing. ECL is a highly expressive language; it doesn’t just perform simple data operations. It can do everything from natural language processing to complex statistical modeling and from data clustering to non-obvious relationship analysis.
4. Your data is ‘dirty.’ The data comes in a variety of formats, but none of the formats are entirely well specified. The data has errors, has been accumulated over time and is full of both human and system artifacts. ECL has a rich variety of data types, file types and most importantly a strong ‘library building’ mechanism that allows the data to be methodically knocked into shape.
5. Many people need to work with the data. ECL has a very strong module encapsulation mechanism and revision control mechanism built in. It is designed to allow hundreds of people to work on many datasets which need to be integrated and fused. It is designed to maximize code-reuse and provide long-term code maintainability.
6. Analysts rather than strong programmers need to use data. A somewhat unusual feature of ECL is that it is dictionary based. It is actually a domain specific meta-language. Put another way, it is possible for the stronger coders to EXTEND the ECL language with simpler, domain specific terms which then allow analysts to code in ECL.
In the compiler world, a highly (sometimes called globally) optimizing compiler performs a series of mappings or changes to the program it is given, in order to produce code which is better than the one the programmer specified, while achieving the same effect. Obviously the point at which you declare a compiler ‘highly optimizing’ is somewhat subjective. However, we believe ECL exceeds the mark by a good distance. Some of the global optimizations it performs are:
1. Common sub-expression elimination: across a large complex data process it is very common for the same value or same file to be accessed multiple times. Common sub-expression elimination is the process by which the compiler scans ALL that a program is doing and looking for anything that happens twice. If anything does happen twice the compiler causes the first occurrence to be stored so that second time around it does not have to be computed again. ECL is particularly strong in that it does NOT require both occurrences to be called the same thing or be written the same way. ECL is declarative so if you were effectively doing the same thing twice it will only do it once.
2. Redundant operation elimination: In a large piece of code that reuses other large pieces of code it is not uncommon for operations to be specified that are not really needed. As a trivial example, a given attribute might return a file SORTed by a particular key. Another piece of code uses that file but immediately JOINs it to something else destroying the SORT order. ECL spots that the SORT wasn't used and eliminates it. Conversely, if a file was sorted and an upcoming JOIN would normally sort the data to perform the join, it will skip the second join and use the existing SORT order.
3. Operation hoisting: It is possible that two operations in a row both have to occur but at the moment they are in a sub-optimal order. For example if the code is filtering after a SORT it is better to move the filter before the SORT. ECL will perform this optimization and others like it.
4. Operation substitution: Sometimes two operations are both required and in the correct sequence, but because BOTH have to happen it is possible to substitute a third operation which is more optimal. Alternately, it may be wise to split a single heavy operation into a number of simpler ones. An example is the global aggregate table operation; rather than redistribute all of the data to aggregate, it will expand the operation into a local aggregate, followed by a slightly different global aggregate – this saves network bandwidth and avoids nasty skew errors when aggregating data in with some high frequency values.
5. Redundant field elimination: If many different processes use a file it is not uncommon for the file to contain fields which are not used in a particular process. Again ECL scans the whole graph and will strip out fields from every dataflow as soon as it can. At the very least this reduces memory consumption and network utilization. If the field was a computed field it may also save processor cycles.
6. Global Resource Coloring: When all of the pure optimizations have taken place there still will (usually) be work that needs to be done. To make sure that work happens as quickly as possible it is necessary to ensure that all the parts of the machine (disk, network, cpu, memory) are working as hard as they can all of the time. It is also important to ensure that all of the machine components are not being over-worked as this can cause the system to ‘thrash’ (degrade in a non-linear fashion due to resource contention). ECL therefore takes the graph of all the operations that need to occur together with the sequencing constraints and ‘colors’ the graph with the target cluster resources. Put simply, it organizes the work to keep everybody busy.
The above lists the major optimizations that justify the title ‘heavily optimizing.’ Combined they result in an execution graph that is usually far superior to anything that a team of programmers would have the time or patience to put together by hand.
There are also other features in the system that help maximize system performance, but they are not strictly language optimizations.
ECL is not a functional language. It is a declarative language, though it does share a number of important characteristics with functional languages. Perhaps the most obvious is that ECL does not have variables and thus the majority of ‘functions’ and ‘procedures’ do not have state, though it is possible to declare external ECL functions as having state. It is also true that most ECL constructs do not have side-effects which is again similar to a functional language. This allows the compiler a great deal of flexibility when optimizing.
Using ECL, you can find ‘hidden’ relationships that can be at least N2 complexity where simple partitioning is not effective. Information extraction can be done from human language texts and other data sources as well as the extraction of entities, events and relationships.
ECL is designed in a way which reduces the complexity of parallel programming. The built-in parallelism makes creating a high performance parallel application easier. ECL applications can be developed and implemented rapidly.
It can manipulate huge volumes of data efficiently and the parallel processing means that ‘’order of magnitude’’ performance increases over other approaches are possible. Also, data intensive algorithms that were previously thought to be intractable or infeasible are achievable with ECL.
ECL is ideal for the implementation of ETL, (Extract, Transform and Load), Information Retrieval, Information Extraction and other data intensive work.
ECL is a data centric language. It is not general purpose yet not tied to a domain; it is tied to the data.
ECL has been used in everything from business intelligence to healthcare, from debt collection to counter intelligence and from law enforcement to theology. The one common feature all of these tasks have in common is the need to process data in volumes, forms or ways which are not possible or at least not realistic using conventional technology.
Out of the box ECL supports tables and graphs as native visualizations. More statistical visualizations such as charts and scatter plots are supported using the ‘export to Excel’ mechanism. ECL also offers XSLT extensibility of its output format; so with a little coding, extra visualizations can be available without using Excel.
MapReduce is a patented software framework introduced by Google to support distributed computing on large data sets on clusters of computers.
The MapReduce framework is based on Map and Reduce functions used to explicitly manipulate distributed data. This framework does very little to hide the complexity of parallel programming. Moreover, expressing certain data problems in terms of map-reduce is not trivial, and most data problems require multiple sequential phases of map-reduce to accomplish the task.
ECL, on the other hand, is based on the functional dataflow paradigm and designed in a way which reduces the complexity of parallel programming. ECL supports built-in parallelism, both extrinsic and intrinsic (among multiple nodes and across multiple operations) and hides the complexity of this parallelism from the programmer standpoint. This makes creating a high performance parallel application easier. ECL applications can be developed and implemented rapidly.
ECL can manipulate huge volumes of data efficiently. Also, data intensive algorithms that were previously thought to be intractable or infeasible are achievable with ECL.
HPCC Systems provides a utility program called Bacon, which can automatically translate Pig programs into the equivalent ECL. The Bacon-translated versions of the PigMix tests are presented at "The PigMix Benchmark on HPCC".
The HPCC is designed, built and optimized around the ECL language. It is a declarative, data centric programming language built specifically for Big Data. We offer training videos, tutorials, and reference materials to get you started quickly. We also offer training sessions. ECL generates C++ and also allows C++ to be used to extend the language or plug into other libraries. Other languages can be integrated using PIPE, SOAPCALL and External DLLs.
Efficient Cross-Partition Operations can be performed including 'Global Joins' and 'Sorts' to manipulate partitioned data as a single logical object. Data intensive operations can be performed to minimize overheads when reading/writing large amounts of data. This kind of operation also executes the algorithm closer to the data and encapsulates it internally (rather than moving large amounts of data to the processors).
In order to distribute the data across a Thor cluster, some form of record delimiter or record length indication will be needed so that the data may be partitioned. ECL has a rich syntax to allow data to be read in a wide variety of formats, structured layouts and in completely unstructured form. In the unlikely event that ECL really cannot handle the format natively, it is possible to use the PIPE command to ingest the data via a program written in some alternate language.
One of the major features of ECL is it has a fully integrated file management system that allows data to be moved, utilized and maintained using a suite of tools. The tools also have a programming level interface to allow them to be used from code. It is vastly preferable to work within this framework if at all possible. Ultimately however, if you really have to do it, it is quite possible to write directly onto the disks without the framework knowing.
The technical answer is that ECL is not built upon a chunked file system. By default the data resident for a single file upon a single node will be in one continuous piece. This greatly speeds up the sequential reading of a large file. It is important to realize that most ECL programmers do not worry about underlying formats at all. The ECL file management system encapsulates all of the low-level details of a distributed file system including failover. To the programmer, a file has data and contains data.
- Command-line applications (see PIPE in the ECL Language Reference)
- Soap Calls (see SOAPCALL in the ECL Language Reference)
- External DLLs that implement well-defined standards using the ECL SERVICE and PlugIn interfaces (refer to the External Service Implementation in the ECL Language Reference)
- In-line C++ code (see BEGINC++ in the ECL Language Reference)
HPCC uses a descriptive programming language called ECL to greatly minimize development time while working with extremely large complex data problems.
However, the HPCC JDBC Driver* supports a subset of read-only SQL commands. In conjunction with your favorite JDBC Client it is possible to query HPCC data with SQL.
*Currently in BETA
The scope of these languages is completely different and disjointed, which makes it an unfair comparison. However, there are a number of characteristics of ECL that make it the preferred choice of programming languages. ECL is more expressive allowing most algorithms to be represented completely. One ECL query can replace the functionality of multiple SQL queries.
ECL allows complex processing algorithms to be executed on the server instead of in the middleware or in client programs. It also moves the algorithm close to the data and encapsulates it internally, rather than moving large amounts of data to the processing algorithm externally. Algorithms can also be implicitly executed on data in parallel rather than moving the data to a single client.
ECL also encourages deep data comprehension and code reuse which enhances development productivity and improves the quality of the results.
An ODBC connection is not currently available, but we are looking into this possibility.
Data gets deployed to a ROXIE cluster automatically whenever a query is loaded on the cluster. Use the Roxieconfig tool, select a query to load and Roxieconfig will lookup all files / keys used by that query. The information is then sent to the Roxie cluster to be used. Depending on the cluster configuration, Roxie will either copy the files or use them directly from the remote location.
Data loading is controlled by the Distributed File Utility (DFU) Server. The data files are placed onto the Landing Zone and then copied and distributed (sprayed) onto the THOR cluster. A single file is distributed into multiple physical files across the nodes in the THOR cluster. One logical file is also created which is used in the code written by ECL programmers.
Yes, the HDFS to HPCC Connector (H2H)* provides access to your HDFS data. Use H2H to create ECL datasets based on your flat, or CSV HDFS data files.
*Currently in BETA
Yes, an HPCC JDBC Driver* is available which supports a subset of read-only SQL commands. In conjunction with your favorite JDBC Client it is possible to query HPCC data with SQL.
*Currently in BETA
There are multiple causes to the “long tails” problem in Hadoop. Some of these causes, related for example to data skews and slow nodes, get amplified by the fact that multiple MapReduce cycles are normally serialized over a single data workflow (when, for example, performing a multi-join, working through a graph traversal problem or executing a clustering algorithm).
HPCC utilizes several mechanisms to minimize the lasting effect of these long tails, including the additional parallelization that was described in the previous post, a record oriented filesystem which ensures that each node receives an approximate similar load (in terms of number of data records processed by each node, even for variable length and/or XML record layouts) and enough instrumentation to make the user aware of the data skew levels at each step in the data workflow execution graph.
The fundamental design concepts in HPCC are not based in the MapReduce paradigm postulated by Google in 2004. As a matter of fact, HPCC predates that paper by a several years.
The idea behind the way data workflows are architected in HPCC is based on high level data primitives (SORT, PROJECT, DISTRIBUTE, JOIN, etc.), exposed through the ECL language, and a powerful optimizer which, at ECL compile time, determines how these operations can be parallelized during execution, and what the execution strategy should be to achieve the highest performance in the system.
ECL is a declarative language, so ideally the programmer doesn’t need to define the control flow of the program. A large number of data operations are commutative in nature, and since transferring (big) data is normally very expensive, the optimizer can, for example, move a filter closer to the beginning to reduce the amount of data that is carried over in subsequent operations. Other optimizations such as lazy execution are also utilized to eliminate throwaway code and data structures.
The specific execution plans vary, depending on how the particular data workflow (ECL program) looks like, and the system provides for a graphical display of the exact execution plan that the optimizer determined to be the most appropriate for that workflow. Once you submit a workunit from the ECL IDE, you can visualize the execution plan for that workunit, and even key metrics in each intermediate step which include number of data records processed, data skews and the specific operation represented. As you can see, a complex execution graph is normally subdivided in multiple subgraphs, and many of those operations are parallelized if there is no need for a synchronization barrier (or if the optimizer thinks that excessive parallelization will affect the overall performance negatively).
It is recommended that you download the Virtual Machine and/or binaries of the platform and play with some of the examples that we provide in our portal, to understand how this all works in practice. Although in real life you would never need to tinker with the platform itself, if you feel inclined to seeing how things work under the hood, please feel free to download the C++ source code of the HPCC platform from our GIT repository and take a look at the inner implementation details of the platform and ECL compiler and optimizer.
Another source of reference is the PigMix Benchmark on HPCC.
Operations, Administration, Planning
HPCC is designed to run on thousands of nodes. At minimum, you need one current and supported Linux system to run a single node cluster. If you are interested in running HPCC on Windows please contact us. In addition, the ECL IDE is available for Windows and ECL Eclipse Plugin is available for all platforms. Detailed information is available in the system requirements guide.
No. It integrates Commodity-Off-the-Shelf (COTS) hardware.
Minimum memory per node: 4GB
Currently no single file can be greater than 4 exabytes (or 4 million terabytes). This is not a current concern as other limitations occur in real-world systems before this limitation is reached; such as total available disk space on your cluster. There are no theoretical limits to the number of files.
The data files and index files referenced by the query’s ECL code are made available in one of these ways, depending on the configuration of the Roxie cluster.
Two configuration settings in Configuration Manager determine how this works:
copyResources Copies necessary data and key files from the current location when the query is published.
useRemoteResources Loads necessary data and key files from the current location when the query is published.
These options may appear to be mutually exclusive, but the chart below shows what each possible combination means.
-------------------------------------------------------------------- copyResources T useRemoteResources T Directs the Roxie cluster to use the remote copy of the data until it can copy the data locally. This allows a query to be available immediately using the remote data until the copy completes.
-------------------------------------------------------------------- copyResources T useRemoteResources F Directs the Roxie cluster to copy the data locally. The query cannot be executed until the data copy completes. This ensures optimum performance but may delay the query's availability until the file copy completes.
-------------------------------------------------------------------- copyResources F useRemoteResources T Directs the Roxie cluster to load the data from a remote location and never copy locally. The query can be executed immediately, but performance is limited by network bandwidth. This allows queries to run without using any Roxie node disk space, but reduces its throughput capabilities. This is the default for a single node because Thor and Roxie are on the same node and share the same disk drives.
-------------------------------------------------------------------- copyResources F useRemoteResources F Will use data and indexes previously loaded but will not copy or read remote data.
Logging in (e.g., being a registered user) allows you to post to the forums, contribute code and submit issues. Otherwise, anonymous users can only read these areas and cannot post or contribute.
No, you are free to keep your ECL source code for yourself. Only changes or improvements to the HPCC platform must be disclosed. There are no obligations attached to ECL programs.
A step-by-step guide with links to system requirements, documentation and other reference materials is available to help you get started.
HPCC Editions and Licensing
The Community Edition will come with source code, available through a versioning control system and periodic snapshots. Binary versions of the Community Edition will also be made available at important milestones. Support for the Community Edition will be provided by the community.
The Enterprise Edition is provided as an enterprise-ready set of binary installable packages, together with documentation, support, indemnity and additional modules normally required in highly available critical or production type environments. There are fees (i.e., License Fees) associated to the Enterprise Edition and its support.
The Community Edition is free to download, includes the source code, and as of v3.10 is licensed under the Apache License, Version 2.0. New code from HPCC Systems and Community contributors will land here first, so expect to see it continuously changing and growing. By the nature of the open source development process and the fact that the source code repository will be in constant evolution, it is recommended that binary versions be used for anything beyond experimentation and/or platform code development. If you are a developer, or you are just interested in seeing how things work under the hood, this is the version for you.
The Enterprise Edition is available under a paid commercial license. The Enterprise Edition includes support, indemnification and additional modules that are usually required for critical and production environments. If you represent an organization intending to use the HPCC platform to support a critical or production type business/application, this is the version that you should use.
Of course! The ECL language does not differ between Community and Enterprise Editions. Applications written under either edition will run unmodified on the other platform; the only exception being ECL code requiring ECL libraries or plugins only available under the Enterprise Edition or which are part of our Add-on modules.