HPCC Frequently Asked Questions - The ECL Language and Developers
In its simplest form, an ECL program is simply one or more expressions, and, when run, the value of the expressions are written to the workunit or console.
To create more complex expressions without becoming unreadable, you can use the definition syntax to give a name to an expression, then use that name in subsequent ECL expressions. For extra flexibility, a definition can be parameterized to make a function definition.
When the ECL program is compiled, all references to defined symbols are substituted for the appropriate expressions to form a single complex expression representing the entire program. This expression is then optimized and compiled into executable code.
Because the entire program is an expression (and not a series of statements) there is no implied order of execution. Logically (and actually in many cases) the entire program can execute at once in parallel across as many cores and/or nodes are available in the cluster.
Here is a simple example:
TheWorld := ' and the world';
Message := 'Hello HPCC' + TheWorld;
“Hello HPCC and the world”
You should look at ECL if one of the following scenarios listed below is true. If two of the scenarios below are true, then you need ECL:
1. You have large volumes of data. ECL has repeatedly been proven to be the most efficient solution to large scale data processing. In other words, you can do more with fewer nodes (on the terabyte sort we needed 4x fewer nodes than Hadoop to get the same result). Thus whether you are dealing with terabytes on ten nodes or petabytes on thousands, ECL is the cost-effective solution if the data is huge.
2. You need to do more with limited staff. ECL is a highly productive language and usually requires 10x fewer lines of code than other languages.
3. You need to perform complex data processing. ECL is a highly expressive language; it doesn’t just perform simple data operations. It can do everything from natural language processing to complex statistical modeling and from data clustering to non-obvious relationship analysis.
4. Your data is ‘dirty.’ The data comes in a variety of formats, but none of the formats are entirely well specified. The data has errors, has been accumulated over time and is full of both human and system artifacts. ECL has a rich variety of data types, file types and most importantly a strong ‘library building’ mechanism that allows the data to be methodically knocked into shape.
5. Many people need to work with the data. ECL has a very strong module encapsulation mechanism and revision control mechanism built in. It is designed to allow hundreds of people to work on many datasets which need to be integrated and fused. It is designed to maximize code-reuse and provide long-term code maintainability.
6. Analysts rather than strong programmers need to use data. A somewhat unusual feature of ECL is that it is dictionary based. It is actually a domain specific meta-language. Put another way, it is possible for the stronger coders to EXTEND the ECL language with simpler, domain specific terms which then allow analysts to code in ECL.
In the compiler world, a highly (sometimes called globally) optimizing compiler performs a series of mappings or changes to the program it is given, in order to produce code which is better than the one the programmer specified, while achieving the same effect. Obviously the point at which you declare a compiler ‘highly optimizing’ is somewhat subjective. However, we believe ECL exceeds the mark by a good distance. Some of the global optimizations it performs are:
1. Common sub-expression elimination: across a large complex data process it is very common for the same value or same file to be accessed multiple times. Common sub-expression elimination is the process by which the compiler scans ALL that a program is doing and looking for anything that happens twice. If anything does happen twice the compiler causes the first occurrence to be stored so that second time around it does not have to be computed again. ECL is particularly strong in that it does NOT require both occurrences to be called the same thing or be written the same way. ECL is declarative so if you were effectively doing the same thing twice it will only do it once.
2. Redundant operation elimination: In a large piece of code that reuses other large pieces of code it is not uncommon for operations to be specified that are not really needed. As a trivial example, a given attribute might return a file SORTed by a particular key. Another piece of code uses that file but immediately JOINs it to something else destroying the SORT order. ECL spots that the SORT wasn't used and eliminates it. Conversely, if a file was sorted and an upcoming JOIN would normally sort the data to perform the join, it will skip the second join and use the existing SORT order.
3. Operation hoisting: It is possible that two operations in a row both have to occur but at the moment they are in a sub-optimal order. For example if the code is filtering after a SORT it is better to move the filter before the SORT. ECL will perform this optimization and others like it.
4. Operation substitution: Sometimes two operations are both required and in the correct sequence, but because BOTH have to happen it is possible to substitute a third operation which is more optimal. Alternately, it may be wise to split a single heavy operation into a number of simpler ones. An example is the global aggregate table operation; rather than redistribute all of the data to aggregate, it will expand the operation into a local aggregate, followed by a slightly different global aggregate – this saves network bandwidth and avoids nasty skew errors when aggregating data in with some high frequency values.
5. Redundant field elimination: If many different processes use a file it is not uncommon for the file to contain fields which are not used in a particular process. Again ECL scans the whole graph and will strip out fields from every dataflow as soon as it can. At the very least this reduces memory consumption and network utilization. If the field was a computed field it may also save processor cycles.
6. Global Resource Coloring: When all of the pure optimizations have taken place there still will (usually) be work that needs to be done. To make sure that work happens as quickly as possible it is necessary to ensure that all the parts of the machine (disk, network, cpu, memory) are working as hard as they can all of the time. It is also important to ensure that all of the machine components are not being over-worked as this can cause the system to ‘thrash’ (degrade in a non-linear fashion due to resource contention). ECL therefore takes the graph of all the operations that need to occur together with the sequencing constraints and ‘colors’ the graph with the target cluster resources. Put simply, it organizes the work to keep everybody busy.
The above lists the major optimizations that justify the title ‘heavily optimizing.’ Combined they result in an execution graph that is usually far superior to anything that a team of programmers would have the time or patience to put together by hand.
There are also other features in the system that help maximize system performance, but they are not strictly language optimizations.
ECL is not a functional language. It is a declarative language, though it does share a number of important characteristics with functional languages. Perhaps the most obvious is that ECL does not have variables and thus the majority of ‘functions’ and ‘procedures’ do not have state, though it is possible to declare external ECL functions as having state. It is also true that most ECL constructs do not have side-effects which is again similar to a functional language. This allows the compiler a great deal of flexibility when optimizing.
Using ECL, you can find ‘hidden’ relationships that can be at least N2 complexity where simple partitioning is not effective. Information extraction can be done from human language texts and other data sources as well as the extraction of entities, events and relationships.
ECL is designed in a way which reduces the complexity of parallel programming. The built-in parallelism makes creating a high performance parallel application easier. ECL applications can be developed and implemented rapidly.
It can manipulate huge volumes of data efficiently and the parallel processing means that ‘’order of magnitude’’ performance increases over other approaches are possible. Also, data intensive algorithms that were previously thought to be intractable or infeasible are achievable with ECL.
ECL is ideal for the implementation of ETL, (Extract, Transform and Load), Information Retrieval, Information Extraction and other data intensive work.
ECL is a data centric language. It is not general purpose yet not tied to a domain; it is tied to the data.
ECL has been used in everything from business intelligence to healthcare, from debt collection to counter intelligence and from law enforcement to theology. The one common feature all of these tasks have in common is the need to process data in volumes, forms or ways which are not possible or at least not realistic using conventional technology.
Out of the box ECL supports tables and graphs as native visualizations. More statistical visualizations such as charts and scatter plots are supported using the ‘export to Excel’ mechanism. ECL also offers XSLT extensibility of its output format; so with a little coding, extra visualizations can be available without using Excel.
MapReduce is a patented software framework introduced by Google to support distributed computing on large data sets on clusters of computers.
The MapReduce framework is based on Map and Reduce functions used to explicitly manipulate distributed data. This framework does very little to hide the complexity of parallel programming. Moreover, expressing certain data problems in terms of map-reduce is not trivial, and most data problems require multiple sequential phases of map-reduce to accomplish the task.
ECL, on the other hand, is based on the functional dataflow paradigm and designed in a way which reduces the complexity of parallel programming. ECL supports built-in parallelism, both extrinsic and intrinsic (among multiple nodes and across multiple operations) and hides the complexity of this parallelism from the programmer standpoint. This makes creating a high performance parallel application easier. ECL applications can be developed and implemented rapidly.
ECL can manipulate huge volumes of data efficiently. Also, data intensive algorithms that were previously thought to be intractable or infeasible are achievable with ECL.
HPCC Systems provides a utility program called Bacon, which can automatically translate Pig programs into the equivalent ECL. The Bacon-translated versions of the PigMix tests are presented at "The PigMix Benchmark on HPCC".
The HPCC is designed, built and optimized around the ECL language. It is a declarative, data centric programming language built specifically for Big Data. We offer training videos, tutorials, and reference materials to get you started quickly. We also offer training sessions. ECL generates C++ and also allows C++ to be used to extend the language or plug into other libraries. Other languages can be integrated using PIPE, SOAPCALL and External DLLs.
Efficient Cross-Partition Operations can be performed including 'Global Joins' and 'Sorts' to manipulate partitioned data as a single logical object. Data intensive operations can be performed to minimize overheads when reading/writing large amounts of data. This kind of operation also executes the algorithm closer to the data and encapsulates it internally (rather than moving large amounts of data to the processors).
In order to distribute the data across a Thor cluster, some form of record delimiter or record length indication will be needed so that the data may be partitioned. ECL has a rich syntax to allow data to be read in a wide variety of formats, structured layouts and in completely unstructured form. In the unlikely event that ECL really cannot handle the format natively, it is possible to use the PIPE command to ingest the data via a program written in some alternate language.
One of the major features of ECL is it has a fully integrated file management system that allows data to be moved, utilized and maintained using a suite of tools. The tools also have a programming level interface to allow them to be used from code. It is vastly preferable to work within this framework if at all possible. Ultimately however, if you really have to do it, it is quite possible to write directly onto the disks without the framework knowing.
The technical answer is that ECL is not built upon a chunked file system. By default the data resident for a single file upon a single node will be in one continuous piece. This greatly speeds up the sequential reading of a large file. It is important to realize that most ECL programmers do not worry about underlying formats at all. The ECL file management system encapsulates all of the low-level details of a distributed file system including failover. To the programmer, a file has data and contains data.
- Command-line applications (see PIPE in the ECL Language Reference)
- Soap Calls (see SOAPCALL in the ECL Language Reference)
- External DLLs that implement well-defined standards using the ECL SERVICE and PlugIn interfaces (refer to the External Service Implementation in the ECL Language Reference)
- In-line C++ code (see BEGINC++ in the ECL Language Reference)
HPCC uses a descriptive programming language called ECL to greatly minimize development time while working with extremely large complex data problems.
However, the HPCC JDBC Driver* supports a subset of read-only SQL commands. In conjunction with your favorite JDBC Client it is possible to query HPCC data with SQL.
*Currently in BETA
The scope of these languages is completely different and disjointed, which makes it an unfair comparison. However, there are a number of characteristics of ECL that make it the preferred choice of programming languages. ECL is more expressive allowing most algorithms to be represented completely. One ECL query can replace the functionality of multiple SQL queries.
ECL allows complex processing algorithms to be executed on the server instead of in the middleware or in client programs. It also moves the algorithm close to the data and encapsulates it internally, rather than moving large amounts of data to the processing algorithm externally. Algorithms can also be implicitly executed on data in parallel rather than moving the data to a single client.
ECL also encourages deep data comprehension and code reuse which enhances development productivity and improves the quality of the results.
An ODBC connection is not currently available, but we are looking into this possibility.
Data gets deployed to a ROXIE cluster automatically whenever a query is loaded on the cluster. Use the Roxieconfig tool, select a query to load and Roxieconfig will lookup all files / keys used by that query. The information is then sent to the Roxie cluster to be used. Depending on the cluster configuration, Roxie will either copy the files or use them directly from the remote location.
Data loading is controlled by the Distributed File Utility (DFU) Server. The data files are placed onto the Landing Zone and then copied and distributed (sprayed) onto the THOR cluster. A single file is distributed into multiple physical files across the nodes in the THOR cluster. One logical file is also created which is used in the code written by ECL programmers.
Yes, the HDFS to HPCC Connector (H2H)* provides access to your HDFS data. Use H2H to create ECL datasets based on your flat, or CSV HDFS data files.
*Currently in BETA
Yes, an HPCC JDBC Driver* is available which supports a subset of read-only SQL commands. In conjunction with your favorite JDBC Client it is possible to query HPCC data with SQL.
*Currently in BETA
There are multiple causes to the “long tails” problem in Hadoop. Some of these causes, related for example to data skews and slow nodes, get amplified by the fact that multiple MapReduce cycles are normally serialized over a single data workflow (when, for example, performing a multi-join, working through a graph traversal problem or executing a clustering algorithm).
HPCC utilizes several mechanisms to minimize the lasting effect of these long tails, including the additional parallelization that was described in the previous post, a record oriented filesystem which ensures that each node receives an approximate similar load (in terms of number of data records processed by each node, even for variable length and/or XML record layouts) and enough instrumentation to make the user aware of the data skew levels at each step in the data workflow execution graph.
The fundamental design concepts in HPCC are not based in the MapReduce paradigm postulated by Google in 2004. As a matter of fact, HPCC predates that paper by a several years.
The idea behind the way data workflows are architected in HPCC is based on high level data primitives (SORT, PROJECT, DISTRIBUTE, JOIN, etc.), exposed through the ECL language, and a powerful optimizer which, at ECL compile time, determines how these operations can be parallelized during execution, and what the execution strategy should be to achieve the highest performance in the system.
ECL is a declarative language, so ideally the programmer doesn’t need to define the control flow of the program. A large number of data operations are commutative in nature, and since transferring (big) data is normally very expensive, the optimizer can, for example, move a filter closer to the beginning to reduce the amount of data that is carried over in subsequent operations. Other optimizations such as lazy execution are also utilized to eliminate throwaway code and data structures.
The specific execution plans vary, depending on how the particular data workflow (ECL program) looks like, and the system provides for a graphical display of the exact execution plan that the optimizer determined to be the most appropriate for that workflow. Once you submit a workunit from the ECL IDE, you can visualize the execution plan for that workunit, and even key metrics in each intermediate step which include number of data records processed, data skews and the specific operation represented. As you can see, a complex execution graph is normally subdivided in multiple subgraphs, and many of those operations are parallelized if there is no need for a synchronization barrier (or if the optimizer thinks that excessive parallelization will affect the overall performance negatively).
It is recommended that you download the Virtual Machine and/or binaries of the platform and play with some of the examples that we provide in our portal, to understand how this all works in practice. Although in real life you would never need to tinker with the platform itself, if you feel inclined to seeing how things work under the hood, please feel free to download the C++ source code of the HPCC platform from our GIT repository and take a look at the inner implementation details of the platform and ECL compiler and optimizer.
Another source of reference is the PigMix Benchmark on HPCC.