Data Descriptors and Simple Example Programs

The following Data Descriptors and Simple Example Programs have been provided by members of the HPCC Systems Community.

Building ECL datasets from ODBC data sources

Query ODBC data sources from within HPCC. Particularly useful in Roxie to connect to transactional databases receiving real-time data.

This program includes a C++ header file and an ECL file. The header file "hpcc-odbc.h" needs to be installed in your system in a location that is in your default include path. Typically, on linux systems, /usr/include is a safe bet. Non-default options can also be accommodated by special compiler options given in the ECL code. The ECL file "ODBC.ecl" contains the embedded C++ snippet in a BEGINC++ structure which gets data from the ODBC source and serializes it into the ECL dataset. The layout of the resulting dataset is also specified in the file. The file contains an action at the end which makes it a runnable file. The action assumes you have an ODBC DSN called 'MySQL-Test' which contains a table called 'hotlist'. You can change the name of the DSN and the SQL statement to whatever you like. KNOWN LIMITATION: The code is known to freeze at times in a locked state when executing multiple SQL statements in parallel with connection pooling configured to be ON in the system ODBC configuration. If you experience this, try turning off your connection pooling, or change your driver threading level.

Sigma.js gexf example (Sudoku)

This will give you a working example of implementing a Roxie service to return custom xml format (in this case gexf for graphxml).

In the attached zipped file is the following: ECL Folders: Sudoku and gexf Visualization Folder: demos The ecl folder you can load in your ecl repository and the demos folder you can deploy on your webserver. I generally cheat and use ESP to serve as a webserver. I copy the demos folder to the /opt/HPCCSystems/componentfiles/files folder on the HPCCSystems server (this is probably breaking all sorts of rules!) Compile and publish the Sudoku query on roxie then load the Sudoku.html up.

"Sentilyze" Twitter Sentiment Analysis

Sentilyze classifies tweets with positive or negative sentiment. Also included is Language Classification. (UPDATED 9/6/12, see thread in the 'Contributors' subforum in 'Forums' for details).

Sentilyze classifies tweets with positive or negative sentiment. Documentation is included in the .zip file. All files are included except data from twitter because that is not allowed under Twitter's TOS. You will have to acquire that data yourself. If you have questions about this please comment and I am willing to help out as much as I can. You can also get the ECL by cloning the Machine Learning Library repo at https://github.com/hpcc-systems/ecl-ml/.

FINCEN Money Services List

Locate Check Cashing businesses or Currency Dealers in your area

This sample code reads in the Financial Crimes Enforcement Network's list of money services businesses (names, address and services provided) across the US. You can use this list to locate and count how many check cashing businesses are in your area.

National Highway Transportation Safety Administration (NHTSA)

Database containing complaints reported to the NHTSA by consumers for the last twenty years, consisting of over 850,000 records and includes complaints regarding automobiles and automobile accessories.

The NHTSA Complaints database contains complaints reported to the National Highway Transportation Safety Administration (NHTSA) by consumers for nearly the last twenty years. It consists of over 850,000 records and includes complaints regarding not only automobiles, but also automobile accessories.

Heritage Healthcare Prize

Analyze Healthcare data and prepare a contest entry for the Heritage Healthcare Prize.

The Heritage Provider Network is sponsoring a contest whose goal is to develop a predictive algorithm that can identify patients who will be admitted to the hospital within the next year using historical claims data. This example shows how to create a simple entry using ECL.

WikipediaStats

This is an initial draft of Wikipedia Descriptors that focus on pages and their links to other Wikipedia Pages, aka "The Wikipedia Graph".

Like the IMDB data, it is a simple example of a graph, the Wikipedia Graph, and the foundation from which more graph examples can be built. Unlike IMDB, this is a Directed Graph, so some interesting insights can be gleaned. You can have fun looking at pages that have the most inbound, outbound or bi-directional links but there is so much more you can do once you you wrap your mind around it. "Which US President\Vice President has the most influence in WikipediaLand!?"

Airline Performance Monitor (APM)

Keeping the Airline Industry on its feet!

Using data from the Bureau of Transportation Statistics, this code allows you to trivially find out what flights got delayed or cancelled from a given airport or on a given day. For example, in no time at all, you can get the information about all the flights that got delayed from LAX and the factors contributing to those delays or you can get information about all the flights that got cancelled on 5th February, 2011 and so on.

MusicMoz Artists, Albums and Tracks

I like music, and when I saw the Music Moz Public Data Source I wanted to take it for a spin in HPCC and ECL. I was amazed at the simplicity and power of ECL when processing a fairly complex XML file.

This submission showcases the power and support of ECL with XML documents. The RECORD definitions allow you to drill down to the information that you need, and the string library support allows you to easily clean redundant or obsolete data as needed. The result is a powerful and quick way to search for music artists and their albums and tracks in the HPCC.

Project Gutenberg ebook feed analysis

Analyze feed data from Project Gutenberg

This sample code uses Project Gutenberg's feeds and performs some basic querying and analysis for potential mis-categorized books.

U.S. Patent and Trademark Office (USPTO) Data

U.S. Patent filing data for researching inventions and statistics.

This example demonstrates some aggregation functions on large datasets created from XML data. Using publicly available patent filings from the U.S. Patent and Trademark Office (USPTO), this code demonstrates how to load U.S. patent filings into the HPCC platform for useful research and analysis, such as finding out how many patents have been filed for a particular invention or discipline or researching patents by company and/or geographical area.

Surrey Police Spending Analysis

Via the web, I traveled to the United Kingdom and found a public data source pertaining to the Surrey Police Department. Using ECL, I was able to generate some nifty cross-tabulation reports and analyze their spending habits in more detail.

The data source was a comma delimited set of files containing Surrey Police Spending records for each month between April 2009 and March 2010. Pre-processing these files into a single CSV file simplified the spray, although using ECL this process could have been automated as well. This code demonstrates cross-tabulation reports using TABLE and a number of aggregate functions that are standard in ECL. Enjoy! Attached zip file contains the composite CSV file and ECL code ready for spray and import respectively.

Ingest Drug Data and Find Drugs Containing Specific Ingredients

This example uses drug data provided by the FDA. It demonstrates how to ingest this data, which comes as multiple separate files, and how to use that data to locate all drugs that contain a particular ingredient.

This example uses drug data provided by the FDA. It demonstrates how to ingest this data, which comes as multiple separate files, and how to use that data to locate all drugs that contain a particular ingredient. Files included:

  • BWR_FDA_Drugs - A stand-alone BWR that ingests the drug data and outputs all of the drugs that contain a particular ingredient. This BWR doesn't write out the data to the system.
  • BWR_ETL - A BWR that ingests the drug data into the system for later use.
  • BWR_FindDrugsByIngredient - A BWR that demonstrates how to use the data ingested into the system to locate all drugs that contain a particular ingredient. BWR_ETL has to be run before running this one.
  • Layouts - The record layouts used for this data.
  • Datasets - Dataset definitions used for this data.
Gene Data from Pseudomonas

Ingest of data with nested child data sets.

This sample shows how to ingest data that includes nested child sets of strings and nested child data sets. The data used for the sample is Chromosome data downloaded from Pseudomonas.

Zip Code Tabulation Areas (ZCTA)

U.S. Census Bureau Zip Code Tabulation Area (ZCTA) data and use for Zip Code-to-Zip Code calculations.

Using data from the U.S. Census Bureau, this code allows you to trivially calculate distances between all zip codes. This permits extremely fast lookups for geospatial calculations on the zip code level -- for example, in determining what cities are neighbors of a given city, how far it is (as the crow flies) between two different zip codes, and so on.

Share your Code

Share your code with the HPCC System Community.

Share now