Data Descriptors and Simple Example Programs
The following Data Descriptors and Simple Example Programs have been provided by members of the HPCC Systems Community.
Gene Data from Pseudomonas
Ingest of data with nested child data sets.
This sample shows how to ingest data that includes nested child sets of strings and nested child data sets. The data used for the sample is Chromosome data downloaded from Pseudomonas.
Ingest Drug Data and Find Drugs Containing Specific Ingredients
This example uses drug data provided by the FDA. It demonstrates how to ingest this data, which comes as multiple separate files, and how to use that data to locate all drugs that contain a particular ingredient.
This example uses drug data provided by the FDA. It demonstrates how to ingest this data, which comes as multiple separate files, and how to use that data to locate all drugs that contain a particular ingredient. Files included:
- BWR_FDA_Drugs – A stand-alone BWR that ingests the drug data and outputs all of the drugs that contain a particular ingredient. This BWR doesn’t write out the data to the system.
- BWR_ETL – A BWR that ingests the drug data into the system for later use.
- BWR_FindDrugsByIngredient – A BWR that demonstrates how to use the data ingested into the system to locate all drugs that contain a particular ingredient. BWR_ETL has to be run before running this one.
- Layouts – The record layouts used for this data.
- Datasets – Dataset definitions used for this data.
Surrey Police Spending Analysis
The data source was a comma delimited set of files containing Surrey Police Spending records for each month between April 2009 and March 2010. Pre-processing these files into a single CSV file simplified the spray, although using ECL this process could have been automated as well. This code demonstrates cross-tabulation reports using TABLE and a number of aggregate functions that are standard in ECL. Enjoy! Attached zip file contains the composite CSV file and ECL code ready for spray and import respectively.
U.S. Patent and Trademark Office (USPTO) Data
U.S. Patent filing data for researching inventions and statistics.
This example demonstrates some aggregation functions on large datasets created from XML data. Using publicly available patent filings from the U.S. Patent and Trademark Office (USPTO), this code demonstrates how to load U.S. patent filings into the HPCC platform for useful research and analysis, such as finding out how many patents have been filed for a particular invention or discipline or researching patents by company and/or geographical area.
Sigma.js gexf example (Sudoku)
In the attached zipped file is the following: ECL Folders: Sudoku and gexf Visualization Folder: demos The ecl folder you can load in your ecl repository and the demos folder you can deploy on your webserver. I generally cheat and use ESP to serve as a webserver. I copy the demos folder to the /opt/HPCCSystems/componentfiles/files folder on the HPCCSystems server (this is probably breaking all sorts of rules!) Compile and publish the Sudoku query on roxie then load the Sudoku.html up.
Building ECL datasets from ODBC data sources
Query ODBC data sources from within HPCC. Particularly useful in Roxie to connect to transactional databases receiving real-time data.
This program includes a C++ header file and an ECL file. The header file “hpcc-odbc.h” needs to be installed in your system in a location that is in your default include path. Typically, on linux systems, /usr/include is a safe bet. Non-default options can also be accommodated by special compiler options given in the ECL code. The ECL file “ODBC.ecl” contains the embedded C++ snippet in a BEGINC++ structure which gets data from the ODBC source and serializes it into the ECL dataset. The layout of the resulting dataset is also specified in the file. The file contains an action at the end which makes it a runnable file. The action assumes you have an ODBC DSN called ‘MySQL-Test’ which contains a table called ‘hotlist’. You can change the name of the DSN and the SQL statement to whatever you like.
KNOWN LIMITATION: The code is known to freeze at times in a locked state when executing multiple SQL statements in parallel with connection pooling configured to be ON in the system ODBC configuration. If you experience this, try turning off your connection pooling, or change your driver threading level.
National Highway Transportation Safety Administration (NHTSA)
Database containing complaints reported to the NHTSA by consumers for the last twenty years, consisting of over 850,000 records and includes complaints regarding automobiles and automobile accessories.
The NHTSA Complaints database contains complaints reported to the National Highway Transportation Safety Administration (NHTSA) by consumers for nearly the last twenty years. It consists of over 850,000 records and includes complaints regarding not only automobiles, but also automobile accessories.
Project Gutenberg ebook feed analysis
Analyze feed data from Project Gutenberg
This sample code uses Project Gutenberg’s feeds and performs some basic querying and analysis for potential mis-categorized books.
MusicMoz Artists, Albums and Tracks
I like music, and when I saw the Music Moz Public Data Source I wanted to take it for a spin in HPCC and ECL. I was amazed at the simplicity and power of ECL when processing a fairly complex XML file.
This submission showcases the power and support of ECL with XML documents. The RECORD definitions allow you to drill down to the information that you need, and the string library support allows you to easily clean redundant or obsolete data as needed. The result is a powerful and quick way to search for music artists and their albums and tracks in the HPCC.
Airline Performance Monitor (APM)
Keeping the Airline Industry on its feet!
Using data from the Bureau of Transportation Statistics, this code allows you to trivially find out what flights got delayed or cancelled from a given airport or on a given day. For example, in no time at all, you can get the information about all the flights that got delayed from LAX and the factors contributing to those delays or you can get information about all the flights that got cancelled on 5th February, 2011 and so on.
FINCEN Money Services List
Locate Check Cashing businesses or Currency Dealers in your area.
This sample code reads in the Financial Crimes Enforcement Network’s list of money services businesses (names, address and services provided) across the US. You can use this list to locate and count how many check cashing businesses are in your area.
“Sentilyze” Twitter Sentiment Analysis
Sentilyze classifies tweets with positive or negative sentiment. Also included is Language Classification. (UPDATED 9/6/12, see thread in the ‘Contributors’ subforum in ‘Forums’ for details).
Sentilyze classifies tweets with positive or negative sentiment. Documentation is included in the .zip file. All files are included except data from twitter because that is not allowed under Twitter’s TOS. You will have to acquire that data yourself. If you have questions about this please comment and I am willing to help out as much as I can. You can also get the ECL by cloning the Machine Learning Library repo at https://github.com/hpcc-systems/ecl-ml/.
Like the IMDB data, it is a simple example of a graph, the Wikipedia Graph, and the foundation from which more graph examples can be built. Unlike IMDB, this is a Directed Graph, so some interesting insights can be gleaned. You can have fun looking at pages that have the most inbound, outbound or bi-directional links but there is so much more you can do once you you wrap your mind around it. “Which US PresidentVice President has the most influence in WikipediaLand!?”