Want to see HPCC Systems in action?
Check out these interactive demos on the key features and components within the HPCC Systems platform.
Data profiling is a key step in data analysis for data integration when a new data file needs to be processed. This demo shows capabilities of the ECL language for use in data analysis and profiling of raw data. This demo allows the user to cut/paste a .csv (comma separated variable) formatted data file into a query window and receive a data profiling report as output, plus an ECL record definition that could be used with the data file. The report includes essential information for each field including cardinality, record counts by length of field, record counts by number of word tokens in the field, record counts by character appearing in the field values, record counts for the top patterns, and record counts for the most frequent values. Analyzing this report can quickly help determine if there are any anomalies, invalid values, or other problems with the data.
- Data profiling can be accomplished easily using the ECL language as a data analysis tool.
- Application code can be developed on a Thor cluster and then deployed to Roxie to provide a useful online service.
- The ECL language supports a flexible data model as shown by the child datasets in the data profiling output dataset/report. ECL can operate on data in any format, structured or unstructured.
This entire demo is built end to end using HPCC Systems. The ETL prep and data delivery queries are all coded in ECL. The search results are delivered by leveraging a ranking algorithm. The demo shows the capabilities of Roxie queries to deliver data in the format and shape which make integrating Web network visualizations like Sigma.js easy. The efforts made into building this demo include a simple example of processing XML, parsing links from Wikipedia Pages using PARSE, calculating Google Page Rank in ECL, and creating Roxie queries for integrating visualization and rapid delivery of indexes.
- Inbound Only links are calculated by searching all 16 million Wikipedia pages to find which pages point at a particular page and then match that against the outbound links to get the links that are not reciprocated.
- The Ranked Search queries search strings that starts with the text entered and returns a result of matching pages ordered by highest ranking page first.
- Graph Visualization using Sigma.js to natively handle GEXF graph XML output along with ECL queries visually demonstrate the page rank of Wikipedia pages.
||Cancer Rate Demo
The data for the examples was taken from http://seer.cancer.gov/.
Surveillance, Epidemiology, and End Results (SEER) Program of the National Cancer Institute works to provide information on cancer statistics.
The SEER research data includes SEER incidence and population data associated by age, sex, race, year of diagnosis, and geographic areas (including SEER registry and county).
SEER collects data on cancer cases from various locations and sources throughout the United States. Data collection began in 1973 with a limited amount of registries and it continues to expand to include even more areas and demographics today.
About the Demo:
As an initial demo, the SEER data was cleaned and analyzed using the HPCC Systems platform to derive a set of useful reports. The next goal is to perform predictive analytics using the HPCC Systems platform Machine Learning capabilities.