The Download: Tech Talks by the HPCC Systems Community, Episode 6
On August 1, 2017, HPCC Systems hosted the latest edition of The Download: Tech Talks. These technically-focused talks are for the community, by the community. The Download: Tech Talks is intended to provide continuing education through high quality content and meaningful development insight throughout the year.
Episode Guest Speakers and Subjects:
Lorraine Chapman, Consulting Business Analyst, LexisNexis Risk Solutions
Lorraine starts the session off with information on the HPCC Systems intern program and a special introduction to each of our featured interns.
Lorraine has worked alongside software developers for over 20 years in a supportive role, which has ranged from producing documentation including developing on-line help systems to software testing and release management.
Lorraine joined LexisNexis in 2004 and, as well as continuing to work alongside the HPCC Systems platform development team, also administers the HPCC Systems Intern Program and manages our application to be an accepted organization for Google Summer of Code.
Lily Xu – PhD student of Computer Science at Clemson University
Lily presents: Extending the YinYang K-Means machine learning algorithm in ECL
Vivek Nair – PhD student of Computer Science North Carolina State University
Vivek presents: Working to allow Spark to use HPCC Systems as a datastore and provide ECL programmers with the ability to access Spark algorithms.
George Mathew – PhD student of Computer Science at North Carolina State University
George presents: Implementing the Gradient Trees machine learning algorithm in ECL to build predictive models.
Key Discussion Topics:
1:05- Flavio Villanustre provides community updates:
HPCC Systems 6.4 now gold! Among the features include:
- More performance improvements on Roxie
- New ML Bundles for Logistic Regression & Linear Regression
- Colorization & icon options in ECL IDE
- Extended embedded language support for R, Python & SWS AWS plugins
- Enhanced support for Dynamic ESDL
- WsSQL 6.4.0 and wsclient 1.2 coming soon!
- Reminder: Call for Poster Abstracts still open for the 2017 HPCC Systems Community Day!
11:24- Lorraine Chapman- HPCC Systems Intern Program
Lorraine reviews the HPCC Systems intern program and shares key information, including:
16:25- Lily Xu: Yinyang K-Means Clustering Algorithm in HPCC Systems
Lily discusses the meaning of clustering and how to cluster using the clustering algorithms in the HPCC Systems Machine Library. She touches on:
Lily also discusses her intern experience.
Q: Should Yingyang provide equivalent results to standard K-means or do you expect to be less accurate due to the optimization?
A: You achieve a higher speed with better optimization but the exact same level of accuracy in the result.
Q: What have been your biggest challenges when coding and optimizing this algorithm in ECL?
A: Last summer, the biggest challenge has been the learning curve for ECL. This summer, my challenge has been how to really understand the algorithm and make the sequential algorithm work well in the HPCC System platform.
Q: Which ones of these make Yingyang faster than standard K-means? A large number of points, a large number of dimensions or a large number of centroids? When I should prefer it over standard K-means?
A: This depends on the environment. For the original sequential environment, Y-K means should be faster over all of these aspects. HPCC Systems is different than the sequential programming environment so Y-K means will outperform standard K-Means when there are a large number of dimensions and centroids as well.
43:30- George Mathew- Gradient Boosting Trees
George reviews implementing the Gradient Trees machine learning algorithm in ECL to build predictive models. He discusses:
Q: This is super cool! When do we expect to get a bundle?
A: George hopes to get a bundle put together. Keep watching the forum for more information
Q: What would be the reason to select gradient boosting and a weak learner instead of a strong learner? Is it because there are less hyper-parameters to deal with? Or is there an entirely different reason?
A: A strong learner would require a lot of memory associated with it and estimating parameters would be more difficult to estimate. A weak learner would consume less memory but might impact run time length. If you can compromise between run time and memory, this would be a good approach to take.
Q: You are mostly talking about using gradient boosting with decision trees, but wouldn’t it be better to just use random forests instead? Or should both be combined, using gradient boosting with random forests?
A: There are two different boosting techniques. With random forest, you have multiple strong classifiers and you select the output value based on polling the results of all the classifiers. Random forest has the advantage over gradient boosting in that you can run it in parallel. In gradient boosting each classifier is a weak classifier so you can use gradient boosting to prioritize memory. Random forest can be super charged with gradient boosting to reduce the number of trees as well as the number of iterations.
1:07:05- Vivek Nair: Spark-HPCC: HPCC Systems with Spark
Vivek discusses working to allow Spark to use HPCC Systems as a datastore and provide ECL programmers with the ability to access Spark algorithms.
Vivek discusseshis project including:
Vivek also discusses possible solutions and shows a demonstration of how to run Spark using HPCC Systems data.
1:26:55 – Q&A
Q: Do you have an estimation for the performance penalty of using HPCCFuseJ to access data on HPCC from Spark?
A: HPCC Systems Fuse J do not download all of the data into HPCC Systems, instead, we use chunks of the data. Large data sets become performance heavy. Our results show performance decays with the size of the data. We would like to see this speed increase.
Q: Are there any limitations on the structure or type of the HPCC data that Spark can process?
A: Currently HPCC Systems Fuse J is only available for Thor files. It exposes as a JSON object. In the Spark platform, the JSON structure is something readable by Spark.
Q: How does HPCC stand up to Spark in general working with similar data? Just comparing each system separately, not the connectivity.
A: This is not related to the project, but we are doing something similar when we are trying to do a construct between HPCC System on Spark. We hope to have some results soon including a comparison of these two systems.
Q: Does HPCCFuseJ allow a remote user command-line access to data stored in an HPCC cluster?
A: HPCC Systems Fuse J is like a plugin and changes to web service calls which talk to the cluster.
Have a new success story to share? We would welcome you to be a speaker at one of our upcoming The Download: Tech Talks episodes.
- Want to pitch a new use case?
- Have a new HPCC Systems application you want to demo?
- Want to share some helpful ECL tips and sample code?
- Have a new suggestion for the roadmap?
Be a featured speaker for an upcoming episode! Email your idea to Techtalks@hpccsystems.com
Visit The Download Tech Talks wiki for more information about previous speakers and topics: https://wiki.hpccsystems.com/display/hpcc/HPCC+Systems+Tech+Talks
Watch past episodes of The Download:
Links to resources mentioned in Tech Talk 6:
- Blogs about the program: https://hpccsystems.com/blog
- Available projects: https://wiki.hpccsystems.com/x/yIBc
- Previously complete projects: https://wiki.hpccsystems.com/x/g4BR
- Student wiki: https://wiki.hpccsystems.com/x/HwBm
- HPCC Systems Technical Presentation Competition 2016: https://wiki.hpccsystems.com/x/FQCv
More about HPCC Systems Community Day 2017
The HPCC Systems Summit Community Day and new training workshop are taking place October 3 & 4 in Buckhead, Atlanta, Georgia. Learn More and Register
- Community Day will be held in Atlanta on October 4, 2017
- Poster Competition held on October 3, submission instructions available on the Wiki
- Thank you to our Sponsors! Datum, Infosys and Cognizant!
NEW THIS YEAR! Pre-Event Workshop on October 3 – Mastering Your Big Data with ECL
- Registration is open to the public to attend
- Details at https://hpccsystems.com/hpccsummit2017
- This class is for attendees who want to understand the HPCC Systems platform and learn ECL to build powerful data queries. Anyone who needs a basic familiarity and learn best practices with ECL should attend. The one day class will take the student through the entire ETL cycle from Spray (Extract) to Transform (THOR) and finally to Load (ROXIE).
Part 1: Data Extraction and Transformation
Quick overview of THOR cluster, and the parallel distributed data processing concept, setting up a cluster, ECL Watch overview, spraying data, ECL IDE, ECL language essentials, and more…
Part 2: Prepare the Data Search Engine
Defining and building an INDEX, getting single and batch results, data indexing, filtering and normalization, searching, and more…
Part 3: Write and Publish ROXIE query
Call Search, Implicit function, publish in ECL Watch, test in WS-ECL, and more…