The Download: Tech Talks by the HPCC Systems® Community, Episode 15

Tech Talk Blog Image On June 28, 2018, HPCC Systems hosted the latest edition of The Download: Tech Talks. This series of workshops is specifically designed for the community by the community with the goal to share knowledge, spark innovation, and further build and link the relationships within our HPCC Systems community.

Watch the Tech Talk  

Links to resources mentioned in Tech Talk 15:

Presentation Slides

Support Forums

New Beta Release

TensorFlow Repository

ECL Tutorial

ECL Playground

Samples Code

         

Episode Guest Speakers and Subjects

Jingqing Zhang PhD Candidate Data Science Institute Imperial College London Jingqing.zhang15@imperial.ac.uk

Jingqing Zhang is a 1st-year PhD (HiPEDS) at Department of Computing, Imperial College London under supervision of Prof. Yi-Ke Guo. His research interest includes Text Mining, Data Mining, Deep Learning and their applications. He received his MRes degree in Computing from Imperial College with Distinction in 2017 and BEng in Computer Science and Technology from Tsinghua University in 2016.

Bob Foreman Senior Software Engineer LexisNexis Risk Solutions Robert.Foreman@lexisnexisrisk.com

Bob Foreman has worked with the HPCC Systems technology platform and the ECL programming language for over 5 years and has been a technical trainer for over 25 years. He is the developer and designer of the HPCC Systems Online Training Courses and is the Senior Instructor for all classroom and Webex/Lync based training.

Key Discussion Topics:

1:11– Flavio Villanustre provides community updates:

HPCC Systems Platform updates

  • 6.4.20-1 Gold available
  • 7.0.0-beta2 available

    • Better performance and usability
    • New ECL language and library features
    • WsSQL now integrated into the platform
    • Spark-HPCC Systems Connector to read Thor files natively
    • Download today – we need feedback!

Latest Blogs

Systemd – Easier management of your HPCC Systems components

First Look – HPCC Systems log visualizations using ELK

HPCC Systems 7.0.0 beta release – Try it now!

Reminder

  • 2018 HPCC Systems Community Day, Atlanta
  • We need speakers! CFP deadline on July 20
  • Sponsor packages still available
  • Workshop & Poster Competition on October 8
  • Main event on October 9
  • Visit hpccsystems.com/hpccsummit2018

8:45 – Jingqing Zhang, Imperial College of London – Deep Sequence Learning in Traffic Prediction and Text Classification

Jingqing Zhang is a 1st-year PhD (HiPEDS) at Department of Computing, Imperial College London under supervision of Prof. Yi-Ke Guo. His research interest includes Text Mining, Data Mining, Deep Learning and their applications. He received his MRes degree in Computing from Imperial College with Distinction in 2017 and BEng in Computer Science and Technology from Tsinghua University in 2016.

Jingqing introduces two recent works both of which exploit deep learning models. The traffic prediction project (accepted by KDD’18) releases a new large-scale traffic dataset, with auxiliary information including search queries from Baidu Map app and proposes hybrid models to achieve state-of-the-art prediction accuracy. The other topic is on text classification uses knowledge graph and a two-step classification policy to achieve zero-shot learning.

 31:22- Q&A

Q: Why do you use the two phased frameworks to classify text?

A: Our framework has two phases.  The first phase is a confidence prediction with seen classes only.  The second phrase uses seen classes or unseen classes.  We find that the model is more reliable on seen classes than unseen classes during testing.  Therefore, we apply the second phrase to transfer knowledge and learn from the seen to unseen classes based on class hierarchy so that the prediction of unseen classes can be more accurate.

Q: How many events are discovered based on the search quarry?

A: The data we collected is between April 1, 2017 to May 31, 2017, so basically 2 months.  During this time in Beijing, we discovered 932 events.  You can find more details in our paper.

If you have additional questions, please contact Jingqing Zhang.

33:50 – Bob Foreman, LexisNexis Risk Solutions
ECL Summer Code Camp Review

On May 16, 2018, five HPCC Systems Ambassadors along with Flavio Villanustre met with eight iRISE2 members for a two-hour ECL Code Camp. The event was a great success and Bob shares with the community what they did and some of the ECL ideas that came out of it. Tips from Data Ingestion to ECL to Data Evaluation will be included in this segment.

59:14 – Q&A

Q: To import the raw data, a dynamic file technique was used.  Would super files be a better way to go?

A:  Well, I think so.  In the everyday operations of LexisNexis we use super files a lot.  In the code that is attached, I actually have the super file implementation as well because I wanted to test the performance to see if there was a performance hit one way or the other using one or the other technique.  The answer is by having a single definition, a single super file, the project becomes extensible.  So instead of manually going in and copying and pasting in new files that you have in a dynamic file (which can grow over time), the super file allows you to just add to the super file and the coder then only has to reference the super file definition and it wouldn’t matter if they have 18 months of data, 24 months of data or 36 months of data their code doesn’t have to be changed and the data changes dynamically.

Q: What other profiling tools are currently available?

A: That is a great question.  In GitHub, we have a download called Data Patterns and it took some of the best practices that we use for profiling data and it is downloadable ECL that you can just apply to any particular data set.  We have a product that we use that is not open source called SALT.  Real quickly, the profiling part of SALT was then exposed and made open source by being just pure ECL.  All you have to do is pass us the data set and it will give you a complete profiling results you are working with.  It is a very processor intensive code process, so you have to run this on a large cluster.  If you try to run it on a small cluster, you may run out of memory or it will take too long to run to be effective.

If you have additional questions, please contact Bob Foreman.

Have a new success story to share? We would welcome you to be a speaker at one of our upcoming The Download: Tech Talks episodes.

  • Want to pitch a new use case?
  • Have a new HPCC Systems application you want to demo?
  • Want to share some helpful ECL tips and sample code?
  • Have a new suggestion for the roadmap?

Be a featured speaker for an upcoming episode! Email your idea to Techtalks@hpccsystems.com

Visit the Download Tech Talks wiki for more information about previous speakers and topics.