On February 15, HPCC Systems hosted the latest edition of The Download: Tech Talks. This series of workshops is specifically designed for the community by the community with the goal to share knowledge, spark innovation, and further build and link the relationships within our HPCC Systems community.
Links to resources mentioned in Tech Talk 11:
Episode Guest Speakers and Subjects:
Raj Chandrasekaran, CTO & Co-Founder, ClearFunnel
Raj is the CTO/Co-Founder of ClearFunnel, a Big Data Analytics as a Service Platform Startup, leading their Product Strategy and Solutions. ClearFunnel focuses on enabling Marketing Analytics, Advanced Text Analytics, Bio Informatics and Image Processing for various clients in Technology, Maritime, Publishing, and Healthcare domains.
James McMullan, Software Engineer III, LexisNexis Risk Solutions
James has a broad range of Software Engineering experience from developing low-level system drivers for X-Ray fluorescence equipment to mobile video games and web applications. He is a recent addition to the LexisNexis team and is part of an internal R&D group where he has been working on multiple projects including: HPCC Systems & Spark benchmarks, integration projects between the HPCC Systems, Spark, and Hadoop ecosystems, and document storage systems.
Bob Foreman, Senior Software Engineer, LexisNexis Risk Solutions
Bob Foreman has worked with the HPCC Systems technology platform and the ECL programming language for over 5 years, and has been a technical trainer for over 25 years. He is the developer and designer of the HPCC Systems Online Training Courses, and is the Senior Instructor for all classroom and Webex/Lync based training.
Key Discussion Topics:
1:35 – Flavio Villanustre provides community updates:
HPCC Systems Platform updates
- 6.4.10-1 is the latest gold version / Community Changelog
- 6.4.12 RC1 coming soon
- 7.0.0 Beta planned for early Q2 – among the key features:
Reminder: 2018 Summer Internship Proposal Period Open
- Interested candidates can submit proposals from the Ideas List
- Visit the Student Wiki for more details
- Deadline to submit is April 6, 2018
- Program runs late May through mid-August
- Don’t delay!
9:00- Raj Chandrasekaran, CTO & Co-Founder, ClearFunnel – Scaling Data Science capabilities: Leveraging a homogeneous Big Data ecosystem
With so many big data processing engines available, how does a start-up decide which one to use? For ClearFunnel, the answer was easy – HPCC Systems. Raj Chandrasekaran discusses both Hadoop and Spark, recent incremental innovations, and the challenges of using these big data processing engines. He then talks about HPCC Systems and how the simple, homogeneous tech stack makes it a breeze to operate with minimal investment in resources and time. Raj also touches on how ClearFunnel leveraged HPCC Systems for commercial success and how they implemented a full spectrum of complex data engineering use cases with HPCC Systems.
Q: How does HPCC Systems handle micro-batching for real-time analytics requirements?
A: We leveraged the super index super key format and looked at how we could extend it for micro- batching. Two items to consider when going through this process – size vs time constraint. We looked at how soon we should do it. How often should we do it. And at what size should we do it. It did amazingly. If you want to do less than 10 seconds, we found that it does not scale well because of the inherent nature of how the data has to transfer from Thor to ROXIE. If it is for more than 10 seconds, we can handle the data in an adequate manner.
Q: Is the work done for rolling updates open sourced as well? If so, where do I find the code?
A: Rolling updates are not open sourced. Bob Foreman, Senior Software Engineer, LexisNexis Risk Solutions is looking into how super key index and rolling updates can be handled within the HPCC Systems capabilities. Bob added for the super key, you can update the sub key while the query is active and use package maps to help with the deployment, so the query never has to come down.
Q: Regarding deep learning with geospatial is there any specific driver for geospatial functions?
A: No, we are using the same functions which are already available in the HPCC Systems library.
Q: For imaging processing, are you taking advantage of the imaging processing libraries of the HPCC Systems or are you using your own?
A: Yes, we started using the basics of imaging processing with HPCC Systems.
Q: Have you needed to implement stream processing to deal with IOT based requirements and if so, can you explain how you implemented it?
A: I will give two hints – streaming into S3 and micro-batching. These are the biggest changes that were required. Simply understand how to spray into landing zone continuously using any open source version. Then do micro-batching and that should be more than adequate to satisfy any IOT processing.
If you have additional questions, please contact Raj Chandrasekaran.
37:05 – James McMullan, Software Engineer III, LexisNexis Risk Solutions – HDFS Connector Preview
James walks us through the motivations and goals behind the creation of the HDFS connector and provides a brief overview of the architecture of HDFS. Specifically, how is data stored in HDFS and how we can read and write data to/from HDFS. He touches on how the connector works and how it achieves parallelism and then takes us through a brief demo.
1:03:05 – Q&A
Q: Very useful tool for the community. When will it be available?
A: Great question. It is still in progress and will be on the company GitHub when it is available.
Q: Did you test performance? Does the performance really scale with the number of HPCC and HDFS nodes?
A: We are still testing performance, but it does scale with the number of nodes in the HPCC cluster.
Q: Does this work for writing to HDFS as well as reading from HDFS?
A: Yes, it does and hopefully the demonstration showed how easy that is.
If you have additional questions, please contact James McMullan.
1:04:53 – Bob Foreman, Senior Software Engineer, LexisNexis Risk Solutions – Building a RELATIONal Dataset – A Valentine’s Day Special!
In our ongoing ECL Tips and Trips series, Bob explains that most of the datasets on an HPCC cluster are organized in a normalized architecture. This unique linking field in one dataset can be used to join with other datasets using a one-to-one or a one-to-many relationship. At LexisNexis, it is often referred to as the “Data Donut”.
Bob shows us how using a denormalized dataset can improve the power of your queries and discover hidden relationship in the data. He further explains that ECL has powerful and easy support in moving from a normalized to a denormalized format when needed. In summary, he reminds us that knowing how to move both ways, and the best practices in doing so, is a good skill to have for all ECL developers. As always, Bob walks through an ECL code demonstration.
1:27:38 – Q&A
Q: Is there an easy way to flatten the datasets with child datasets?
A: That is what I did with denormalization – that is how you do that.
Q: Can you have a hierarchical nested child dataset in a Russian doll fashion? For example, real estate properties for a person including the value of each property for a number of years?
A: I imagine you can. The sky is the limit as long as you understand the data you are working with.
If you have additional questions, please contact Bob Foreman.
Have a new success story to share? We would welcome you to be a speaker at one of our upcoming The Download: Tech Talks episodes.
• Want to pitch a new use case?
• Have a new HPCC Systems application you want to demo?
• Want to share some helpful ECL tips and sample code?
• Have a new suggestion for the roadmap?
Be a featured speaker for an upcoming episode! Email your idea to Techtalks@hpccsystems.com.
Visit The Download Tech Talks wiki for more information about previous speakers and topics.
Watch past episodes of The Download: