On January 18, HPCC Systems hosted the latest edition of The Download: Tech Talks. This series of workshops is specifically designed for the community by the community with the goal to share knowledge, spark innovation, and further build and link the relationships within our HPCC Systems community.
Links to resources mentioned in Tech Talk 10:
Episode Guest Speakers and Subjects:
Chris Gropp, PhD Candidate, Clemson University
Chris Gropp is a PhD candidate at Clemson University. His research interests include machine learning, high performance computing, and data analysis. Chris is currently working on refining topic modeling approaches to text analysis, both by improving the algorithms themselves and by developing new methods to analyze output. He is also working with a number of other researchers to apply existing tools to new domains.
Rodrigo Pastrana, Software Architect, LexisNexis Risk Solutions
Rodrigo is an Architect with the HPCC systems supercomputer focusing in platform integration and plug-in development. He has been a member of the HPCC core technology team for over five years and a member of the LexisNexis team for seven. Rodrigo is the principle developer of
WsSQL, the HPCC JDBC connector, the HPCC Java APIs library and tools, and the Dynamic ESDL component. He has more than fifteen years of experience in design, research and development of state of the art technology including IBM’s embedded text-to-speech and voice recognition products, Eclipse’s device development environment. Rodrigo holds an MS and BS in Computer Engineering from the University of Florida and during his professional career has filed more than ten patent disclosures through the USPTO.
Richard Taylor, Chief Trainer, HPCC Systems, LexisNexis® Risk Solutions
Richard Taylor has worked with the HPCC Systems technology platform and the ECL programming language for over 15 years. He is the original author of the ECL documentation, developer, and designer of the HPCC Systems Training Courses, and is the Chief Instructor for all classroom and remote based training.
Key Discussion Topics:
1:12 – Flavio Villanustre provides community updates:
HPCC Systems Platform updates
• 6.4.6-1 is the latest gold version
• 6.4.8 RC2 available now
Reminder: 2018 Summer Internship Proposal Period Open
• Interested candidates can submit proposals from the Ideas List
• Visit the Student Wiki for more details
• Deadline to submit is April 6, 2018
• Don’t delay as some proposals may get accepted earlier
• Program runs late May through mid-August
2018 HPCC Systems Summit Community Day
• October in Atlanta
• Pre-event workshop, Poster Competition, Public Admission & Sponsorship packages – All returning this year!
4:18 – Chris Gropp, PhD candidate, Clemson University – Asking the Right Questions with Machine Learning
The HPCC Systems Machine Learning Library contains a number of powerful tools, but it is important to use them properly. Chris will discuss how to ask the right questions by taking a step backwards from the methods themselves and examining the requirements defined by the applications.
Chris begins by talking about the need to ask the right questions and then he uses a brief example to demonstrate why you need to ask the right questions. He then provides a brief introduction to topic models before discussing the quest that lead to the development of the Clustered Latent Dirichlet Allocation (CLDA), which solved one of these problems. After discussing the case study, he wraps up by proving some general advice on how you can apply this to solve broader problems.
17:12 – Q&A
Q: How much domain expertise do you need to apply these methods for topic classification?
A: Certainly, if you know nothing about the domain you are working with, you might have some problems. I recommend you work with a domain expert to confirm the topics make sense. But to actually run the models, most of the models are reasonably straightforward to use. You do need some management to ensure your data is in the right format, but you do not need to be a machine learning expert to get things out of these models. I would say they are moderately accessible, but I hope to make them more accessible by providing additional information on what to look out for as you go through the process.
Q: How do you determine the right number of topics to configure the parameters of a priority for topic modeling algorithms?
A: Interesting question. There are a couple of metrics that people have used, but there are problems with these metrics, so I am actually writing a paper on this topic right now. It is much harder to answer this question than it sounds, so stay tuned. The right answer will come from looking at what you need from it. For example, if your application is about trying to sort documents into various buckets, then you need the right number of buckets for that.
Q: What type of intervals are appropriate to segment documents in timestep? Is a day fine or a month? Are there good rules of thumb?
A: Also an interesting question. This is the reason we started the process of putting this into ECL, so we can run experiments on it. For our experiments, we did things off of a year. The data set that we worked with was abstracts from conference submissions which were scattered throughout the year, so it did not make sense to break it up by month. For other applications, a day or month might be the right answer. It comes back to looking at the application you have and thinking about what you are looking for and what you need from it. Also make sure it does not violate one of your constraints, so make sure your bucket size is large enough so you have enough data to work with, but fine grain enough that you get the information out of the application from it.
Q: Can you tell me a little more about how CLDA combines topics across timesteps?
A: We take all of the topics and throw them into one giant bin and run clustering on that. In this context, each topic can be looked at as a vector. Future work will also look at doing other ways of clustering, but what you get out of this is that each of these clusters have a bunch of topics assigned to it and drawing from any of the timesteps since we throw them all into the bucket first. This means we can look at the clusters and say in this timestep we have a couple of these topics here that all capture different parts of the same idea and here it was not mentioned at all. So we get information on topics being born and dying off or even periodic topics that show up every 4 years or something. Interesting graphs are attached in the paper.
If you have additional questions, please contact Chris Gropp.
23:38 – Rodrigo Pastrana, Software Architect, LexisNexis Risk Solutions – Creating Front-facing Web Services to Deliver your HPCC Systems Query Data
The HPCC Systems platform provides everything you need to easily create production grade web services to deliver your query data. Rodrigo will discuss the tools and frameworks provided by the HPCC Systems platform and walk through the end-to-end creation of a sample web service.
HPCC Systems web services framework involves many technologies, components, and methodologies, but today, Rodrigo will focus on the main component and talk about how they are applied to a methodology called Dynamic ESDL which is used to create professional, production level web services. Rodrigo will also show us a quick demo.
50:57 – Q&A
Q: Can I use ESDL to create interfaces for Thor to?
A: I cannot commit to this 100%, but I believe as long as you can publish your query on the HPCC Systems platform, you can target your Thor. That doesn’t always make sense. When you want to deliver your data, you want to target a ROXIE platform which is tailored for very fast delivery of data. The Thor platform is not geared for that – it is more for data prep work. You might want to target ROXIE even if we allow it.
Q: Does it support other backend programming language in addition to JAVA?
A: Currently, it does not. C++ and JAVA are the only two targeted high-level languages supported. As we see the need to add new languages, we are looking into opening support for other languages.
Q: How do you restrict your new services to a subset of HPCC Systems users?
A: That is done through our security manager framework which lets you tie your security to different backends or some proprietary security manager. At that point, the rules are to be created in that backend, but what the security manger would do is authenticate and ask the backend to authorize access to the web services or feature based on the credentials provided.
If you have additional questions, please contact Rodrigo Pastrana.
54:19 – Richard Taylor, Chief Trainer, HPCC Systems, LexisNexis Risk Solutions – ECL Tips and Cool Tricks
Richard gives us a demo on the latest tips and tricks with using ECL. In this session, he will be parsing the PARSE function. Exploring some interesting PARSE techniques used in date parsing.
1:20:14 – Q&A
Q: Beautiful code, well done. What was your biggest challenge in this project?
A: The biggest challenge was the transform function and determining exactly how to figure out what I was doing. There is a lot of complexity and the code was originally written a couple of years ago, so I had to fine-tune it for this specific presentation. Other challenges included using validate. I had never used validate and when you use validate, you should always use a function with it – a Boolean function. That is the easiest way and then you can write code anyway you want. Another challenge was getting all the different patterns to the optional differences and then using the match function (which I had not used a lot) to determine which ones were matched and which ones were not.
Q: In the case of using OPT, do I get back information on if the OPT part was matched or not?
A: Yes you do. I showed that in the time parsing. Remember, seconds on the am/pm was optional. This matched function tells me if I got anything in seconds or not. That is how you do that.
If you have additional questions, please contact Richard Taylor.
Have a new success story to share? We would welcome you to be a speaker at one of our upcoming The Download: Tech Talks episodes.
• Want to pitch a new use case?
• Have a new HPCC Systems application you want to demo?
• Want to share some helpful ECL tips and sample code?
• Have a new suggestion for the roadmap?
Be a featured speaker for an upcoming episode! Email your idea to Techtalks@hpccsystems.com.
Visit The Download Tech Talks wiki for more information about previous speakers and topics.
Watch past episodes of The Download: