The Download: Tech Talks by the HPCC Systems® Community, Episode 17
On September 13, 2018, HPCC Systems hosted the latest edition of The Download: Tech Talks. This series of workshops is specifically designed for the community by the community with the goal to share knowledge, spark innovation, and further build and link the relationships within our HPCC Systems community. In this very special edition of Tech Talks, we featured more of our 2018 HPCC Systems summer interns and the work they are doing with machine learning.
Links to resources mentioned in Tech Talks Episode 17:
2018 HPCC Systems Community Day
HPCC Systems Machine Learning Library
HPCC Systems Internship Program
Episode Guest Speakers and Subjects:
Farah Al Shanik, Clemson University – Equivalence Terms for Text Search Bundle
Farah Alshanik is a Ph.D. student of computer science at Clemson University. She received her B.S from Jordan University of Science and Technology. She is working with Dr. Amy Apon as a Research Assistance in Data Intensive Computing Ecosystems (DICE) Lab. Her interest is focused on applying high performance computing to machine learning problems.
Soukaina Filali, Georgia State University – Fraud Detection on Transactional Data using a Time Series Mining Approach
Soukaina Filali Boubrahimi is a 3rd year PhD student in the department of computer science at Georgia State University and research assistant member of the data mining lab at GSU under the supervision of Dr. Rafal Angryk. In this talk, Soukaina will discuss the fraud detection project using time series data mining approach she worked on during her internship. Her research interest is on time series data mining (classification, clustering…). She has publications on the topic in IEEE Big Data, IEEE DSAA, DEXA and the applied APJs journal. Soukaina received a Master in Science in Software Engineering and a Bachelor in Computer science from Al Akhawayn University (AUI) in Ifrane, Morocco.
Lily Xu, Clemson University & Gus Reyna, LexisNexis – Using HPCC Systems ML to Map Thousands of Public Records Data Descriptions to Standard Codes
Lily Xu is a PhD candidate from DICE lab directed by Dr. Apon in the school of computing of Clemson University. It’s her third time interning with the HPCC Systems team working on machine learning applications. Her research area is machine learning, natural language processing, and high-performance computing. She can speak only three languages, but she can program more than three languages.
Gus Reyna is a Director with LexisNexis Risk Solutions where he leads the engineering team for the Motor Vehicle Report (MVR) data products. He has been working at LexisNexis for 9 years building data solutions on the HPCC Systems platform.
Key Discussion Topics:
1:07 – Jessica Lorti provides community updates:
Reminder: 2018 HPCC Systems Community Day, Atlanta
- Registration open to our external Community through September 21
- Workshop & Poster Competition on October 8
- Main event sessions on October 9
- Thank you to our sponsors!
- Visit hpccsystems.com/hpccsummit2018
Community Day Pre-Event Workshops
Morning Workshop: Data Extraction and Transformation with ECL
- Understand the HPCC Systems platform and learn ECL to build powerful data queries.
- This three-hour workshop will take the student through the data extraction and transformation cycle using HPCC Systems and ECL.
- Anyone who needs a basic familiarity and or wants to learn best practices with ECL should attend.
Afternoon Workshop: Real World Data Problems and Easy ECL Solutions
- Participants can build on their knowledge of the HPCC Systems Platform and ECL to solve real world data problems.
- This three-hour workshop will use a real-world case study to illustrate how to manage disparate data with ease and efficiency. Attendees will learn how to profile, transform, aggregate, and analyze New York City taxi data to extract some important conclusions.
- Morning workshop participants or anyone with a basic familiarity with ECL should attend.
Code examples and hands-on lessons will be included in both workshops. See additional course details and prerequisites at hpccsummit2018.eventbrite.com to register and attend.
7:30 – Farah Al Shanik, Clemson University – Equivalence Terms for Text Search Bundle
Text Search Bundle (TSB) is an open source project for searching on XML text documents and contains many subtasks, one being equivalence terms. We can consider equivalence terms as strong synonyms for TSB. Several term equivalences: initialism, abbreviation, synonyms & similarity based on context.
We used HPCC Systems to develop a Text search tool via Moby thesaurus to return a set of synonyms, word2vec algorithm to return similar words, and then built a dataset for state names & its abbreviation to return the set of related documents while improving the initialism for TSB to find strings with or without the punctuation.
15:55 – Q&A
Q: Will the use of word2vec replace traditional search engines?
A: In my opinion, word2vec will not replace traditional search engines, but it will improve the performance of traditional search engines.
If you have additional questions, please contact Farah Al Shanik.
18:58 – Soukaina Filali, Georgia State University – Fraud Detection on Transactional Data using a Time Series Mining Approach
The project consists of detecting fraudulent pre-paid cards from non-fraudulent ones using mined patterns on their respective historical bank transactions data. There are numerous types of card programs, each of which comes with different fraud risk levels. Every fraud category has representative patterns that a human manually monitors on a daily basis. The goal here is to combine the domain expert engineered features with time series shapelets mining techniques to provide an automated fraud detection solution, which can potentially help in early fraud detection.
44:50 – Q&A
Q: What was the most challenging part of this project?
A: The most challenging part of the project was the need to get familiar with the domain and the TSB specifics. Also, negative class labels are not 100% negative, which is not the typical problem in machine learning.
Q: Which of the three data sources were most difficult to work with?
A: Time series is more challenging to work with.
Q: Which set of features were not very useful for the task of classification?
A: I think the right answer is the time series was not very helpful in this particular case.
If you have additional questions, please contact Soukaina Filali.
49:52 – Lily Xu, Clemson University & Gus Reyna, LexisNexis Risk Solutions – Using HPCC Systems ML to Map Thousands of Violation Descriptions to Standard Violation Codes
Insurance companies face a challenge to underwrite and rate their policies using different descriptions across states for similar traffic violations. LexisNexis Risk Solutions provides Standard Violation Codes (SVCs) that give one consistent meaning for insurers. Lily and Gus explain how HPCC Systems ML addressed the problem of mapping thousands of disparate violation descriptions to a corresponding SVC and the future for this approach.
1:08:27 – Q&A
Q: Was the machine learning technique more helpful than the non-machine learning technique for solving this use case and can you please elaborate?
A: When we started to program the rules directly, we found that our match rates were not very high. Once we applied the machine learning technique, our match rates were higher and we were able to harness all of the body of knowledge behind that vs writing our own direct programs.
Q: Where can we get these great machine learning tool kits?
A: In the attachments/links for this presentation, you can go to the HPCC Systems website and download the free models there.
Q: How long did it take to develop the machine learning mapping program?
A: From start to finish, it took 8 months. But understanding the data is where the bulk of our time was spent and the exploratory analysis upfront was critical.
Q: Did you consider other approaches for this problem besides machine learning?
A: The reason we continue to use machine learning is because the data is changing fast and machine learning can handle all of the changing data very quickly. Machine Learning is also less time consuming and more effective to use.
If you have additional questions, please contact Lily Xu or Gus Reyna.
Have a new success story to share? We would welcome you to be a speaker at one of our upcoming The Download: Tech Talks episodes.
- Want to pitch a new use case?
- Have a new HPCC Systems application you want to demo?
- Want to share some helpful ECL tips and sample code?
- Have a new suggestion for the roadmap?
Be a featured speaker for an upcoming episode! Email your idea to Techtalks@hpccsystems.com
Visit The Download Tech Talks wiki for more information about previous speakers and topics.