The Download: Tech Talks by the HPCC Systems Community, Episode 1
On January 12, 2017, HPCC Systems hosted the first ever episode of The Download: Tech Talks. These technically-focused talks are for the community, by the community. The Download: Tech Talks is intended to provide continuing education through high quality content and meaningful development insight throughout the year.
Watch the webcast here: https://www.brighttalk.com/webcast/15091/240577
Episode 1 Guest Speakers and Subjects:
Anirudh Shah, Co-Founder and CTO, 3Loq Labs
Anirudh Shah is the Founder & CTO, 3LOQ Labs. He is currently working on his second startup and has more than a decade of experience in machine learning, natural language processing, mobile software development and management. Anirudh has been using HPCC Systems for the past four years.
3LOQ meshes proprietary Advanced Computing techniques with Human Wisdom to produce Actionable Insights at speed. These insights allow your brand to partner with the Right customer, at the right time, to fulfill their purchase intention. We create an UNIQUE CUSTOMER View by contextualizing billions of transaction data points. The customer can be viewed individually or in relation to other customers.
- How 3Loq Labs uses HPCC Systems to process more than 500 monthly marketing campaigns at the largest private bank in India across the banks entire portfolio.
- Experience with HPCC Systems in production
- Automation and data sanity frameworks
Allan Wrobel, Sr Software Engineer, LexisNexis Risk Solutions
Allan has spent his career working in the technology industry, (that’s 1976!) and he has been working with Databases since the mid-eighties.
Allan has worked with LexisNexis Risk Solutions since 2011 and the inception of LexisNexis Risk Solutions in the UK. Initially working with Data Operations, Allan is now serves as an ECL developer on both Thor and ROXIE.
- Thor is well known for making short the processing of billions of records, and this promotes the tendency to use brute force in its deployment. Watch how the UK managed to implement efficiency over brute force to reduce the processing time for a daily build of a billion record ingest file from 12 hours, to 2 hours, and enabled further speed increases in other processes.
- Making full use of Superfiles to make order of magnitude improvements to build times on Thor. (plus fringe benefits)
Lorraine Chapman, Consulting Business Analyst, HPCC Systems
Lorraine has worked alongside software developers for over 20 years in a supportive role which has ranged from producing documentation including developing on-line help systems to software testing and release management.
Lorraine joined LexisNexis in 2004 and as well as continuing to work alongside the HPCC Systems platform development team, also administers the HPCC Systems Intern Program and manages our application to be an accepted organization for Google Summer of Code.
As an active blogger on the hpccsystems.com, Lorraine covers a wide range of subjects from new release information, features and improvements and the work students have completed during their internships.
- In 2015, HPCC Systems was an accepted organization for Google Summer of Code (GSoC) taking on 2 students involved in this program. However, we had the bandwidth to support more students and so the HPCC Systems summer internship program was born. Four students joined the program in 2015 and four more in 2016. We will apply for GSoC and run our intern program again in 2017. Hear how the programs work, how projects are identified and find out about student successes on these programs.
Key Discussion Topics:
1:55- Flavio Villanustre discusses the purpose of The Download: Tech Talks. Learn more on his vision for delivering leading edge technology information on a monthly basis.
7:05- Anirudh Shah: Automation using ECL+Jinja2
Anirudh provides an overview of 3Loq Labs and how they use Machine Learning (ML) and Natural Language Processing (NLP) for Marketing organizations. Anirudh provides examples on how they utilize Jinja 2 to process billions of transactions and customer data for their clients.
Q. How does Jinja2 compare to ECL macros, in which ways is Jinja2 better than ECL macros?
A. There are 2 major reasons:
- to use Jinja2- ease of use. Jinja2-based files are easier to read and the kind of text manipulation you can do with Jinja2 filters is not available with ECL macros
- Being able to connect to a database or other system and being able to pull the parameters to inside the macros
- ECL macro advantage over Jinja2 is that ECL macros allow you to read from the HPCC System directly.
The key to remember is that you can use both of them together. They are not mutually exclusive.
Q. Is Jinja2 freely available?
A. Yes Jinja2 is an open source library and it is available under BSD license. It has been in development so it is quite stable.
Q. What kind of features does Jinja2 support?
A. Jinja2 has some advanced filtering and it also has the capability to create a hierarchy of templates allowing for inheritance. Jinja2 provides very sophisticated templates can be created using Jinja2 as it supports custom filters, template inheritance, extensions, etc. It is quite powerful.
Q. How stable is the library?
A. The library has been in development and it is widely used in web development. It is quite stable and it is being used by quite a few people. We have not encountered any issues with Jinja2. Moreover, the community support is great.
Q. How complex is it to set up the connections with ECL?
A. It is really quite simple. You are not actually connecting to Thor or ECL and all the manipulation is happening before the fact. You are using Jinja only to generate the ECL code. Once the code is generated, you would use the command line or other method to invoke the ECL. They are not tightly coupled at all.
20:45- Allan Wroble: Leveraging Superfiles on Thor
Allan has been with LexisNexis Risk Solutions since 2011 initially with Data operations and now developing on Thor and ROXIE.
In this presentation, Allan discusses handling daily updates on monolithic logical files while needing the ability to roll back, when needed. De-duplicating against the previous day’s data after several months of logical data, becomes very time consuming. The process Allan reviews covers coordinating multiple logical files which takes each daily file and roll them into a monthly file with all the data for that month. Monthly files could be rolled up into yearly files, if desired. While the build is slightly longer, the roll back capability is maintained.
- Build times reduced by orders of magnitude. The bigger the data the greater the improvement.
- Allowed ad-hoc searches of very large un-indexed data, where the search is date bound.
- Archiving historical data becomes a trivial exercise that only uses the DFU server.
Q. Can you share with us the number of IOPS (input output per second) the Thor cluster typically has, the data coming in?
A. We are getting 50GB of data coming in each day in the Thor base file.
Q. How does one manipulate data held in a Logical file using generic attributes, when each file is tied to a specific layout?
A. Our library functions only use the DFU server, i.e. LogicalFileList, move, copy and rename. The only functionality that requires Thor is the ‘Merge’ of daily files into month files, and this is achieved using a callback function where the temporary superfile filename list and the target ‘Month’ filename are passed as parameters to said callback function. The ‘Merge’ itself being done in this callback.
Q. Do you have to be at a specified version, or above, of the HPCC Systems platform before you can use this approach to manipulating superfiles/logicalfiles?
A. This has been used in production since the end of 2012, it uses nothing but standard DFU Server functions.
Q. Is this functionality available to the HPCC Systems community?
A. Not as of now. This presentation is about one approach used to manipulate superfiles/logicalfiles, that works for our business case. The ideas shown here may be applicable to your business and if they are, can dramatically improve build times. That being said there is, out there, a proposal to make a bundle, of this functionality, available on the HPCC community site.
51:50- (late clarification request)- While the particular functions Allan showed today are not specifically available, the HPCC systems library has information on promoting superfiles lists. If you have petabyte files, it will move all of the data to the father. There isn’t yet functionality in the standard library to just move particular logical files from one generation to the next.
36:05- Lorraine Chapman: Student Opportunities with HPCC Systems
Lorraine discusses Student Opportunities with HPCC Systems and the three ways you can be involved with our three programs: Google Summer of Code, HPCC Systems Summer Intern Program, LexisNexis Corporate Intern Program. For all programs, Lorraine needs both students and mentors. If you have projects for our community which are suitable for students as well as anyone interested in mentoring a student should reach out to Lorraine. Lorraine reviews program overviews and differences as well as what qualities make a great proposal. Of course, the greatest advantage a student can have is familiarizing themselves with HPCC System.
49:25- Key links for more information including available projects, where to access proposal help as well as blogs and past project information. Check out Lorraine’s blogs on past intern programs to understand more on what other students have done.
Q: I’m a community member but not a LexisNexis employee. Can I still become a mentor for the HPCC Systems intern program?
A: Yes! The HPCC Systems intern program does welcome community members to be mentors. The same is true for Google Summer of Code. This does not apply for the LexisNexis Intern Program, which requires internal LexisNexis employees to serve as mentors.
Q. Have there been people who have completed more than one project over the years?
A. Yes, while I wouldn’t want to fill all of the spaces with returning students each year so we have room for new participants.
Q. Are internships only for the northern hemisphere or are there opportunities for internships during the summer in the southern hemisphere?
A. Lorraine is open to the suggestion. Please reach out to Lorraine.
Q. Does the program provide moving stipends or other moving expenses?
A. Google Summer of Code is run through Google and the majority of the students will be working remotely and they are paid a flat rate. The HPCC Systems program is similar. We did provide some assistance for one student’s housing needs who worked from our Boca Raton office. That said, this is mainly a flat fee for remote working employees. The LexisNexis Intern Program information needs to come from that program office. Please email Lorraine and she will assist in making connections to Renu Midha who manages this program.
Q. What is the code check in and review process for interns?
A. This primarily is related to the Google Summer of Code projects. If you are working on a coding project related to HPCC systems, you become a developer like other programs. The review process would be the same as any other developer working on a project, utilizing GitHub.
Have a new success story to share? We would welcome you to be a speaker at one of our upcoming The Download: Tech Talks episodes.
- Want to pitch a new use case?
- Have a new HPCC Systems application you want to demo?
- Want to share some helpful ECL tips and sample code?
- Have a new suggestion for the roadmap?
Be a featured speaker for an upcoming episode! Email your idea to Techtalks@hpccsystems.com
Visit The Download Tech Talks wiki for more information: https://wiki.hpccsystems.com/display/hpcc/HPCC+Systems+Tech+Talks