The Download: Tech Talks by the HPCC Systems Community, Episode 3

On March 30, 2017, HPCC Systems hosted the latest edition of The Download: Tech Talks.  These technically-focused talks are for the community, by the community.  The Download: Tech Talks is intended to provide continuing education through high quality content and meaningful development insight throughout the year.

Watch the latest webcast here:

Episode Guest Speakers and Subjects:

​ Joselito (Joey) Chua , PhD, Manager Software Engineer, Optimal Decisions Group

Joselito (Joey) Chua leads the software engineering team in the Optimal Decisions Group in LexisNexis Risk Solutions.  He specialises in information-theoretic approaches to machine learning.  He is a fan of anime.

Joey presents:

Prescriptive Analytics – a Software Engineering Perspective

Prescriptive tools are key components in realising the value proposition of data analytics and business intelligence solutions.  The ultimate goal of analytics is to improve outcomes using insights from data.  Descriptive tools summarise what has happened, whereas predictive tools model what is likely to happen.  Prescriptive tools, on the other hand, suggest a course of action that will likely achieve the best outcomes. 

This talk presents an overview of prescriptive techniques involving simulation and optimisation, the engineering challenges in building prescriptive tools, and HPCC solutions for those challenges.

Jill Luber, Senior Architect, LexisNexis Risk Solutions

Jill Luber is a Senior Architect for LexisNexis Risk Solutions with leadership responsibility for strategy, implementation, and stability of all US and international data linking products, including the LexID, Business LexID, Healthcare Provider ID, and UK LexID.  With 17 engineers across multiple geographies, the Linking Team develops the core competences underpinning all products at Risk executing on the HPCC big data platform.  Jill has presented patented, Risk linking concepts at the 2015 RELX leadership conference as well as the RELX Board meeting.  She has been a member of the technology organization for over 13 years.

Jill presents:

Migrating an ECL code repository into Git, Part II

This session will take a quick look at a migration plan that moved ECL production code, production processes and developers out of MySQL/SVN and into a Git code management culture.  This includes migrating both Roxie and Thor processes to use Git branches across multiple HPCC Systems environments, all while continuing production data builds and releases.

Michael Gardner, Software Engineer II, LexisNexis Risk Solutions

Michael Gardner is a HPCC Systems Platform team member and developer.  He is responsible for the HPCC Platform init system, various build issues, administrative scripts, and HPCC Java projects.  His most recent active work includes systems integration for the HPCC Systems Platform, and an antlr3 (c) to antlr4 (cpp) migration for the wssql project.

Michael presents:

HPCC Systems Platform: Java APIs and Tools

This presentation will be in regards to the Java API and tools released by the HPCC Systems Platform team.  These projects include wsclient, rdf2hpcc, clienttools, and jdbc.  These open source projects, which can be found in the hpcc-systems github repositories, are designed to allow downstream developers a consistent means by which to interface with the HPCC Systems Platform.  And to facilitate the workflow of common tasks a downstream developer might be concerned with.

Bob Foreman, Senior Software Engineer, HPCC Systems, LexisNexis Risk Solutions

Bob Foreman has worked with the HPCC Systems technology platform and the ECL programming language for over 5 years, and has been a technical trainer for over 25 years. He is the developer and designer of the HPCC Systems Online Training Courses, and is the Senior Instructor for all classroom and Webex/Lync based training.

Bob presents:

In Search of the Lost ECL Tutorial

In this presentation, Bob explores David Bayliss’ ECL Bible Tutorial, with particular focus on the GRAPH function and building the inverted index for the ROXIE search.  A recorded screen share helps you navigate and better understand how to use the functionality.

Key Discussion Topics:

1:50- Flavio Villanustre discusses:

  • The HPCC Systems Summer Intern program, deadline April 22
  • Call for presentations and poster abstracts for the 2017 HPCC systems Community Day to be help the week of October 2nd in Atlanta, Georgia
  • Machine Learning Update
    • Two new bundles coming in 6.4.0 in a mid-year release
    • Linear and logistic regression
    • Check for first release candidate in May

13:45- Jill Luber- Migrating an ECL code repository into Git, Part II

15:05 Jill discusses the LexisNexis Risk Solutions migration to GitLab.  She covers the reasons for the migration. Benefits the team will receive from the migration include branching capability, release management and distributed code base.

Topics include:

  • How to migrate
  • How the code structured in Git vs. MySQL
  • How to manage the migration
  • How to begin migration
  • What to consider before migration

32:35- Q&A

Q.  Should I use tags or branches to manage attribute versions and releases and if so, why?

A.  Tags are like a frozen point of time so it does fit will with a release, like a time stamp.  Branches allow altering of the branches and hot fixes with a commit history so there is a view into the history.  It is a little trickier to do this with tags.  There are a lot of options in using branches and tags.  We used branches because of the option to alter what is in production and keep track of that.

Q. How does Git prevent your code from being overwritten if two people check out code one after the other?

A.  Merge Conflict functionality determines the head of the branch so it knows the branch you started with.  If two people start with the same branch with the same ancestory, the system knows the status of the code at the head of the branch. Git will accept the first pull request.  The subsequent submission will receive a merge conflict and allow for resolution before acceptance.  This functionality is one of the main reasons the team is moving to Git.

36:40- Joey Chua: Prescriptive Analytics – a Software Engineering Perspective

Joey explains that the ultimate goal of data analytics is to improve outcomes using insights from data and the various phases of development moving from data curation to insights and from insights to decisions.

Joey explains how to move to prescriptive analytics with prescriptive tools that improve how insights can deliver better outcomes.  Topics of conversation include:

  • The value proposition of data analytics and business intelligence solutions.
  • How to simulate many possible scenarios, and select options that achieve the desired outcomes and satisfy constraints.
  • Several characteristics and elements that present software engineering challenges
    • Multiple competing objectives and constraints
    • Large number of options
    • Causal and response models that require behavioural attributes
  • How a high-performance cluster computing platform helps meet the engineering challenges.

55:10- Q&A

Q.  Did you need to create visualizations in HPCC Systems?

A.  There are several visualization tools available in HPCC Systems which are open source and available for use.

Q. Are there any open source analytical tools that are available in ECL bundles?

A.  There is a machine learning bundle available which would be very useful.

Q. If prescriptive tools can already determine which actions will lead to the best outcome, what is left for the decision makers to do?

A.  Prescriptive tools do just that- they prescribe.  Executing the actions such as self driving cars or high frequency trading are outside the scope.  Prescriptive tools are just tools and the responsibility for the outcomes falls to the decision makers.  Outcomes are only as good as the options and assumptions, which also come from the decision makers.  Also, there is also no substitute for common sense or the decision maker skill in balancing gut feel and objectivity.

Questions not addressed on air:

Q. What is the rate of adoption of prescriptive tools?

A.  In 2013, Gartner reported 3% of the companies they surveyed use prescriptive tools, compared to 30% who use predictive tools.  In 2016, Gartner estimated the rate for the use of prescriptive tools at 10%.  IDC predicts that by 2020, 50% of business analytics software will have prescriptive tools built-in. I would suggest prescriptive analytics are actually already ubiquitous in our gadgets and apps.  For example, navigation apps suggest the best routes based on crowd-sourced traffic reports.  Self-driving cars will be an interesting development to watch. 

Q.  Is it necessary to develop predictive capabilities before we develop prescriptive capabilities in our business analytics tools?

A.  While predictive models are required in prescriptive tools, there is a strong argument that it is actually better to start from a “prescriptive position” and work backwards.  That is, start with questions about the decisions they want to make, rather than try to “boil the ocean” of data in the hope of finding something useful and interesting.  By starting with the decisions and options that need to go into the prescriptive tools, efforts in building descriptive and predictive models can be more directed towards the business needs.

47:30- Michael Gardner: HPCC Systems Platform: Java APIs and Tools

Michael discusses Java API and tools released by the HPCC Systems Platform team. 

These projects are found in the hpcc-systems github repositories and are designed to allow downstream developers a consistent means by which to interface with the HPCC Systems Platform.  These projects can be used to facilitate the workflow of common tasks a downstream developer might be concerned with.  Projects discussed include:

  • Wsclient
  • rdf2hpcc
  • clienttools,
  • jdbc

1:06:45- Q&A

Q:  How efficient is the JDBC interface?  Do queries get converted into ECL and do they use existing ROXIE indexes and queries?

A: They do get converted into a form of ECL.  You don’t get to use the full features of the language, unfortunately, but as far as ROXIE indexes and queries go, we will need to check on that.

Q. Is there a JNI interface for ECL or HPCC?

A.  Wsclient is pure Java, it has no c/c++ dependencies and does not require nor provide JNI based interfaces. Separate from this project,

HPCC platform does provide a mechanism to create cpp based client code to interface with the ESP web services.

Q: Is wsclient compatible with multiple versions of the platform?

A: Yes if a developer is using the client/platform/utils portion of the interface.  We encapsulate the appropriate SOAP calls and handle any differences in the target platforms.  That being said, the wsdl generated SOAP calls are available to developers if they wish to utilize them and take on the risk of compatibility themselves.

Q.  is there a tutorial available for using JAVA ECL plugins

A.  The project contains sample code which illustrates typical use cases. Also, a youtube video is in the works.

Questions not answered on air:

Q: I already have already created my own logic to execute ECL.  How would I benefit from using the wsclient library?

A:  The largest benefit of using the wsclient library is that it is well tested and mature.  You don’t have to recreate the wheel or worry about slight differences between platform version targets.  It leaves developers more time to worry about the actual problem they’re trying to solve, instead of how to tell the platform what they want done.

Q: What is the purpose of hosting the HPCC Systems Java projects on the maven central repository?

A:  We hope that by hosting outside our intranet, it will allow more developers to easily access and leverage the tools we’ve created. 

1:10:10- Bob Foreman: In Search of the Lost ECL Tutorial

Bob talks about the lost ECL tutorial.  This tutorial has been well-hidden on David Bayliss’s personal website and Bob reveals the tutorial and he walks through the three phases of the tutorial including an interactive demonstration of how to take these steps yourself.  Bob walks through the following aspects of the tutorial:

Part I
Getting, Smacking, Organizing, Structuring, and Preparing the Data

Part II
The GRAPH Function, Heart of the Search Engine

Part III
Build a ROXIE Query

1:37:46- Q&A

Q. I see that the index is using numeric fields in this case and the text is in the payload.  Could we use alphanumeric fields instead and are indexes using numeric fields more efficient?

A.  Indexes using a numeric field may be a little more efficient but you can use anything as an index field element, alpha numeric, and string fields are absolutely supported.

Q. What popular ECL technique was used to “smack the data into shape”?

A: We were looking at several different techniques.  We were using PROJECT, ROLLUPS, memory tables.  We were using a memory table with default values.  It was reading through the original raw text and extracting the chapter and the verse and all the pertinent information into a memory table that we were able to continue working with and add the book number and eventually build an index on our final result.

Q.  What if my memory table doesn’t fit in the memory of a single Thor node?  Does it still work?  Does it use the filesystem to spill what doesn’t fit in the memory?

A.  Yes, you have RAM on your machine and you have RAM for each node on your cluster.  If you have a single node and you run out of RAM, a disc spill will happen to use disc as additional RAM.  The only caveat is that your performance may drop a little bit because it is going into disc for additional RAM.

Questions not answered on air:

Q. How does David process the Bible text to get things like word count and finding text? 

A: Standard Function String Libraries

More information on HPCC Systems Training can be found here:  HPCC Systems Training and Current Class Schedule

Have a new success story to share?     We would welcome you to be a speaker at one of our upcoming The Download: Tech Talks episodes.

  • Want to pitch a new use case?   
  • Have a new HPCC Systems application you want to demo?   
  • Want to share some helpful ECL tips and sample code?   
  • Have a new suggestion for the roadmap?

Be a featured speaker for an upcoming episode! Email your idea to Techtalks@hpccsystems.com  

Visit The Download Tech Talks wiki for more information:https://hpccsystems.atlassian.net/wiki/display/hpcc/HPCC+Systems+Tech+Talks

Watch Past The Download: Tech Talks Webcasts:

The Download: Tech Talks by the HPCC Systems Community, Episode 1

  • Anirudh Shah, Co-Founder, 3Loq
    • How we use HPCC Systems to process more than 500 monthly marketing campaigns at the largest private bank in India across the banks entire portfolio.
    • Our experience with HPCC Systems in production
    • Automation and data sanity frameworks
  • Allan Wrobel, Senior Engineer, LexisNexis
    • Making full use of Superfiles to make order of magnitude improvements to build times on THOR. (plus fringe benefits)
    • Thor is well known for making short the processing of billions of records, and this promotes the tendency to use brute force in its deployment. Watch how the UK managed to implement efficiency over brute force to reduce the processing time for a daily build of a billion record ingest file from 12 hours, to 2 hours, and enabled further speed increases in other processes.
  • Lorraine Chapman, Consulting Business Analyst, HPCC Systems
    • In 2015, HPCC Systems was an accepted organization for Google Summer of Code (GSoC) taking on 2 students involved in this program. However, we had the bandwidth to support more students and so the HPCC Systems summer internship program was born. Four students joined the program in 2015 and four more in 2016. We will apply for GSoC and run our intern program again in 2017. Hear how the programs work, how projects are identified and find out about student successes on these programs.

The Download: Tech Talks by the HPCC Systems Community, Episode 2

  • Fujio Turner, Solutions Architect, Couchbase – Mobile/IoT & HPCC Systems
  • Fujio discusses the challenges around IoT and address the following questions:
  • As there are more mobile and embedded devices all generating more data, what does that mean now and for the future?
  • What has to change in an organization’s infrastructure to keep up?
  • And how can I best take advantage this new stream of information?
  • Jacob Pellock, Sr Director Software Engineering, LexisNexis Risk Solutions
    • Jacob presents Operationalizing jobs on Thor utilizing Python, Git and HPCC Systems client tools – Part I
  • Roger Dev, Sr Architect, LexisNexis Risk Solutions
    • Roger’s presentation addresses: Basic Linear Algebra Subsystem (BLAS) and Parallel Block BLAS (PBBlas) libraries.  Manipulation of matrix data via Linear Algebra operations lies at the heart of many data-mining and machine-learning techniques. New modules for HPCC provide highly scalable and performant implementations of these operations.
  • Richard Taylor, Chief Trainer, HPCC Systems
    • Richard provides an overview on HPCC Systems Training: Updates and Deep Dives on Cool Code as well as an update on what is going on with ECL/HPCC/SALT/KEL training courses.   

Jessica comes from an extensive background defining and implementing strategic programs across a variety of marketing disciplines for the technology, financial services, and energy industries. She has held senior marketing roles at GE, Intel, Compaq and Grant Thornton where she managed product marketing and brought new technologies to market, developed and launched social media and online marketing efforts, and developed new business models in conjunction with sales and key corporate partners. Jessica holds a Bachelor of Science in International Economics from Texas Tech University and a Masters in International Management with a concentration in Marketing from Thunderbird, the Global School of International Management. She has also earned the LEAN Six Sigma Green Belt certification.