5.0 has been released! Right on time for the festivities celebrating the third anniversary of our Open Source HPCC Systems platform!
What makes this event even more exciting is the fact that a number of new integrations are available too: How would you like to use SQL queries through a web services interface, to access published Roxie ECL queries? WsSQL now makes this possible! And how about a complete deployment of an entire HPCC environment with a click of a mouse under Canonical’s Ubuntu Juju? The free HPCC charm is available in the Juju store! And if you use Nagios, the new seamless integration allows direct monitoring of HPCC infrastructure with your existing monitoring system.
This is a great milestone for the HPCC community; so head over to the Downloads section now, and don’t forget to use the forums to tell others about your experience, get answers to your questions and socialize with the rest of the community.
And Happy Birthday, HPCC Systems!
On May 1, the report “Big Data: Preserving Values, Seizing Opportunities” was released by the Executive Office of the President in response to a directive from President Obama to examine the impact of big data technology on society at large.
Big data is a movement in its own right, but after the White House report was released there was an influx of media articles questioning big data, and in particular, the safety of big data. Questions began to circulate on not only how secure the data is but also on the privacy rights of the citizens' records whose very personal information is stored in this data.
For example, GigaOM posted an article titled “It’s 2014. Do You Know Where your Data Is?” and a LinkedIn blog that declared, “Big Data has Big Problems.” Both stories addressed the security and privacy of information stored and utilized for big data purposes.
Recently, I gave an interview to discuss how LexisNexis Risk Solutions protects our data and customer information as well as address the recent concerns raised in the media regarding big data, and what we are doing on our HPCC Systems big data platform to maximize security. Below is the Q&A transcript from the interview.
Moderator: Why is LexisNexis’ information safe and then why should customers trust us?
Flavio Villanustre: We are secure because we have a comprehensive security and privacy program in place. We continuously test our security posture and address any weaknesses that we may find, and we also have state of the art controls around access to the data.
But security goes far beyond just technology. Security isn’t just about making your software secure so that it cannot be breached, you need to also make your processes secure. You need to provide awareness to your employees so that they don’t get socially engineered, for example, and apply controls around security across your entire organization. It’s not just technology, and it’s not just customer support or customer operations.
What are some specific things we do to protect the data?
We do a lot of things. On the administrative side, we have a comprehensive set of security policies, procedures and standards. We provide security through training and we require that employees have knowledge of our policies. We do internal and external (independent third party) risk assessments to ensure that every part of the process is assessed from a risk standpoint, and that controls are commensurate with the exposure and overall risk.
We also employ technical controls, which are things like firewalls and network segmentation, data loss prevention systems and anti-malware protection systems. Administrative and technical controls complement each other.
In addition, we draw a distinction between “security” and “compliance.” Too often, we see organizations “checking the box” to assure themselves that they are compliant with respect to some framework. Our viewpoint is: if we do a very good job with information security (at a technical level and at a process level), compliance more or less takes care of itself.
In general, controls can be classified in three categories: preventive, detective, and corrective. In general, the most important ones are the preventive controls, which are put in place to prevent an attack or mitigate a threat. You need to keep in mind that it is very difficult to undo the damage when sensitive data is leaked or exposed. This is why we put significant emphasis on preventive controls and prioritize prevention. At the same time, we have to always be prepared for the possibility that data might be leaked or exposed, which is where detective controls come in handy, i.e. the sooner we can detect an intrusion or malicious attack, we can minimize potential damage, as opposed to detecting the event weeks or months later.
How does the security of HPCC Systems compare to the threat of other big data systems like Hadoop?
[HPCC Systems] is a lot better. We have been doing this for 20 years, and we built HPCC Systems specifically to support our business. As a result, many of the requirements that we have in our business around security and privacy are also incorporated into the HPCC Systems platform.
By contrast, Hadoop was designed from the ground up to allow people to store massive amounts of data on relatively inexpensive hardware and then be able to perform searches like the "find the needle in a haystack" type of problem. This is what it has been focused on for the longest time, rather than on making it work securely. So the security on Hadoop came as an afterthought, and even the basic authentication mechanisms weren't deployed until a couple of years ago.
I saw that HPCC Systems went open source in 2011. Does that cause any vulnerability issues?
Not at all! On the contrary, this increases security through transparency and collaborative development. Generally speaking, the open source movement – started back in the 80s – is about releasing not just the compiled (or binary) version of the software, but also the programming language version of the software, the source code from which the binary code is generated. Rather than making things less secure, the increased number of reviewers and contributors can identify and correct security issues much faster with their combined efforts, making HPCC Systems even less vulnerable.
When legacy systems are converged onto the HPCC Systems platform, are there any concerns that one needs to be aware of? Some leading journals suggest that technology has progressed so quickly that legacy systems may have issues with merging into a new platform?
It's true that technology has changed, and that it changes very rapidly. It’s no longer a time where we have a new generation of technology every 20 years. Now, a new generation happens every two or three years.
These days, big data encompasses many things: social media, videos, free text – which is not well supported by legacy systems. When you’re trying to deploy HPCC Systems in that environment, there are two ways you can do it. You can say, “Well, I’m going to phase out all my legacy systems completely and move all the data,” but that might not be viable for many companies since they may need to continue to operate while they do that migration, so integration is needed. As with any other integration process, there is some complexity, which could generate some potential security issues in between the interfaces, while you are trying to connect one system to the other and move data. Which is why, when we migrate legacy systems on to the HPCC Systems platform, we play close attention to the security controls that may need to be implemented or refactored as a function of the migration.
Do we ever have risks of losing data?
Well, the reality is that everyone does. There is the myth of complete security, and it’s just that, a myth. There is no way you can say, “I’m 100 percent exempt from having any security issues ever.” Of course, we believe, based on applying best-in-class security practices, having thorough and comprehensive monitoring and surveillance and having a mature set of processes that adapts to the ever changing security threat landscape, that we have a very low risk of losing data.
Maybe I’ve been watching too many political action and sci-fi shows lately, but I was watching 24 and my mind kind of races, which makes me ask: do we ever have anybody try to intentionally hack our systems to see if they can get in?
We don’t have any controls against aliens from outer space, but we do try to intentionally hack into our systems. We have security assessments and penetration testing and we regularly perform both, internally and externally. In addition to our own security experts - who are very well-trained and very knowledgeable of these practices - we also have third parties that we hire on a regular basis, to attempt to break into our systems.
Unfortunately, the number of hackers or wannabe hackers is very large, and you can bet they will be creative in trying to find new ways of breaking into your systems. So, if you’re not performing continuous surveillance and scanning for new threats and attack methodologies, it will potentially expose you in the long run.
What are some challenges that you see with protecting big data in general in the future? And what do you think we need to do to combat those threats?
First of all, we need to draw a distinction between security and privacy. I think the biggest challenges are going to be potentially around privacy, which is a very touchy subject because there is no universal concept of privacy. This distinction is necessary because some people often confuse a perceived lack of privacy with a lack of security.
What is considered acceptable privacy in the US might not be acceptable privacy in Germany or China. What’s privacy to us today is not the type of privacy we used to consider 20 years ago, and it won’t be the privacy 20 years from now. It’s important to have a clear understanding of what the society accepts as privacy, to ensure that you don’t go beyond that. You never want to be seen as creepy, and I can’t define exactly what creepy is, but you will know when you see it.
There can also be better education. For example, when you go to install an application on your smart phone, and the list of permissions pops up, the list is so long, you probably don’t even read it. You say, “I need the application, so accept.” Well, I don’t think that is the right way of doing it. There needs to be some bullet points, saying, “Look, you give me your data, and for giving me your data, I will use your data in this way.” It needs to be clear and understandable by anyone.
At the end of the day, there needs to be an exchange of value between the data owner (the person) and the data user, and that value needs to be measurable and tangible. I am glad to allow others to use my data, if that gives me access to better credit, simplifies my access to online services and makes my children safer in the streets.
With the holidays almost upon us and while we start to wind down to spend some quality time with our families, I thought that it would be a good opportunity to recount some of the things that we have done during 2013.
For starters, a major release of the platform (4.0) and several minor releases are a tribute to the hard work of the entire HPCC Systems community. The full self-paced online training content for HPCC, covering basic and advanced Thor, Roxie and ECL content, programming exercises and self-assessment materials is also a significant step up from our more traditional face-to-face training, allowing hundreds or thousands of people to learn HPCC from the comfort of their sofas. The addition of the ECL bundle functionality is another great example of a relatively simple enhancement that can go a long way to open the door for code sharing and reuse, and we will be building on this concept in 2014 to support seamless download and installation of ECL bundles from public and private repositories. We also made some good progress in integration and tooling, with the release of the technical preview of the new ECLWatch, the enhanced support for Hadoop HDFS, an improved JDBC/SQL driver and myriads of enhancements in the areas of graphical user interfaces to HPCC, charts and dashboards. Outside of the platform itself, a substantial amount of progress in our academic and research outreach programs have started to create good traction in those communities too.
But it's not just about the past, and there are already exciting things happening in our development branch, including state of the art optimizations to better utilize the existing hardware (multiple CPU cores per node in Thor, for example), porting more machine learning algorithms to the new PB-BLAS framework for vectorized distributed linear algebra arithmetic and the ongoing work in supporting available CUDA compatible hardware for accelerated processing. And this doesn't even start to scratch the surface for what will come in 2014.
But if you got to this point, you probably already spent too much time reading and should be going back to celebrating.
Have a fun and safe enjoyable Holiday Season with your family, and come back for more in 2014!
Last week, close to 300 people met in Delray Beach, Florida, to follow an intensive and densely packed agenda full of technology content on the HPCC Systems platform. It was our third annual HPCC Summit event and doubled its attendance from the previous year. Besides the great food, party and accommodations (yeah, I thought I should mention that too), there were many plenary and break-out sessions covering a broad range of areas including HPCC roadmap, new developments, recent enhancements, applications, integration with third party platforms and more.
Some of the presentations were plain brilliant, providing material that would take weeks to fully digest, and many of the attendees will be watching the video recordings of them for months to come (in case you were wondering, we also recorded most of the presentations). And some very funny videos entered our video competition, with the winner implementing an HHPCC (Human HPCC): worth a watch if you're up for some light humor.
An interesting aspect of the conference was the contest on ECL Bundles, which got the competitive juices flowing and brought very good submissions from several community members both, internal to LexisNexis and external too. Being one of the judges proved quite difficult as it was hard to define a single winner and we ended up giving prices (neat iPad minis) to both, the winner and the runner up. We also ended up disqualifying a submission from a clever to-remain-unnamed contestant who decided that it was a good idea to submit work done for his/her regular job as an entry to this contest (btw, the specific piece of work is quite sophisticated and very useful, and you will see it permeate into the platform in our upcoming 4.0.2 release).
And speaking of contributions, as one of the ideas floated within the conference, there is an ongoing effort to create entries of ECL code samples for the Rosetta Code project, so if you have some time to kill and or a neat idea on how to implement one of the code examples in Rosetta Code in ECL, feel free to head over to their site and submit it to their side. These entries will surely be useful to people trying to get started in ECL and/or trying to learn new ECL coding tricks.
We'll continue to make some of the material used during the HPCC Summit 2013 publicly available over the next few weeks, but if you are particularly interested in something, please do not hesitate to ask.
After almost two years of continuous development, version 4.0 of the HPCC Systems platform has finally been released, with an impressive number of exciting new features and capabilities!
But the changes don't stop in the underlying platform, and face lifts have been given to the user interfaces, including a production release of the Eclipse IDE ECL plugin, and a technology preview of the next generation ECL Watch, using the latest web technologies to deliver a more streamlined and consistent user interface.
On the Roxie front, support for JSON queries (in addition to the existing SOAP access) and improvements to the Roxie "Packages" file management make it easier than ever to support Roxie queries.
And the new "ECL Bundles" functionality deserves an honorific mention: now it's very easy to package and distribute bundles of ECL capabilities (modules, functions, etc.) in a consistent manner, supporting encapsulation, versioning, dependencies, updates, licensing information and more. This should facilitate ECL code sharing both, internally and externally, and a public repository of these ECL bundles is already in the works.
4.0 just went out, and I'm already looking forward to what 5.0 will bring!
As the end of the 2012 calendar year approaches, at least for a good chunk of the world (and may come to an end on 12/21 for a crazy bunch), some people start celebrating holidays in different cultures and countries. I consider this season a good time to go over things to come in the HPCC Systems platform arena.
Our 3.8 release is out (3.8.6-4 can be downloaded from here) and, while there may still be minor bug fixes (no, there are never bugs, just "under-appreciated features" that we may want to get rid of), 3.10 is well in the works. 3.10, which should be released in the next few weeks, changes the open source license to Apache 2.0 and brings a number of enhancements to the platform in different areas, as you can see from the commit history in our GitHub source code repository.
4.0, the next major release, is already in the plans and will bring a number of exciting new features, including improvements to our ECL Playground (if you haven’t played with it, I strongly recommend it), significant improvements to ECL watch, support for very fast linear algebra through PB-BLAS (not to be confused with PBLAS) and some other interesting developments, such as more thorough integration with R and reporting tools through external connectors, improved SQL/JDBC connectivity on dynamic Roxie queries for interactive reporting tools, and improvements around documentation and usage examples.
In the next few weeks, also expect to see improvements to our Portal, with the addition of a Wiki for collaborative documentation and a description of our general HPCC Systems roadmap and ongoing projects, to help community members decide if they’d like to join any of these efforts. As part of this move, we are planning to include specific projects that could be good starters for some community members, so please let us know if you would like to tackle any of those.
During 2013, we will be actively working to continue raising awareness for the HPCC Systems platform, and will be specifically focused on community building activities (details coming later). And, from the Exciting Training Department, we are currently working on creating a significant amount of materials for our upcoming MOOC (Massive Open Online Courses), which will help you learn everything that you ever wanted to know about ECL, but were afraid to ask.
And now the shameless plug: Trish and I are tasked with the organization of the first Big Data track as part of the 2013 Symposium on Collaborative Technologies, in cooperation with ACM and IEEE, to happen in May, in San Diego, so please feel free, and a little bit compelled? :) to submit papers, present posters and let us know of any other way that you may want to help.
Now go and enjoy with your families and have a great holiday season! Happy Hacking!
I often get asked about comparing the HPCC Systems platform and Hadoop. As many of you probably know already, there are a number of substantial differences between them, and several of these differences are described here.
In a few words, HPCC and Hadoop are both open source projects released under an Apache 2.0 license, and are free to use, with both leveraging commodity hardware and local storage interconnected through IP networks, allowing for parallel data processing and/or querying across this architecture. But this is where most of the similarities end.
From a timeline perspective, HPCC was originally designed and developed about 12 years ago (1999-2000); our first patent around HPCC technology was even filed back in 2002, and HPCC was in production across our systems back in 2002. To put things in perspective, it wasn’t until December 2004 that the two researchers from Google described the distributed computing model based on Map and Reduce. The Hadoop project didn’t start until 2005, if I remember correctly, and it was around 2006 when it split from Nutch to become its own top level project.
This doesn’t necessarily mean that you couldn’t say that certain HPCC operations don’t use an scatter and gather model (equivalent to Map and Reduce), as applicable, but HPCC was designed under a different paradigm to provide for a comprehensive and consistent high-level and concise declarative dataflow oriented programming model, represented by the ECL language used throughout it. What this really means, is that you can express data workflows and data queries in a very high level manner, avoiding the complexities of the underlying architecture of the system. While Hadoop has two scripting languages which allow for some abstractions (Pig and Hive), they don’t compare with the formal aspects, sophistication and maturity of the ECL language which provides for a number of benefits such as data and code encapsulation, the absence of side effects, the flexibility and extensibility through macros, functional macros and functions, and the libraries of production ready high level algorithms available.
One of the significant limitations of the strict MapReduce model utilized by Hadoop, is the fact that internode communication is left to the Shuffle phase, which makes certain iterative algorithms that require frequent internode data exchange hard to code and slow to execute (as they need to go through multiple phases of Map, Shuffle and Reduce, each one of these representing a barrier operation that forces the serialization of the long tails of execution). In contrast, the HPCC Systems platform provide for direct inter-node communication at all times, which is leveraged by many of the high level ECL primitives. Another disadvantage for Hadoop is the use of Java as the programming language for the entire platform, including the HDFS distributed filesystem, which adds for overhead from the JVM; in contrast, HPCC and ECL are compiled into C++, which executes natively on top of the Operating System, lending to more predictable latencies and overall faster execution (we have seen anywhere between 3 and 10 times faster execution on HPCC, compared to Hadoop, on the exact same hardware).
The HPCC Systems platform, as you probably saw, has two components: a back-end batch oriented data workflow processing and analytics system called Thor (equivalent to Hadoop MapReduce), and a front-end real-time data querying and analytics system called Roxie (which has no equivalent in the Hadoop world). Roxie allows for real-time delivery and analytics of data through parameterized ECL queries (think of them as equivalent to store procedures in your traditional RDBMS). The closest to Roxie that you have with Hadoop is Hbase, which is a strict key/value store and, thus, provides only for very rudimentary retrieval of values by exact or partial key matching. Roxie, on the other hand, allows for compound keys, dynamic indices, smart stepping of these indices, aggregation and filtering, and complex calculations and processing.
But above all, the HPCC Systems platform presents the users with a homogeneous platform which is production ready and has been proven for many years in our own data services, from a company which has been in the Big Data Analytics business even before Big Data was called Big Data.
As I was preparing the Keynote that I delivered at World-Comp'12, about Machine Learning on the HPCC Systems platform, it occurred to me that it was important to remark that when dealing with big data and machine learning, most of the time and effort is usually spent on the data ETL (Extraction, Transformation and Loading) and feature extraction process, and not on the specific learning algorithm applied. The main reason is that while, for example, selecting a particular classifier over another could raise your F score by a few percentage points, not selecting the correct features, or failing to cleanse and normalize the data properly can decrease the overall effectiveness and increase the learning error dramatically.
This process can be especially challenging when the data used to train the model, in the case of supervised learning, or that needs to be subject to the clustering algorithm, in the case of, for example, a segmentation problem, is large. Profiling, parsing, cleansing, normalizing, standardizing and extracting features from large datasets can be extremely time consuming without the right tools. To make things worse, it can be very inefficient to move data during the process, just because the ETL portion is performed on a system different to the one executing the machine learning algorithms.
While all these operations can be parallelized across entire datasets to reduce the execution time, there don't seem to be many cohesive options available to the open source community. Most (or all) open source solutions tend to focus on one aspect of the process, and there are entire segments of it, such as data profiling, where there seem to be no options at all.
Fortunately, the HPCC Systems platform includes all these capabilities, together with a comprehensive data workflow management system. Dirty data ingested on Thor can be profiled, parsed, cleansed, normalized and standardized in place, using either ECL, or some of the higher level tools available, such as SALT (see this earlier post) and Pentaho Kettle (see this page). And the same tools provide for distributed feature extraction and several distributed machine learning algorithms, making the HPCC Systems platform the open source one stop shop for all your big data analytics needs.
If you want to know more, head over to our HPCC Systems Machine Learning page and take a look for yourself.
More than 12 years ago, back in 2000, LexisNexis was pushing the envelope on what could be done to process and analyze large amounts of data with commercially available solutions at the time. The overall data size, combined with the large number of records and the complexity of the processing required made existing solutions non-viable. As a result, LexisNexis invented, from the ground up, a data-intensive supercomputer based on a parallel share-nothing architecture running on commodity hardware, which ultimately became the HPCC Systems platform.
To put this in a time perspective, it wasn't until 2004 (several years later) that a pair of researchers from Google published a paper on the MapReduce processing model, which fueled Hadoop a few years later.
The HPCC Systems platform was originally designed, tested and refined to specifically address big data problems. It can perform complex processing of billions (or even trillions) of records, allowing users to run analytics in their entire data repository, without resorting to sampling and/or aggregates. Its real-time data delivery and analytics engine (Roxie) can handle thousands of simultaneous transactions, even on complex analytical models.
As part of the original design, the HPCC Systems platform can handle disparate data sources, with changing data formats, incomplete content, fuzzy matching and linking, etc., which are paramount to LexisNexis proprietary flagship linking technology known as LexID(sm).
But it is thanks to ECL, the high-level data-oriented declarative programming language powering the HPCC Systems platform, that this technology is truly unique. With advanced concepts such as data and code encapsulation, lazy evaluation, prevention of side effects, implicit parallelism and code reuse and extensibility, is that data scientists can focus on what needs to be done, rather on superfluous details around the specific implementation. These characteristics make the HPCC Systems platform significantly more efficient than anything else available in the marketplace.
Last June, almost a year ago, LexisNexis decided to release its supercomputing platform, under the HPCC Systems name, giving enterprises the benefit of an open source data intensive supercomputer that can solve large and complex data challenges. One year later, HPCC Systems has made a name for itself and built an impressive Community. Moreover, the HPCC Systems platform has been named one of the top five "start-ups" to watch and has been included in a recent Gartner 2012 Cool IT Vendors report.
LexisNexis has made an impact in the marketplace with its strategic decision to open source the HPCC Systems platform: a bold and innovative decision that can only arise from a Company which prides itself of being a thought leader, when it comes to Technology and Big Data analytics.
One of our community members recently asked about fraud detection using the HPCC Systems platform. The case that this person described involved identifying potentially fraudulent traders, who were performing a significant number of transactions over a relatively short time period. As I was responding to this post in our forums, and trying to keep the answer concise enough to fit in the forums format, I thought that it would be useful to have a slightly more extensive post, around ideas and concepts when designing an anomaly detection system on the HPCC Systems platform.
For this purpose I'll asume that, while it's possibly viable to come up with a certain number of rules to define how normal activity looks like even though the number of rules could be large, it's probably unfeasible to come up with rules that would describe every potential anomalous behavior (fraudsters can be very creative!). I will also assume that while, in certain cases, individual transactions could be flagged as anomalous due to characteristics in the particular data record, in the most common case, it is through aggregates and statistical deviations that an anomaly can be identified.
The first thing to define is the number of significant dimensions (or features) the data has. If there is one dimension (or very few dimensions), where most of the significant variability occurs, it could be conceivable to manually define rules that, for example, would mark transactions beyond 3 or 4 sigma (standard deviations from the mean for the particular dimension) as suspicious. Unfortunately, things are not always so simple.
Generally, there are multiple dimensions, and identifying by hand those that are the most relevant can be tricky. In other cases, performing some type of time series correlation can identify suspicious cases (for example, in the case of logs for a web application, seeing that someone has logged in from two locations a thousand miles apart in a short time frame could be a useful indicator). Fortunately, there are certain machine learning methodologies that can come to the rescue.
One way to tackle this problem is to assume that we can use historical data on good cases to train a statistical model (remember that bad cases are scarce and too variable). This is known as a semi-supervised learning technique, where you train your model only on the "normal" activity and expect to detect anomalous cases that exhibit characteristics which are different from the "norm". One specific method that can be used for this purpose is called PCA (Principal Components Analysis), which can automatically reduce the number of dimensions to those that present the largest significance (there is a loss of information as a consequence of this reduction, but this tends to be minimal compared to the value of reducing the computational complexity). KDA (Kernel Density Estimation) is another semi-supervised method to identify outliers. On the HPCC Systems Platform, PCA is supported through our ECL-ML machine learning module. KDA is currently available on HPCC through the ECL integration with Paperboat .
A possible more interesting approach is to use a completely unsupervised learning methodology. Using a clustering technique such as agglomerative hierarchical clustering, supported in HPCC, as part of the ECL-ML machine learning module, can help identify those events which don't clusterize easily. Other clustering method also available on ECL-ML, k-means, is less effective as it requires to define the number of centroids a priori, which could be very difficult. When using agglomerative hierarchical clustering, one of the aspects that could require some experimentation is to identify the number of iterations required to have the best effectiveness: too many iterations and there will be no outliers as all the data will be clusterized, too few iterations and many normal cases could still be outside of the clusters.
Beyond these specific techniques, the best possible approach probably includes a combination of methods. If there are clear rules that can quickly identify suspicious case, those could be used to validate or rule out results from statistical algorithms, and since a strictly rules based system would be ineffective to detect every possible outlier, using some of the machine learning methodologies described above too, would be highly recommended.