Blogs

HPCC/Hadoop data integration, or how the mighty Thor rides the yellow elephant

You probably thought that the HPCC Systems platform and Hadoop were two technologies that represented the opposite ends of a spectrum, and that choosing one would make attempting to use the other, unrealistic. If this is what you believed: think again (and keep reading).

The HPCC Systems platform has just released its Hadoop data integration connector. The HPCC/HDFS integration connector provides a way to seamlessly access data stored in your HDFS distributed filesystem from within the Thor component of HPCC. And, as an added bonus, it also allows you to write to HDFS from within Thor.

As you can see, this new feature enables several opportunities to leverage HPCC components from within your existing Hadoop cluster. One such application would be to plug the Roxie real-time distributed data analytics and delivery system, providing real time access to complex data queries and analytics, to data processed in your Hadoop cluster. It would also allow you to leverage the distributed machine learning and linear algebra libraries that the HPCC platform offers through its ECL-ML (ECL Machine Learning) module. And if you needed a highly efficient and highly reliable data workflow processing system, you could take advantage of the HPCC Systems platform and ECL, or even combine it with Pentaho Kettle/Spoon, to add a graphical interface to ETL and data integration.

So what does it take to use the HPCC/HDFS connector (or H2H, as we like to call it)? Not much! The H2H connector has been packaged to include all the necessary components, which are to be deployed to every HPCC node. HPCC can coexist with Hadoop, or run on a different set of nodes (which is normally recommended for performance reasons).

How did we do it? We leveraged the capabilities of ECL to pipe data in and out of a running workunit, through the ECLPipe command, and we created some clever ECL Macros (did I mention before that ECL Macros are awesome?) to provide for adequate data and function mappings from within an ECL program. Thanks to this, using H2H is transparent to the ECL software developer, and HDFS becomes just an option of a particular type of data repository.
What are the gotchas? Well, HDFS is not as efficient as the distributed filesystem used by HPCC, so this data read and write will not be any faster than HDFS allows (but it won't be sensibly slower either). Another caveat is that transparent access to compressed data (as it's normally provided by HPCC) is not available to data accessed from within HDFS (although decompression can be achieved easily in a following step, after the data is read).

I hope you are as excited as we are, about this HPCC/Hadoop data integration initiative. Please take a look at the H2H section of our HPCC Systems portal for more information: http://hpccsystems.com/H2H, and don't hesitate to send us your feedback. This HPCC/HDFS connector is still in beta stage, but we expect to have a 1.0 release very soon.

Flavio Villanustre

Real time classification on the HPCC Systems platform

It is not uncommon to find situations where a classification model needs to be trained using a very large amount of historic data, but the ability to perform classification of new data in real time is required. There are many examples of this need, from real time sentiment analysis in tweets or news, to anomaly detection for fraud or fault identification. The common theme in all these cases is that the value of the real time data feeds has a steep decrease over time, and delayed decisions taken on this data are significantly less effective.

When faced with this challenge, traditional platforms tend to fall short of expectations. Those platforms that can deal with significant amounts of historical data and a very large number of features to create classification models (Hadoop is an example of such a platform), have no good option for real time classification using these models. This type of problems are quite common, for example, in text classification. In these cases, People usually need to resort to different tools, and even homegrown systems using Python and a myriad of other tools, to cope with this real time need.

The problem with these homegrown tools, is that they need to meet all the concurrency and availability requirements that real time systems impose, as these online systems are usually critical to fulfill important internal or external roles for the business (the one anomaly that you just missed because your real time classifier didn't work properly, could represent significant losses for the business).

What makes this even more challenging is the fact that, many times, it is desirable to retrieve and compare specific examples from the training set used to create the model, in real time too. And while developing a system that can classify data in real time using a pre-existing model may be quite doable, being able to also retrieve analogous or related cases would certainly require coupling the system with a database of sorts (just another moving part that adds complexity and cost to the system and potentially reduces its overall reliability).

But look no more, as the HPCC Systems platform may be just what you have been looking for all along: a consistent and homogeneous platform that provides for both functions, and a seamless workflow to move new and updated models, from the system where they are developed (Thor), to the real time classifier (Roxie).

At this point, it's probably worth explaining a little bit how Roxie works. Roxie is a distributed, highly concurrent and highly available, programmable data delivery system. Data queries (in a way equivalent to the stored procedures in your legacy RDBMS) are coded using ECL, which is the same high level data-oriented declarative programming language that powers Thor. Roxie is built for the most stringent high availability requirements, and the data and system redundancy factor is defined by the user at configuration time, with no single point of failure across the entire system. ECL code developed can be reused across both systems, Thor and Roxie.

A scenario like the one I described above, can be easily implemented in the HPCC Systems platform, using one of the classifiers provided by the ECL-ML (ECL Machine Learning) modules on Thor, and running your entire historical training set. To make this even more compelling, all the classifiers in ECL-ML have been designed with a common interface in mind, so plugging in a different classifier (for example, switching from a generative to a discriminative model) is as simple as changing a single line of ECL code. After a model (or several) is created, it can be tested on a test and/or verification set to validate it, and moved to Roxie for real time classification and matching. The entire training set can also be indexed and moved to Roxie, if real time retrieval of related records is required.

Powerful, simple, elegant, reliable. And every one of these components are available under an open source license, for you to play with.

For more information, head over to our HPCC Systems portal (http://hpccsystems.com).

New HPCC Systems Podcast

At HPCC Systems we have been very busy finding better ways to communicate with our Community. As a result of this, we have just released the first edition of our official HPCC Systems podcast, in which the Host and our Community Manager, Trish McCall, has a conversation with our senior trainer Bob Foreman around different aspects of the HPCC Systems platform, the ECL data-intensive programming language and some other topics that we hope you will find interesting.

In upcoming editions, we plan on having guests (Hint, hint! Let us know if you would like to be one of them!) covering new developments and the roadmap for HPCC Systems, discussions on specific capabilities around Machine Learning and Natural Language Processing, some coverage on SALT, our Scalable Automated Linking Technology, and much more.

For this first edition, Trish and Bob tried hard to keep the content under 30 minutes, which is just about perfect for a medium sized commute.

Don't waste a minute and head over to our podcasts page, or find it in iTunes. Please send us feedback and don't forget to rate it in iTunes, if you like it.

Flavio Villanustre

Consume your data with some SALT on it

Don't be surprised by the title: I'm not trying to play down the link between high blood pressure and a diet rich in Sodium. In the HPCC Systems platform world, SALT has a completely different meaning.

SALT is an acronym for Scalable Automated Linking Technology, and it's a programming environment support tool which functions as an ECL code generator to automatically produce ECL code for a variety of data integration applications, addressing common data processes based on a small configuration file of user-defined specification statements.

SALT has been designed from the ground up, as a system for effective record linking and clustering, but it also has capabilities around data ingest, data profiling, data hygiene, data source consistency monitoring and data quality, and data update management. In addition to this, SALT can generate the inverted data records required for Boolean Search Engine applications.

SALT offers many advantages when developing a new data-oriented application. SALT encapsulates a significant amount of ECL programming knowledge, experience, and best practices for the types of applications supported and can result in significant increases in developer productivity. It affords significant reductions in implementation time and cost over a hand-coded approach.

In case you wonder how this all works in practice, the SALT process begins with a user defined specification file. This is a text file with statements and parameters that define the data file and fields to be processed, and the associated processing options such as the module into which the generated code will be imported. The SALT command line interface reads the specification file, and based on various command line options, generates an output file in ECL, which can be easily imported into an existing ECL project. The resulting file includes attributes which can be executed on a Data Refinery (THOR) to perform the process for which the code has been generated, such as linking the records in a file. Depending on the process, SALT also generates code to define and build appropriate key files and queries for deployment to the Rapid Data Delivery Engine (Roxie).

In general, record linking fits into a general class of data processing known as data integration, which can be defined as the problem of combining information from multiple heterogeneous data sources. Data integration can include data preparation steps such as parsing, profiling, cleansing, normalization, and parsing and standardization of the raw input data prior to record linkage to improve the quality of the input data and to make the data more consistent and comparable (these data preparation steps are sometimes referred to as ETL or extract, transform, load). SALT provides data profiling and data hygiene applications to support the data preparation process. In addition SALT provides a general data ingest application which allows input files to be combined or merged with an existing base file. You can also use SALT to generate a parsing and classification engine for unstructured data which can be use for data preparation. The data preparation steps are usually followed by the actual record linking or clustering process. SALT provides applications for several different types of record linking including internal, external, and remote.

Data profiling, data hygiene and data source consistency checking, while key components of the record linking process, have their own value within the data integration process. All of these are also supported by SALT and can be leveraged even when record linking is not a necessary part of a particular data work unit.

SALT uses advanced concepts such as term specificity to determine the relevance/weight of a particular field in the scope of the linking process, and a mathematical model based on the input data, rather than the need for
hand coded user rules, which is key to the overall efficiency of the method.

For more information on SALT, head over to our HPCC Systems portal and take a look for yourself.

Flavio Villanustre

Sailing through machine learning on a PaperBoat

While the ECL-ML (ECL Machine Learning) libraries currently support a variety of prevalent algorithms in machine learning, there could always be the need for the one that has not been added just yet. And, the fact that ECL-ML provides a distributed linear algebra library, which greatly simplifies distributed vectorized implementations, is a blessing, but it still requires some coding in ECL to add new algorithms.

Fortunately, the folks over at Ismion, Inc., and particularly Nick Vasiloglou, have done something about it. They have ported their highly optimized PaperBoat library to HPCC, and made the integration so seamless (through some very clever abstraction layer based on ECL Macros) that PaperBoat functions are available as native ECL definitions.

The most interesting aspects of PaperBoat are around efficiency, and I couldn't say it more clearly than the authors themselves:

PaperBoat has been developed based on the C++ template metaprogramming principles. This approach makes PaperBoat easily configurable and efficient. For example the data are stored always on the minimum data precision needed.Every column is stored in the precision the user specifies. Columns with the same precision are stored next to each other. This triggers vectorization speedups offered in any modern preprocessor. Also libraries like BLAS/LAPACK/FLAME can speed up vector operations. Templatization also avoids the virtual function overhead and it allows the compiler to do extensive optimizations, since all the code is available at compile time. Our experiments showed 4x speedup over an implementation of the library with virtual functions. Another feature of PaperBoat is threading. All fundamental algorithms are tasks that are executed asynchronously. Synchronization of tasks is based on a data availability model inspired by datalog. Another advantage of the PaperBoat is the multidimensional indexing structures that can speed up orders of magnitude machine learning algorithms. Multidimensional trees can speed up things in two ways, either by clever stratified sampling algorithms or by clever geometric tricks, that lead to efficient branch and bound pruning.

The list of algorithms currently supported by PaperBoat is also enticing. While there is some overlap with what is currently available on ECL-ML, and ECL-ML could be preferable for massive amounts of data with large number of features, PaperBoat could be a choice for less extreme cases. And the beauty of it, is that the user gets to choose by changing a single line of ECL (or, why not, even run both and compare the results?).

One interesting algorithm that is available under PaperBoat and not ECL-ML yet, for example, is LASSO, a method for regression shrinkage and selection in linear models, which is favored by some people in the scoring and analytics industry (and if you're curious about LASSO, you can check out the original paper here: http://www-stat.stanford.edu/~tibs/lasso/lasso.pdf).

I hope that, by now, I picked up your interest, so don't waste any more time and head over to the HPCC Systems portal, and then check out http://ismion.com/documentation/ecl-pb/index.html for a good tutorial on PaperBoat and ECL.

Flavio Villanustre

New release of our ECL Machine Learning libraries (ECL-ML)

A lot has happened since the version 1.0 release of our Machine Learning libraries. As you can see by checking out our ML portal (http://hpccsystems.com/ML), there are a ton of new algorithms, and significant improvements to existing ones.

Logistic regression, for example, has received a much needed revamp and besides, discriminative classification methods tend to be more widely used than generative methods, due to their better asymptotic error convergence (if you're curious about this, you could check out this classic paper from Andrew Ng: http://ai.stanford.edu/~ang/papers/nips01-discriminativegenerative.pdf). However, if you are a generative methods fan, check out our Naive Bayes classifier implementation too.

But what is possibly more interesting is the fact that now all our classifiers, including perceptron, logistic regression and Naive Bayes, sit behind a unified classifier interface, which allows them to be swapped in and out very easily. This is extremely convenient if you need to see which classifier is the best choice for a particular problem, and/or if you want to learn multiple models at once.

I can't leave the topic on classifiers, without mentioning classical decision trees, which are also part of this release.

Clustering, now includes K-D Trees (http://en.wikipedia.org/wiki/K-d_tree), a very cool data structure widely used for efficient multi-dimensional nearest neighbor searches and GIS storage and retrieval, among others.

The discretizers, essentially providing functionality to allow for easy conversion of continuous values into discrete buckets that can be used for a classifier, have also been subject to a substantial review and improvement, with a much better and easier to use interface.

On the document n-gram extraction functions, the inclusion of a Porter stemmer (http://tartarus.org/~martin/PorterStemmer/) is a much welcome addition, to ensure that English word terminations and inflections don't get in the way of NLP related tasks.

A couple other interesting newcomers are Singular Vector Decomposition (SVD) and Primary Component Analysis (PCA), for dimensionality reduction, such as, for example, when building anomaly detection systems. SVD is also useful to deal with certain linguistic ambiguity problems and, if you're interested on this particular topic, this general tutorial should help: http://www.miislita.com/information-retrieval-tutorial/svd-lsi-tutorial-1-understanding.html.

One area that had significant development, is visualization. With the addition of visualization components to ECL, collectively included within our VL sub-tree, HPCC provides, in a simple and straightforward way, a way to display graphical charts representing the tabular data results, making it very convenient to quickly spot interesting graphical aspects in the data results. I can't emphasize enough the ease of use, and the fact that it doesn't require resorting to any external tools (batteries included here, too!).

In sum, if you are a machine learning professional, or even if you have some interest on highly scalable distributed machine learning implementations, head over to: http://hpccsystems.com/ML and take a look. You won't be disappointed.

Flavio Villanustre

What hardware should I use for my next HPCC?

I get asked frequently about the optimal configuration for a Thor or Roxie system. Besides the pro-forma "it depends" and "how will you use the system?" statements, I think it would be useful to describe what the guiding principles are, when defining the architecture for a given HPCC Systems platform implementation.

When it comes to Thor, one of the first variables that need to be considered, is the size of the input data that will be ingested into the system, over a reasonable time frame. For the majority of the applications, Thor ends up becoming a live archive for all data ever ingested or, at least, until the data becomes useless and/or must be destroyed due to regulatory and/or legal requirements. In addition to the size of the input data, the size of the output data, and any space required for intermediate and temporary files need to be accounted for. A good rule of thumb is, that the total amount of space needed for the system is, usually about 2.5 to 3 times the size of the aggregated input data over time. If, as it's usually recommended, there is a need for redundancy, the total amount of space needs to be doubled, to account for the mirroring of the data slices in other nodes (although the use of RAID 5 or RAID 6 may reduce or eliminate the need for this extra redundancy). But deciding the amount of space needed in Thor is just the beginning: the next area that needs to be covered is related to processing time and overall system performance.

Thor has been designed as a massive parallel data-intensive workflow processing system and, as such, is relatively tolerant to reasonable I/O latencies. Moreover, the majority of the operations require reading a certain amount of data from the drives (almost always sequential, from beginning to end of the logical file slice residing in a particular node), performing some processing (which could be either entirely contained in memory, or require mostly sequential spills to drives) and eventually writing down the output sequentially (again) to the drives. With this particular behavior in mind, when running a single thor worker per node, it makes perfect sense to use SATA drives (even 7200RPM 3.5' drives will do) and pick the largest available at the time. If more than a couple workers are running on each node (which could be used, in certain cases, to take advantage of special hardware configurations) the additional concurrency could turn the drive I/O to a more random pattern, which would substantially impair performance of SATA drives and significantly benefit from SAS drives (particularly, 2.5' 10,000 RPM drives). SSD drives, due to the high cost and limited capacity, and to the fact that their sequential read/write performance is not significantly better than mechanical drives, are not recommended for this environment.

The number of CPU cores in Thor is not one of the most critical aspects of the configuration, with the processing time usually overshadowed by the I/O read/write times, so even a single socket motherboard with a quad or six core CPU should be fine, for the majority of the applications.

The total amount of Random Access Memory across a Thor cluster determines if a particular task (subgraph) can be kept entirely in memory, or if it needs to be spilled to drives. For this reason, having enough memory to handle entire subgraphs in memory can speed up workunits considerably, as memory access is orders of magnitude faster than drive I/O.

Roxie, on the other hand, has a very different performance profile. On the one side, Roxie workflows (queries) usually require a high amount of indirection due to indexed data access, and this has the side effect of introducing a large number of drive head seeks. On the other side, Roxie tends to be quite sensitive to I/O latencies, particularly when the total query time needs to be kept to a minimum. For these reasons, it's desirable to consider only SAS drives (2.5' 10,000 RPM drives) or even SSD drives. The total amount of space required in Roxie is usually much smaller than Thor, because primary indices have compressed payload (and data has been subject to some reduction, at this point), and secondary keys are just references to the payloads present on the primary indices, so SSD drives, even with their limited space, could be sufficient.

Roxie is noticeably more demanding, in terms of CPU utilization, and it's recommended to have as many CPU cores as possible, to minimize latencies (Roxie multithreads and spreads the load across as many cores, as there are in the node).

Memory, in Roxie, is not as critical for overall performance, however certain in-memory operations can benefit the overall query times, in cases where, for example, keeping a small and frequently used dataset in RAM could be desirable.

And last, but not least, the network interconnect for both systems needs to accommodate the bandwidth required to transfer the data efficiently between nodes, to avoid bottlenecks. While Gigabit Ethernet tends to be sufficient for the majority of the systems, high performance nodes could benefit from 10GE or even Infiniband (the latter offering 5 times as much bandwidth as 10GE, with an equivalent cost).

While the intent of this blog post is not to replace an expert analysis and architectural design phase for particular applications, it should be a sufficient guideline to estimate the overall hardware cost of a given HPCC related project, and provide some good rules of thumb and best practices for people implementing these platforms.

Flavio Villanustre

Sentiment Analysis on HPCC

A couple of days ago, I was discussing with a few colleagues how to get a small project on sentiment analysis going. I was explaining that the HPCC Systems platform, with its ECL-ML (ECL Machine Learning) libraries, currently has a good set of tools to get a large chunk of the work done, effortlessly.

When asked how to get started, I indicated that there are a few good public labeled corpus repositories that could be used to train a classifier, such as this one, or this one, or even this one. Of course, if you feel courageous enough, you could mine a corpus yourself from, for example, Twitter, following these guidelines.

But having a good training corpus is only the beginning of the story. You'll still need to decide how you'll architect your features and which classifier you will use, and there are some tricks of the trade that always need to be considered, if you want reasonable accuracy.

I would recommend, before doing anything else, taking a look at the latest version (1.2b) of the ECL-ML guide (yes, this is the docbook source, as the PDF for this version should be coming out this week, or possibly next), and particularly at the "Using ML with documents (ML.Docs)" section. Several enhancements and new functionality has been added since ECL-ML version 1.0, so I strongly encourage using the latest version of the libraries.

When you get to the definition of features, there are a couple of tricks that tend to be quite useful, specifically around the treatment of negations and semantically irrelevant frequent words, and also the use of Laplacian smoothing.

When dealing with monograms and short n-grams, the inverse connotation created by negation can get non-obvious for the classifier, unless context is provided for the entire phrase. For example, a phrase that goes: "This movie isn't really good" could appear as positive, if the classifier ignores the fact that "isn't" is negating "good". For the English language, one simple trick is to parse the phrases and append "NOT_" to the beginning of every word appearing after "not" and "n't". The net effect is potentially doubling the dictionary size, but can accurately now differentiate a "NOT_good" from a "good" and count the sentiment appropriately.

Frequent words not semantically relevant are also known as "stop words". There are some good lists out there, so procure one (or a couple) of them, like this one, or this one.

If you are planning to use Naive Bayes as the classifier (this applies to other classifiers too), and since, unfortunately, your training set will never contain every word that you will ever see, it's good practice to use Laplacian smoothing (multiplying by 0 is never good :) ), and reserve a little bucket of probabilities for those unseen words.

And finally you'll need to decide on what classifier to use. While Naive Bayes tends to be an easy choice (and there is an excellent Naive Bayes implementation as part of the ECL-ML libraries), the strong independence assumption that Naive Bayes makes can lead to less than optimal accuracy. Other classifiers, such as Logistic Regression, Support Vector Machines and perceptrons usually make for a better choice. Good implementations of these algorithms are also part of the ECL-ML libraries, and utilizing any one of them from within HPCC is extremely simple (please refer to the documentation on ECL-ML linked above).

I hope this short guide serves as a good introduction to sentiment analysis and text classification on HPCC, and don't forget to take a look at the ECL-ML portal at: http://hpccsystems.com/ml.

Flavio Villanustre

Bottom up programming and ECL

I was having a conversation with a friend of mine yesterday, and we were discussing how, as a programmer, people are not supposed to fight the language that they are using. If the language is an obstacle rather than a helper, it's probably time to look for a different programming language for that particular task.

As we were exchanging ideas around this, in the context of how ECL (ECL, as the programming language behind the HPCC Systems platform, not to be confused with this ECL ) works, I remembered having read an excellent essay from Paul Graham, who wrote the book "On Lisp" back in 1993, which is, in my opinion, one of the best books on Lisp out there.

So I looked up the essay and read it, and to my surprise, the entire document is absolutely accurate if you replace every entry of the word "Lisp" by the word "ECL". One of the paragraphs that caught my attention is the following (I replaced "Lisp" by "ECL" for the reader's benefit):

"In ECL, you don't just write your program down toward the language, you also build the language up toward your program. As you're writing a program you may think "I wish ECL had such-and-such an operator." So you go and write it. Afterward you realize that using the new operator would simplify the design of another part of the program, and so on. Language and program evolve together. Like the border between two warring states, the boundary between language and program is drawn and redrawn, until eventually it comes to rest along the mountains and rivers, the natural frontiers of your problem. In the end your program will look as if the language had been designed for it. And when language and program fit one another well, you end up with code which is clear, small, and efficient.

It's worth emphasizing that bottom-up design doesn't mean just writing the same program in a different order. When you work bottom-up, you usually end up with a different program. Instead of a single, monolithic program, you will get a larger language with more abstract operators, and a smaller program written in it. Instead of a lintel, you'll get an arch."

The entire purpose behind the design of ECL, was to create a language that would grow with the programmer, with enough plasticity to become almost a DSL suited to solve the current problem, but general enough to be applied to analogous tasks too, reusing most of the code (and the data attached to it).

The high level data oriented primities, the encouragment of purity across the language (but not to a point that it could become an obstacle to practical implementation) and the ability to encapsulate code and data behind standard interfaces, provides for an excellent collaborative development environment, which not only does create readable and reusable code, but also prevents a vast number of potential programming defects.

The declarative nature of the ECL programming paradigm lets the programmer focus on the task at hand and be less concerned with the specific implementation details. The implicit parallelism abstracts the details around multiprocessing and multinodes, allowing the same exact ECL program to run on a small virtual machine or a cluster composed of hundreds of nodes. The powerful optimizer will ensure that the execution plan is the most optimal, and the fact that ECL compiles to C++, and to machine code, ensures that your process will run at native speed on the hardware, without any extra overhead.

So an extensible declarative data-oriented high level programming language, on top of an extremely high performance data-intensive parallel processing system: What else could you ask for?

Head over here and take a look for yourself (and maybe, while you are at it, take HPCC for a ride).

Flavio Villanustre

Recommended Items

I personally hate the Amazon recommendation system. As someone who has half a dozen major hobbies, who buys a lot of technical books, even more theology books, books for four different children and whose wife is an English teacher the recommendations when I log in to Amazon are rather more eclectic than if they picked half a dozen books at random.

That said, once I have a couple of books in my shopping cart the recommendations suddenly transform; I go from ignoring them all to having a compulsive need to buy exactly what they recommend. Often they will recommend the book I would have chosen first if only I had found it.

This shift is not my imagination and it is not magic; it is machine learning. Specifically it is a branch of machine learning referred to as Frequent Pattern Mining (FPM). FPM first appears in the literature in the early nineties; the concept is very very simple:

what groups of items have appeared together in more than 'N' different shopping carts over time?

If you can answer that question then you can pretty much answer the question: given they have these M items in their cart what is the M+1th item most likely to be.

Simple to express, logically simple to code - and quite likely to bring your machine to its knees if you have any sizable amount of data!

Of course the HPCC platform was precisely designed to handle the problems that bring most machines to their knees; so as we began to build the core of our machine learning library it was one of the first problems we tackled.

There are three main algorithms for finding FPM: Apriori, Eclat & FP_Growth. We have started with the most popular Apriori and code it two different ways:
a) Old school - simple code, fixed with data structures & rely on the metal
b) New school - nested loops, variable data structures - less brute force

Once we have some performance stats I'll let you know which method wins ...

For many of you the question will be: so what? If you don't have a shopping cart - do you care about FPM? Well - you should. FPM can be used to track shopping carts but really applies to any groups of behaviors that can be placed in a session:

Cyber Security: FPM can tell if a given session has a common collection of properties (and thus if it has an anomalous collection of properties)
Document Clustering: If given collections of words frequently occur in documents then they probably form a centroid of a topic

Fraud Detection: Similar to cyber security - if you can detect all common patterns then you can find the odd ones.

To return to Amazon; the issue is that the 'simple averages' of behavior simply don't work once the data gets big enough. I have exhibited so many different behaviors that you cannot tell 'on average' what I'm about to do next. However; if you have FPM, then as soon as the first few parts of a behavior pattern are manifest, the rest becomes very predictable indeed.

Contact Us

email us   Email us
Toll-free   US: 1.877.316.9669
International   Intl: 1.678.694.2200

Sign up to get updates through
our social media channels:

facebook  twitter  LinkedIn  Google+  Meetup  rss  Mailing Lists

Get Started