I personally hate the Amazon recommendation system. As someone who has half a dozen major hobbies, who buys a lot of technical books, even more theology books, books for four different children and whose wife is an English teacher the recommendations when I log in to Amazon are rather more eclectic than if they picked half a dozen books at random.
That said, once I have a couple of books in my shopping cart the recommendations suddenly transform; I go from ignoring them all to having a compulsive need to buy exactly what they recommend. Often they will recommend the book I would have chosen first if only I had found it.
This shift is not my imagination and it is not magic; it is machine learning. Specifically it is a branch of machine learning referred to as Frequent Pattern Mining (FPM). FPM first appears in the literature in the early nineties; the concept is very very simple:
what groups of items have appeared together in more than 'N' different shopping carts over time?
If you can answer that question then you can pretty much answer the question: given they have these M items in their cart what is the M+1th item most likely to be.
Simple to express, logically simple to code - and quite likely to bring your machine to its knees if you have any sizable amount of data!
Of course the HPCC platform was precisely designed to handle the problems that bring most machines to their knees; so as we began to build the core of our machine learning library it was one of the first problems we tackled.
There are three main algorithms for finding FPM: Apriori, Eclat & FP_Growth. We have started with the most popular Apriori and code it two different ways:
a) Old school - simple code, fixed with data structures & rely on the metal
b) New school - nested loops, variable data structures - less brute force
Once we have some performance stats I'll let you know which method wins ...
For many of you the question will be: so what? If you don't have a shopping cart - do you care about FPM? Well - you should. FPM can be used to track shopping carts but really applies to any groups of behaviors that can be placed in a session:
Cyber Security: FPM can tell if a given session has a common collection of properties (and thus if it has an anomalous collection of properties)
Document Clustering: If given collections of words frequently occur in documents then they probably form a centroid of a topic
Fraud Detection: Similar to cyber security - if you can detect all common patterns then you can find the odd ones.
To return to Amazon; the issue is that the 'simple averages' of behavior simply don't work once the data gets big enough. I have exhibited so many different behaviors that you cannot tell 'on average' what I'm about to do next. However; if you have FPM, then as soon as the first few parts of a behavior pattern are manifest, the rest becomes very predictable indeed.