One of our community members recently asked about fraud detection using the HPCC Systems platform. The case that this person described involved identifying potentially fraudulent traders, who were performing a significant number of transactions over a relatively short time period. As I was responding to this post in our forums, and trying to keep the answer concise enough to fit in the forums format, I thought that it would be useful to have a slightly more extensive post, around ideas and concepts when designing an anomaly detection system on the HPCC Systems platform.
For this purpose I'll assume that, while it's possibly viable to come up with a certain number of rules to define how normal activity looks like even though the number of rules could be large, it's probably unfeasible to come up with rules that would describe every potential anomalous behavior (fraudsters can be very creative!). I will also assume that while, in certain cases, individual transactions could be flagged as anomalous due to characteristics in the particular data record, in the most common case, it is through aggregates and statistical deviations that an anomaly can be identified.
The first thing to define is the number of significant dimensions (or features) the data has. If there is one dimension (or very few dimensions), where most of the significant variability occurs, it could be conceivable to manually define rules that, for example, would mark transactions beyond 3 or 4 sigma (standard deviations from the mean for the particular dimension) as suspicious. Unfortunately, things are not always so simple.
Generally, there are multiple dimensions, and identifying by hand those that are the most relevant can be tricky. In other cases, performing some type of time series correlation can identify suspicious cases (for example, in the case of logs for a web application, seeing that someone has logged in from two locations a thousand miles apart in a short time frame could be a useful indicator). Fortunately, there are certain machine learning methodologies that can come to the rescue.
One way to tackle this problem is to assume that we can use historical data on good cases to train a statistical model (remember that bad cases are scarce and too variable). This is known as a semi-supervised learning technique, where you train your model only on the "normal" activity and expect to detect anomalous cases that exhibit characteristics which are different from the "norm". One specific method that can be used for this purpose is called PCA (Principal Components Analysis), which can automatically reduce the number of dimensions to those that present the largest significance (there is a loss of information as a consequence of this reduction, but this tends to be minimal compared to the value of reducing the computational complexity). KDA (Kernel Density Estimation) is another semi-supervised method to identify outliers. On the HPCC Systems Platform, PCA is supported through our ECL-ML machine learning module. KDA is currently available on HPCC through the ECL integration with Paperboat .
A possible more interesting approach is to use a completely unsupervised learning methodology. Using a clustering technique such as agglomerative hierarchical clustering, supported in HPCC, as part of the ECL-ML machine learning module, can help identify those events which don't clusterize easily. Other clustering method also available on ECL-ML, k-means, is less effective as it requires to define the number of centroids a priori, which could be very difficult. When using agglomerative hierarchical clustering, one of the aspects that could require some experimentation is to identify the number of iterations required to have the best effectiveness: too many iterations and there will be no outliers as all the data will be clusterized, too few iterations and many normal cases could still be outside of the clusters.
Beyond these specific techniques, the best possible approach probably includes a combination of methods. If there are clear rules that can quickly identify suspicious case, those could be used to validate or rule out results from statistical algorithms, and since a strictly rules based system would be ineffective to detect every possible outlier, using some of the machine learning methodologies described above too, would be highly recommended.