Thu Mar 22, 2018 5:40 pm
Login Register Lost Password? Contact Us

ML - calculate Euclidean distance

Topics related to the set of Machine Learning libraries and Matrix processing algorithms

Tue Aug 25, 2015 12:48 pm Change Time Zone

I have just started exploring HPCC ML module. I am trying to use KNN for classification of my test dataset (contains 4 feature fields). Please let me know if my approach is right:

    1. Read my training and test data files into datasets
    2. use ML.ToField on the dataset
    3. Call ML.Cluster.Distances with training and test datasets as parameters - this computes Euclidean distance (this is the default for 3rd param to distances?) for every row in left with every row in right, taking into account all features?
    4. Call ML.Cluster.Closest for result from previous step - this computes closest neighbour for each row? How do I pass x to this to get x closest neighbours?

- Gayathri
Posts: 75
Joined: Wed May 08, 2013 5:03 am

Tue Aug 25, 2015 3:55 pm Change Time Zone

Your steps 1 and 2 are correct. But, I believe you want a KNN supervised learning algorithm, i.e. you use your training set with a learning algorithm to learn some kind of model which you then use for classification. And, with KNN the model is actually all the rows of your training set. Then, the classifier compares new rows (of independent variables (or features or X)) with those of your training set. And, the class (or Y or dependent variable) that is assigned to each row will be the closest.

Look at ML.Tests.Explanatory.KNN_KDTree.ecl which is an example using KNN_KDTree (in ML.Lazy.ecl). Also, in the same module is KNN. You use it just like KNN_KDTree.
Posts: 240
Joined: Mon May 07, 2012 6:23 pm

Wed Aug 26, 2015 10:08 am Change Time Zone

Yes Tim, I want to implement supervised learning using KNN and I am using Euclidean distance for measurement. I have 2 labelled sets - training set and test set.

I want to use training set to learn and predict for the test set so that I can cross-verify predictions with labels from test set.

This is what I want to do:
for each row of test set
    Compute Euc distance (for all features) with every row of training set
    Take k closest distances
    Assign the max label from k neighbours to the current row's label

To implement this, given a training Matrix X and a test Matrix Y, for each row y in Y, I need to compute sqrt((x1-y1)^2 + (x2-y2)^2...). Will I be able to achieve this using ML.Cluster.Distances?

- Gayathri
Posts: 75
Joined: Wed May 08, 2013 5:03 am

Wed Aug 26, 2015 4:12 pm Change Time Zone

You might be able to use ML.Cluster.Distances, but I have a feeling it will be difficult because that function was setup for only those clustering algorithms is ML.Cluster.
Posts: 240
Joined: Mon May 07, 2012 6:23 pm

Wed Jul 13, 2016 10:37 pm Change Time Zone

Code: Select all
REAL euclidean_distance(DATASET(Types.NumericField) a, DATASET(Types.NumericField) b):= FUNCTION
                temp := JOIN(a, b, LEFT.number = RIGHT.number, TRANSFORM(Types.NumericField,
                                                       := -1;
                                                                SELF.number := LEFT.number;
                                                                SELF.value := POWER(LEFT.value-RIGHT.value, 2)
                return (SQRT(SUM(temp, temp.value)));
Posts: 11
Joined: Wed Oct 15, 2014 3:43 am

Return to Machine Learning

Who is online

Users browsing this forum: Bing [Bot] and 1 guest