Sun Jun 24, 2018 10:41 am
Login Register Lost Password? Contact Us


Calculate Median

Topics related to the set of Machine Learning libraries and Matrix processing algorithms

Sat Dec 01, 2012 4:51 pm Change Time Zone

Hi,

How to calculate the median for a particular column..?

Tried looking at the concept of FieldAggregates.Medians, but not clear about the usage.

Can anybody share a example on how to use Median..?

Can median be used for cross tab reports (to calculate median based on a grouped value)...?

Kindly help me regarding this.

Thanks a lot in advance.

Regards,
ksviswa
ksviswa
 
Posts: 129
Joined: Sat Jun 09, 2012 9:43 am

Mon Dec 03, 2012 8:40 pm Change Time Zone

ksviswa,

This code shows you how it works:
Code: Select all
IMPORT ML,STD;

ds := DATASET([{20120101},{20120102},{20120103},{20120104},{20120105},{20120106}],{integer d});

ML.Types.NumericField XF(ds L, integer C) := TRANSFORM
  SELF.id := C;
   SELF.number := 1;
   SELF.value := STD.Date.FromJulianDate(L.D);
END;

P := PROJECT(ds,XF(LEFT,COUNTER));

AVG2(REAL L, REAL R) := AVE(L,R);
DateStr(REAL jdate) := STD.Date.ToString(STD.Date.ToJulianDate(ROUND(jdate)),'%Y-%m-%d');

Simples := ML.FieldAggregates(P).Simple;
Minval  := Simples[1].minval;
Maxval  := Simples[1].maxval;
Meanval := Simples[1].mean;

Quarts := ML.FieldAggregates(P).NTileRanges(4);
Quart1 := AVG2(Quarts[1].max,Quarts[2].min);
// Quart2 := AVG2(Quarts[2].max,Quarts[3].min);
Quart2 := ML.FieldAggregates(P).Medians[1].median;
Quart3 := AVG2(Quarts[3].max,Quarts[4].min);

OutDS := DATASET([{'Min',DateStr(Minval)},
                           {'1st Qu',DateStr(Quart1)},
                           {'Median',DateStr(Quart2)},
                           {'Mean',DateStr(Meanval)},
                           {'3rd Qu',DateStr(Quart3)},
                           {'Max',DateStr(Maxval)}],
                 {STRING10 Prompt,STRING10 Val});
OutDS;
HTH,

Richard
rtaylor
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 1359
Joined: Wed Oct 26, 2011 7:40 pm

Mon Dec 03, 2012 11:02 pm Change Time Zone

Hi ksviswa,

Here's how you can use the FieldAggregates.Medians:

IMPORT ML;

myData := DATASET([{1,1,1}, {2,2,1}, {3,5,2}, {4,7,6}],{UNSIGNED rid; REAL f1; REAL f2;});
// Turn into regular NumericField file (with continuous variables)
ML.ToField(myData,o);
ML.FieldAggregates(o).Medians;

The output will have median values for both f1 and f2 columns.

Please make sure your ecl-ml code is up to date. You can get the most recent version at : https://github.com/hpcc-systems/ecl-ml

Edin
edin.muharemagic@lexisnexis.com
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 3
Joined: Wed Jul 13, 2011 9:38 pm

Tue Dec 04, 2012 6:02 am Change Time Zone

Thanks Richard and Edin,

The recent code from github solved the median calculation issue for a particular column..

How the same can be applied to a group..?

Sample Input to calculate median based on a grouped value.

Code: Select all

'A','Boston',23
'B','Boston',43
'C','Boston',29

'D','Chicago',15

'E','NY',12
'F','NY',55
'G','NY',57
'H','NY',61



Problem Statement : Median age per city

Sample Output :

Code: Select all

city       MedianAge
---------- -----------
Boston     29
Chicago    15
NY         56




Can we use FieldAggregates.Medians to compute this result or any other logic to be implemented.

Any pointers regarding the same would be highly appreciated.

Thanks a lot in advance.

Thanks and Regards,
ksviswa
ksviswa
 
Posts: 129
Joined: Sat Jun 09, 2012 9:43 am

Tue Dec 04, 2012 3:17 pm Change Time Zone

ksviswa,

Try this:
Code: Select all
IMPORT ML,STD;

rec := RECORD
  STRING1 Ltr;
  STRING10 City;
   unsigned1 age;
END;

ds := dataset([ {'A','Boston',23},
                        {'B','Boston',43},
                        {'C','Boston',29},
                        {'D','Chicago',15},
                        {'E','NY',12},
                        {'F','NY',55},
                        {'G','NY',57},
                        {'H','NY',61}],rec);

t1 := SORT(TABLE(ds,{city},City),City);                        

CityNums := PROJECT(t1,TRANSFORM({UNSIGNED4 Num,STRING10 City},SELF.Num := COUNTER;SELF := LEFT));

ML.Types.NumericField XF(ds L, integer C) := TRANSFORM
  SELF.id := C;
   SELF.number := CityNums(City = L.City)[1].Num;
   SELF.value := L.age;
END;

P := PROJECT(ds,XF(LEFT,COUNTER));

ML.FieldAggregates(P).Medians;
HTH,

Richard
rtaylor
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 1359
Joined: Wed Oct 26, 2011 7:40 pm

Wed Dec 05, 2012 7:49 am Change Time Zone

Hi,

Thanks a lot..:)

Regards,
ksviswa
ksviswa
 
Posts: 129
Joined: Sat Jun 09, 2012 9:43 am

Mon Jan 30, 2017 7:17 pm Change Time Zone

Perhaps not most efficient, but in a situation where I couldn't directly use the ML package I decided to calculate the median by taking the middle value(s):
Code: Select all
rec := RECORD
  STRING1 Ltr;
  STRING10 City;
  unsigned1 age;
END;

ds := dataset([ {'A','Boston',23},
{'B','Boston',43},
{'C','Boston',29},
{'D','Chicago',15},
{'E','NY',12},
{'F','NY',55},
{'G','NY',57},
{'H','NY',11}],rec);

// count records per city
t_counts_format := RECORD
   STRING10 City:=ds.City;
   UNSIGNED3 recordCount:=COUNT(GROUP);
END;
t_counts := TABLE(
   ds,
   t_counts_format,
   ds.City
);
// join ds with count
ds_city_count_layout:=RECORD
  STRING1 Ltr;
  STRING10 City;
  unsigned1 age;
   UNSIGNED3 recordCount;
END;
ds_city_count_layout join_ds_city_count(ds L, t_counts R):=TRANSFORM
   SELF:=L;
   SELF.recordCount:=R.recordCount;
END;
ds_city_count:=JOIN(
   ds,
   t_counts,
   LEFT.City=RIGHT.City
);

//iterate and only keep the median(s): 1 value for uneven recordCount, 2 values if the group count is even.
ds_city_count_grouped_city := GROUP(SORT(ds_city_count,City,age),City);  //then group them
ds_city_count_layout MedianValues(ds_city_count_grouped_city L, INTEGER C) :=
   TRANSFORM, SKIP((L.recordCount % 2 = 0 AND (C < L.recordCount/2 OR C > ((L.recordCount/2)+1))) OR (L.recordCount % 2 != 0 AND C != ((L.recordCount+1)/2)))
   SELF := L;
END;
ds_city_count_grouped_city_median_values := PROJECT(ds_city_count_grouped_city,
          MedianValues(LEFT,COUNTER));
// median is now average of values in the group.
city_median_age_format:=RECORD
      STRING10 City:=ds_city_count_grouped_city_median_values.City;      
      UDECIMAL10_5 median_age:=AVE(GROUP,ds_city_count_grouped_city_median_values.age);
END;
city_median_age:=TABLE(
   ds_city_count_grouped_city_median_values,
   city_median_age_format,
   ds_city_count_grouped_city_median_values.City
);
OUTPUT(city_median_age);
jeroenbaas
 
Posts: 11
Joined: Mon Sep 30, 2013 2:20 pm

Tue Jan 31, 2017 3:51 pm Change Time Zone

Jeroen,

Here's a slightly different approach:
Code: Select all
rec := RECORD
  STRING1 Ltr;
  STRING10 City;
  UNSIGNED1 age;
END;

ds := DATASET([ {'A','Boston',23},
                {'B','Boston',43},
                {'C','Boston',29},
                {'D','Chicago',15},
                {'E','NY',12},
                {'F','NY',55},
                {'G','NY',57},
                {'H','NY',11}],rec);

DNrec := RECORD
  STRING10 City;
  DATASET({UNSIGNED1 age}) ages;
END;
pds := PROJECT(TABLE(ds,{city},city),
               TRANSFORM(DNrec,SELF.Ages := [],SELF := LEFT));
dnds := DENORMALIZE(pds, ds,
                    LEFT.city = RIGHT.city,
                    TRANSFORM(DNrec,
                              SELF.Ages := LEFT.Ages +
                                           ROW({RIGHT.age},{UNSIGNED1 age});
                              SELF := LEFT));
Outrec := RECORD
  STRING10     City;
  UDECIMAL10_1 MedianAge;
END;
OutRec XF2(dnds L) := TRANSFORM
  AgeCnt := COUNT(L.ages);
  S_ages := SORT(L.ages,age);
  MidRec := AgeCnt - (AgeCnt DIV 2);
  SELF.MedianAge := IF(AgeCnt % 2 = 1,
                       S_ages[MidRec].age, 
                       (S_ages[MidRec].age + S_ages[MidRec+1].age)/2);
  SELF := L;   
END;
PROJECT(dnds,XF2(LEFT));
I created a nested child dataset (using your example data) of cities and their ages to achieve the required data grouping. All the real work is done in the TRANSFORM for the final PROJECT.

I ran each version several times, and this version seems to be consistently faster than your original code (although that could change given large amounts of data -- so YMMV :) );

HTH,

Richard
rtaylor
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 1359
Joined: Wed Oct 26, 2011 7:40 pm


Return to Machine Learning

Who is online

Users browsing this forum: No registered users and 0 guests

cron