Wed Jun 23, 2021 3:47 pm
Login

## Calculate Median

Topics related to the set of Machine Learning libraries and Matrix processing algorithms
Hi,

How to calculate the median for a particular column..?

Tried looking at the concept of FieldAggregates.Medians, but not clear about the usage.

Can anybody share a example on how to use Median..?

Can median be used for cross tab reports (to calculate median based on a grouped value)...?

Kindly help me regarding this.

Thanks a lot in advance.

Regards,
ksviswa
ksviswa

Posts: 129
Joined: Sat Jun 09, 2012 9:43 am

ksviswa,

This code shows you how it works:
Code: Select all
`IMPORT ML,STD;ds := DATASET([{20120101},{20120102},{20120103},{20120104},{20120105},{20120106}],{integer d});ML.Types.NumericField XF(ds L, integer C) := TRANSFORM  SELF.id := C;   SELF.number := 1;   SELF.value := STD.Date.FromJulianDate(L.D);END;P := PROJECT(ds,XF(LEFT,COUNTER));AVG2(REAL L, REAL R) := AVE(L,R);DateStr(REAL jdate) := STD.Date.ToString(STD.Date.ToJulianDate(ROUND(jdate)),'%Y-%m-%d');Simples := ML.FieldAggregates(P).Simple;Minval  := Simples.minval;Maxval  := Simples.maxval;Meanval := Simples.mean;Quarts := ML.FieldAggregates(P).NTileRanges(4);Quart1 := AVG2(Quarts.max,Quarts.min);// Quart2 := AVG2(Quarts.max,Quarts.min);Quart2 := ML.FieldAggregates(P).Medians.median;Quart3 := AVG2(Quarts.max,Quarts.min);OutDS := DATASET([{'Min',DateStr(Minval)},                           {'1st Qu',DateStr(Quart1)},                           {'Median',DateStr(Quart2)},                           {'Mean',DateStr(Meanval)},                           {'3rd Qu',DateStr(Quart3)},                           {'Max',DateStr(Maxval)}],                 {STRING10 Prompt,STRING10 Val});OutDS;`
HTH,

Richard
rtaylor
Community Advisory Board Member Posts: 1582
Joined: Wed Oct 26, 2011 7:40 pm

Hi ksviswa,

Here's how you can use the FieldAggregates.Medians:

IMPORT ML;

myData := DATASET([{1,1,1}, {2,2,1}, {3,5,2}, {4,7,6}],{UNSIGNED rid; REAL f1; REAL f2;});
// Turn into regular NumericField file (with continuous variables)
ML.ToField(myData,o);
ML.FieldAggregates(o).Medians;

The output will have median values for both f1 and f2 columns.

Please make sure your ecl-ml code is up to date. You can get the most recent version at : https://github.com/hpcc-systems/ecl-ml

Edin
edin.muharemagic@lexisnexis.com
Community Advisory Board Member Posts: 3
Joined: Wed Jul 13, 2011 9:38 pm

Thanks Richard and Edin,

The recent code from github solved the median calculation issue for a particular column..

How the same can be applied to a group..?

Sample Input to calculate median based on a grouped value.

Code: Select all
`'A','Boston',23'B','Boston',43'C','Boston',29'D','Chicago',15'E','NY',12'F','NY',55'G','NY',57'H','NY',61`

Problem Statement : Median age per city

Sample Output :

Code: Select all
`city       MedianAge---------- -----------Boston     29Chicago    15NY         56`

Can we use FieldAggregates.Medians to compute this result or any other logic to be implemented.

Any pointers regarding the same would be highly appreciated.

Thanks a lot in advance.

Thanks and Regards,
ksviswa
ksviswa

Posts: 129
Joined: Sat Jun 09, 2012 9:43 am

ksviswa,

Try this:
Code: Select all
`IMPORT ML,STD;rec := RECORD  STRING1 Ltr;  STRING10 City;   unsigned1 age;END;ds := dataset([ {'A','Boston',23},                        {'B','Boston',43},                        {'C','Boston',29},                        {'D','Chicago',15},                        {'E','NY',12},                        {'F','NY',55},                        {'G','NY',57},                        {'H','NY',61}],rec);t1 := SORT(TABLE(ds,{city},City),City);                        CityNums := PROJECT(t1,TRANSFORM({UNSIGNED4 Num,STRING10 City},SELF.Num := COUNTER;SELF := LEFT));ML.Types.NumericField XF(ds L, integer C) := TRANSFORM  SELF.id := C;   SELF.number := CityNums(City = L.City).Num;   SELF.value := L.age;END;P := PROJECT(ds,XF(LEFT,COUNTER));ML.FieldAggregates(P).Medians;`
HTH,

Richard
rtaylor
Community Advisory Board Member Posts: 1582
Joined: Wed Oct 26, 2011 7:40 pm

Hi,

Thanks a lot.. Regards,
ksviswa
ksviswa

Posts: 129
Joined: Sat Jun 09, 2012 9:43 am

Perhaps not most efficient, but in a situation where I couldn't directly use the ML package I decided to calculate the median by taking the middle value(s):
Code: Select all
`rec := RECORD  STRING1 Ltr;  STRING10 City;  unsigned1 age;END;ds := dataset([ {'A','Boston',23},{'B','Boston',43},{'C','Boston',29},{'D','Chicago',15},{'E','NY',12},{'F','NY',55},{'G','NY',57},{'H','NY',11}],rec);// count records per cityt_counts_format := RECORD   STRING10 City:=ds.City;   UNSIGNED3 recordCount:=COUNT(GROUP);END;t_counts := TABLE(   ds,   t_counts_format,   ds.City);// join ds with countds_city_count_layout:=RECORD  STRING1 Ltr;  STRING10 City;  unsigned1 age;   UNSIGNED3 recordCount;END;ds_city_count_layout join_ds_city_count(ds L, t_counts R):=TRANSFORM   SELF:=L;   SELF.recordCount:=R.recordCount;END;ds_city_count:=JOIN(   ds,   t_counts,   LEFT.City=RIGHT.City);//iterate and only keep the median(s): 1 value for uneven recordCount, 2 values if the group count is even.ds_city_count_grouped_city := GROUP(SORT(ds_city_count,City,age),City);  //then group themds_city_count_layout MedianValues(ds_city_count_grouped_city L, INTEGER C) :=   TRANSFORM, SKIP((L.recordCount % 2 = 0 AND (C < L.recordCount/2 OR C > ((L.recordCount/2)+1))) OR (L.recordCount % 2 != 0 AND C != ((L.recordCount+1)/2)))   SELF := L;END;ds_city_count_grouped_city_median_values := PROJECT(ds_city_count_grouped_city,          MedianValues(LEFT,COUNTER));// median is now average of values in the group.city_median_age_format:=RECORD      STRING10 City:=ds_city_count_grouped_city_median_values.City;            UDECIMAL10_5 median_age:=AVE(GROUP,ds_city_count_grouped_city_median_values.age);END;city_median_age:=TABLE(   ds_city_count_grouped_city_median_values,   city_median_age_format,   ds_city_count_grouped_city_median_values.City);OUTPUT(city_median_age);`
jeroenbaas

Posts: 11
Joined: Mon Sep 30, 2013 2:20 pm

Jeroen,

Here's a slightly different approach:
Code: Select all
`rec := RECORD  STRING1 Ltr;  STRING10 City;  UNSIGNED1 age;END;ds := DATASET([ {'A','Boston',23},                {'B','Boston',43},                {'C','Boston',29},                {'D','Chicago',15},                {'E','NY',12},                {'F','NY',55},                {'G','NY',57},                {'H','NY',11}],rec);DNrec := RECORD  STRING10 City;  DATASET({UNSIGNED1 age}) ages;END;pds := PROJECT(TABLE(ds,{city},city),               TRANSFORM(DNrec,SELF.Ages := [],SELF := LEFT));dnds := DENORMALIZE(pds, ds,                    LEFT.city = RIGHT.city,                    TRANSFORM(DNrec,                              SELF.Ages := LEFT.Ages +                                            ROW({RIGHT.age},{UNSIGNED1 age});                              SELF := LEFT));Outrec := RECORD  STRING10     City;  UDECIMAL10_1 MedianAge;END;OutRec XF2(dnds L) := TRANSFORM  AgeCnt := COUNT(L.ages);  S_ages := SORT(L.ages,age);  MidRec := AgeCnt - (AgeCnt DIV 2);   SELF.MedianAge := IF(AgeCnt % 2 = 1,                       S_ages[MidRec].age,                         (S_ages[MidRec].age + S_ages[MidRec+1].age)/2);  SELF := L;   END;PROJECT(dnds,XF2(LEFT));`
I created a nested child dataset (using your example data) of cities and their ages to achieve the required data grouping. All the real work is done in the TRANSFORM for the final PROJECT.

I ran each version several times, and this version seems to be consistently faster than your original code (although that could change given large amounts of data -- so YMMV );

HTH,

Richard
rtaylor
Community Advisory Board Member Posts: 1582
Joined: Wed Oct 26, 2011 7:40 pm

Return to Machine Learning

### Who is online

Users browsing this forum: No registered users and 1 guest