Mon Nov 29, 2021 7:59 pm

## Calculate Median

Topics related to the set of Machine Learning libraries and Matrix processing algorithms
Hi,

How to calculate the median for a particular column..?

Tried looking at the concept of FieldAggregates.Medians, but not clear about the usage.

Can anybody share a example on how to use Median..?

Can median be used for cross tab reports (to calculate median based on a grouped value)...?

Kindly help me regarding this.

Regards,
ksviswa
ksviswa

Posts: 129
Joined: Sat Jun 09, 2012 9:43 am

ksviswa,

This code shows you how it works:
Code: Select all
IMPORT ML,STD;

ds := DATASET([{20120101},{20120102},{20120103},{20120104},{20120105},{20120106}],{integer d});

ML.Types.NumericField XF(ds L, integer C) := TRANSFORM
SELF.id := C;
SELF.number := 1;
SELF.value := STD.Date.FromJulianDate(L.D);
END;

P := PROJECT(ds,XF(LEFT,COUNTER));

AVG2(REAL L, REAL R) := AVE(L,R);
DateStr(REAL jdate) := STD.Date.ToString(STD.Date.ToJulianDate(ROUND(jdate)),'%Y-%m-%d');

Simples := ML.FieldAggregates(P).Simple;
Minval  := Simples[1].minval;
Maxval  := Simples[1].maxval;
Meanval := Simples[1].mean;

Quarts := ML.FieldAggregates(P).NTileRanges(4);
Quart1 := AVG2(Quarts[1].max,Quarts[2].min);
// Quart2 := AVG2(Quarts[2].max,Quarts[3].min);
Quart2 := ML.FieldAggregates(P).Medians[1].median;
Quart3 := AVG2(Quarts[3].max,Quarts[4].min);

OutDS := DATASET([{'Min',DateStr(Minval)},
{'1st Qu',DateStr(Quart1)},
{'Median',DateStr(Quart2)},
{'Mean',DateStr(Meanval)},
{'3rd Qu',DateStr(Quart3)},
{'Max',DateStr(Maxval)}],
{STRING10 Prompt,STRING10 Val});
OutDS;
HTH,

Richard
rtaylor

Posts: 1604
Joined: Wed Oct 26, 2011 7:40 pm

Hi ksviswa,

Here's how you can use the FieldAggregates.Medians:

IMPORT ML;

myData := DATASET([{1,1,1}, {2,2,1}, {3,5,2}, {4,7,6}],{UNSIGNED rid; REAL f1; REAL f2;});
// Turn into regular NumericField file (with continuous variables)
ML.ToField(myData,o);
ML.FieldAggregates(o).Medians;

The output will have median values for both f1 and f2 columns.

Edin
edin.muharemagic@lexisnexis.com

Posts: 3
Joined: Wed Jul 13, 2011 9:38 pm

Thanks Richard and Edin,

The recent code from github solved the median calculation issue for a particular column..

How the same can be applied to a group..?

Sample Input to calculate median based on a grouped value.

Code: Select all

'A','Boston',23
'B','Boston',43
'C','Boston',29

'D','Chicago',15

'E','NY',12
'F','NY',55
'G','NY',57
'H','NY',61

Problem Statement : Median age per city

Sample Output :

Code: Select all

city       MedianAge
---------- -----------
Boston     29
Chicago    15
NY         56

Can we use FieldAggregates.Medians to compute this result or any other logic to be implemented.

Any pointers regarding the same would be highly appreciated.

Thanks and Regards,
ksviswa
ksviswa

Posts: 129
Joined: Sat Jun 09, 2012 9:43 am

ksviswa,

Try this:
Code: Select all
IMPORT ML,STD;

rec := RECORD
STRING1 Ltr;
STRING10 City;
unsigned1 age;
END;

ds := dataset([ {'A','Boston',23},
{'B','Boston',43},
{'C','Boston',29},
{'D','Chicago',15},
{'E','NY',12},
{'F','NY',55},
{'G','NY',57},
{'H','NY',61}],rec);

t1 := SORT(TABLE(ds,{city},City),City);

CityNums := PROJECT(t1,TRANSFORM({UNSIGNED4 Num,STRING10 City},SELF.Num := COUNTER;SELF := LEFT));

ML.Types.NumericField XF(ds L, integer C) := TRANSFORM
SELF.id := C;
SELF.number := CityNums(City = L.City)[1].Num;
SELF.value := L.age;
END;

P := PROJECT(ds,XF(LEFT,COUNTER));

ML.FieldAggregates(P).Medians;
HTH,

Richard
rtaylor

Posts: 1604
Joined: Wed Oct 26, 2011 7:40 pm

Hi,

Thanks a lot..

Regards,
ksviswa
ksviswa

Posts: 129
Joined: Sat Jun 09, 2012 9:43 am

Perhaps not most efficient, but in a situation where I couldn't directly use the ML package I decided to calculate the median by taking the middle value(s):
Code: Select all
rec := RECORD
STRING1 Ltr;
STRING10 City;
unsigned1 age;
END;

ds := dataset([ {'A','Boston',23},
{'B','Boston',43},
{'C','Boston',29},
{'D','Chicago',15},
{'E','NY',12},
{'F','NY',55},
{'G','NY',57},
{'H','NY',11}],rec);

// count records per city
t_counts_format := RECORD
STRING10 City:=ds.City;
UNSIGNED3 recordCount:=COUNT(GROUP);
END;
t_counts := TABLE(
ds,
t_counts_format,
ds.City
);
// join ds with count
ds_city_count_layout:=RECORD
STRING1 Ltr;
STRING10 City;
unsigned1 age;
UNSIGNED3 recordCount;
END;
ds_city_count_layout join_ds_city_count(ds L, t_counts R):=TRANSFORM
SELF:=L;
SELF.recordCount:=R.recordCount;
END;
ds_city_count:=JOIN(
ds,
t_counts,
LEFT.City=RIGHT.City
);

//iterate and only keep the median(s): 1 value for uneven recordCount, 2 values if the group count is even.
ds_city_count_grouped_city := GROUP(SORT(ds_city_count,City,age),City);  //then group them
ds_city_count_layout MedianValues(ds_city_count_grouped_city L, INTEGER C) :=
TRANSFORM, SKIP((L.recordCount % 2 = 0 AND (C < L.recordCount/2 OR C > ((L.recordCount/2)+1))) OR (L.recordCount % 2 != 0 AND C != ((L.recordCount+1)/2)))
SELF := L;
END;
ds_city_count_grouped_city_median_values := PROJECT(ds_city_count_grouped_city,
MedianValues(LEFT,COUNTER));
// median is now average of values in the group.
city_median_age_format:=RECORD
STRING10 City:=ds_city_count_grouped_city_median_values.City;
UDECIMAL10_5 median_age:=AVE(GROUP,ds_city_count_grouped_city_median_values.age);
END;
city_median_age:=TABLE(
ds_city_count_grouped_city_median_values,
city_median_age_format,
ds_city_count_grouped_city_median_values.City
);
OUTPUT(city_median_age);
jeroenbaas

Posts: 11
Joined: Mon Sep 30, 2013 2:20 pm

Jeroen,

Here's a slightly different approach:
Code: Select all
rec := RECORD
STRING1 Ltr;
STRING10 City;
UNSIGNED1 age;
END;

ds := DATASET([ {'A','Boston',23},
{'B','Boston',43},
{'C','Boston',29},
{'D','Chicago',15},
{'E','NY',12},
{'F','NY',55},
{'G','NY',57},
{'H','NY',11}],rec);

DNrec := RECORD
STRING10 City;
DATASET({UNSIGNED1 age}) ages;
END;
pds := PROJECT(TABLE(ds,{city},city),
TRANSFORM(DNrec,SELF.Ages := [],SELF := LEFT));
dnds := DENORMALIZE(pds, ds,
LEFT.city = RIGHT.city,
TRANSFORM(DNrec,
SELF.Ages := LEFT.Ages +
ROW({RIGHT.age},{UNSIGNED1 age});
SELF := LEFT));
Outrec := RECORD
STRING10     City;
UDECIMAL10_1 MedianAge;
END;
OutRec XF2(dnds L) := TRANSFORM
AgeCnt := COUNT(L.ages);
S_ages := SORT(L.ages,age);
MidRec := AgeCnt - (AgeCnt DIV 2);
SELF.MedianAge := IF(AgeCnt % 2 = 1,
S_ages[MidRec].age,
(S_ages[MidRec].age + S_ages[MidRec+1].age)/2);
SELF := L;
END;
PROJECT(dnds,XF2(LEFT));
I created a nested child dataset (using your example data) of cities and their ages to achieve the required data grouping. All the real work is done in the TRANSFORM for the final PROJECT.

I ran each version several times, and this version seems to be consistently faster than your original code (although that could change given large amounts of data -- so YMMV );

HTH,

Richard
rtaylor