Wed Jan 19, 2022 5:09 pm

Getting a weighted sample form a DATASET

Questions around writing code and queries
Hi,
My math is not up to this.
I have to generate a sample of exactly 6 million records from the set summary below.
The sample has to contain examples of records from every 'id'. But the count of records held be each 'id' varies enormously, so from id 10884 (bottom) all its 2345 records could be included in the sample, but where the counts of records are in the millions they obviously have to be cut down by some amount.
Its not a straight percentage from each bucket, as I said the smaller buckets can be accommodated completely in the sample.
err - help
id,Count
111,61617819
222,57579975
333,42477398
8888,39820652
1234,26562216
4321,26060023
1122,17037490
27409,14852329
18309,12606673
27651,10219386
32228,8268928
27724,6577956
1378,5142707
10573,4551204
18560,3845789
2440,2894748
15780,1899968
19480,1101994
20773,1049426
33473,983285
23118,880476
11214,679229
19836,549526
19060,488667
11356,267270
19811,130791
5004,51367
13733,42917
10884,2345
Allan

Posts: 442
Joined: Sat Oct 01, 2011 7:26 pm

Allan,

Thanks, this was fun! Code: Select all
SampleSize := 6000000; //six million
ds := DATASET([{111,61617819},{222,57579975},{333,42477398},{8888,39820652},
{1234,26562216},{4321,26060023},{1122,17037490},{27409,14852329},
{18309,12606673},{27651,10219386},{32228,8268928},{27724,6577956},
{1378,5142707},{10573,4551204},{18560,3845789},{2440,2894748},
{15780,1899968},{19480,1101994},{20773,1049426},{33473,983285},
{23118,880476},{11214,679229},{19836,549526},{19060,488667},
{11356,267270},{19811,130791},{5004,51367},{13733,42917},{10884,2345}],
{UNSIGNED id,UNSIGNED ctr});
RecCnt := COUNT(ds);

//What's the definition of "small" number (keep all recs)?
// SmallNum := ROUND(SampleSize/RecCnt);       //average?
SmallNum := ROUND((SampleSize/RecCnt)/2);   //half average?

SmallSet := ds(ctr <= SmallNum);
LargeSet := ds(ctr > SmallNum);

SmallSetSum := SUM(SmallSet,ctr);
LargeSetSum := SUM(LargeSet,ctr);
LargeTarget := SampleSize - SmallSetSum; //number to get from larger IDs

OutRec := RECORD
ds;
UNSIGNED SampleCnt := ds.ctr;
END;
SmallRes := TABLE(SmallSet,OutRec);
LargeRes := PROJECT(LargeSet,
TRANSFORM(OutRec,
Pct := LEFT.ctr/LargeSetSum; //percentage of total Large
SELF.SampleCnt := ROUND(LargeTarget * Pct),
SELF := LEFT));

CntSum := SUM(SmallRes+LargeRes,SampleCnt);
Diff   := CntSum - SampleSize;        //How close are we to the number we want?

//minor final adjustments to produce exact SampleSize number
LargeFinal(DATASET(OutRec) d,INTEGER n) :=
PROJECT(d,
TRANSFORM(OutRec,
NewCnt := IF(n < 0,LEFT.SampleCnt+1,LEFT.SampleCnt-1);
SELF.SampleCnt := IF(ABS(n) >= COUNTER,NewCnt,LEFT.SampleCnt);
SELF := LEFT));
LargeRes,
LargeFinal(SORT(LargeRes,-SampleCnt),Diff)); //modify largest first

FinalRes;                //Exact numbers of records to get for each ID value
SUM(FinalRes,SampleCnt); //This should be the same as SampleSize defined above
Then you only need to write the code to get the precise number of sample recs defined for each ID value, something like this:
Code: Select all
//get actual samples something like this:
ds := DATASET( ... );   //the real dataset to get samples from

//d = filtered dataset of one id, n = number of sample recs to get
GetSample(DATASET(rec) d,UNSIGNED n) := FUNCTION
Interval := TRUNCATE(COUNT(d)/n);
Samples  := SAMPLE(d,n);
RETURN CHOOSEN(Samples,n);
END;
PROJECT(FinalRecs,
TRANSFORM({FinalRecs,DATASET(RECORDOF(ds)) ChildData},
SELF.ChildData := GetSample(ds(ID = LEFT.ID),LEFT.SampleCnt),
SELF := LEFT));

HTH,

Richard
rtaylor Posts: 1610
Joined: Wed Oct 26, 2011 7:40 pm

Brilliant Richard!

After a bit of experimentation I decided 1/2 average was the 'best' cutoff between large and small sample size.
By the way, small typo, in your example use of the distribution you calculate 'Interval', but don't use it, I expect you meant SAMPLE(d,Interval) instead of SAMPLE(d,n).

I was wondering if a more general solution would be useful to the community.
e.g. a FUNCTIONMACRO whose inputs are:
1. the input dataset
2. a call back function which given a record returned the quartile/decile to place the record
3. Total number of records to return in the sample
4. SET OF 4 or 10 or 'n' percentages giving the max percentage of records to return in any one quartile/decile/<other number of buckets> Obviously if the input does not contain enough records fulfilling the criteria for a bucket then 100% of that bucket is returned.

The FUNCTIONMACRO then returns a DATASET (same record type as input) where the count of sample of records for each bucket is the percent of the total requested (param 3)
There would have to be an ASSERT FAIL if the SUM (parameter 4) != 100

Parameter 2 would be tricky as the decision on which bucket to place the record might require meta data, as is the case in my COUNT of records. The callback would need to know the COUNT in the input dataset for that particular record type (in case above ID).
But then again any such meta-data could be calculated by the user of the FUNCTIONMACRO and made available to the call-back function. It may be handy to have a parameter to the callback which is just a pass-through of meta-data passed into the FUNCTIONMACRO. That way the callback function will always have meta-data easily available to it.

Anyway,

Thanks Very much Richard for helping me progress my project.
Yours
Allan
Allan

Posts: 442
Joined: Sat Oct 01, 2011 7:26 pm

All,
The example code above to actually generate the sample can take a VERY long time as you're filtering the entire input for every ID. IT can also blow the max size of spill files. (10Mb default)

I've used good old PROCESS again. With the RIGHT transform keeping a running tally of the records processed for each ID. Then the LEFT transform just SKIPs if, for that LEFT's record ID, the running tally >= maxSampleSize for that ID.
We're selecting sequential records from the input so there is no 'Interval', but it does complete in seconds, and there is no spill file to cause problems.

Yours

Allan
Allan

Posts: 442
Joined: Sat Oct 01, 2011 7:26 pm

Allan,
I expect you meant SAMPLE(d,Interval) instead of SAMPLE(d,n).
You are absolutely correct! And notice that, for the small IDs the COUNT(d) and n values are the same, making the Interval=1, and SAMPLE(ds,1) returns all the records in the ds. Turning this all into a FUNCTIONMACRO is the next logical step, and you only need to pass in the dataset you want to sample from, the unique ID field, and the number of records you want returned, because the id/count dataset can be easily created for the rest of the code to work on with a simple TABLE, like this:
Code: Select all
TABLE(ds,{idfield, ctr := COUNT(GROUP)},idfield);

HTH,

Richard
rtaylor Posts: 1610
Joined: Wed Oct 26, 2011 7:40 pm

Here is a working example, with unit tests, as a generalised FUNCTIONMACRO
Code: Select all
/*
BucketSample returns a sample of 'TotalSampleSize' number of records from DATASET 'ds'.
Where the fraction of the total sample in any one bucket is dictated by a
SET OF UNSIGNED 'BucketPercents'. The SUM of all the elements of 'BucketPercents'
must equal 100.
The call-back FUNCTION 'BucketDescriminator' dictates which 'bucket' any one record
belongs in.
'BucketDescriminator' FUNCTION must take two parameters
1. RECORDOF(ds)
2. 'BucketDescriminatorArgument'. This is a pass through argument that allows meta data
on the DATASET, or anything else for that matter, to be easily available to the callback.
'BucketDescriminator' returns an UNSIGNED in the range 0 <= f(ds) <= COUNT(BucketPercents),
that indicates, as an index, which of the buckets the record belongs to.
If the Discriminator returns 0 (Zero) this indicates that the record has been explicitly excluded
from any sample.

Note the input dataset 'ds' may well have to be read in its own graph, i.e. use INDEPENDENT.
*/
BucketSample(ds,TotalSampleSize,BucketPercents,BucketDescriminator,BucketDescriminatorArgument) := FUNCTIONMACRO
#UNIQUENAME(Id);
LOCAL Buckets := DATASET(BucketPercents,{UNSIGNED bucket});
LOCAL Chk100Percent := ASSERT(SUM(Buckets,bucket) = 100,'Total allocation to buckets must equal 100%.',FAIL);
LOCAL Checks := PARALLEL(Chk100Percent);
LOCAL
SampleSizes := PROJECT(Buckets
,TRANSFORM({UNSIGNED %id%;UNSIGNED size}
;SELF.%id% := COUNTER
;SELF.size := LEFT.bucket / 100 * TotalSampleSize));

// Use 'SKIP' in preference to filtering to reduce spilling to disk.
{UNSIGNED %Id%,RECORDOF(ds)} BucketUp(RECORDOF(ds) L) := TRANSFORM,SKIP(BucketDescriminator(L,BucketDescriminatorArgument) = 0)
SELF.%Id% := BucketDescriminator(L,BucketDescriminatorArgument);
SELF := L;
END;
LOCAL AllocatedRecsToBucket := PROJECT(ds,BucketUp(LEFT));

{DATASET(RECORDOF(ds)) d} GatherSample(RECORDOF(SampleSizes) L) := TRANSFORM
SELF.d := PROJECT(ENTH(AllocatedRecsToBucket(%Id% = L.%Id%),L.size)
,TRANSFORM(RECORDOF(ds);SELF := LEFT));
END;
LOCAL GatheredByBucket := PROJECT(SampleSizes,GatherSample(LEFT));
LOCAL Gathered := NORMALIZE(GatheredByBucket,LEFT.d,TRANSFORM(RIGHT));
RETURN WHEN(Gathered,Checks);
ENDMACRO;

/////////////////////////////////////////////////////
// Unit Test
/////////////////////////////////////////////////////

INData := NORMALIZE(DATASET([{1}],{INTEGER1 x}),1000000,TRANSFORM({UNSIGNED someDataRow},SELF.someDataRow := COUNTER)) : INDEPENDENT;

////////////////////////////////////////////////////////////////
// Example call-back
////////////////////////////////////////////////////////////////

ExcludeSet := {UNSIGNED ExcludeLowBound,UNSIGNED ExcludeHighBound};

UNSIGNED db(RECORDOF(INData) rec,DATASET(ExcludeSet) Exclude) := FUNCTION
RETURN IF(DontUse,0,(rec.someDataRow % 5)+1);
END;

Smpl := BucketSample(INData,100000,[2,70,8,12,8],db,DATASET([{100,1000},{80000,90000}],ExcludeSet));

ExpectedResults := DATASET([{'Set ending 0 or 5',COUNT(Smpl(someDataRow % 5 = 0)) =  2000}
,{'Set ending 1 or 6',COUNT(Smpl(someDataRow % 5 = 1)) = 70000}
,{'Set ending 2 or 7',COUNT(Smpl(someDataRow % 5 = 2)) =  8000}
,{'Set ending 3 or 8',COUNT(Smpl(someDataRow % 5 = 3)) = 12000}
,{'Set ending 4 or 9',COUNT(Smpl(someDataRow % 5 = 4)) =  8000}
,{'Set ending 4 or 9',COUNT(Smpl(someDataRow % 5 = 4)) =  8000}
,{'Contains invalid records 100 to 1000',NOT EXISTS(Smpl(someDataRow BETWEEN 100 AND 1000))}
,{'Contains invalid records 80000 to 90000',NOT EXISTS(Smpl(someDataRow BETWEEN 80000 AND 90000))}
],{STRING Id;BOOLEAN Pass});
// Empty dataset is a PASS
OUTPUT(ExpectedResults(NOT Pass),NAMED('FAILED'));

Note an empty dataset result from the unit tests is a PASS.
Yours
Allan
Allan

Posts: 442
Joined: Sat Oct 01, 2011 7:26 pm