Sun May 19, 2019 12:50 pm
Login Register Lost Password? Contact Us


Getting a weighted sample form a DATASET

Questions around writing code and queries

Tue Jan 22, 2019 4:52 pm Change Time Zone

Hi,
My math is not up to this.
I have to generate a sample of exactly 6 million records from the set summary below.
The sample has to contain examples of records from every 'id'. But the count of records held be each 'id' varies enormously, so from id 10884 (bottom) all its 2345 records could be included in the sample, but where the counts of records are in the millions they obviously have to be cut down by some amount.
Its not a straight percentage from each bucket, as I said the smaller buckets can be accommodated completely in the sample.
err - help
    id,Count
    111,61617819
    222,57579975
    333,42477398
    8888,39820652
    1234,26562216
    4321,26060023
    1122,17037490
    27409,14852329
    18309,12606673
    27651,10219386
    32228,8268928
    27724,6577956
    1378,5142707
    10573,4551204
    18560,3845789
    2440,2894748
    15780,1899968
    19480,1101994
    20773,1049426
    33473,983285
    23118,880476
    11214,679229
    19836,549526
    19060,488667
    11356,267270
    19811,130791
    5004,51367
    13733,42917
    10884,2345
Allan
 
Posts: 363
Joined: Sat Oct 01, 2011 7:26 pm

Tue Jan 22, 2019 9:33 pm Change Time Zone

Allan,

Thanks, this was fun! :)
Code: Select all
SampleSize := 6000000; //six million
ds := DATASET([{111,61617819},{222,57579975},{333,42477398},{8888,39820652},
               {1234,26562216},{4321,26060023},{1122,17037490},{27409,14852329},
               {18309,12606673},{27651,10219386},{32228,8268928},{27724,6577956},
               {1378,5142707},{10573,4551204},{18560,3845789},{2440,2894748},
               {15780,1899968},{19480,1101994},{20773,1049426},{33473,983285},
               {23118,880476},{11214,679229},{19836,549526},{19060,488667},
               {11356,267270},{19811,130791},{5004,51367},{13733,42917},{10884,2345}],
              {UNSIGNED id,UNSIGNED ctr});
RecCnt := COUNT(ds);

//What's the definition of "small" number (keep all recs)?
// SmallNum := ROUND(SampleSize/RecCnt);       //average?         
SmallNum := ROUND((SampleSize/RecCnt)/2);   //half average?                

SmallSet := ds(ctr <= SmallNum);
LargeSet := ds(ctr > SmallNum);

SmallSetSum := SUM(SmallSet,ctr);
LargeSetSum := SUM(LargeSet,ctr);
LargeTarget := SampleSize - SmallSetSum; //number to get from larger IDs

OutRec := RECORD
  ds;
  UNSIGNED SampleCnt := ds.ctr;
END;
SmallRes := TABLE(SmallSet,OutRec);
LargeRes := PROJECT(LargeSet,
                    TRANSFORM(OutRec,
                              Pct := LEFT.ctr/LargeSetSum; //percentage of total Large
                              SELF.SampleCnt := ROUND(LargeTarget * Pct),
                              SELF := LEFT));

CntSum := SUM(SmallRes+LargeRes,SampleCnt);                                             
Diff   := CntSum - SampleSize;        //How close are we to the number we want?

//minor final adjustments to produce exact SampleSize number
LargeFinal(DATASET(OutRec) d,INTEGER n) :=
   PROJECT(d,
           TRANSFORM(OutRec,
                     NewCnt := IF(n < 0,LEFT.SampleCnt+1,LEFT.SampleCnt-1);
                     SELF.SampleCnt := IF(ABS(n) >= COUNTER,NewCnt,LEFT.SampleCnt);
                     SELF := LEFT));
LargeAdjust := IF(Diff = 0,
                  LargeRes,
                  LargeFinal(SORT(LargeRes,-SampleCnt),Diff)); //modify largest first

FinalRes := SmallRes & LargeAdjust;
FinalRes;                //Exact numbers of records to get for each ID value
SUM(FinalRes,SampleCnt); //This should be the same as SampleSize defined above
Then you only need to write the code to get the precise number of sample recs defined for each ID value, something like this:
Code: Select all
//get actual samples something like this:
ds := DATASET( ... );   //the real dataset to get samples from

//d = filtered dataset of one id, n = number of sample recs to get
GetSample(DATASET(rec) d,UNSIGNED n) := FUNCTION
  Interval := TRUNCATE(COUNT(d)/n);
  Samples  := SAMPLE(d,n);
  RETURN CHOOSEN(Samples,n);
END;
PROJECT(FinalRecs,
        TRANSFORM({FinalRecs,DATASET(RECORDOF(ds)) ChildData},
                  SELF.ChildData := GetSample(ds(ID = LEFT.ID),LEFT.SampleCnt),
                  SELF := LEFT));

HTH,

Richard
rtaylor
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 1452
Joined: Wed Oct 26, 2011 7:40 pm

Wed Jan 23, 2019 11:33 am Change Time Zone

Brilliant Richard!

After a bit of experimentation I decided 1/2 average was the 'best' cutoff between large and small sample size.
By the way, small typo, in your example use of the distribution you calculate 'Interval', but don't use it, I expect you meant SAMPLE(d,Interval) instead of SAMPLE(d,n).

I was wondering if a more general solution would be useful to the community.
e.g. a FUNCTIONMACRO whose inputs are:
1. the input dataset
2. a call back function which given a record returned the quartile/decile to place the record
3. Total number of records to return in the sample
4. SET OF 4 or 10 or 'n' percentages giving the max percentage of records to return in any one quartile/decile/<other number of buckets> Obviously if the input does not contain enough records fulfilling the criteria for a bucket then 100% of that bucket is returned.

The FUNCTIONMACRO then returns a DATASET (same record type as input) where the count of sample of records for each bucket is the percent of the total requested (param 3)
There would have to be an ASSERT FAIL if the SUM (parameter 4) != 100

Parameter 2 would be tricky as the decision on which bucket to place the record might require meta data, as is the case in my COUNT of records. The callback would need to know the COUNT in the input dataset for that particular record type (in case above ID).
But then again any such meta-data could be calculated by the user of the FUNCTIONMACRO and made available to the call-back function. It may be handy to have a parameter to the callback which is just a pass-through of meta-data passed into the FUNCTIONMACRO. That way the callback function will always have meta-data easily available to it.

Anyway,

Thanks Very much Richard for helping me progress my project.
Yours
Allan
Allan
 
Posts: 363
Joined: Sat Oct 01, 2011 7:26 pm

Wed Jan 23, 2019 2:55 pm Change Time Zone

All,
The example code above to actually generate the sample can take a VERY long time as you're filtering the entire input for every ID. IT can also blow the max size of spill files. (10Mb default)

I've used good old PROCESS again. With the RIGHT transform keeping a running tally of the records processed for each ID. Then the LEFT transform just SKIPs if, for that LEFT's record ID, the running tally >= maxSampleSize for that ID.
We're selecting sequential records from the input so there is no 'Interval', but it does complete in seconds, and there is no spill file to cause problems.

Yours

Allan
Allan
 
Posts: 363
Joined: Sat Oct 01, 2011 7:26 pm

Wed Jan 23, 2019 3:24 pm Change Time Zone

Allan,
I expect you meant SAMPLE(d,Interval) instead of SAMPLE(d,n).
You are absolutely correct! And notice that, for the small IDs the COUNT(d) and n values are the same, making the Interval=1, and SAMPLE(ds,1) returns all the records in the ds. :)

Turning this all into a FUNCTIONMACRO is the next logical step, and you only need to pass in the dataset you want to sample from, the unique ID field, and the number of records you want returned, because the id/count dataset can be easily created for the rest of the code to work on with a simple TABLE, like this:
Code: Select all
TABLE(ds,{idfield, ctr := COUNT(GROUP)},idfield);

HTH,

Richard
rtaylor
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 1452
Joined: Wed Oct 26, 2011 7:40 pm

Tue Feb 26, 2019 10:37 am Change Time Zone

Here is a working example, with unit tests, as a generalised FUNCTIONMACRO
Code: Select all
/*
  BucketSample returns a sample of 'TotalSampleSize' number of records from DATASET 'ds'.
  Where the fraction of the total sample in any one bucket is dictated by a
  SET OF UNSIGNED 'BucketPercents'. The SUM of all the elements of 'BucketPercents'
  must equal 100.
  The call-back FUNCTION 'BucketDescriminator' dictates which 'bucket' any one record
  belongs in.
  'BucketDescriminator' FUNCTION must take two parameters
      1. RECORDOF(ds)
      2. 'BucketDescriminatorArgument'. This is a pass through argument that allows meta data
         on the DATASET, or anything else for that matter, to be easily available to the callback.
  'BucketDescriminator' returns an UNSIGNED in the range 0 <= f(ds) <= COUNT(BucketPercents),
  that indicates, as an index, which of the buckets the record belongs to.
  If the Discriminator returns 0 (Zero) this indicates that the record has been explicitly excluded
  from any sample.

  Note the input dataset 'ds' may well have to be read in its own graph, i.e. use INDEPENDENT.
*/
BucketSample(ds,TotalSampleSize,BucketPercents,BucketDescriminator,BucketDescriminatorArgument) := FUNCTIONMACRO
    #UNIQUENAME(Id);
    LOCAL Buckets := DATASET(BucketPercents,{UNSIGNED bucket});
    LOCAL Chk100Percent := ASSERT(SUM(Buckets,bucket) = 100,'Total allocation to buckets must equal 100%.',FAIL);
    LOCAL Checks := PARALLEL(Chk100Percent);
    LOCAL
    SampleSizes := PROJECT(Buckets
                          ,TRANSFORM({UNSIGNED %id%;UNSIGNED size}
                                     ;SELF.%id% := COUNTER
                                     ;SELF.size := LEFT.bucket / 100 * TotalSampleSize));
   
    // Use 'SKIP' in preference to filtering to reduce spilling to disk.
    {UNSIGNED %Id%,RECORDOF(ds)} BucketUp(RECORDOF(ds) L) := TRANSFORM,SKIP(BucketDescriminator(L,BucketDescriminatorArgument) = 0)
        SELF.%Id% := BucketDescriminator(L,BucketDescriminatorArgument);
        SELF := L;
    END;
    LOCAL AllocatedRecsToBucket := PROJECT(ds,BucketUp(LEFT));

    {DATASET(RECORDOF(ds)) d} GatherSample(RECORDOF(SampleSizes) L) := TRANSFORM
        SELF.d := PROJECT(ENTH(AllocatedRecsToBucket(%Id% = L.%Id%),L.size)
                         ,TRANSFORM(RECORDOF(ds);SELF := LEFT));
    END;
    LOCAL GatheredByBucket := PROJECT(SampleSizes,GatherSample(LEFT));
    LOCAL Gathered := NORMALIZE(GatheredByBucket,LEFT.d,TRANSFORM(RIGHT));
    RETURN WHEN(Gathered,Checks);
ENDMACRO;

/////////////////////////////////////////////////////
// Unit Test
/////////////////////////////////////////////////////

INData := NORMALIZE(DATASET([{1}],{INTEGER1 x}),1000000,TRANSFORM({UNSIGNED someDataRow},SELF.someDataRow := COUNTER)) : INDEPENDENT;

////////////////////////////////////////////////////////////////
// Example call-back
////////////////////////////////////////////////////////////////

ExcludeSet := {UNSIGNED ExcludeLowBound,UNSIGNED ExcludeHighBound};

UNSIGNED db(RECORDOF(INData) rec,DATASET(ExcludeSet) Exclude) := FUNCTION
    DontUse := EXISTS(PROJECT(Exclude,TRANSFORM({BOOLEAN bad};SELF.Bad := rec.someDataRow BETWEEN LEFT.ExcludeLowBound AND LEFT.ExcludeHighBound))(bad));
    RETURN IF(DontUse,0,(rec.someDataRow % 5)+1);
END;

Smpl := BucketSample(INData,100000,[2,70,8,12,8],db,DATASET([{100,1000},{80000,90000}],ExcludeSet));

ExpectedResults := DATASET([{'Set ending 0 or 5',COUNT(Smpl(someDataRow % 5 = 0)) =  2000}
                           ,{'Set ending 1 or 6',COUNT(Smpl(someDataRow % 5 = 1)) = 70000}
                           ,{'Set ending 2 or 7',COUNT(Smpl(someDataRow % 5 = 2)) =  8000}
                           ,{'Set ending 3 or 8',COUNT(Smpl(someDataRow % 5 = 3)) = 12000}
                           ,{'Set ending 4 or 9',COUNT(Smpl(someDataRow % 5 = 4)) =  8000}
                           ,{'Set ending 4 or 9',COUNT(Smpl(someDataRow % 5 = 4)) =  8000}
                           ,{'Contains invalid records 100 to 1000',NOT EXISTS(Smpl(someDataRow BETWEEN 100 AND 1000))}
                           ,{'Contains invalid records 80000 to 90000',NOT EXISTS(Smpl(someDataRow BETWEEN 80000 AND 90000))}
                           ],{STRING Id;BOOLEAN Pass});
// Empty dataset is a PASS
OUTPUT(ExpectedResults(NOT Pass),NAMED('FAILED'));

Note an empty dataset result from the unit tests is a PASS.
Yours
Allan
Allan
 
Posts: 363
Joined: Sat Oct 01, 2011 7:26 pm


Return to Programming

Who is online

Users browsing this forum: No registered users and 1 guest

cron