Fri Aug 17, 2018 3:06 am
Login Register Lost Password? Contact Us


Yet another grouping problem

Questions around writing code and queries

Fri Nov 17, 2017 11:21 am Change Time Zone

Hi,

I have a flat dataset of dated address IDs tied to a person. e.g.
Code: Select all
addr      date       name    Some_data_on_person
1     19951015       Allan      X
1     19961111       Nina       A
1     19970213       Nina       B
1     19980314       Allan      Y
2     19930101       Allan      P

For every major key (that is addr and date) I need to pull forward any information on a person that does NOT exist at that date. If data for a person does exist at that date it is taken as an update and any historic information, for that person, must not be pulled forward.
So the result I require from the input above is:

Code: Select all
1     19951015       Allan      X
1     19961111       Allan      X
                     Nina       A
1     19970213       Allan      X
                     Nina       B
1     19980314       Allan      Y
                     Nina       B
2     19930101       Allan      P

Now I can produce this but I'm using child datasets in the computation, which completely slugs at some versions of HPCC (fortunately v6 of hpcc works fine, but this ECL must run on older versions)
Is there a way to generate this result without the use of child datasets?
e.g. generating some common sequence number that can then be used in a subsequent self JOIN?

Just to clarify: The final result is a dataset with child people dataset, I just need the computation to be flat, i.e. sweep through the input cleanly so THOR does not slug.
Yours
Allan
Allan
 
Posts: 306
Joined: Sat Oct 01, 2011 7:26 pm

Fri Nov 17, 2017 3:23 pm Change Time Zone

Allan,

OK, here's a solution that creates your nested child DATASET using two PROJECTs:
Code: Select all
//NOTICE that I started by adding a UID field to each record
r := {UNSIGNED1 UID, UNSIGNED1 addr, UNSIGNED4 date,
      STRING10 name, STRING1 Some_data_on_person};
ds := DATASET([{1,1,19951015,'Allan','X'},
               {2,1,19961111,'Nina','A'},
               {3,1,19970213,'Nina','B'},
               {4,1,19980314,'Allan','Y'},
               {5,2,19930101,'Allan','P'}],r);

//extract just the unique Addr/date combinations
AddrDate  := TABLE(ds,{addr,date},addr,date);

//organize the child records
Children  := SORT(TABLE(ds,{addr,date,name,UID}),addr,name,-UID);
//the descending UID sort allows DEDUP to just get the latest record for each name

//vertical slice for more efficient operation
ChildRecs := TABLE(ds,{UID,name,Some_data_on_person});

//a normal PROJECT to get parent data into the Nested Child dataset
//  and a nested PROJECT to get the kids
ChildRec := {STRING10 name, STRING1 Some_data_on_person};
OutRec := RECORD
  ds.addr;
  ds.date;
  DATASET(ChildRec) NameData;
END;   
ParentData := PROJECT(AddrDate,TRANSFORM(OutRec,
                                 Kids := DEDUP(Children(addr = LEFT.addr,
                                                        date <= LEFT.Date),
                                               name);
                                 SetKids := SET(Kids,UID);
                                 SELF.NameData := PROJECT(ChildRecs(UID IN SetKids),
                                                          ChildRec);
                                 SELF := LEFT));
ParentData;

I used the ChildRecs TABLE this way to easily allow for your "Some_data_on_person" to be extrapolated to multiple fields, not just a single one-character string.

Hopefully, this solution won't "slug" your Thor. :)

HTH,

Richard
rtaylor
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 1369
Joined: Wed Oct 26, 2011 7:40 pm

Fri Nov 17, 2017 8:28 pm Change Time Zone

Thanks Richard for this very prompt reply.

I'll try it out, since posting the question I've worked out a solution using two iterates but Your's looks neater, though I'm not sure how long the SET's will take to run.

I'll give your solution a try, it will take me a bit of time to alter it for my actual code plus its late Friday, so don't hold your breadth.

Once again, thanks very much.

Allan
Allan
 
Posts: 306
Joined: Sat Oct 01, 2011 7:26 pm

Fri Nov 17, 2017 8:48 pm Change Time Zone

Hum Richard,

Looking at your ECL, for every major key (addrid, date), and we've a lot of them, we're PROJECting the entire name information, admittedly filtered but still filtering the same child dataset a LOT of times. is there any way this entire process could all be made LOCAL? that would help.

Yours
Allan
Allan
 
Posts: 306
Joined: Sat Oct 01, 2011 7:26 pm

Mon Nov 20, 2017 7:16 pm Change Time Zone

Allan,

OK, here's the same process but with LOCAL added:
Code: Select all
//NOTICE that I started by adding a UID field to each record
r := {UNSIGNED1 UID, UNSIGNED1 addr, UNSIGNED4 date,
      STRING10 name, STRING1 Some_data_on_person};
ds := DATASET([{1,1,19951015,'Allan','X'},
               {2,1,19961111,'Nina','A'},
               {3,1,19970213,'Nina','B'},
               {4,1,19980314,'Allan','Y'},
               {5,2,19930101,'Allan','P'}],r);

//DISTRIBUTE sets up for the LOCAL Operations
dds  := DISTRIBUTE(ds,HASH32(addr));

//extract just the unique Addr/date combinations
AddrDate  := SORT(TABLE(dds,{addr,date},addr,date,LOCAL),addr,-date,LOCAL);

//organize the child records
Children  := SORT(TABLE(dds,{addr,date,name,UID},LOCAL),addr,name,-UID,LOCAL);
//the descending UID sort allows DEDUP to just get the latest record for each name

//vertical slice for more efficient operation
ChildRecs := TABLE(dds,{UID,name,Some_data_on_person},LOCAL);

//PROJECT just the parent data into the Nested Child dataset
ChildRec := {STRING10 name, STRING1 Some_data_on_person};
OutRec := RECORD
  ds.addr;
  ds.date;
  DATASET(ChildRec) NameData;
END;   
ParentData := PROJECT(AddrDate,TRANSFORM(OutRec,
                                 Kids := DEDUP(Children(addr = LEFT.addr,
                                                        date <= LEFT.Date),
                                               name);
                                 SetKids := SET(Kids,UID);
                                 SELF.NameData := PROJECT(ChildRecs(UID IN SetKids),
                                                          ChildRec);
                                 SELF := LEFT),LOCAL);
SORT(ParentData,addr,date);

HTH,

Richard
rtaylor
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 1369
Joined: Wed Oct 26, 2011 7:40 pm

Wed Nov 22, 2017 8:15 am Change Time Zone

Thanks Richard,

Got a solution of my own, will get round to comparing performance sometime.

Thanks very much

Allan
Allan
 
Posts: 306
Joined: Sat Oct 01, 2011 7:26 pm


Return to Programming

Who is online

Users browsing this forum: No registered users and 1 guest

cron