Mon Oct 25, 2021 12:40 am
Login Register Lost Password? Contact Us


A question about the GROUP function

Comments and questions related to the Enterprise Control Language

Wed Feb 27, 2013 4:51 pm Change Time Zone

Hi, I'm a new guy studying ECL. I have a question here about the GROUP function:

I have some question about using GROUP function.

1. In the reference, it said that "The GROUP function fragments a recordset into a set of sets." What is the meaning of "set of sets"? What is the differences between a grouped dataset and a ungrouped dataset (like a set of records)?

2. In the reference, "This allows aggregations and other operations (such as ITERATE, DEDUP, ROLLUP, SORT and others)", I always use these functions (like DEDUP, ROLLUP) directly. Don't understand when I need the GROUP function.

Thanks a lot if anyone can answer my question!
Leofei
 
Posts: 53
Joined: Mon Nov 26, 2012 5:13 pm

Wed Feb 27, 2013 5:37 pm Change Time Zone

Leofei,

The GROUP function is meant to make processing huge datasets faster by allowing operations to work on smaller chunks of data.

For example, let's say you have a 10 BILLION record dataset that you need to SORT by lastname, firstname, middlename, and gender. You could just do it this way:
Code: Select all
Rec := RECORD
  STRING30 lastname;
  STRING20 firstname;
  STRING20 middlename;
  STRING1  gender;
   //and a bunch of other fields
END;
ds := DATASET('MyTenBillionRecordFile',Rec,FLAT);

SortedRecs := SORT(ds,lastname,firstname,middlename,gender);
This code would work, but it be a single 10 billion record global sort, which could take quite a bit of time (depending on the size of your cluster).

So an alternative would be to do it this way:
Code: Select all
Rec := RECORD
  STRING30 lastname;
  STRING20 firstname;
  STRING20 middlename;
  STRING1  gender;
   //and a bunch of other fields
END;
ds := DATASET('MyTenBillionRecordFile',Rec,FLAT);

SortedRecs := SORT(ds,lastname);
GrpRecs    := GROUP(SortedRecs,lastname);
FinalRecs  := SORT(GrpRecs,firstname,middlename,gender);
The difference here is that the initial global sort by lastname will go reasonably fast, then the GROUP by lastname creates a separate subgroup (each on a single node) of records for each unique lastname, so that the last SORT by firstname, middlename, and gender will happen separately and independently on each subgroup.

So, if you had exactly 10,000 last names and a completely even distribution of data, that last SORT would actually do 10,000 1-million-record sorts instead of a single 10-BILLION-record sort. And since each subgroup is contained on a single node, if you were running a 400-node cluster you would be doing at least 400 of those 10,000 1-million-record sorts simultaneously at all times until the entire sorting job is done.

HTH,

Richard
rtaylor
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 1600
Joined: Wed Oct 26, 2011 7:40 pm

Wed Feb 27, 2013 6:34 pm Change Time Zone

Hi, Richard, Thank you for your efficient reply!

It sounds like GROUP is doing the same thing as DISTRIBUTE does. May I think they are dealing with the data in the same way?

Also, is there any problem if I don’t UNGROUP the dataset after the GROUP operation?

Thanks and looking forward to your answer.

-Leo
Leofei
 
Posts: 53
Joined: Mon Nov 26, 2012 5:13 pm

Wed Feb 27, 2013 8:05 pm Change Time Zone

Leo,

Yes, GROUP and DISTRIBUTE sound similar, but how similar are they? I don't know precisely (you can look at the source code if you really want that answer :) ).

The big difference between the two is that DISTRIBUTE does not create the subgroups that GROUP does, so subsequent operations will not operate the same if you just DISTRIBUTE instead of using GROUP.

UNGROUP is not always necessary, but it may solve some problems if they occur. In most cases a GROUPed dataset can be used just like a non-GROUPed dataset for operations that don't work on subgroups separately.

HTH,

Richard
rtaylor
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 1600
Joined: Wed Oct 26, 2011 7:40 pm

Wed Feb 27, 2013 9:02 pm Change Time Zone

Thx, Richard. It did help me to understand more detials in the code.
Leofei
 
Posts: 53
Joined: Mon Nov 26, 2012 5:13 pm

Thu Mar 21, 2019 7:34 am Change Time Zone

I am wondering if the ungroup will lose distribution of the dataset if the group was created on a distributed dataset locally. Based on what I can see, it does not, however I am looking for verification.

Thanks.
newportm
 
Posts: 22
Joined: Tue Nov 15, 2016 2:48 pm

Thu Mar 21, 2019 1:45 pm Change Time Zone

newportm,

My understanding is that UNGROUP would remove the grouping but leave the data where it is at the point of the UNGROUP.

HTH,

Richard
rtaylor
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 1600
Joined: Wed Oct 26, 2011 7:40 pm


Return to ECL

Who is online

Users browsing this forum: No registered users and 1 guest

cron