Fri Dec 03, 2021 1:01 pm
Login Register Lost Password? Contact Us


JOIN options

Questions around writing code and queries

Tue Sep 07, 2021 8:54 pm Change Time Zone

I'm working with some code that is demonstrating some unusual behavior.
I have two datasets: ds_001 (built from a TABLE statement) with 25 records.
Primary field STRING06 KEYFIELD.
ds_002 DEDUP(SORT( sourceData, KEYFIELD), KEYFIELD); with 1,256 recs
Primary field STRING06 KEYFIELD.
ds_001 is distributed by KEYFIELD. ds_002 is not.

JOIN_DATA := JOIN( ds_001
, ds_002
, LEFT.KEYFIELD = RIGHT.KEYFIELD
, TRANSFORM( RECORDOF(LEFT)
, SELF.CO_NAME := RIGHT.CO_NAME;
SELF := LEFT;
)
, SMART
, LOCAL // because ds_001 is distributed
);
I end up with no records in JOIN_DATA. I have displayed both files and I have proven both files contain matching KEYFIELD data. I attempted changing the SMART to LOOKUP and to MANY - all produced an empty file (no matches). These options basically say the right recordset (if the count is low enough) will be copied to all the nodes before the JOIN.

I think ds_002 having 1,256 records, this would be low enough for that to be true.

I have come to the conclusion that because ds_001 has only 25 records, that SMART/LOOKYP/MANY is ignored and the JOIN is attempted as distributed and since ds_002 is not distributed, NOTHING will match. If I pull the DISTRIBUTE off ds_001 and JOIN / HASH, I get a 100% match in JOIN_DATA. Same thing if I DISTRIBUTE ds_002 by KEYFIELD: SMART/LOOKYP/MANY gets a 100% match in JOIN_DATA.

Am I correct in thinking the record size of ds_001 vs. ds_002 (left vs. right) is the reason?
John Meier
 
Posts: 18
Joined: Wed Jun 29, 2016 7:45 pm

Wed Sep 08, 2021 5:56 pm Change Time Zone

John,
Am I correct in thinking the record size of ds_001 vs. ds_002 (left vs. right) is the reason?
The first thing I notice is your use of the LOCAL option when every other option you try (SMART/LOOKUP/ALL) creates implicitly local operations anyway. So I would start by using SMART and losing the LOCAL and see if that works for you. Then go on to trying the LOOKUP and ALL options (also without LOCAL present).

One other thing: the typical use of JOIN would have the "large" dataset as the first parameter and the smaller as the second. So try switching the two around.

I would also eliminate the DISTRIBUTE, since you originally wanted SMART to pick your best option anyway.

Let me know what results you get with those changes.

HTH,

Richard
rtaylor
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 1606
Joined: Wed Oct 26, 2011 7:40 pm

Thu Sep 09, 2021 1:28 pm Change Time Zone

I had to use LOCAL because ds_001 was distributed (which I couldn't understand why since it was only 25 records in size). And despite it being smaller, it was on the LEFT side of the JOIN because it was attempting to pull a name from the RIGHT dataset when the KEY_FIELDs matched. I tried all of the JOIN options (with the forced LOCAL) and none matched. I distributed ds_002 (1,256 records), dropped the options, kept the LOCAL and it worked. I then dropped the DISTRIBUTEs, dropped the options, switched the LOCAL to HASH and it worked. I have since reworked the code so it doesn't DISTRIBUTE. I was looking to see if my assumptions about why the options weren't working made sense or not.
John Meier
 
Posts: 18
Joined: Wed Jun 29, 2016 7:45 pm

Thu Sep 09, 2021 1:53 pm Change Time Zone

BTW - the reworked JOIN uses the two non-DISTRIBUTEd files along with the SMART option.
John Meier
 
Posts: 18
Joined: Wed Jun 29, 2016 7:45 pm

Thu Sep 09, 2021 2:09 pm Change Time Zone

John,
I had to use LOCAL because ds_001 was distributed (which I couldn't understand why since it was only 25 records in size).
I assume you thought that because the compiler gave you a warning saying you "used DISTRIBUTE without LOCAL" but it was only a warning, so you did not have to use LOCAL.
And despite it being smaller, it was on the LEFT side of the JOIN because it was attempting to pull a name from the RIGHT dataset when the KEY_FIELDs matched.
Since you're just looking for matches between the two (an inner JOIN), it doesn't matter which order the two datasets are in for that logic. However, it does matter when you're talking about very large datasets where one is considerably larger than the other. In that case, it's typical to put the large one as the LEFT dataset (1st parm) and the small one as the RIGHT (2nd parm). Since your datasets are both very small, it doesn't matter which order they're in.
I tried all of the JOIN options (with the forced LOCAL) and none matched. I distributed ds_002 (1,256 records), dropped the options, kept the LOCAL and it worked. I then dropped the DISTRIBUTEs, dropped the options, switched the LOCAL to HASH and it worked. I have since reworked the code so it doesn't DISTRIBUTE.
Because the record set has only 25 records, DISTRIBUTE is unnecessary, and putting it in the second parm means that SMART has fewer records to copy to every node if it decides to make it an ALL JOIN for you.
I was looking to see if my assumptions about why the options weren't working made sense or not.
No, your assumption that it was the number of records in the two record sets was incorrect. Basically, you were just making a common mistake -- making the whole thing too complex (over-thinking the issue). Most of the options on JOIN are meant for working with huge datasets. With your data, a simple inner JOIN is all you really need, without any other options, because there's too little data to see any difference in performance with or without SMART or ALL or ...

HTH,

Richard
rtaylor
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 1606
Joined: Wed Oct 26, 2011 7:40 pm


Return to Programming

Who is online

Users browsing this forum: Bing [Bot] and 1 guest

cron