Skip to main content

JOIN

JOIN(leftrecset, rightrecset, joincondition [, transform] [, jointype] [, joinflags] )

JOIN(setofdatasets, joincondition, transform, SORTED( fields) [, jointype] )

leftrecsetThe left set of records to process.
rightrecsetThe right set of records to process. This may be an INDEX.
joinconditionAn expression specifying how to match records in the leftrecset and rightrecset or setofdatasets (see Matching Logic discussions below). In the expression, the keyword LEFT is the dataset qualifier for fields in the leftrecset and the keyword RIGHT is the dataset qualifier for fields in the rightrecset.
transformOptional. The TRANSFORM function to call for each pair of records to process. If omitted, JOIN returns all fields from both the leftrecset and rightrecset, with the second of any duplicate named fields removed.
jointypeOptional. An inner join if omitted, else one of the listed types in the JOIN Types section below.
joinflagsOptional. Any option (see the JOIN Options section below) to specify exactly how the JOIN operation executes.
setofdatasetsThe SET of recordsets to process ([idx1,idx2,idx3]), typically INDEXes, which all must have the same format.
SORTEDSpecifies the sort order of records in the input setofdatasets and also the output sort order of the result set.
fieldsA comma-delimited list of fields in the setofdatasets, which must be a subset of the input sort order. These fields must all be used in the joincondition as they define the order in which the fields are STEPPED.
Return:JOIN returns a record set.

The JOIN function produces a result set based on the intersection of two or more datasets or indexes (as determined by the joincondition).

JOIN Two Datasets

JOIN(leftrecset, rightrecset, joincondition [, transform] [, jointype] [, joinflags] )

The first form of JOIN processes through all pairs of records in the leftrecset and rightrecset and evaluates the condition to find matching records. If the condition and jointype specify the pair of records qualifies to be processed, the transform function executes, generating the result.

JOIN dynamically sorts/distributes the leftrecset and rightrecset as needed to perform its operation based on the condition specified, therefore the output record set is not guaranteed to be in the same order as the input record sets. If JOIN does do a dynamic sort of its input record sets, that new sort order cannot be relied upon to exist past the execution of the JOIN. This principle also applies to any GROUPing--the records are automatically "un-grouped" as needed except under the following circumstances:

* For LOOKUP and ALL joins, the GROUPing and sort order of the leftrecset are preserved.

* For KEYED joins the GROUPing (but not the sort order) of the leftrecset is preserved.

Matching Logic - JOIN

The record matching joincondition is processed internally as two parts:

"equality" (hard match)All the simple "LEFT.field = RIGHT.field" logic that defines matching records. For JOINs that use keys, all these must be fields in the key to qualify for inclusion in this part. If there is no "equality" part to the joincondition logic, then you get a "JOIN too complex" error.
"non-equality" (soft match)All other matching criteria in the joincondition logic, such as "LEFT.field > RIGHT.field" expressions or any OR logic that may be involved with the final determination of which leftrecset and rightrecset records actually match.

This internal logic split allows the JOIN code to be optimized for maximum efficiency--first the "equality" logic is evaluated to provide an interim result that is then evaluated against any "non-equality" in the matching joincondition.

Options

The following joinflags options may be specified to determine exactly how the JOIN executes.

[, PARTITION LEFT | PARTITION RIGHT | [MANY] LOOKUP [ FEW] ] | GROUPED | ALL | NOSORT [ ( which ) ] | KEYED [ (index) [, UNORDERED ] ] | LOCAL | HASH ]][, KEEP(n) ] [, ATMOST( [ condition, ] n ) ] [, LIMIT( value [, SKIP | transform | FAIL ]) ] [, SKEW(limit [, target] ) [, THRESHOLD( size ) ] ] [, SMART ] [, UNORDERED | ORDERED( bool ) ] [, STABLE | UNSTABLE ] [, PARALLEL [ ( numthreads ) ] ] [, ALGORITHM( name ) ]

PARTITION LEFT | RIGHTSpecifies which recordset provides the partition points that determine how the records are sorted and distributed amongst the supercomputer nodes. PARTITION RIGHT specifies the rightrecset while PARTITION LEFT specifies the leftrecset. If omitted, PARTITION LEFT is the default.
[MANY] LOOKUPSpecifies the rightrecset is a relatively small file of lookup records that can be fully copied to every node. If MANY is not present, the rightrecset records bear a Many to 0/1 relationship with the records in the leftrecset (for each record in the leftrecset there is at most 1 record in the rightrecset). If MANY is present, the rightrecset records bear a Many to 0/Many relationship with the records in the leftrecset. This option allows the optimizer to avoid unnecessary sorting of the leftrecset. Valid only for inner, LEFT OUTER, or LEFT ONLY jointypes. The ATMOST, LIMIT, and KEEP options are supported in conjunction with MANY LOOKUP.
SMARTSpecifies to use an in-memory lookup when possible, but use a distributed join if the right dataset is large.
FEWSpecifies the LOOKUP rightrecset has few records, so little memory is used, allowing multiple lookup joins to be included in the same Thor subgraph.
GROUPEDSpecifies the same action as MANY LOOKUP but preserves grouping. Primarily used in the rapid Data Delivery Engine. Valid only for inner, LEFT OUTER, or LEFT ONLY jointypes. The ATMOST, LIMIT, and KEEP options are supported in conjunction with GROUPED.
ALLSpecifies the rightrecset is a small file that can be fully copied to every node, which allows the compiler to ignore the lack of any "equality" portion to the condition, eliminating the "join too complex" error that the condition would normally produce. If an "equality" portion is present, the JOIN is internally executed as a MANY LOOKUP. The KEEP option is supported in conjunction with this option.
NOSORTPerforms the JOIN without dynamically sorting the tables. This implies that the leftrecset and/or rightrecset must have been previously sorted and partitioned based on the fields specified in the joincondition so that records can be easily matched.
whichOptional. The keywords LEFT or RIGHT to indicate the leftrecset or rightrecset has been previously sorted. If omitted, NOSORT assumes both the leftrecset and rightrecset have been previously sorted.
KEYEDSpecifies using indexed access into the rightrecset (see INDEX).
indexOptional. The name of an INDEX into the rightrecset for a full-keyed JOIN (see below). If omitted, indicates the rightrecset will always be an INDEX (useful when the rightrecset is passed in as a parameter to a function).
UNORDEREDOptional. Specifies the KEYED JOIN operation does not preserve the sort order of the leftrecset.
LOCALSpecifies the operation is performed on each supercomputer node independently, without requiring interaction with all other nodes to acquire data; the operation maintains the distribution of any previous DISTRIBUTE.
HASHSpecifies an implicit DISTRIBUTE of the leftrecset and rightrecset across the supercomputer nodes based on the joincondition so each node can do its job with local data.
KEEP(n)Specifies the maximum number of matching records (n) to generate into the result set. If omitted, all matches are kept. This is useful where there may be many matching pairs and you need to limit the number in the result set. KEEP is not supported for RIGHT OUTER, RIGHT ONLY, LEFT ONLY, or FULL ONLY jointypes.
ATMOSTSpecifies a maximum number of matching records which, if exceeded, eliminates all those matches from the result set. This is useful for situations where you need to eliminate all "too many matches" record pairs from the result set. ATMOST is not supported on RIGHT ONLY or RIGHT OUTER jointypes. There are two forms: ATMOST(condition, n) -- maximum is computed only for the condition. ATMOST(n) -- maximum is computed for the entire joincondition, unless KEYED is used in the joincondition, in which case only the KEYED expressions are used. When ATMOST is specified (and the JOIN is not full or half-keyed), the joincondition and condition may include string field comparisons that use string indexing with an asterisk as the upper bound, as in this example: J1 := JOIN(dsL,dsR, LEFT.name[1..*]=RIGHT.name[3..*] AND LEFT.val < RIGHT.val, T(LEFT,RIGHT), ATMOST(LEFT.name[1..*]=RIGHT.name[3..*],3)); The asterisk indicates matching as many characters as necessary to reduce the number of candidate matches to below the ATMOST number (n).
conditionA portion of the joincondition expression.
nSpecifies the maximum number of matches allowed.
LIMITSpecifies a maximum number of matching records which, if exceeded, either fails the job, or eliminates all those matches from the result set. This is useful for situations where you need to eliminate all "too many matches" record pairs from the result set. Typically used for KEYED and "half-keyed" joins (see below), LIMIT differs from ATMOST primarily by its affect on a LEFT OUTER join, in which a leftrecset record with too many matching records would be treated as a non-match by ATMOST (the leftrecset record would be in the output with no matching rightrecset records), whereas LIMIT would either fail the job entirely, or SKIP the record (eliminating the leftrecset record entirely from the output). If omitted, the default is LIMIT(10000). The LIMIT is applied to the set of records that meet the the hard match ("equality") portion of the joincondition but before the soft match ("non-equality") portion of the joincondition is evaluated.
valueThe maximum number of matches allowed; LIMIT(0) is unlimited.
SKIPOptional. Specifies eliminating the matching records that exceed the maximum value of the LIMIT result instead of failing the job.
transformOptional. Specifies outputting a single record produced by the transform instead of failing the workunit (similar to the ONFAIL option of the LIMIT function).
FAILOptional. Specifies using the FAIL action to configure the error message when the job fails.
SKEWIndicates that you know the data for this join will not be spread evenly across nodes (will be skewed after both files have been distributed based on the join condition) and you choose to override the default by specifying your own limit value to allow the job to continue despite the skewing. Only valid on non-keyed joins (the KEYED option is not present and the rightrecset is not an INDEX).
limitA value between zero (0) and one (1.0 = 100%) indicating the maximum percentage of skew to allow before the job fails (the default is 0.1 = 10%).
targetOptional. A value between zero (0) and one (1.0 = 100%) indicating the desired maximum percentage of skew to allow (the default is 0.1 = 10%).
THRESHOLDIndicates the minimum size for a single part of either the leftrecset or rightrecset before the SKEW limit is enforced. Only valid on non-keyed joins (the KEYED option is not present and the rightrecset is not an INDEX).
sizeAn integer value indicating the minimum number of bytes for a single part.
UNORDEREDOptional. Specifies the output record order is not significant.
ORDEREDSpecifies the significance of the output record order.
boolWhen False, specifies the output record order is not significant. When True, specifies the default output record order.
STABLEOptional. Specifies the input record order is significant.
UNSTABLEOptional. Specifies the input record order is not significant.
PARALLELOptional. Try to evaluate this activity in parallel.
numthreadsOptional. Try to evaluate this activity using numthreads threads.
ALGORITHMOptional. Override the algorithm used for this activity.
nameThe algorithm to use for this activity. Must be from the list of supported algorithms for the SORT function's STABLE and UNSTABLE options.

The following options are mutually exclusive and may only be used to the exclusion of the others in this list: PARTITION LEFT | PARTITION RIGHT | [MANY] LOOKUP | GROUPED | ALL | NOSORT | HASH

In addition to this list, the KEYED and LOCAL options are also mutually exclusive with the options listed above, but not to each other. When both KEYED and LOCAL options are specified, only the INDEX part(s) on each node are accessed by that node.

Typically, the leftrecset should be larger than the rightrecset to prevent skewing problems (because PARTITION LEFT is the default behavior). If the LOOKUP or ALL options are specified, the rightrecset must be small enough to be loaded into memory on every node, and the operation is then implicitly LOCAL. The ALL option is impractical if the rightrecset is larger than a few thousand records (due to the number of comparisons required). The size of the rightrecset is irrelevant in the case of "half-keyed" and "full-keyed" JOINs (see the Keyed Join discussion below).

Use SMART when the right side dataset is likely to be small enough to fit in memory, but is not guaranteed to fit.

If you get an error similar to this:

"error: 1301: Pool memory exhausted:..."

this means the rightrecset is too large and a LOOKUP JOIN should not be used. A SMART JOIN may be a good option in this case.

Keyed Joins

A "full-keyed" JOIN uses the KEYED option and the joincondition must be based on key fields in the index. The join is actually done between the leftrecset and the index into the rightrecset--the index needs the dataset's record pointer (virtual(fileposition)) field to properly fetch records from the rightrecset. The typical KEYED join passes only the rightrecset to the TRANSFORM.

If the rightrecset is an INDEX, the operation is a "half-keyed" JOIN. Usually, the INDEX in a "half-keyed" JOIN contains "payload" fields, which frequently eliminates the need to read the base dataset. If this is the case, the "payload" INDEX does not need to have the dataset's record pointer (virtual(fileposition)) field declared. For a "half-keyed" JOIN the joincondition may use the KEYED and WILD keywords that are available for use in INDEX filters, only.

For both types of keyed join, any GROUPing of the base record sets is left untouched. See KEYED and WILD for a discussion of INDEX filtering.

Join Logic

The JOIN operation follows this logic:

1. Record distribution/sorting to get match candidates on the same nodes.

The PARTITION LEFT, PARTITION RIGHT, LOOKUP, ALL, NOSORT, KEYED, HASH, and LOCAL options indicate how this happens. These options are mutually exclusive; only one may be specified, and PARTITION LEFT is the default. SKEW and THRESHOLD may modify the requested behaviour. LOOKUP also has the additional effect of deduping the rightrecset by the joincondition.

2. Record matching.

The joincondition, LIMIT, and ATMOST determine how this is done.

An implicit limit of 10000 is added when there is no LIMIT specified AND the following is true:

There is no ATMOST limit specified AND it is not a LEFT ONLY JOIN AND (there is either no KEEP limit specified OR the JOIN has a postfilter).

3. Determine what matches to pass to transform.

The jointype determines this.

4. Generate output records through the TRANSFORM function.

The implicit or explicit transform parameter determines this.

5. Filter output records with SKIP.

If the transform for a record pair results in a SKIP, then the output record is not counted towards any KEEP option totals.

6. Limit output records with KEEP.

Any output records for a given leftrecset record over and above the permitted KEEP value are discarded. In a FULL OUTER join, rightrecset records that match no record are treated as if they all matched different default leftrecset records (that is, the KEEP counter is reset for each one).

TRANSFORM Function Requirements - JOIN

The transform function must take at least one or two parameters: a LEFT record formatted like the leftrecset, and/or a RIGHT record formatted like the rightrecset (which may be of different formats). The format of the resulting record set need not be the same as either of the inputs.

Join Types: Two Datasets

The following jointypes produce the following types of results, based on the records matching produced by the joincondition:

inner (default)Only those records that exist in both the leftrecset and rightrecset.
LEFT OUTERAt least one record for every record in the leftrecset.
RIGHT OUTERAt least one record for every record in the rightrecset.
FULL OUTERAt least one record for every record in the leftrecset and rightrecset.
LEFT ONLYOne record for each leftrecset record with no match in the rightrecset.
RIGHT ONLYOne record for each rightrecset record with no match in the leftrecset.
FULL ONLYOne record for each leftrecset and rightrecset record with no match in the opposite record set.

Example:

outrec := RECORD
  people.id;
  people.firstname;
  people.lastname;
END;

RT_folk := JOIN(people(firstname[1] = 'R'),
                people(lastname[1] = 'T'),
                LEFT.id=RIGHT.id,
                TRANSFORM(outrec,SELF := LEFT));
OUTPUT(RT_folk);

//*********************** Half KEYED JOIN example:
peopleRecord := RECORD
  INTEGER8 id;
  STRING20 addr;
END;
peopleDataset := DATASET([{3000,'LONDON'},{3500,'SMITH'},
                          {30,'TAYLOR'}], peopleRecord);
PtblRec doHalfJoin(peopleRecord l) := TRANSFORM
  SELF := l;
END;
FilledRecs3 := JOIN(peopleDataset, SequenceKey,
                    LEFT.id=RIGHT.sequence,doHalfJoin(LEFT));
FilledRecs4 := JOIN(peopleDataset, AlphaKey,
                    LEFT.addr=RIGHT.Lname,doHalfJoin(LEFT));

//******************* Full KEYED JOIN example:
PtblRec := RECORD
  INTEGER8 seq;
  STRING2  State;
  STRING20 City;
  STRING25 Lname;
  STRING15 Fname;
END;
PtblRec Xform(person L, INTEGER C) := TRANSFORM
  SELF.seq      := C;
  SELF.State    := L.per_st;
  SELF.City     := L.per_full_city;
  SELF.Lname    := L.per_last_name;
  SELF.Fname    := L.per_first_name;
END;
Proj := PROJECT(Person(per_last_name[1]=per_first_name[1]),
                Xform(LEFT,COUNTER));
PtblOut := OUTPUT(Proj,,'~RTTEMP::TestKeyedJoin',OVERWRITE);

Ptbl := DATASET('RTTEMP::TestKeyedJoin',
                {PtblRec,UNSIGNED8 __fpos {virtual(fileposition)}},
                FLAT);
AlphaKey := INDEX(Ptbl,{lname,fname,__fpos},
                  '~RTTEMPkey::lname.fname');
SeqKey := INDEX(Ptbl,{seq,__fpos},'~RTTEMPkey::sequence');

Bld1 := BUILD(AlphaKey ,OVERWRITE);
Bld2 := BUILD(SeqKey,OVERWRITE);
peopleRecord := RECORD
  INTEGER8 id;
  STRING20 addr;
END;
peopleDataset := DATASET([{3000,'LONDON'},{3500,'SMITH'},
                          {30,'TAYLOR'}], peopleRecord);
joinedRecord := RECORD
  PtblRec;
  peopleRecord;
END;
joinedRecord doJoin(peopleRecord l, Ptbl r) := TRANSFORM
 SELF := l;
 SELF := r;
END;

FilledRecs1 := JOIN(peopleDataset, Ptbl,LEFT.id=RIGHT.seq,
                    doJoin(LEFT,RIGHT), KEYED(SeqKey));
FilledRecs2 := JOIN(peopleDataset, Ptbl,LEFT.addr=RIGHT.Lname,
                    doJoin(LEFT,RIGHT), KEYED(AlphaKey));
SEQUENTIAL(PtblOut,Bld1,Bld2,OUTPUT(FilledRecs1),OUTPUT(FilledRecs2))

JOIN Set of Datasets

JOIN(setofdatasets, joincondition, transform, SORTED( fields) [, jointype] [, UNORDERED | ORDERED( bool ) ] [, STABLE | UNSTABLE ] [, PARALLEL [ ( numthreads ) ] ] [, ALGORITHM( name ) ] )

The second form of JOIN is similar to the MERGEJOIN function in that it takes a SET OF DATASETs as its first parameter. This allows the possibility of joining more than two datasets in a single operation.

Record Matching Logic

The record matching joincondition may contain two parts: a STEPPED condition that may optionally be ANDed with non-STEPPED conditions. The STEPPED expression contains leading equality expressions of the fields from the SORTED option (trailing components may be range comparisons if the range values are independent of the LEFT and RIGHT rows), ANDed together, using LEFT and RIGHT as dataset qualifiers. If not present, the STEPPED condition is deduced from the fields specified by the SORTED option.

The order of the datasets within the setofdatasets can be significant to the way the joincondition is evaluated. The joincondition is duplicated between adjacent pairs of datasets, which means that this joincondition:

       LEFT.field = RIGHT.field

when applied against a setofdatasets containing three datasets, is logically equivalent to:

       ds1.field = ds2.field AND ds2.field = ds3.field

TRANSFORM Function Requirements - JOIN setofdatasets

The transform function must take at least one parameter which must take either of two forms:

LEFTformatted like any of the setofdatasets. This indicates the first dataset in the setofdatasets.
ROWS(LEFT)formatted like any of the setofdatasets. This indicates a record set made up of all records from any dataset in the setofdatasets that match the joincondition--this may not include all the datasets in the setofdatasets, depending on which jointype is specified.

The format of the resulting output record set must be the same as the input datasets.

Join Types: setofdatasets

The following jointypes produce the following types of results, based on the records matching produced by the joincondition:

INNERThis is the default if no jointype is specified. Only those records that exist in all datasets in the setofdatasets.
LEFT OUTERAt least one record for every record in the first dataset in the setofdatasets.
LEFT ONLYOne record for every record in the first dataset in the setofdatasets for which there is no match in any of the subsequent datasets.
MOFN(min [,max])One record for every record with matching records in min number of adjacent datasets within the setofdatasets. If max is specified, the record is not included if max number of dataset matches are exceeded.

Example:

Rec := RECORD,MAXLENGTH(4096)
  STRING1  Letter;
  UNSIGNED1    DS;
  UNSIGNED1    Matches   := 0;
  UNSIGNED1    LastMatch := 0;
  SET OF UNSIGNED1 MatchDSs  := [];
END;

ds1 := DATASET([{'A',1},{'B',1},{'C',1},{'D',1},{'E',1}],Rec);
ds2 := DATASET([{'A',2},{'B',2},{'H',2},{'I',2},{'J',2}],Rec);
ds3 := DATASET([{'B',3},{'C',3},{'M',3},{'N',3},{'O',3}],Rec);
ds4 := DATASET([{'A',4},{'B',4},{'R',4},{'S',4},{'T',4}],Rec);
ds5 := DATASET([{'B',5},{'V',5},{'W',5},{'X',5},{'Y',5}],Rec);
SetDS := [ds1,ds2,ds3,ds4,ds5];

Rec XF(Rec L,DATASET(Rec) Matches) := TRANSFORM
  SELF.Matches   := COUNT(Matches);
  SELF.LastMatch := MAX(Matches,DS);
  SELF.MatchDSs  := SET(Matches,DS);
  SELF := L;
END;
j1 := JOIN(SetDS,
           STEPPED(LEFT.Letter=RIGHT.Letter),
           XF(LEFT,ROWS(LEFT)),SORTED(Letter));
j2 := JOIN(SetDS,
           STEPPED(LEFT.Letter=RIGHT.Letter),
           XF(LEFT,ROWS(LEFT)),SORTED(Letter),LEFT OUTER);
j3 := JOIN(SetDS,
           STEPPED(LEFT.Letter=RIGHT.Letter),  
           XF(LEFT,ROWS(LEFT)),SORTED(Letter),LEFT ONLY);
j4 := JOIN(SetDS,
           STEPPED(LEFT.Letter=RIGHT.Letter),
           XF(LEFT,ROWS(LEFT)),SORTED(Letter),MOFN(3));
j5 := JOIN(SetDS,
           STEPPED(LEFT.Letter=RIGHT.Letter),
           XF(LEFT,ROWS(LEFT)),SORTED(Letter),MOFN(3,4));

OUTPUT(j1);
OUTPUT(j2);
OUTPUT(j3);
OUTPUT(j4);
OUTPUT(j5);

See Also: TRANSFORM Structure, RECORD Structure, SKIP, ROWDIFF, STEPPED, KEYED/WILD, MERGEJOIN