DEDUP

DEDUP
Prev	Built-in Functions and Actions	Next

DEDUP(recordset [, condition [[MANY], ALL[, HASH]] [,BEST (sort-list)[[, KEEP n ] [, keeper ] ] [, LOCAL] [, UNORDERED | ORDERED( bool ) ] [, STABLE | UNSTABLE ] [, PARALLEL [ ( numthreads ) ] ] [, ALGORITHM( name ) ] )

recordset	The set of records to process, typically sorted in the same order that the expression will test. This may be the name of a dataset or derived record set, or any expression that results in a derived record set.
condition	Optional. A comma-delimited list of expressions or key fields in the recordset that defines "duplicate" records. The keywords LEFT and RIGHT may be used as dataset qualifiers for fields in the recordset. If the condition is omitted, every recordset field becomes the match condition. You may use the keyword RECORD (or WHOLE RECORD) to indicate all fields in that structure, and/or you may use the keyword EXCEPT to list non-dedup fields in the structure.
MANY	Optional. Specifies or perform a local sort/dedup before finding duplicates globally. This is most useful when many duplicates are expected.
ALL	Optional. Matches the condition against all records, not just adjacent records. This option may change the output order of the resulting records.
HASH	Optional. Specifies the ALL operation is performed using hash tables.
BEST	Optional. Provides additional control over which records are retained from a set of "duplicate" records. The first in the <sort-list> order of records are retained. BEST cannot be used with a KEEP parameter greater than 1.
sort-list	A comma delimited list of fields defining the duplicate records to keep.. The fields may be prefixed with a minus sign to require a reverse sort on that field.
KEEP	Optional. Specifies keeping n number of duplicate records. If omitted, the default behavior is to KEEP 1. Not valid with the ALL option present.
n	The number of duplicate records to keep. If keeper is set to RIGHT, the only valid number of duplicate records to keep is 1.
keeper	Optional. The keywords LEFT or RIGHT. LEFT (the default, if omitted) keeps the first record encountered and RIGHT keeps the last.
LOCAL	Optional. Specifies the operation is performed on each supercomputer node independently, without requiring interaction with all other nodes to acquire data; the operation maintains the distribution of any previous DISTRIBUTE.
UNORDERED	Optional. Specifies the output record order is not significant.
ORDERED	Specifies the significance of the output record order.
bool	When False, specifies the output record order is not significant. When True, specifies the default output record order.
STABLE	Optional. Specifies the input record order is significant.
UNSTABLE	Optional. Specifies the input record order is not significant.
PARALLEL	Optional. Try to evaluate this activity in parallel.
numthreads	Optional. Try to evaluate this activity using numthreads threads.
ALGORITHM	Optional. Override the algorithm used for this activity.
name	The algorithm to use for this activity. Must be from the list of supported algorithms for the SORT function's STABLE and UNSTABLE options.
Return:	DEDUP returns a set of records.

The DEDUP function evaluates the recordset for duplicate records, as defined by the condition parameter, and returns a unique return set. This is similar to the DISTINCT statement in SQL. The recordset should be sorted, unless ALL is specified.

If a condition parameter is a single value (field), DEDUP does a simple field-level de-dupe equivalent to LEFT.field=RIGHT.field. The condition is evaluated for each pair of adjacent records in the record set. If the condition returns TRUE, the keeper record is kept and the other removed.

The ALL option means that every record pair is evaluated rather than only those pairs adjacent to each other, irrespective of sort order. The evaluation is such that, for records 1, 2, 3, 4, the record pairs that are compared to each other are:

(1,2),(1,3),(1,4),(2,1),(2,3),(2,4),(3,1),(3,2),(3,4),(4,1),(4,2),(4,3)

This means two compares happen for each pair, allowing the condition to be non-commutative.

KEEP n effectively means leaving n records of each duplicate type. This is useful for sampling. The LEFT keeper value (implicit if neither LEFT nor RIGHT are specified) means that if the left and right records meet the de-dupe criteria (that is, they "match"), the left record is kept. If the RIGHT keeper appears instead, the right is kept. In both cases, the next comparison involves the de-dupe survivor; in this way, many duplicate records can collapse into one.

The BEST option provides additional control over which records are retained from a set of "duplicate" records. The first in the sort-list order of records are retained. The sort-list is comma delimited list of fields. The fields may be prefixed with a minus sign to require a reverse sort on that field.

DEDUP(recordset, field1, BEST(field2) ) means that from set of duplicate records, the first record from the set duplicates sorted by field2 is retained. DEDUP(recordset, field1, BEST(-field2) ) produces the last record sorted by field2 from the set of duplicates.

The BEST option cannot be used with a KEEP parameter greater than 1.

Example:

SomeFile := DATASET([{'001','KC','G'}, 
                     {'002','KC','Z'}, 
                     {'003','KC','Z'}, 
                     {'004','KC','C'}, 
                     {'005','KA','X'}, 
                     {'006','KB','A'}, 
                     {'007','KB','G'}, 
                     {'008','KA','B'}],{STRING3 Id, String2 Value1, String1 Value2});

SomeFile1 := SORT(SomeFile, Value1);

DEDUP(SomeFile1, Value1, BEST(Value2));
// Output: 
// id value1 value2
// 008 KA B
// 006 KB A
// 004 KC C

DEDUP(SomeFile1, Value1, BEST(-Value2));
// Output:
// id value1 value2
// 005 KA X
// 007 KB G
// 002 KC Z

DEDUP(SomeFile1, Value1, HASH, BEST(Value2));
// Output: 
// id value1 value2
// 008 KA B
// 006 KB A
// 004 KC C

Complex Record Set Conditions

The DEDUP function with the ALL option is useful in determining complex recordset conditions between records in the same recordset. Although DEDUP is traditionally used to eliminate duplicate records next to each other in the recordset, the conditional expression combined with the ALL option extends this capability. The ALL option causes each record to be compared according to the conditional expression to every other record in the recordset. This capability is most effective with small recordsets; larger recordsets should also use the HASH option.

Example:

LastTbl := TABLE(Person,{per_last_name});
Lasts   := SORT(LastTbl,per_last_name);
MySet   := DEDUP(Lasts,per_last_name);
      // unique last names -- this is exactly equivalent to:
      //MySet := DEDUP(Lasts,LEFT.per_last_name=RIGHT.per_last_name);
      // also exactly equivalent to:
      //MySet := DEDUP(Lasts);
NamesTbl1 := TABLE(Person,{per_last_name,per_first_name});
Names1    := SORT(NamesTbl1,per_last_name,per_first_name);
MyNames1  := DEDUP(Names1,RECORD);
      //dedup by all fields -- this is exactly equivalent to:
      //MyNames1  := DEDUP(Names,per_last_name,per_first_name);
      // also exactly equivalent to:
      //MyNames1 := DEDUP(Names1);
NamesTbl2 := TABLE(Person,{per_last_name,per_first_name, per_sex});
Names2    := SORT(NamesTbl,per_last_name,per_first_name);
MyNames2  := DEDUP(Names,RECORD, EXCEPT per_sex);
      //dedup by all fields except per_sex
      // this is exactly equivalent to:
      //MyNames2 := DEDUP(Names, EXCEPT per_sex);

/* In the following example, we want to determine how many 'AN' or 'AU' type inquiries 
have occurred within 3 days of a 'BB' type inquiry. 
The COUNT of inquiries in the deduped recordset is subtracted from the COUNT 
of the inquiries in the original recordset to provide the result.*/
INTEGER abs(INTEGER i) := IF ( i < 0, -i, i );
WithinDays(ldrpt,lday,rdrpt,rday,days) :=
  abs(DaysAgo(ldrpt,lday)-DaysAgo(rdrpt,rday)) <= days;
DedupedInqs := DEDUP(inquiry, LEFT.inq_ind_code='BB' AND
     RIGHT.inq_ind_code IN ['AN','AU'] AND
                          WithinDays(LEFT.inq_drpt,
           LEFT.inq_drpt_day,
           RIGHT.inq_drpt,
           RIGHT.inq_drpt_day,3),
        ALL );
InqCount := COUNT(Inquiry) - COUNT(DedupedInqs);
OUTPUT(person(InqCount >0),{InqCount});

See Also: SORT, ROLLUP, TABLE, FUNCTION Structure

Prev	Up	Next
CRON	Home	DEFINE