Tue Dec 07, 2021 5:59 pm
Login Register Lost Password? Contact Us


Smart Join

Questions around writing code and queries

Mon Nov 15, 2021 5:41 pm Change Time Zone

checkCorrectionKey := join(InputFile, CorrectionIndex,
keyed(left.DID = right.DID),
transform(left), smart);

"InputFile" may have one record or 15billion records( as per one dataset we have today in prod)

"CorrectionIndex" has currently 300records.. this would eventually grow but not too fast or too big.

With this said, if smart option a good choice in the join or any other suggestions?

Also, when tested this join it gave warning:

Warning: Smart specified on an unfiltered keyed join - was this intended?
HPCC_KK
 
Posts: 2
Joined: Mon Nov 15, 2021 5:32 pm

Mon Nov 15, 2021 7:35 pm Change Time Zone

HPCC_KK,

Personally, given that your CorrectionIndex only has 300 records and is expected to grow slowly, I would not use JOIN at all.

I'd do it more like this:
Code: Select all
SetCorrectionKey := SET(CorrectionIndex,DID);
checkCorrectionKey := InputFile(DID IN SetCorrectionKey);

Your inner JOIN is basically accomplishing exactly the same thing, since your TRANSFORM(LEFT) is just going to give you the records from InputFile that match the set of DID fields in the CorrectionIndex.

HTH,

Richard
rtaylor
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 1606
Joined: Wed Oct 26, 2011 7:40 pm

Tue Nov 16, 2021 3:03 am Change Time Zone

CorrectionIndex has 300records as of last week, it will grow rapidly after few months and it could be 10k+ records. I dont think SET is not an option for 10+ records. I am thinking of a lookup join or hash or smart join. Please suggest.

Thanks Richard.
HPCC_KK
 
Posts: 2
Joined: Mon Nov 15, 2021 5:32 pm

Tue Nov 16, 2021 6:34 pm Change Time Zone

HPCC_KK,

Just to test it, I created a SET containing 84,140 elements (a 10% sampling of the unique ID field from an 841,400 record dataset) then used that SET to filter the original dataset from which it came. It ran successfully in a bit over a minute on my 1-node VM environment.

So, IMO if it works well with an 84K SET, then a 10K SET will also work nicely, so I would personally still use that method.

HTH,

Richard
rtaylor
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 1606
Joined: Wed Oct 26, 2011 7:40 pm


Return to Programming

Who is online

Users browsing this forum: Bing [Bot] and 1 guest