Sun Sep 26, 2021 4:44 pm
Login Register Lost Password? Contact Us


data cleansing ,facing challenging

Comments and questions related to the Enterprise Control Language

Wed Feb 27, 2019 10:38 am Change Time Zone

hi all,

it is very challenging one to make data standardization,
i have done find replace 'infoTech Pvt Ltd' to 'infoTech private Ltd'.in first transform
i thought that is not perfect,it can it possible to with single transform like,''infoTech private Limited',

my inline data like this.
IMPORT STD.Str;

Layout:=record
UNSIGNED1 cid;
STRING Company_Name;
end;

CompRec:=DATASET([{1,'infoTech private Ltd'},
{2,'infoTech Pvt Ltd'},
{3,'gate private'}],
Layout);


lookRecSet:=DATASET([{'pvt','private'},
{'Ltd','Limited'}}],
{STRING findWord,
STRING replaceWord}
);


//THIS JOIN,will only do the FIND REPLACE FOR SINGLE WORD
JoinLookup := JOIN(CompRec,
lookRecSet,
//regexfind(RIGHT.findWord,LEFT.cname,nocase),
Str.FindWord(LEFT.Company_Name,RIGHT.findWord,TRUE),
TRANSFORM(RECORDOF(LEFT),
SELF.Company_Name:=Str.FindReplace(LEFT.Company_Name,RIGHT.findWord,RIGHT.replaceWord);
SELF:=LEFT;
),
ALL);


JOIN(JoinLookup,
lookRecSet,
Str.FindWord(LEFT.Company_Name,RIGHT.findWord,TRUE),
TRANSFORM(RECORDOF(LEFT),
SELF.Company_Name:=Str.FindReplace(LEFT.Company_Name,RIGHT.findWord,RIGHT.replaceWord);
SELF:=LEFT;
),
ALL
);
-- any can suggest, how it possible,thanks
suleman Shreef
 
Posts: 21
Joined: Wed Feb 27, 2019 9:15 am

Wed Feb 27, 2019 2:16 pm Change Time Zone

suleman,

Here's one way to approach the problem:
Code: Select all
IMPORT STD;

Layout:=record
  UNSIGNED1 cid;
  STRING Company_Name;
END;

CompRec:=DATASET([{1,'infoTech Private Ltd'},
                  {2,'infoTech Pvt Ltd'},
                  {3,'gate private'}],
                 Layout);

lookRecs:=DATASET([{'Pvt','Private'},{'Ltd','Limited'},{'gate','fred'}],
                   {STRING findWord,STRING replaceWord});

ReplaceCnt := COUNT(LookRecs);  //how many words to replace
LoopBody(DATASET(Layout) ds,INTEGER C) :=
  PROJECT(ds,
          TRANSFORM(Layout,
                    Rec := lookRecs[C];  //which replace words to use this time
                    SELF.Company_Name := Std.Str.FindReplace(LEFT.Company_Name,
                                                             rec.findWord,
                                                             rec.replaceWord),
                    SELF := LEFT));
LOOP(CompRec,ReplaceCnt,LoopBody(ROWS(LEFT),COUNTER));

The LOOP will run the PROJECT as many times as you have words to replace, always working with the result of the previous LOOP iteration. So it will replace one word at a time in each record.

HTH,

Richard
rtaylor
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 1594
Joined: Wed Oct 26, 2011 7:40 pm

Thu Feb 28, 2019 10:16 am Change Time Zone

Thanks a lot Richard,
your code is helped me to achieve my expectation

Regards,
Suleman Shreef
suleman Shreef
 
Posts: 21
Joined: Wed Feb 27, 2019 9:15 am

Thu Feb 28, 2019 1:00 pm Change Time Zone

Suleman,

OK, I wasn't happy with the LOOP solution, because it would involve PROJECTing through the same set of records multiple times, and so, could be relatively inefficient on large datasets.

So, here's an alternative that I would expect to be more efficient:
Code: Select all
IMPORT STD;

Layout:=record
  UNSIGNED1 cid;
  STRING Company_Name;
END;

CompRec:=DATASET([{1,'infoTech Private Ltd'},
                  {2,'infoTech Pvt Ltd'},
                  {3,'gate private'}],
                 Layout);

lookRecs:=DATASET([{'Pvt','Private'},{'Ltd','Limited'},{'gate','fred'}],
                   {STRING findWord,STRING replaceWord});

//DICTIONARY Solution
ReplaceDCT := DICTIONARY(LookRecs,{findWord => replaceWord});

ReplaceFunc(STRING s) := FUNCTION
  rec := {STRING w};
  DSwords := DATASET(Std.Str.SplitWords(s,' '),rec);
  P := PROJECT(DSWords,
               TRANSFORM(rec,
                         SELF.w := IF(LEFT.w IN ReplaceDCT,
                                      ReplaceDCT[LEFT.w].replaceWord,
                                      LEFT.w)));
  RETURN ROLLUP(P,TRUE,
                TRANSFORM(rec,SELF.w := TRIM(LEFT.w + ' ' + RIGHT.w,LEFT)))[1].w;                                 
END;

ProjRecs := PROJECT(CompRec,
               TRANSFORM(layout,
                         SELF.Company_Name := ReplaceFunc(LEFT.Company_Name),
                         SELF := LEFT));
OUTPUT(ProjRecs,NAMED('DICTIONARY_Solution'));

This example uses a DICTIONARY for the replacement words, allowing a single pass through the CompRec dataset.

HTH,

Richard
rtaylor
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 1594
Joined: Wed Oct 26, 2011 7:40 pm

Thu Feb 28, 2019 2:54 pm Change Time Zone

Thank you Richard :shock: superb thinking
suleman Shreef
 
Posts: 21
Joined: Wed Feb 27, 2019 9:15 am

Thu Feb 28, 2019 3:22 pm Change Time Zone

hi,
Richard i have done small project in hpcc can you check , if you send your email i will forward my hppc Project.
suleman Shreef
 
Posts: 21
Joined: Wed Feb 27, 2019 9:15 am


Return to ECL

Who is online

Users browsing this forum: No registered users and 1 guest

cron