data cleansing ,facing challenging
hi all,
it is very challenging one to make data standardization,
i have done find replace 'infoTech Pvt Ltd' to 'infoTech private Ltd'.in first transform
i thought that is not perfect,it can it possible to with single transform like,''infoTech private Limited',
my inline data like this.
IMPORT STD.Str;
Layout:=record
UNSIGNED1 cid;
STRING Company_Name;
end;
CompRec:=DATASET([{1,'infoTech private Ltd'},
{2,'infoTech Pvt Ltd'},
{3,'gate private'}],
Layout);
lookRecSet:=DATASET([{'pvt','private'},
{'Ltd','Limited'}}],
{STRING findWord,
STRING replaceWord}
);
//THIS JOIN,will only do the FIND REPLACE FOR SINGLE WORD
JoinLookup := JOIN(CompRec,
lookRecSet,
//regexfind(RIGHT.findWord,LEFT.cname,nocase),
Str.FindWord(LEFT.Company_Name,RIGHT.findWord,TRUE),
TRANSFORM(RECORDOF(LEFT),
SELF.Company_Name:=Str.FindReplace(LEFT.Company_Name,RIGHT.findWord,RIGHT.replaceWord);
SELF:=LEFT;
),
ALL);
JOIN(JoinLookup,
lookRecSet,
Str.FindWord(LEFT.Company_Name,RIGHT.findWord,TRUE),
TRANSFORM(RECORDOF(LEFT),
SELF.Company_Name:=Str.FindReplace(LEFT.Company_Name,RIGHT.findWord,RIGHT.replaceWord);
SELF:=LEFT;
),
ALL
);
-- any can suggest, how it possible,thanks
it is very challenging one to make data standardization,
i have done find replace 'infoTech Pvt Ltd' to 'infoTech private Ltd'.in first transform
i thought that is not perfect,it can it possible to with single transform like,''infoTech private Limited',
my inline data like this.
IMPORT STD.Str;
Layout:=record
UNSIGNED1 cid;
STRING Company_Name;
end;
CompRec:=DATASET([{1,'infoTech private Ltd'},
{2,'infoTech Pvt Ltd'},
{3,'gate private'}],
Layout);
lookRecSet:=DATASET([{'pvt','private'},
{'Ltd','Limited'}}],
{STRING findWord,
STRING replaceWord}
);
//THIS JOIN,will only do the FIND REPLACE FOR SINGLE WORD
JoinLookup := JOIN(CompRec,
lookRecSet,
//regexfind(RIGHT.findWord,LEFT.cname,nocase),
Str.FindWord(LEFT.Company_Name,RIGHT.findWord,TRUE),
TRANSFORM(RECORDOF(LEFT),
SELF.Company_Name:=Str.FindReplace(LEFT.Company_Name,RIGHT.findWord,RIGHT.replaceWord);
SELF:=LEFT;
),
ALL);
JOIN(JoinLookup,
lookRecSet,
Str.FindWord(LEFT.Company_Name,RIGHT.findWord,TRUE),
TRANSFORM(RECORDOF(LEFT),
SELF.Company_Name:=Str.FindReplace(LEFT.Company_Name,RIGHT.findWord,RIGHT.replaceWord);
SELF:=LEFT;
),
ALL
);
-- any can suggest, how it possible,thanks
- suleman Shreef
- Posts: 21
- Joined: Wed Feb 27, 2019 9:15 am
suleman,
Here's one way to approach the problem:
The LOOP will run the PROJECT as many times as you have words to replace, always working with the result of the previous LOOP iteration. So it will replace one word at a time in each record.
HTH,
Richard
Here's one way to approach the problem:
- Code: Select all
IMPORT STD;
Layout:=record
UNSIGNED1 cid;
STRING Company_Name;
END;
CompRec:=DATASET([{1,'infoTech Private Ltd'},
{2,'infoTech Pvt Ltd'},
{3,'gate private'}],
Layout);
lookRecs:=DATASET([{'Pvt','Private'},{'Ltd','Limited'},{'gate','fred'}],
{STRING findWord,STRING replaceWord});
ReplaceCnt := COUNT(LookRecs); //how many words to replace
LoopBody(DATASET(Layout) ds,INTEGER C) :=
PROJECT(ds,
TRANSFORM(Layout,
Rec := lookRecs[C]; //which replace words to use this time
SELF.Company_Name := Std.Str.FindReplace(LEFT.Company_Name,
rec.findWord,
rec.replaceWord),
SELF := LEFT));
LOOP(CompRec,ReplaceCnt,LoopBody(ROWS(LEFT),COUNTER));
The LOOP will run the PROJECT as many times as you have words to replace, always working with the result of the previous LOOP iteration. So it will replace one word at a time in each record.
HTH,
Richard
- rtaylor
- Community Advisory Board Member
- Posts: 1619
- Joined: Wed Oct 26, 2011 7:40 pm
Thanks a lot Richard,
your code is helped me to achieve my expectation
Regards,
Suleman Shreef
your code is helped me to achieve my expectation
Regards,
Suleman Shreef
- suleman Shreef
- Posts: 21
- Joined: Wed Feb 27, 2019 9:15 am
Suleman,
OK, I wasn't happy with the LOOP solution, because it would involve PROJECTing through the same set of records multiple times, and so, could be relatively inefficient on large datasets.
So, here's an alternative that I would expect to be more efficient:
This example uses a DICTIONARY for the replacement words, allowing a single pass through the CompRec dataset.
HTH,
Richard
OK, I wasn't happy with the LOOP solution, because it would involve PROJECTing through the same set of records multiple times, and so, could be relatively inefficient on large datasets.
So, here's an alternative that I would expect to be more efficient:
- Code: Select all
IMPORT STD;
Layout:=record
UNSIGNED1 cid;
STRING Company_Name;
END;
CompRec:=DATASET([{1,'infoTech Private Ltd'},
{2,'infoTech Pvt Ltd'},
{3,'gate private'}],
Layout);
lookRecs:=DATASET([{'Pvt','Private'},{'Ltd','Limited'},{'gate','fred'}],
{STRING findWord,STRING replaceWord});
//DICTIONARY Solution
ReplaceDCT := DICTIONARY(LookRecs,{findWord => replaceWord});
ReplaceFunc(STRING s) := FUNCTION
rec := {STRING w};
DSwords := DATASET(Std.Str.SplitWords(s,' '),rec);
P := PROJECT(DSWords,
TRANSFORM(rec,
SELF.w := IF(LEFT.w IN ReplaceDCT,
ReplaceDCT[LEFT.w].replaceWord,
LEFT.w)));
RETURN ROLLUP(P,TRUE,
TRANSFORM(rec,SELF.w := TRIM(LEFT.w + ' ' + RIGHT.w,LEFT)))[1].w;
END;
ProjRecs := PROJECT(CompRec,
TRANSFORM(layout,
SELF.Company_Name := ReplaceFunc(LEFT.Company_Name),
SELF := LEFT));
OUTPUT(ProjRecs,NAMED('DICTIONARY_Solution'));
This example uses a DICTIONARY for the replacement words, allowing a single pass through the CompRec dataset.
HTH,
Richard
- rtaylor
- Community Advisory Board Member
- Posts: 1619
- Joined: Wed Oct 26, 2011 7:40 pm
hi,
Richard i have done small project in hpcc can you check , if you send your email i will forward my hppc Project.
Richard i have done small project in hpcc can you check , if you send your email i will forward my hppc Project.
- suleman Shreef
- Posts: 21
- Joined: Wed Feb 27, 2019 9:15 am
6 posts
• Page 1 of 1
Who is online
Users browsing this forum: Bing [Bot] and 1 guest