Identifying positions of differences in two strings.
Hi Given two strings, e.g.
The Fox jumped over the lazy Dog.
The Dog jumped over the lazy Fox.
I would like a function to return say a dataset of start position of difference and length of difference. e.g
There should be some criteria to merge differences within words. In the example above Fox and Dog share a common letter, but it's the whole word that is different, not just the individual characters. Kind of Merge differences occurring between white space.
The return format does not have to be exactly as I show above. Preferable it should be suitable for the 'data Visualizations' library, enabling differences to be highlighted.
There is always the EMBED option to drop into other languages and library set, but hey this should be doable in pure ECL?
(P.S. Case Sensitivity, include punctuation, collapse white space, that kind of thing, can just be options to the FUNCTION, they don't effect the basic approach much)
Yours
Allan
The Fox jumped over the lazy Dog.
The Dog jumped over the lazy Fox.
I would like a function to return say a dataset of start position of difference and length of difference. e.g
- Code: Select all
Start Position Length
5 3
30 3
There should be some criteria to merge differences within words. In the example above Fox and Dog share a common letter, but it's the whole word that is different, not just the individual characters. Kind of Merge differences occurring between white space.
The return format does not have to be exactly as I show above. Preferable it should be suitable for the 'data Visualizations' library, enabling differences to be highlighted.
There is always the EMBED option to drop into other languages and library set, but hey this should be doable in pure ECL?
(P.S. Case Sensitivity, include punctuation, collapse white space, that kind of thing, can just be options to the FUNCTION, they don't effect the basic approach much)
Yours
Allan
- Allan
- Posts: 444
- Joined: Sat Oct 01, 2011 7:26 pm
Allan,
Here's my quick take on it:
HTH,
Richard
Here's my quick take on it:
- Code: Select all
IMPORT Std;
WordDiffs(STRING s1,STRING s2,BOOLEAN pNoCase=FALSE) := FUNCTION
ins1 := IF(pNoCase=FALSE,s1,Std.Str.ToUpperCase(s1));
ins2 := IF(pNoCase=FALSE,s2,Std.Str.ToUpperCase(s2));
FindWord(STRING w,STRING s) := Std.Str.Find(s,w,1);
WordSet1 := Std.Str.Splitwords(ins1,' ');
WordSet2 := Std.Str.Splitwords(ins2,' ');
WordRec := {UNSIGNED WordNum,STRING word,UNSIGNED StartPos,UNSIGNED WordLen};
WordRec WordXF(INTEGER C, STRING s, SET OF STRING ws) := TRANSFORM
SELF.WordNum := C;
SELF.word := ws[C];
SELF.WordLen := LENGTH(ws[C]);
SELF.StartPos := FindWord(TRIM(ws[C] + ' ' + ws[C+1]),s);
END;
ds1 := DATASET(COUNT(WordSet1),WordXF(COUNTER, ins1, WordSet1));
ds2 := DATASET(COUNT(WordSet2),WordXF(COUNTER, ins2, WordSet2));
// RETURN ds1+ds2; //just to test positions
j := JOIN(ds1,ds2,
LEFT.WordNum=RIGHT.WordNum,
TRANSFORM({UNSIGNED WordNum,STRING diff},
SELF.WordNum := LEFT.WordNum,
SELF.diff := ROWDIFF(LEFT,RIGHT)))(diff<>'');
SetDiffs := SET(j,WordNum);
DiffWords := ds1(WordNum IN SetDiffs) + ds2(WordNum IN SetDiffs);
RETURN SORT(DiffWords,WordNum);
END;
t1 := 'The Fox jumped over the lazy Dog.';
t2 := 'The Dog jumped over the lazy Fox.';
t3 := 'The fox jumped over the lazy Fox.';
WordDiffs(t1,t2);
WordDiffs(t1,t3,TRUE);
HTH,
Richard
- rtaylor
- Community Advisory Board Member
- Posts: 1619
- Joined: Wed Oct 26, 2011 7:40 pm
Richard,
This is great.
Just one minor point, if consecutive words are different, like:
Currently, the difference detected in 'lazy' and 'Dog' come out as distinct differences but really it would be nice if they were merged into one reference.
There is enough information in the output for the user of this function to do his own merge (given offset and length), but this could be done for them.
Thanks Richard, all the best
Allan
This is great.
Just one minor point, if consecutive words are different, like:
- Code: Select all
The Fox jumped over the lazy Dog.
The Dog jumped over the layy Fox.
Currently, the difference detected in 'lazy' and 'Dog' come out as distinct differences but really it would be nice if they were merged into one reference.
There is enough information in the output for the user of this function to do his own merge (given offset and length), but this could be done for them.
Thanks Richard, all the best
Allan
- Allan
- Posts: 444
- Joined: Sat Oct 01, 2011 7:26 pm
Allan,
I would post-process the result (using WordNum) to find any contiguous differences, if necessary. That keeps this a simpler tool, useful for both cases.
Richard
I would post-process the result (using WordNum) to find any contiguous differences, if necessary. That keeps this a simpler tool, useful for both cases.

Richard
- rtaylor
- Community Advisory Board Member
- Posts: 1619
- Joined: Wed Oct 26, 2011 7:40 pm
Allan,
So I had a thought that both scenarios would be easier if the function returned just one record for each mismatched word, so here's the one replacement definition required to do that:
HTH,
Richard
So I had a thought that both scenarios would be easier if the function returned just one record for each mismatched word, so here's the one replacement definition required to do that:
- Code: Select all
DiffWords := JOIN(ds1(WordNum IN SetDiffs),ds2(WordNum IN SetDiffs),
LEFT.WordNum = RIGHT.WordNum,
TRANSFORM({UNSIGNED WordNum,
{WordRec AND NOT WordNum} Lword,
{WordRec AND NOT WordNum} Rword},
SELF.Lword := LEFT,
SELF.Rword := RIGHT,
SELF := LEFT));
HTH,
Richard
- rtaylor
- Community Advisory Board Member
- Posts: 1619
- Joined: Wed Oct 26, 2011 7:40 pm
Thanks Again richard.
This is going to be most useful in comparing layouts between environments and highlighting differences.
Yours
Allan
This is going to be most useful in comparing layouts between environments and highlighting differences.
Yours
Allan
- Allan
- Posts: 444
- Joined: Sat Oct 01, 2011 7:26 pm
6 posts
• Page 1 of 1
Who is online
Users browsing this forum: No registered users and 2 guests