Thu Dec 02, 2021 6:47 am
Login Register Lost Password? Contact Us


Identifying positions of differences in two strings.

Questions around writing code and queries

Fri Oct 23, 2020 12:01 pm Change Time Zone

Hi Given two strings, e.g.

The Fox jumped over the lazy Dog.
The Dog jumped over the lazy Fox.
I would like a function to return say a dataset of start position of difference and length of difference. e.g
Code: Select all
Start Position     Length
5                  3
30                 3

There should be some criteria to merge differences within words. In the example above Fox and Dog share a common letter, but it's the whole word that is different, not just the individual characters. Kind of Merge differences occurring between white space.
The return format does not have to be exactly as I show above. Preferable it should be suitable for the 'data Visualizations' library, enabling differences to be highlighted.

There is always the EMBED option to drop into other languages and library set, but hey this should be doable in pure ECL?
(P.S. Case Sensitivity, include punctuation, collapse white space, that kind of thing, can just be options to the FUNCTION, they don't effect the basic approach much)

Yours

Allan
Allan
 
Posts: 442
Joined: Sat Oct 01, 2011 7:26 pm

Mon Oct 26, 2020 2:06 pm Change Time Zone

Allan,

Here's my quick take on it:
Code: Select all
IMPORT Std;
WordDiffs(STRING s1,STRING s2,BOOLEAN pNoCase=FALSE) := FUNCTION
  ins1 := IF(pNoCase=FALSE,s1,Std.Str.ToUpperCase(s1));
  ins2 := IF(pNoCase=FALSE,s2,Std.Str.ToUpperCase(s2));
  FindWord(STRING w,STRING s) := Std.Str.Find(s,w,1);
  WordSet1 := Std.Str.Splitwords(ins1,' ');
  WordSet2 := Std.Str.Splitwords(ins2,' ');
  WordRec := {UNSIGNED WordNum,STRING word,UNSIGNED StartPos,UNSIGNED WordLen};
  WordRec WordXF(INTEGER C, STRING s, SET OF STRING ws) := TRANSFORM
    SELF.WordNum := C;
    SELF.word := ws[C];
    SELF.WordLen := LENGTH(ws[C]);
    SELF.StartPos := FindWord(TRIM(ws[C] + ' ' + ws[C+1]),s);
  END;
  ds1 := DATASET(COUNT(WordSet1),WordXF(COUNTER, ins1, WordSet1));
  ds2 := DATASET(COUNT(WordSet2),WordXF(COUNTER, ins2, WordSet2));
  // RETURN ds1+ds2; //just to test positions
  j := JOIN(ds1,ds2,
            LEFT.WordNum=RIGHT.WordNum,
            TRANSFORM({UNSIGNED WordNum,STRING diff},
                      SELF.WordNum := LEFT.WordNum,
                      SELF.diff := ROWDIFF(LEFT,RIGHT)))(diff<>'');
  SetDiffs  := SET(j,WordNum);                              
  DiffWords := ds1(WordNum IN SetDiffs) + ds2(WordNum IN SetDiffs);               
  RETURN SORT(DiffWords,WordNum);
END;

t1 := 'The Fox jumped over the lazy Dog.';
t2 := 'The Dog jumped over the lazy Fox.';
t3 := 'The fox jumped over the lazy Fox.';

WordDiffs(t1,t2);
WordDiffs(t1,t3,TRUE);
I solved the "possible duplicate words" issue by looking for the position of the word and its following word (look at case insensitive and "THE"). Let me know if you see any issues I missed.

HTH,

Richard
rtaylor
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 1606
Joined: Wed Oct 26, 2011 7:40 pm

Mon Oct 26, 2020 3:55 pm Change Time Zone

Richard,

This is great.

Just one minor point, if consecutive words are different, like:
Code: Select all
The Fox jumped over the lazy Dog.
The Dog jumped over the layy Fox.

Currently, the difference detected in 'lazy' and 'Dog' come out as distinct differences but really it would be nice if they were merged into one reference.
There is enough information in the output for the user of this function to do his own merge (given offset and length), but this could be done for them.

Thanks Richard, all the best

Allan
Allan
 
Posts: 442
Joined: Sat Oct 01, 2011 7:26 pm

Mon Oct 26, 2020 4:01 pm Change Time Zone

Allan,

I would post-process the result (using WordNum) to find any contiguous differences, if necessary. That keeps this a simpler tool, useful for both cases. :)

Richard
rtaylor
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 1606
Joined: Wed Oct 26, 2011 7:40 pm

Mon Oct 26, 2020 6:38 pm Change Time Zone

Allan,

So I had a thought that both scenarios would be easier if the function returned just one record for each mismatched word, so here's the one replacement definition required to do that:
Code: Select all
  DiffWords := JOIN(ds1(WordNum IN SetDiffs),ds2(WordNum IN SetDiffs),
                    LEFT.WordNum = RIGHT.WordNum, 
                    TRANSFORM({UNSIGNED WordNum,
                               {WordRec AND NOT WordNum} Lword,
                               {WordRec AND NOT WordNum} Rword},
                              SELF.Lword := LEFT,               
                              SELF.Rword := RIGHT,               
                              SELF := LEFT));               

HTH,

Richard
rtaylor
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 1606
Joined: Wed Oct 26, 2011 7:40 pm

Mon Oct 26, 2020 7:17 pm Change Time Zone

Thanks Again richard.
This is going to be most useful in comparing layouts between environments and highlighting differences.

Yours

Allan
Allan
 
Posts: 442
Joined: Sat Oct 01, 2011 7:26 pm


Return to Programming

Who is online

Users browsing this forum: No registered users and 1 guest