Thu Jun 20, 2019 5:12 pm
Login Register Lost Password? Contact Us


Mixing single(FIRST) and multiple Pattern matching (PARSE)

Questions around writing code and queries

Tue May 21, 2019 4:12 pm Change Time Zone

Hi,

Bit of an esoteric question here, where there is an obvious workaround, but the exercise is to improve my understanding of pattern matching with PARSE (non Tiomta variant).

I have a simple CSV structure, field separator ',' where, for most fields, I just want the FIRST match. However for dob I want to match it two ways.
As a single string 'DD-MM-YYYY', but also as three UNSIGNEDs DD MM and YYYY.
Code: Select all
rec := {STRING line};
datafile := DATASET([
{'1234567,Allan Wrobel,24-03-1958,6 Barkham Rd Woking Surry RG41 4DA.'}
,{'8976   ,Anna White,20-01-1961,55 Walton Rd Cambs PO87 4RT.'}
,{'45432  ,Nina Brown,28-04-1974,27 Alma Dr Chesham Bucks AM12 2WA.'}
],rec);

PATTERN content := ANY*;
PATTERN id      := FIRST content;
PATTERN name    := content;
PATTERN dob     := content;
PATTERN address := content;
PATTERN day     := PATTERN('[0-9]{2}');
PATTERN month   := day;
PATTERN year    := PATTERN('[0-9]{4}');
PATTERN exprA   := id ',' name ',' dob ',' address;
PATTERN exprB   := id ',' name ',' day '-' month '-' year ',' address;

PATTERN expr    := exprA OR exprB;

RDate := RECORD
    UNSIGNED1 day;
    UNSIGNED1 month;
    UNSIGNED2 year;
END;

results := RECORD
    UNSIGNED Id1      := (UNSIGNED) MATCHTEXT(id);
    STRING   name     := MATCHTEXT(name);
    STRING   dob      := MATCHTEXT(dob);
    RDate    Date     := ROW({MATCHTEXT(day),MATCHTEXT(month),MATCHTEXT(year)},RDate);
    STRING   address  := MATCHTEXT(address);
END;

PARSE(datafile,line,expr,results,FIRST);

Because of the 'OR' in the 'expr' if I use FIRST the parser just takes pattern 'dob'.
If I use 'ALL' I get all matches 'dob' and 'day'... but multiple times as the other fields match multiple times.
I started with the simpler:
Code: Select all
PATTERN expr    := id ',' name ',' (dob OR day '-' month '-' year) ',' address;

To no avail, same behaviour observed.

So my question:
How does one match multiple times on a component of the entire pattern, when for other components, you only want to match once?

Yours

Allan
Allan
 
Posts: 371
Joined: Sat Oct 01, 2011 7:26 pm

Wed May 22, 2019 2:50 pm Change Time Zone

Allan,

The simple answer in this case is to change your RECORD structure, like this:
Code: Select all
results := RECORD
  UNSIGNED Id1  := (UNSIGNED) MATCHTEXT(id);
  STRING   name := MATCHTEXT(name);
  STRING   dob  := MATCHTEXT(dob);
  // RDate    Date := ROW({MATCHTEXT(day),MATCHTEXT(month),MATCHTEXT(year)},RDate);
  RDate    Date := ROW({MATCHTEXT(dob)[1..2],
                        MATCHTEXT(dob)[4..5],
                        MATCHTEXT(dob)[7..]},RDate);
  STRING   address := MATCHTEXT(address);
END;
That will allow you to handle the same matching text the two ways you want.

The more generic answer would be to use a TRANSFORM instead of a RECORD structure to define your PARSE result. TRANSFORM provides much more flexibility in dealing with each and every bit of data extracted by PARSE.

FWIW, here is the way I would have written this PARSE:
Code: Select all
rec := {STRING line};
datafile := DATASET([
   {'1234567,Allan Wrobel,24-03-1958,6 Barkham Rd Woking Surry RG41 4DA.'}
  ,{'8976   ,Anna White,20-01-1961,55 Walton Rd Cambs PO87 4RT.'}
  ,{'45432  ,Nina Brown,28-04-1974,27 Alma Dr Chesham Bucks AM12 2WA.'}
                    ],rec);

PATTERN num     := PATTERN('[0-9]');
PATTERN num2    := REPEAT(num,2);
PATTERN num4    := REPEAT(num,4);
PATTERN txt     := ANY*;
PATTERN sep     := OPT(' '+) ',';

PATTERN id      := num+;
PATTERN dob     := num2 '-' num2 '-' num4 ;
PATTERN expr    := id sep txt sep dob sep txt;

RDate := RECORD
  UNSIGNED1 day;
  UNSIGNED1 month;
  UNSIGNED2 year;
END;

results := RECORD
  UNSIGNED Id1      := (UNSIGNED) MATCHTEXT(id);
  STRING   name     := MATCHTEXT(txt[1]);
  STRING   dob      := MATCHTEXT(dob);
  RDate    Date     := ROW({MATCHTEXT(dob/num2[1]),
                            MATCHTEXT(dob/num2[2]),
                            MATCHTEXT(dob/num4)},RDate);
  STRING   address  := MATCHTEXT(txt[2]);
END;

PARSE(datafile,line,expr,results,BEST);
Fewer and more generic PATTERNs make the code a bit simpler to my mind.

HTH,

Richard
rtaylor
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 1459
Joined: Wed Oct 26, 2011 7:40 pm

Wed May 22, 2019 3:54 pm Change Time Zone

Thanks Very much, Richard,

I'll go away and inwardly digest your examples.

Yours
Allan
Allan
 
Posts: 371
Joined: Sat Oct 01, 2011 7:26 pm

Wed May 22, 2019 3:59 pm Change Time Zone

Richard,

In your 2nd example, you did not need to anchor 'id' to the start of line using FIRST.
Can leaving this out cause problems?
(I usually like to anchor my patterns somewhere so I don't get spurious matches from unexpected places.)

I like the way your example makes use of instances of a match, e.g. num[1] and num[2].

Yours
Allan
Allan
 
Posts: 371
Joined: Sat Oct 01, 2011 7:26 pm

Wed May 22, 2019 4:37 pm Change Time Zone

Alan,
In your 2nd example, you did not need to anchor 'id' to the start of line using FIRST.
Can leaving this out cause problems?
Frankly, I've never used FIRST in any PATTERN definition, so I expect the answer would be, "No, unless it's truly necessary."

The fact that my parse pattern "expr" is explicitly looking for "id sep txt sep dob sep txt;" (an id followed by a sep followed by a txt followed by a sep followed by a dob followed by a sep followed by a txt) eliminates the need for FIRST.

I think FIRST is only useful if it's possible that the pattern you're looking to match could also be found somewhere in the middle of your search string and you really only want a match where it starts at the beginning of that string (which it can't in this case, because the search pattern maps the entire search string).

HTH,

Richard
rtaylor
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 1459
Joined: Wed Oct 26, 2011 7:40 pm

Wed May 22, 2019 5:17 pm Change Time Zone

Allan,

Just to "throw a spanner into the works" here's a totally different approach:
Code: Select all
rec := {STRING line};
datafile := DATASET([
   {'1234567,Allan Wrobel,24-03-1958,6 Barkham Rd Woking Surry RG41 4DA.'}
  ,{'8976   ,Anna White,20-01-1961,55 Walton Rd Cambs PO87 4RT.'}
  ,{'45432  ,Nina Brown,28-04-1974,27 Alma Dr Chesham Bucks AM12 2WA.'}
                    ],rec);

IMPORT Std;
res := RECORD
  UNSIGNED Id1 ;
  STRING   name;
  STRING   dob;
  RDate    Date;
  STRING   address;
END;

PROJECT(datafile,
        TRANSFORM(res,
                  SetVals   := Std.Str.SplitWords(LEFT.line,',');            
                  SELF.id1  := (UNSIGNED)SetVals[1];
                  SELF.name := SetVals[2];
                  SELF.dob  := SetVals[3];
                  SELF.address := SetVals[4];
                  SetDate   := Std.Str.SplitWords(SELF.dob,'-');
                  SELF.Date := ROW({SetDate[1],SetDate[2],SetDate[3]},RDate);
                 ));

No need to use PARSE at all! :)

HTH,

Richard
rtaylor
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 1459
Joined: Wed Oct 26, 2011 7:40 pm


Return to Programming

Who is online

Users browsing this forum: Bing [Bot] and 1 guest

cron