PARSE Pattern Value Types

PARSE Pattern Value Types
Prev	Parsing Support	Next

There are three value types specifically designed and required to define parsing pattern attributes:

PATTERN patternid := parsepattern;

patternid	The attribute name of the pattern.
parsepattern	The pattern, very similar to regular expressions. This may contain other previously defined PATTERN attributes. See ParsePattern Definitions below.

The PATTERN value type defines a parsing expression very similar to regular expression patterns.

TOKEN tokenid := parsepattern;

tokenid	The attribute name of the token.
parsepattern	The token pattern, very similar to regular expressions. This may contain PATTERN attributes but no TOKEN or RULE attributes. See ParsePattern Definitions below.

The TOKEN value type defines a parsing expression very similar to a PATTERN, but once matched, the parser doesn't backtrack to find alternative matches as it would with PATTERN.

RULE [ ( recstruct ) ] ruleid := rulePattern;

recstruct	Optional. The attribute name of a RECORD structure attribute (valid only when the PARSE option is used on the PARSE function).
ruleid	The attribute name of the rule.
rulePattern	The rule pattern, very similar to regular expressions. This may contain PATTERN attributes, TOKEN attributes, or RULE attributes. See ParsePattern Definitions below.

The RULE value type defines a parsing expression containing combinations of TOKENs. If a RULE definition contains a PATTERN it is implicitly converted to a TOKEN. Like PATTERN, once matched, the parser backtracks to find alternative RULE matches.

If the PARSE option is present on the PARSE function (thereby implementing tomita parsing for the operation), each alternative RULE rulePattern may have an associated TRANSFORM function. The different input patterns can be referred to using $1, $2 etc. If the pattern has an associated recstruct then $1 is a row, otherwise it is a string. Default TRANSFORM functions are created in two circumstances:

1. If there are no patterns, the default transform clears the row. For example:
RULE(myRecord) := ; //empty expression = cleared row
2. If there is only a single pattern with an associated record, and that record matches the type of the rule being defined. For example:
RULE(myRecord) e0 := '(' USE(myRecord, 'expression') ')';

ParsePattern Definitions

A parsepattern may contain any combination of the following elements:

pattern-name	The name of any previously defined PATTERN attribute.
(pattern)	Parentheses may be used for grouping.
pattern1 pattern2	Pattern1 followed by pattern2.
'string'	A fixed text string, which may contain escaped octal string control characters (for example, CtrlZ is '\032').
FIRST	Matches the start of the string to search. This is similar to the regular expression ^ token, which is not supported.
LAST	Matches the end of the string to search. This is similar to the regular expression $ token, which is not supported.
ANY	Matches any character.
REPEAT(pattern)	Repeat the pattern any number of times. The regular expression syntax pattern* is supported as a shorthand for REPEAT(pattern).
REPEAT(pattern, expression)	Repeat the pattern expression times. The regular expression syntax pattern<count> is supported as a shorthand for REPEAT(pattern,expression), but the regular expression bounded repeats syntax pattern{expression*} is not.
REPEAT(pattern, low, ANY [,MIN])	Repeat the pattern low or more times (with the MIN option making it a minimal match). The regular expression syntax pattern+ is supported as a shorthand for REPEAT(pattern,low,ANY), but the regular expression bounded repeats syntax pattern{expression ,} is not.
REPEAT(pattern, low, high)	Repeat the pattern from low to high times. The regular expression bounded repeats syntax pattern{low,high} is not supported.
OPT(pattern)	An optional pattern. The regular expression syntax pattern? is supported as a shorthand for OPT(pattern).
pattern1 OR pattern2	Either pattern1 or pattern2. The regular expression syntax pattern1 \| pattern2 is supported as a shorthand for OR.
[list-of-patterns]	A comma-delimited list of alternative patterns, useful for string sets. This is the same as OR.
pattern1 [NOT] IN pattern2	Does the text matched with pattern1 also match pattern2? Pattern1 [NOT] = pattern2 and pattern1 != pattern2 are the same as using IN, but may make more sense in some situations.
pattern1 [NOT] BEFORE pattern2	Check if the given pattern2 does [not] follow pattern1. Pattern2 is not consumed from the input.
pattern1 [NOT] AFTER pattern2	Check if the given pattern2 does [not] precede pattern1. Pattern2 does not consume any input. It must also be a fixed length.
pattern LENGTH(range)	Check whether the length of a pattern is in the range. Range can have the form <value>,<min>..<max>,<min>.. or ..<max> So "digit3 NOT BEFORE digit" could be represented as "digit LENGTH(3)." This is more efficient, and digit* can be defined as a token. "digit* LENGTH(4..6)" matches 4,5 and 6 digit sequences.
VALIDATE(pattern, isValidExpression)	Evaluate isValidExpression to check if the pattern is valid or not. isValidExpression should use MATCHTEXT or MATCHUNICODE to refer to the text that matched the pattern. For example, VALIDATE(alpha, MATCHTEXT[4]='Q') is equivalent to alpha = ANY3 'Q' ANY or more usefully: VALIDATE(alpha*,isSurnameService(MATCHTEXT));
VALIDATE(pattern, isValidAsciiExpression, isValidUnicodeExpression)	A two parameter variant. Use the first isValidAsciiExpression if the string being searched is ASCII; use the second if it is Unicode.
NOCASE(pattern)	Matches the pattern case insensitively, overriding the CASE option on the PARSE function. This may be nested within a CASE pattern.
CASE(pattern)	Matches the pattern case sensitively, overriding the NOCASE option on the PARSE function. This may be nested within a NOCASE pattern.
pattern PENALTY(cost)	Associate a penalty cost with this match of the pattern. This can be used to recover from grammars with unknown words. This requires use of the BEST option on the PARSE operation.
TOKEN(pattern)	Treat the pattern as a token.
PATTERN('regular expression')	Define a pattern using a regular expression built from the following supported syntax elements: (x) Grouping (not used for matching) x\|y Alteratives x or y xy Concatenation of x and y. x* x*? Zero or more. Greedy and minimal versions. x+ x+? One or more. Greedy and minimal versions. x? x?? Zero or one. Greedy and minimal versions. x{m} x{m,} x{m,n} Bounded repeats, also minimal versions [0-9abcdef] A set of characters (may use ^ for exclusion list) (?=...) (?!...) Look ahead assertion (?<=...) (?<!...) Look behind assertion Escape sequences can be used to define UNICODE Character ranges. The encoding is UTF-16 Big Endian. For example: PATTERN AnyChar := PATTERN(U'[\u0001-\u7fff]');
	The following character class expressions are supported (inside sets): [:alnum:] [:cntrl:] [:lower:] [:upper:] [:space:] [:alpha:] [:digit:] [:print:] [:blank:] [:graph:] [:punct:] [:xdigit:]
	Regular expressions do not support: ^ $ to mark the beginning/end of the string Collating symbols [.ch.] Equivalence class [=e=]
USE( [ recstruct , ] 'symbolname' )	Specifies using a pattern defined later with the DEFINE( 'symbolname') function. This creates a forward reference, practical only on RULE patterns for tomita parsing (the PARSE option is present on the PARSE function).
SELF	References the pattern being defined (recursive). This is practical only in RULE patterns for tomita parsing (the PARSE option is present on the PARSE function).

Examples:

rs := RECORD
STRING100 line;
END;
ds := DATASET([{'the fox; and the hen'}], rs);

PATTERN ws := PATTERN('[ \t\r\n]');
PATTERN Alpha := PATTERN('[A-Za-z]');
PATTERN Word := Alpha+;
PATTERN Article := ['the', 'A'];
PATTERN JustAWord := Word PENALTY(1);
PATTERN notHen := VALIDATE(Word, MATCHTEXT != 'hen');
PATTERN NoHenWord := notHen PENALTY(1);
RULE NounPhraseComponent1 := JustAWord | Article ws Word;
RULE NounPhraseComponent2 := NoHenWord | Article ws Word;
ps1 := RECORD
          out1 := MATCHTEXT(NounPhraseComponent1);
END;

ps2 := RECORD
          out2 := MATCHTEXT(NounPhraseComponent2);
END;

p1 := PARSE(ds, line, NounPhraseComponent1, ps1, BEST, MANY, NOCASE);
p2 := PARSE(ds, line, NounPhraseComponent2, ps2, BEST, MANY, NOCASE);
OUTPUT(p1);
OUTPUT(p2);