PARSE(dataset, data, pattern, result , flags [, MAXLENGTH( length ) ])

PARSE(dataset, data, result , XML( path ) [, UNORDERED | ORDERED( bool ) ] [, STABLE | UNSTABLE ] [, PARALLEL [ ( numthreads ) ] ] [, ALGORITHM( name ) ] )

datasetThe set of records to process.
dataAn expression specifying the text to parse, typically the name of a field in the dataset.
patternThe parsing pattern to match.
resultThe name of either the RECORD structure attribute that specifies the format of the output record set (like the TABLE function), or the TRANSFORM function that produces the output record set (like PROJECT).
flagsOne or more parsing options, listed below.
MAXLENGTHSpecifies the the maximum length the pattern can match. If omitted, the default length is 4096.
lengthAn integer constant specifying the maximum number of matching characters.
XMLSpecifies the dataset contains XML data.
pathA string constant containing the XPATH to the tag that delimits the XML data in the dataset.
UNORDEREDOptional. Specifies the output record order is not significant.
ORDEREDSpecifies the significance of the output record order.
boolWhen False, specifies the output record order is not significant. When True, specifies the default output record order.
STABLEOptional. Specifies the input record order is significant.
UNSTABLEOptional. Specifies the input record order is not significant.
PARALLELOptional. Try to evaluate this activity in parallel.
numthreadsOptional. Try to evaluate this activity using numthreads threads.
ALGORITHMOptional. Override the algorithm used for this activity.
nameThe algorithm to use for this activity. Must be from the list of supported algorithms for the SORT function's STABLE and UNSTABLE options.
Return:PARSE returns a record set.

The PARSE function performs a text or XML parsing operation.

PARSE Text Data

The first form operates on the dataset, finding records whose data contains a match for the pattern, producing a result set of those matches in the result format. If the pattern finds multiple matches in the data, then a result record is generated for each match. Each match for a PARSE is effectively a single path through the pattern. If there is more than one path that matches, then the result transform is either called once for each path, or if the BEST option is used, the path with the lowest penalty is selected.

If the result names a RECORD structure, then this form of PARSE operates like the TABLE function to generate the result set, but may also operate on variable length text. If the result names a TRANSFORM function, then the transform generates the result set. The TRANSFORM function must take at least one parameter: a LEFT record of the same format as the dataset. The format of the resulting record set does not need to be the same as the input.

Flags can have the following values:

FIRSTOnly return a row for the first match starting at a particular position.
ALLReturn a row for every possible match of the string at a particular position.
WHOLEOnly match the whole string.
NOSCANIf a position matches, don't continue searching for other matches.
SCANIf a position matches, continue searching from the end of the match, otherwise continue from the next position.
SCAN ALLReturn matches for every possible start position. Use the TRIM function to eliminate parsing extraneous trailing blanks.
NOCASEPerform a case insensitive comparison.
CASEPerform a case sensitive comparison (this is the default).
SKIP(separator-pattern)Specify a pattern that can be inserted after each token in a search pattern. For example, SKIP ( [' ','\t']*) skips spaces and tabs between tokens.
KEEP(max)Only keep the first max matches.
ATMOST(max)Don't produce any matches if there are more than max matches.
MAXReturn a row for the result that matches the longest sequence of the input. Only one match is returned unless the MANY option is also specified.
MINReturn a row for the result that matches the shortest sequence of the input. Only one match is returned unless the MANY option is also specified.
MATCHED( [ rule-reference ] )Used when rule-reference is used in a user-matching function. If a rule-reference is not specified, the matching information may not be preserved.
MATCHED(ALL)Retain all rule-names -- if they are used by user match functions.
NOT MATCHEDGenerate a row if there were no matches on the input row. All calls to the MATCHED() function return false inside the resultstructure.
NOT MATCHED ONLYOnly generate a row if no matches were found.
BESTPick the match with the highest score (lowest penalty). If the MAX or MIN flags are also present, they are applied first. Only one match is returned unless the MANY option is also specified.
MANYReturn multiple matches for BEST, MAX, or MIN options.
PARSEImplements Tomita parsing instead of regular expression parsing technology.
USE([ struct, ] x)Specifies using a RULE pattern attribute defined further on in the code with the DEFINE(x) function, introducing a recursive grammar (the only recursion allowed in ECL). If the optional struct RECORD structure is specified, USE specifies using a RULE pattern attribute defined further on in the code with the DEFINE(x) function that produces a row result in the struct RECORD structure format (valid only with the PARSE option also present). USE is required on PARSE when any patterns cannot be found by walking the rules from the root down without following any USEs.


rec := {STRING10000 line};
datafile := DATASET([
    {'Ge 34:2 And when Shechem the son of Hamor the Hivite, prince of the country, saw her,'+
     ' he took her, and lay with her, and defiled her.'},
    {'Ge 36:10 These are the names of Esaus sons; Eliphaz the son of Adah the wife of Esau,'+
     ' Reuel the son of Bashemath the wife of Esau.'}],rec);
PATTERN ws1 := [' ','\t',','];
PATTERN ws := ws1 ws1?;
PATTERN patStart := FIRST | ws;
PATTERN patEnd := LAST | ws;
PATTERN article := ['A','The','Thou','a','the','thou'];

TOKEN patWord := PATTERN('[a-zA-Z]+');
TOKEN Name := PATTERN('[A-Z][a-zA-Z]+');

RULE Namet := name OPT(ws ['the','king of','prince of'] ws name);
PATTERN produced := OPT(article ws) ['begat','father of','mother of'];
PATTERN produced_by := OPT(article ws) ['son of','daughter of'];
PATTERN produces_with := OPT(article ws) ['wife of'];

RULE relationtype := ( produced | produced_by | produces_with);
RULE progeny := namet ws relationtype ws namet;

results := RECORD
  STRING60 Le := MATCHTEXT(Namet[1]);
  STRING60 Ri := MATCHTEXT(Namet[2]);
  STRING30 RelationPhrase := MatchText(relationtype);
outfile1 := PARSE(datafile,line,progeny,results,SCAN ALL);