Saturday, June 10, 2017

Parsing Expression Grammars

Parsing Expressions Grammar

PEG are a (no longer so) new approach to parsing.

Clibutl offers a set of functions to write parsers in C without having to use an external "parser generator".

Rather than a dry list of functions, I'll present here a short tutorial on how to build a simple parser.
The approach to PEG used by Clibutl is based on the article:
    R. R. Redziejowski. Parsing Expression Grammar as a Primitive
    Recursive-Descent Parser with Backtracking.
    Fundamenta Informaticae 79, 3-4 (2007), pp. 513 - 524. 

Define the grammar

First of all, let's define a grammar. I'll use the classical example of numerical expressions. A full example can be found here: . Let's start simple, just additions and subtractions:

   expr  ::= term (op term)*
   term  ::= num
   op    ::= '+' | '-'
   num   ::= digit+
   digit ::= '0' | '1' | ... | '9' 

Each non-terminal is represented by a pegrule. This is how term looks like:

   pegrule(expr) {
     pegstar {

The pegref() function allows referencing other non-terminals, the pegstar instruction represent unbounded optional repetitions. It is important to notice that the code above is valid C code. Of course after having included utl.h!
Let's write down the entire grammar:
pegrule(expr) {
  pegstar {

pegrule(term) {

pegrule(op) {
  pegchoice {
    pegeither { pegstring("+"); }
    pegor     { pegstring("-"); }

pegrule(num) {
  pegplus {

pegrule(digit) {
As you can see, this is an almost one-to-one translation of the original grammar. I hope this feature will make peg functions easier to use.
The rule for op shows how to express the PEG ordered choice. The pegchoice instruction reminds the switch statment where the either/or branches are similar to the the case clauses.
The pegplus is similar to pegstar but requires at least one match. You can guess there's also a pegopt instruction (not shown in this example) for optional match.
The pegstring() and pegoneof() functions are atomic recognizers that, respectively, match that specified string or one of the characters in the string.

Do the parse

With those simple instructions we have now defined a recognizing parser, i.e. a parser that can tell if a string abides to the specified grammar or not. Let's see how to invoke it:
#define MAX_EXPR_LEN 256
char expression[MAX_EXPR_LEN];

int main(int argc, char *argv [])
  peg_t parser;
  parser = pegnew();

  while (fgets(expression,MAX_EXPR_LEN,stdin)) {
    if (pegparse(parser,expr,expression))
      printf("(%.*s) is a valid expression\n",pegpos(parser)-expression,expression);
      printf("(%s) is not a valid expression\n",expression);
  parser = pegfree(parser);
The pegparse() does all the magic. It takes a parser created with pegnew(), a peg rule to be matched and a string and returns a true value if the initial portion of the string matches the rule or 0 if the string doesn't match at all.
Here some examples of the program output:
"5+4"     -> (5+4) is a valid expression
"1+3XX"   -> (1+3) is a valid expression
"YZX"     -> (YZX) is not a valid expression

"3 + 2"   -> (3) is a valid expression
The last example shows that we need to explicitly take care of white spaces. The recognizer pegwspace matches a possibly empty sequence of white spaces ('\t' and ' '), while pegvspace also matches vertical space like '\r' and '\n'. Look at the expr2.c listing to see how simply you can handle spaces.

Next Steps

Now is the time to make this parser evolve and do something more useful than just recognizing expressions, for example evaluating them. This will require some more detailed discussion and I will deal with it in a future post.