Boost C++ Libraries

...one of the most highly regarded and expertly designed C++ library projects in the world. Herb Sutter and Andrei Alexandrescu, C++ Coding Standards

This is the documentation for an old version of Boost. Click here to view this page for the latest version.
PrevUpHomeNext

Parsing Expression Grammar

Parsing Expression Grammars (PEG) [6] are a derivative of Extended Backus-Naur Form (EBNF) [7] with a different interpretation, designed to represent a recursive descent parser. A PEG can be directly represented as a recursive-descent parser.

Like EBNF, PEG is a formal grammar for describing a formal language in terms of a set of rules used to recognize strings of this language. Unlike EBNF, PEGs have an exact interpretation. There is only one valid parse tree (see Abstract Syntax Tree) for each PEG grammar.

Sequences

Sequences are represented by juxtaposition like in EBNF:

a b

The PEG expression above states that, in order for this to succeed, b must follow a. Here's the syntax diagram:

sequence

Here's a trivial example:

'x' digit

which means the character x must be followed by a digit.

[Note] Note

In Spirit.Qi, we use the >> for sequences since C++ does not allow juxtaposition, while in Spirit.Karma we use the << instead.

Alternatives

Alternatives are represented in PEG using the slash:

a / b
[Note] Note

In Spirit.Qi and Spirit.Karma, we use the | for alternatives just as in EBNF.

Alternatives allow for choices. The expression above reads: try to match a. If a succeeds, success, if not try to match b. This is a bit of a deviation from the usual EBNF interpretation where you simply match a or b. Here's the syntax diagram:

alternative

PEGs allow for ambiguity in the alternatives. In the expression above, both a or b can both match an input string. However, only the first matching alternative is valid. As noted, there can only be one valid parse tree.

Loops

Again, like EBNF, PEG uses the regular-expression Kleene star and the plus loops:

a*
a+
[Note] Note

Spirit.Qi and Spirit.Karma use the prefix star and plus since there is no postfix star or plus in C++.

Here are the syntax diagrams:

kleene

plus

The first, called the Kleene star, matches zero or more of its subject a. The second, plus, matches one ore more of its subject a.

Unlike EBNF, PEGs have greedy loops. It will match as much as it can until its subject fails to match without regard to what follows. The following is a classic example of a fairly common EBNF/regex expression failing to match in PEG:

alnum* digit

In PEG, alnum will eat as much alpha-numeric characters as it can leaving nothing more left behind. Thus, the trailing digit will get nothing. Loops are simply implemented in recursive descent code as for/while loops making them extremely efficient. That is a definite advantage. On the other hand, those who are familiar with EBNF and regex behavior might find the behavior a major gotcha. PEG provides a couple of other mechanisms to circumvent this. We will see more of these other mechanisms shortly.

Difference

In some cases, you may want to restrict a certain expression. You can think of a PEG expression as a match for a potentially infinite set of strings. The difference operator allows you to restrict this set:

a - b

The expression reads: match a but not b.

[Note] Note

There is no difference operator in Spirit.Karma, as the concept does not make sense in the context of output generation.



[6] Bryan Ford: Parsing Expression Grammars: A Recognition-Based Syntactic Foundation, http://pdos.csail.mit.edu/~baford/packrat/popl04/

[7] Richard E. Pattis: EBNF: A Notation to Describe Syntax, http://www.cs.cmu.edu/~pattis/misc/ebnf.pdf


PrevUpHomeNext