Boost C++ Libraries

...one of the most highly regarded and expertly designed C++ library projects in the world. Herb Sutter and Andrei Alexandrescu, C++ Coding Standards

This is the documentation for an old version of boost. Click here for the latest Boost documentation.
PrevUpHomeNext

Qi and Karma

Tutorials
Quick Start
Warming up
Semantic Actions
Complex - Our first complex parser
Sum - adding numbers
Number List - stuffing numbers into a std::vector
Number List Redux - list syntax
Number List Attribute - one more, with style
Roman Numerals
Employee - Parsing into structs
Mini XML - ASTs!
Mini XML - Error Handling
Abstracts
Parsing Expression Grammar
Parsing
Parsing and Generating
Primitives
Operators
Attributes
Semantic Actions
Directives
Rules
Grammars
Debugging
Mini XML - Error Handling
Parse Trees and ASTs
Quick Reference
Reference
Concepts
Char
String
Numeric
Binary
Directive
Action
Nonterminal
Operators
Stream
Auxiliary
Debug
Why would you want to use Spirit.Qi?

Spirit.Qi is designed to be a practical parsing tool. The ability to generate a fully-working parser from a formal EBNF specification inlined in C++ significantly reduces development time. Programmers typically approach parsing using ad hoc hacks with primitive tools such as scanf. Even regular-expression libraries (such as boost regex) or scanners (such as Boost tokenizer) do not scale well when we need to write more elaborate parsers. Attempting to write even a moderately-complex parser using these tools leads to code that is hard to understand and maintain.

One prime objective is to make the tool easy to use. When one thinks of a parser generator, the usual reaction is "it must be big and complex with a steep learning curve." Not so. Spirit is designed to be fully scalable. The framework is structured in layers. This permits learning on an as-needed basis, after only learning the minimal core and basic concepts.

For development simplicity and ease in deployment, the entire framework consists of only header files, with no libraries to link against or build. Just put the spirit distribution in your include path, compile and run. Code size? -very tight -essentially comparable to hand written recursive descent code.

Our tutorials will walk you through the simplest Spirit examples, incrementally building on top of the earlier examples as we expose more and more features and techniques. We will try to be as gentle as possible with the learning curve. We will present the tutorials in a cookbook style approach. This style of presentation is based on our BoostCon '07 and BoostCon '08 slides.

Have fun!

We'll start by showing examples of parser expressions to give you a feel on how to build parsers from the simplest parser, building up as we go.

Trivial Example #1 Parsing a number

Create a parser that will parse a floating-point number.

double_

(You've got to admit, that's trivial!) The above code actually generates a Spirit floating point parser (a built-in parser). Spirit has many pre-defined parsers and consistent naming conventions help you keep from going insane!

Trivial Example #2 Parsing two numbers

Create a parser that will accept a line consisting of two floating-point numbers.

double_ >> double_

Here you see the familiar floating-point numeric parser double_ used twice, once for each number. What's that >> operator doing in there? Well, they had to be separated by something, and this was chosen as the "followed by" sequence operator. The above program creates a parser from two simpler parsers, glueing them together with the sequence operator. The result is a parser that is a composition of smaller parsers. Whitespace between numbers can implicitly be consumed depending on how the parser is invoked (see below).

[Note] Note

When we combine parsers, we end up with a "bigger" parser, but it's still a parser. Parsers can get bigger and bigger, nesting more and more, but whenever you glue two parsers together, you end up with one bigger parser. This is an important concept.

Trivial Example #3 Parsing one or more numbers

Create a parser that will accept one or more floating-point numbers.

*double_

This is like a regular-expression Kleene Star, though the syntax might look a bit odd for a C++ programmer not used to seeing the * operator overloaded like this. Actually, if you know regular expressions it may look odd too since the star is before the expression it modifies. C'est la vie. Blame it on the fact that we must work with the syntax rules of C++.

Any expression that evaluates to a parser may be used with the Kleene Star. Keep in mind, though, that due to C++ operator precedence rules you may need to put the expression in parentheses for complex expressions. The Kleene Star is also known as a Kleene Closure, but we call it the Star in most places.

Trivial Example #4 Parsing a comma-delimited list of numbers

This example will create a parser that accepts a comma-delimited list of numbers.

double_ >> *(char_(',') >> double_)

Notice char_(','). It is a literal character parser that can recognize the comma ','. In this case, the Kleene Star is modifying a more complex parser, namely, the one generated by the expression:

(char_(',') >> double_)

Note that this is a case where the parentheses are necessary. The Kleene star encloses the complete expression above.

Let's Parse!

We're done with defining the parser. So the next step is now invoking this parser to do its work. There are a couple of ways to do this. For now, we will use the phrase_parse function. One overload of this function accepts four arguments:

  1. An iterator pointing to the start of the input
  2. An iterator pointing to one past the end of the input
  3. The parser object
  4. Another parser called the skip parser

In our example, we wish to skip spaces and tabs. Another parser named space is included in Spirit's repertoire of predefined parsers. It is a very simple parser that simply recognizes whitespace. We will use space as our skip parser. The skip parser is the one responsible for skipping characters in between parser elements such as the double_ and char_.

Ok, so now let's parse!

template <typename Iterator>
bool parse_numbers(Iterator first, Iterator last)
{
    bool r = phrase_parse(
        first,                          1
        last,                           2
        double_ >> *(',' >> double_),   3
        space                           4
    );
    if (first != last) // fail if we did not get a full match
        return false;
    return r;
}

1

start iterator

2

end iterator

3

the parser

4

the skip-parser

The parse function returns true or false depending on the result of the parse. The first iterator is passed by reference. On a successful parse, this iterator is repositioned to the rightmost position consumed by the parser. If this becomes equal to str.end(), then we have a full match. If not, then we have a partial match. A partial match happens when the parser is only able to parse a portion of the input.

Note that we inlined the parser directly in the call to parse. Upon calling parse, the expression evaluates into a temporary, unnamed parser which is passed into the parse() function, used, and then destroyed.

Here, we opted to make the parser generic by making it a template, parameterized by the iterator type. By doing so, it can take in data coming from any STL conforming sequence as long as the iterators conform to a forward iterator.

You can find the full cpp file here: ../../example/qi/num_list1.cpp

[Note] Note

char and wchar_t operands

The careful reader may notice that the parser expression has ',' instead of char_(',') as the previous examples did. This is ok due to C++ syntax rules of conversion. There are >> operators that are overloaded to accept a char or wchar_t argument on its left or right (but not both). An operator may be overloaded if at least one of its parameters is a user-defined type. In this case, the double_ is the 2nd argument to operator>>, and so the proper overload of >> is used, converting ',' into a character literal parser.

The problem with omiting the char_ should be obvious: 'a' >> 'b' is not a spirit parser, it is a numeric expression, right-shifting the ASCII (or another encoding) value of 'a' by the ASCII value of 'b'. However, both char_('a') >> 'b' and 'a' >> char_('b') are Spirit sequence parsers for the letter 'a' followed by 'b'. You'll get used to it, sooner or later.

Finally, take note that we test for a full match (i.e. the parser fully parsed the input) by checking if the first iterator, after parsing, is equal to the end iterator. You may strike out this part if partial matches are to be allowed.

Our parser above is really nothing but a recognizer. It answers the question "did the input match our grammar?", but it does not do anything other than that. It does not extract any information from what was parsed. For example, whenever we parse a real number, we wish to store the parsed number after a successful match.

Enter Semantic actions. Semantic actions may be attached to any point in the grammar specification. These actions are C++ functions or function objects that are called whenever a part of the parser successfully recognizes a portion of the input. Say you have a parser P, and a C++ function F, you can make the parser call F whenever it matches an input by attaching F:

P[F]

The expression above links F to the parser, P.

The function/function object signature depends on the type of the parser to which it is attached. The parser double_ passes the parsed number. Thus, if we were to attach a function F to double_, we need F to be declared as:

void F(double n);

There are actually 2 more arguments being passed (the parser context and a reference to a boolean 'hit' parameter). We don't need these, for now, but we'll see more on these other arguments later. Spirit.Qi allows us to bind a single argument function, like above. The other arguments are simply ignored.

Presented are various ways to attach semantic actions:

Given:

// A plain function
void write(int const& i)
{
    std::cout << i << std::endl;
}

// A member function
struct writer
{
    void print(int const& i) const
    {
        std::cout << i << std::endl;
    }
};

// A function object
struct write_action
{
    void operator()(int const& i, unused_type, unused_type) const
    {
        std::cout << i << std::endl;
    }
};

Take note that with function objects, we need to have an operator() with 3 arguments. Since we don't care about the other two, we can use unused_type for these. We'll see more of unused_type elsewhere. Get used to it. unused_type is a Spirit supplied support class.

All examples parse inputs of the form:

"{integer}"

An integer inside the curly braces.

The first example shows how to attach a plain function:

qi::parse(first, last, '{' >> int_[&write] >> '}');

What's new? Well int_ is the sibbling of double_. I'm sure you can guess what this parser does.

The next example shows how to attach a simple function object:

qi::parse(first, last, '{' >> int_[write_action()] >> '}');

We can use Boost.Bind to 'bind' member functions:

writer w;
qi::parse(first, last, '{' >> int_[boost::bind(&writer::print, &w, _1)] >> '}');

Likewise, we can also use Boost.Bind to 'bind' plain functions:

qi::parse(first, last, '{' >> int_[boost::bind(&write, _1)] >> '}');

Yep, we can also use Boost.Lambda:

qi::parse(first, last, '{' >> int_[std::cout << _1 << '\n'] >> '}');

There are more ways to bind semantic action functions, but the examples above are the most common. Attaching semantic actions is the first hurdle one has to tackle when getting started with parsing with Spirit. Familiarize yourself with this task and get intimate with the tools behind it such as Boost.Bind and Boost.Lambda.

The examples above can be found here: ../../example/qi/actions.cpp

Phoenix

Phoenix, a companion library bundled with Spirit, is specifically suited for binding semantic actions. It is like Boost.Lambda in steroids, with special custom features that make it easy to integrate semantic actions with Spirit. If your requirements go beyond simple to moderate parsing, I suggest you use this library. Examples presented henceforth shall be using the library exclusively

Well, not really a complex parser, but a parser that parses complex numbers. This time, we're using Phoenix to do the semantic actions.

Here's a simple parser expression for complex numbers:

    '(' >> double_ >> -(',' >> double_) >> ')'
|   double_

What's new? Well, we have:

  1. Alternates: e.g. a | b. Try a first. If it succeeds, good. If not, try the next alternative, b.
  2. Optionals: e.g. -p. Match the parser p zero or one time.

The complex parser presented above reads as:

  • One or two real number in parantheses, separated by comma (the second number is optional)
  • OR a single real number.

This parser can parse complex numbers of the form:

(123.45, 987.65)
(123.45)
123.45

Here goes, this time with actions:

template <typename Iterator>
bool parse_complex(Iterator first, Iterator last, std::complex<double>& c)
{
    double rN = 0.0;
    double iN = 0.0;
    bool r = phrase_parse(first, last,

        //  Begin grammar
        (
                '(' >> double_[ref(rN) = _1]
                    >> -(',' >> double_[ref(iN) = _1]) >> ')'
            |   double_[ref(rN) = _1]
        ),
        //  End grammar

        space);

    if (!r || first != last) // fail if we did not get a full match
        return false;
    c = std::complex<double>(rN, iN);
    return r;
}

The full cpp file for this example can be found here: ../../example/qi/complex_number.cpp

The double_ parser attaches this action:

ref(n) = _1

This assigns the parsed result (actually, the attribute of double_) to n. ref(n) tells Phoenix that n is a mutable reference. _1 is a Phoenix placeholder for the parsed result attribute.

Here's a parser that sums a comma-separated list of numbers.

Ok we've glossed over some details in our previous examples. First, our includes:

#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix_core.hpp>
#include <boost/spirit/include/phoenix_operator.hpp>
#include <iostream>
#include <string>

Then some using directives:

using namespace boost::phoenix;
using namespace boost::spirit;
using namespace boost::spirit::qi;
using namespace boost::spirit::ascii;
using namespace boost::spirit::arg_names;

Namespace

Description

boost::phoenix

All of phoenix

boost::spirit

All of spirit

boost::spirit::qi

All of spirit.qi

boost::spirit::ascii

ASCII version of char_ and all char related parsers. Other encodings are also provided (e.g. also an ISO8859.1)

boost::spirit::arg_names

Special phoenix placeholders for spirit

[Note] Note

If you feel uneasy with using whole namespaces, feel free to qualify your code, use namespace aliases, etc. For the purpose of this tutorial, we will be presenting unqualified names for both Spirit and Phoenix. No worries, we will always present the full working code, so you won't get lost. In fact, all examples in this tutorial have a corresponding cpp file that QuickBook (the documentation tool we are using) imports in here as code snippets.

Now the actual parser:

template <typename Iterator>
bool adder(Iterator first, Iterator last, double& n)
{
    bool r = phrase_parse(first, last,

        //  Begin grammar
        (
            double_[ref(n) = _1] >> *(',' >> double_[ref(n) += _1])
        )
        ,
        //  End grammar

        space);

    if (first != last) // fail if we did not get a full match
        return false;
    return r;
}

The full cpp file for this example can be found here: ../../example/qi/actions.cpp

This is almost like our original numbers list example. We're incrementally building on top of our examples. This time though, like in the complex number example, we'll be adding the smarts. There's an accumulator (`double& n) that adds the numbers parsed. On a successful parse, this number is the sum of all the parsed numbers.

The first double_ parser attaches this action:

ref(n) = _1

This assigns the parsed result (actually, the attribute of double_) to n. ref(n) tells Phoenix that n is a mutable reference. _1 is a Phoenix placeholder for the parsed result attribute.

The second double_ parser attaches this action:

ref(n) += _1

So, subsequent numbers add into n.

That wasn't too bad, was it :-) ?

This sample demontrates a parser for a comma separated list of numbers. The numbers are inserted in a vector using phoenix.

template <typename Iterator>
bool parse_numbers(Iterator first, Iterator last, std::vector<double>& v)
{
    bool r = phrase_parse(first, last,

        //  Begin grammar
        (
            double_[push_back(ref(v), _1)]
                >> *(',' >> double_[push_back(ref(v), _1)])
        )
        ,
        //  End grammar

        space);

    if (first != last) // fail if we did not get a full match
        return false;
    return r;
}

The full cpp file for this example can be found here: ../../example/qi/num_list2.cpp

This, again, is the same parser as before. This time, instead of summing up the numbers, we stuff them in a std::vector. push_back is supplied by Phoenix. The expression:

push_back(ref(v), _1)

appends the parsed number. Like before, _1 is a Phoenix placeholder for the parsed result attribute. Also, like before, ref(v) tells Phoenix that v, the std::vector, is a mutable reference.

So far, we've been using the syntax:

double_ >> *(',' >> double_)

to parse a comma-delimited list of numbers. Such lists are common in parsing and Spirit provides a simpler shortcut for them. The expression above can be simplified to:

double_ % ','

read as: a list of doubles separated by ','.

This sample, again a variation of our previous example, demonstrates just that:

template <typename Iterator>
bool parse_numbers(Iterator first, Iterator last, std::vector<double>& v)
{
    bool r = phrase_parse(first, last,

        //  Begin grammar
        (
            double_[push_back(ref(v), _1)] % ','
        )
        ,
        //  End grammar

        space);

    if (first != last) // fail if we did not get a full match
        return false;
    return r;
}

The full cpp file for this example can be found here: ../../example/qi/num_list3.cpp

You've seen that the double_ parser has a double attribute. All parsers have an attribute, even complex parsers, those that are composed from primitives using operators, like the list parser, also have an attribute. It so happens that the the attribute of a list parser:

p % d

is a std::vector of the attribute of p. So, for our parser:

double_ % ','

we'll have an attribute of:

std::vector<double>

So, what does this give us? Well, we can simply pass in a std::vector<double> to our number list parser and it will happily churn out our result in our vector. For that to happen, we'll use a variation of the phrase_parse with an additional argument: the parser's attribute:

  1. An iterator pointing to the start of the input
  2. An iterator pointing to one past the end of the input
  3. The parser object
  4. The parser's attribute
  5. Another parser called the skip parser

So, our parser now is further simplified to:

template <typename Iterator>
bool parse_numbers(Iterator first, Iterator last, std::vector<double>& v)
{
    bool r = phrase_parse(first, last,

        //  Begin grammar
        (
            double_ % ','
        )
        ,
        //  End grammar

        v, space);

    if (first != last) // fail if we did not get a full match
        return false;
    return r;
}

The full cpp file for this example can be found here: ../../example/qi/num_list4.cpp

Hey, no more actions!!! Now we're entering the realm of attribute grammars. Cool eh?

This example demonstrates:

  • symbol table
  • rule
  • grammar
Symbol Table

The symbol table holds a dictionary of symbols where each symbol is a sequence of characters (a char, wchar_t, int, enumeration etc.) . The template class, parameterized by the character type, can work efficiently with 8, 16, 32 and even 64 bit characters. Mutable data of type T is associated with each symbol.

Traditionally, symbol table management is maintained seperately outside the BNF grammar through semantic actions. Contrary to standard practice, the Spirit symbol table class symbols is-a parser. An object of which may be used anywhere in the EBNF grammar specification. It is an example of a dynamic parser. A dynamic parser is characterized by its ability to modify its behavior at run time. Initially, an empty symbols object matches nothing. At any time, symbols may be added or removed, thus, dynamically altering its behavior.

Each entry in a symbol table has an associated mutable data slot. In this regard, one can view the symbol table as an associative container (or map) of key-value pairs where the keys are strings.

The symbols class expects two template parameters. The first parameter specifies the character type of the symbols. The second specifies the data type associated with each symbol: its attribute.

Here's a parser for roman hundreds (100..900) using the symbol table. Keep in mind that the data associated with each slot is the parser's attribute (which is passed to attached semantic actions).

struct hundreds_ : symbols<char, unsigned>
{
    hundreds_()
    {
        add
            ("C"    , 100)
            ("CC"   , 200)
            ("CCC"  , 300)
            ("CD"   , 400)
            ("D"    , 500)
            ("DC"   , 600)
            ("DCC"  , 700)
            ("DCCC" , 800)
            ("CM"   , 900)
        ;
    }

} hundreds;

Here's a parser for roman tens (10..90):

struct tens_ : symbols<char, unsigned>
{
    tens_()
    {
        add
            ("X"    , 10)
            ("XX"   , 20)
            ("XXX"  , 30)
            ("XL"   , 40)
            ("L"    , 50)
            ("LX"   , 60)
            ("LXX"  , 70)
            ("LXXX" , 80)
            ("XC"   , 90)
        ;
    }

} tens;

and, finally, for ones (1..9):

struct ones_ : symbols<char, unsigned>
{
    ones_()
    {
        add
            ("I"    , 1)
            ("II"   , 2)
            ("III"  , 3)
            ("IV"   , 4)
            ("V"    , 5)
            ("VI"   , 6)
            ("VII"  , 7)
            ("VIII" , 8)
            ("IX"   , 9)
        ;
    }

} ones;

Now we can use hundreds, tens and ones anywhere in our parser expressions. They are all parsers.

Rules

Up until now, we've been inlining our parser expressions, passing them directly to the phrase_parse function. The expression evaluates into a temporary, unnamed parser which is passed into the phrase_parse function, used, and then destroyed. This is fine for small parsers. When the expressions get complicated, you'd want to break the expressions into smaller easier to understand pieces, name them, and refer to them from other parser expressions by name.

A parser expression can be assigned to, what is called, a "rule". There are various ways to declare rules. The simplest form is:

rule<Iterator> r;

At the very least, the rule needs to know the iterator type it will be working on. This rule cannot be used with phrase_parse. It can only be used with the parse function -- a version that does not do white space skipping (does not have the skipper argument). If you want to have it skip white spaces, you need to pass in the type skip parser, as in the next form:

rule<Iterator, Skipper> r;

Example:

rule<std::string::iterator, space_type> r;

This type of rule can be used for both phrase_parse and parse.

For our next example, there's one more rule form you should know about:

rule<Iterator, Signature> r;

or

rule<Iterator, Signature, Skipper> r;
[Tip] Tip

All rule template arguments after Iterator can be supplied in any order.

The Signature specifies the attributes of the rule. You've seen that our parsers can have an attribute. Recall that the double_ parser has an attribute of double. To be precise, these are synthesized attributes. The parser "synthesizes" the attribute value. Think of them as function return values.

There's another type of attribute called "inherited" attribute. We won't need them for now, but it's good that you be aware of such attributes. You can think of them as function arguments. And, rightly so, the rule signature is a function signature of the form:

result(argN, argN,..., argN)

After having declared a rule, you can now assign any parser expression to it. Example:

r = double_ >> *(',' >> double_);
Grammars

A grammar encapsulates one or more rules. It has the same template parameters as the rule. You declare a grammar by:

  1. deriving a struct (or class) from the grammar class template
  2. declare one or more rules as member variables
  3. initialize the base grammar class by giving it the start rule (its the first rule that gets called when the grammar starts parsing)
  4. initialize your rules in your constructor

The roman numeral grammar is a very nice and simple example of a grammar:

template <typename Iterator>
struct roman : grammar<Iterator, unsigned()>
{
    roman() : roman::base_type(start)
    {
        start = eps             [_val = 0] >>
            (
                +char_('M')     [_val += 1000]
                ||  hundreds    [_val += _1]
                ||  tens        [_val += _1]
                ||  ones        [_val += _1]
            )
        ;
    }

    rule<Iterator, unsigned()> start;
};

Things to take notice of:

  • The grammar and start rule signature is unsigned(). It has a synthesized attribute (return value) of type unsigned with no inherited attributes (arguments).
  • We did not specify a skip-parser. We don't want to skip in between the numerals.
  • roman::base_type is a typedef for grammar<Iterator, unsigned()>. If roman was not a template, you can simply write: base_type(start)
  • But it's best to make your grammar templates, so that they can be reused for different iterator types.
  • _val is another Phoenix placeholder representing the rule's synthesized attribute.
  • eps is a special spirit parser that consumes no input but is always successful. We use it to initialize _val, the rule's synthesized attribute, to zero before anything else. The actual parser starts at +char_('M'), parsing roman thousands. Using eps this way is good for doing pre and post initializations.
  • The expression a || b reads: match a or b and in sequence. That is, if both a and b match, it must be in sequence; this is equivalent to a >> !b | b, but more efficient.
Let's Parse!

bool r = parse(iter, end, roman_parser, result);

if (r && iter == end)
{
    std::cout << "-------------------------\n";
    std::cout << "Parsing succeeded\n";
    std::cout << "result = " << result << std::endl;
    std::cout << "-------------------------\n";
}
else
{
    std::string rest(iter, end);
    std::cout << "-------------------------\n";
    std::cout << "Parsing failed\n";
    std::cout << "stopped at: \": " << rest << "\"\n";
    std::cout << "-------------------------\n";
}

roman_parser is an object of type roman -our roman numeral parser. This time around, we are using the no-skipping version of the parse functions. We do not want to skip any spaces! We are also passing in an attribute, unsigned result, which will receive the parsed value.

The full cpp file for this example can be found here: ../../example/qi/roman.cpp

It's a common question in the Spirit General List: how do I parse and place the results into a C++ struct? Of course, at this point, you already know various ways to do it, using semantic actions. There are many ways to skin a cat. Spirit2, being fully attributed, makes it even easier. The next example demonstrates some features of Spirit2 that makes this easy. In the process, you'll learn about:

  • More about attributes
  • Auto rules
  • Some more built-in parsers
  • Directives

First, let's create a struct representing an employee:

struct employee
{
    int age;
    std::string surname;
    std::string forename;
    double salary;
};

Then, we need to tell Fusion about our employee struct to make it a first- class fusion citizen. If you don't know fusion yet, it is a Boost library for working with heterogenous collections of data, commonly referred to as tuples. Spirit uses fusion extensively as part of its infrastructure.

In fusion's view, a struct is just a form of a tuple. You can adapt any struct to be a fully conforming fusion tuple:

BOOST_FUSION_ADAPT_STRUCT(
    employee,
    (int, age)
    (std::string, surname)
    (std::string, forename)
    (double, salary)
)

Now we'll write a parser for our employee. Inputs will be of the form:

employee{ age, "surname", "forename", salary }

Here goes:

template <typename Iterator>
struct employee_parser : grammar<Iterator, employee(), space_type>
{
    employee_parser() : employee_parser::base_type(start)
    {
        quoted_string %= lexeme['"' >> +(char_ - '"') >> '"'];

        start %=
            lit("employee")
            >> '{'
            >>  int_ >> ','
            >>  quoted_string >> ','
            >>  quoted_string >> ','
            >>  double_
            >>  '}'
            ;
    }

    rule<Iterator, std::string(), space_type> quoted_string;
    rule<Iterator, employee(), space_type> start;
};

The full cpp file for this example can be found here: ../../example/qi/employee.cpp

Let's walk through this one step at a time (not necessarily from top to bottom).

template <typename Iterator>
struct employee_parser : grammar<Iterator, employee(), space_type>

employee_parser is a grammar. Like before, we make it a template so that we can reuse it for different iterator types. The grammar's signature is:

employee()

meaning, the parser generates employee structs. employee_parser skips white spaces using space_type as its skip parser.

employee_parser() : employee_parser::base_type(start)

Initializes the base class.

rule<Iterator, std::string(), space_type> quoted_string;
rule<Iterator, employee(), space_type> start;

Declares two rules: quoted_string and start. start has the same template parameters as the grammar itself. quoted_string has a std::string attribute.

Lexeme
lexeme['"' >> +(char_ - '"') >> '"'];

lexeme inhibits space skipping from the open brace to the closing brace. The expression parses quoted strings.

+(char_ - '"')

parses one or more chars, except the double quote. It stops when it sees a double quote.

Difference

The expression:

a - b

parses a but not b. Its attribute is just A, the attribute of a. b's attribute is ignored. Hence, the attribute of:

char_ - '"'

is just char.

Plus
+a

is the close kin of the kleene star we got so used to in our tutorial. Like it's kin, the kleene star, its attribute is a std::vector<A> where A is the attribute of a. So, putting all these together, the attribute of

+(char_ - '"')

is then:

std::vector<char>
Sequence Attribute

Now what's the attribute of

'"' >> +(char_ - '"') >> '"'

?

Well, typically, the attribute of:

a >> b >> c

is:

fusion::vector<A, B, C>

where A is the attribute of a, B is the attribute of b and C is the attribute of c. What is fusion::vector? - a tuple.

[Note] Note

If you don't know what I am talking about, see: Fusion Vector. It might be a good idea to have a look into Fusion at this point. You'll definitely see more of it in the coming pages.

Attribute Collapsing

Some parsers, especially those very little literal parsers you see, like '"', do not have attributes.

Nodes without attributes are disregarded. In a sequence, like above, all nodes with no attributes are filtered out of the fusion:vector. So, since '"' has no attribute, and +(char_ - '"') has a std::vector<char> attribute, the whole expression's attribute should have been:

fusion::vector<std::vector<char> >

But wait, there's one more collapsing rule: If after the attribute is a single element fusion::vector, The element is stripped naked from its container. So, to make a long story short, the attribute of the expression:

'"' >> +(char_ - '"') >> '"'

is:

std::vector<char>
Auto Rules

It is typical to see rules like:

r = p[_val = _1];

If you have a rule definition like above where the attribute of the RHS (right hand side) of the rule is compatibe with the attribute of the LHS (left hand side), then you can rewrite it as:

r %= p;

The attribute of p automatically uses the attribute of r.

So, going back to our quoted_string rule:

quoted_string %= lexeme['"' >> +(char_ - '"') >> '"'];

is a simplified version of:

quoted_string = lexeme['"' >> +(char_ - '"') >> '"'][val_ = _1];

The attribute of the quoted_string rule: std::string is compatible with the attribute of the RHS: std::vector<char>. The RHS extracts the parsed attribute directly into the rule's attribute, in-situ.

Finally

We're down to one rule, the start rule:

start %=
    lit("employee")
    >> '{'
    >>  int_ >> ','
    >>  quoted_string >> ','
    >>  quoted_string >> ','
    >>  double_
    >>  '}'
    ;

Applying our collapsing rules above, the RHS has an attribute of:

fusion::vector<int, std::string, std::string, double>

These nodes do not have an attribute:

  • lit("employee")
  • '{'
  • '}'
[Note] Note

In case you are wondering, lit("employee") is the same as "employee". We had to wrap it inside lit because immediately after it is >> '{'. You can't right-shift a char[] and a char - you know, C++ syntax rules.

Recall that the attribute of start is the employee struct:

struct employee
{
    int age;
    std::string surname;
    std::string forename;
    double salary;
};

Now everything is clear, right? The struct employee IS compatible with fusion::vector<int, std::string, std::string, double>. So, the RHS of start uses start's attribute (a struct employee) in-situ when it does its work.

Stop and think about it... We've come very close to generating an AST in our last example. We parsed a single structure and generated an in-memory representation of it in the form of a struct: the struct employee. If we changed the implementation to parse one or more employees, the result would be a std::vector<employee>. We can go on and add more hierarchy: teams, departments, corporations. Then we'll have an AST representation of it all.

In this example (actually two examples), we'll now explore parsers how to create ASTs. We will parse a minimalistic XML like language and compile the results into our data structures in the form of a tree.

Along the way, we'll see new features:

  • Inherited attributes
  • Variant attributes
  • Local Variables
  • Not Predicate
  • Lazy Lit

The full cpp files for these examples can be found here: ../../example/qi/mini_xml1.cpp and here: ../../example/qi/mini_xml2.cpp

There are a couple of sample toy-xml files in: ../../example/qi/mini_xml_samples for testing purposes. "4.toyxml" has an error in it.

First Cut

Without further delay, here's the first version of the XML grammar:

template <typename Iterator>
struct mini_xml_grammar : grammar<Iterator, mini_xml(), space_type>
{
    mini_xml_grammar() : mini_xml_grammar::base_type(xml)
    {
        text = lexeme[+(char_ - '<')        [_val += _1]];
        node = (xml | text)                 [_val = _1];

        start_tag =
                '<'
            >>  !char_('/')
            >>  lexeme[+(char_ - '>')       [_val += _1]]
            >>  '>'
        ;

        end_tag =
                "</"
            >>  lit(_r1)
            >>  '>'
        ;

        xml =
                start_tag                   [at_c<0>(_val) = _1]
            >>  *node                       [push_back(at_c<1>(_val), _1)]
            >>  end_tag(at_c<0>(_val))
        ;
    }

    rule<Iterator, mini_xml(), space_type> xml;
    rule<Iterator, mini_xml_node(), space_type> node;
    rule<Iterator, std::string(), space_type> text;
    rule<Iterator, std::string(), space_type> start_tag;
    rule<Iterator, void(std::string), space_type> end_tag;
};

Going bottom up, let's examine the text rule:

rule<Iterator, std::string(), space_type> text;

and its definition:

text = lexeme[+(char_ - '<')        [_val += _1]];

The semantic action collects the chars and appends them (via +=) to the std::string attribute of the rule (represented by the placeholder _val).

Alternates
rule<Iterator, mini_xml_node(), space_type> node;

and its definition:

node = (xml | text)                 [_val = _1];

We'll see what a mini_xml_node structure later. Looking at the rule definition, we see some alternation goiing on here. An xml node is either an xml OR text. Hmmm... hold on to that thought...

rule<Iterator, std::string(), space_type> start_tag;

Again, with an attribute of std::string. Then, it's definition:

start_tag =
        '<'
    >>  !char_('/')
    >>  lexeme[+(char_ - '>')       [_val += _1]]
    >>  '>'
;
Not Predicate

start_tag is similar to the text rule apart from the added '<' and '>'. But wait, to make sure that the start_tag does not parse end_tags too, we add: !char_('/'). This is a "Not Predicate":

!p

It will try the parser, p. If it is successful, fail, otherwise, pass. In other words, it negates the result of p. Like the eps, it does not consume any input though. It will always rewind the iterator position to where it was upon entry. So, the expression:

!char_('/')

basically says: we should not have a '/' at this point.

Inherited Attribute

The end_tag:

rule<Iterator, void(std::string), space_type> end_tag;

Ohh! Now we see an inherited attribute there: std::string. The end_tag does not have a synthesized attribute. Let's see its definition:

end_tag =
        "</"
    >>  lit(_r1)
    >>  '>'
;

_r1 is yet another Phoenix placeholder for the 1st inherited attribute (we have only one, use _r2, _r3, etc. if you have more).

A Lazy Lit

Check out how we used lit here, this time, not with a literal string, but with the value of the 1st inherited attribute, which is specified as std::string in our rule declaration.

Finally, our xml rule:

rule<Iterator, mini_xml(), space_type> xml;

mini_xml is our attribute here. We'll see later what it is. Let's see its definition:

xml =
        start_tag                   [at_c<0>(_val) = _1]
    >>  *node                       [push_back(at_c<1>(_val), _1)]
    >>  end_tag(at_c<0>(_val))
;

Those who know Fusion now will notice at_c<0> and at_c<1>. This gives us a hint that mini_xml is a sort of a tuple - a fusion sequence. at_c<N> here is a lazy version of the tuple accessors, provided by Phoenix.

How it all works

So, what's happening?

  1. Upon parsing start_tag, the parsed start-tag string is placed in at_c<0>(_val).
  2. Then we parse zero or more nodes. At each step, we push_back the result into at_c<1>(_val).
  3. Finally, we parse the end_tag giving it an inherited attribute: at_c<0>(_val). This is the string we obtained from the start_tag. Investigate end_tag above. It will fail to parse if it gets something different from what we got from the start_tag. This ensures that our tags are balanced.

To give the last item some more light, what happens is this:

end_tag(at_c<0>(_val))

calls:

end_tag =
        "</"
    >>  lit(_r1)
    >>  '>'
;

passing in at_c<0>(_val), the string from start tag. This is referred to in the end_tag body as _r1.

The Structures

Let's see our structures. It will definitely be hierarchical: xml is hierarchical. It will also be recursive: xml is recursive.

struct mini_xml;

typedef
    boost::variant<
        boost::recursive_wrapper<mini_xml>
      , std::string
    >
mini_xml_node;

struct mini_xml
{
    std::string name;                           // tag name
    std::vector<mini_xml_node> children;        // children
};

Of Alternates and Variants

So that's how a mini_xml_node looks like. We had a hint that it is either a string or a mini_xml. For this, we use boost.variant<>. boost::recursive_wrapper wraps mini_xml, making it a recursive data structure.

Yep, you got that right: the attribute of an alternate:

a | b

is a

boost::variant<A, B>

where A is the attribute of a and B is the attribute of b.

Adapting structs again

mini_xml is no brainier. It is a plain ol' struct. But as we've seen in our employee example, we can adapt that to be a Fusion sequence:

BOOST_FUSION_ADAPT_STRUCT(
    mini_xml,
    (std::string, name)
    (std::vector<mini_xml_node>, children)
)

One More Take

Here's another version. The AST structure remains the same, but this time, you'll see that we make use of auto-rules making the grammar semantic-action- less. Here it is:

template <typename Iterator>
struct mini_xml_grammar
  : grammar<Iterator, mini_xml(), locals<std::string>, space_type>
{
    mini_xml_grammar()
      : mini_xml_grammar::base_type(xml)
    {
        text %= lexeme[+(char_ - '<')];
        node %= xml | text;

        start_tag %=
                '<'
            >>  !char_('/')
            >>  lexeme[+(char_ - '>')]
            >>  '>'
        ;

        end_tag =
                "</"
            >>  lit(_r1)
            >>  '>'
        ;

        xml %=
                start_tag[_a = _1]
            >>  *node
            >>  end_tag(_a)
        ;
    }

    rule<Iterator, mini_xml(), locals<std::string>, space_type> xml;
    rule<Iterator, mini_xml_node(), space_type> node;
    rule<Iterator, std::string(), space_type> text;
    rule<Iterator, std::string(), space_type> start_tag;
    rule<Iterator, void(std::string), space_type> end_tag;
};

This one shouldn't be any more difficult to understand after going through the first xml parser example. The rules are almost the same, except that, we got rid of semantic actions and used auto-rules (see the employee example if you missed that). There are a couple of new stuff, though. It's all in the xml rule:

Local Variables
rule<Iterator, mini_xml(), locals<std::string>, space_type> xml;

Wow, we have four template parameters now. What's that locals guy doing there? Well, it declares that the rule xml will have one local variable: a string. Let's see how this is used in action:

xml %=
        start_tag[_a = _1]
    >>  *node
    >>  end_tag(_a)
;
  1. Upon parsing start_tag, the parsed start-tag string is placed in the local variable specified by (yet another) Phoenix placeholder: _a. We have only one local variable. If we had more, these are designated by _b.._z.
  2. Then we parse zero or more nodes.
  3. Finally, we parse the end_tag giving it an inherited attribute: _a, our local variable.

There are no actions involved in stuffing data into our xml attribute. It's all taken cared of thatnks to the auto-rule.

A parser will not be complete without error handling. Spirit2 provides some facilities to make it easy to adapt a grammar for error handling. We'll wrap up the Qi tutorial with another version of the mini xml parser, this time, with error handling.

../../example/qi/mini_xml1.cpp and here: ../../example/qi/mini_xml2.cpp

Here's the grammar:

template <typename Iterator>
struct mini_xml_grammar
  : grammar<Iterator, mini_xml(), locals<std::string>, space_type>
{
    mini_xml_grammar()
      : mini_xml_grammar::base_type(xml, "xml")
    {
        text %= lexeme[+(char_ - '<')];
        node %= xml | text;

        start_tag %=
                '<'
            >>  !char_('/')
            >   lexeme[+(char_ - '>')]
            >   '>'
        ;

        end_tag =
                "</"
            >   lit(_r1)
            >   '>'
        ;

        xml %=
                start_tag[_a = _1]
            >   *node
            >   end_tag(_a)
        ;

        xml.name("xml");
        node.name("node");
        text.name("text");
        start_tag.name("start_tag");
        end_tag.name("end_tag");

        on_error<fail>
        (
            xml
          , std::cout
                << val("Error! Expecting ")
                << _4                               // what failed?
                << val(" here: \"")
                << construct<std::string>(_3, _2)   // iterators to error-pos, end
                << val("\"")
                << std::endl
        );
    }

    rule<Iterator, mini_xml(), locals<std::string>, space_type> xml;
    rule<Iterator, mini_xml_node(), space_type> node;
    rule<Iterator, std::string(), space_type> text;
    rule<Iterator, std::string(), space_type> start_tag;
    rule<Iterator, void(std::string), space_type> end_tag;
};

What's new?

Readable Names

First, when we call the base class, we give the grammar a name:

: mini_xml_grammar::base_type(xml, "xml")

Then, we name all our rules:

xml.name("xml");
node.name("node");
text.name("text");
start_tag.name("start_tag");
end_tag.name("end_tag");
On Error

on_error declares our error handler:

on_error<Action>(rule, handler)

This will specify what we will do when we get an error. We will print out an error message using phoenix:

on_error<fail>
(
    xml
  , std::cout
        << val("Error! Expecting ")
        << _4                               // what failed?
        << val(" here: \"")
        << construct<std::string>(_3, _2)   // iterators to error-pos, end
        << val("\"")
        << std::endl
);

we choose to fail in our example for the Action: Quit and fail. Return a no_match (false). It can be one of:

Action

Description

fail

Quit and fail. Return a no_match.

retry

Attempt error recovery, possibly moving the iterator position.

accept

Force success, moving the iterator position appropriately.

rethrow

Rethrows the error.

rule is the rule we attach the handler to. In our case, we are attaching to the xml rule.

handler is the actual error handling function. It expects 4 arguments:

Arg

Description

first

The position of the iterator when the rule with the handler was entered.

last

The end of input.

error-pos

The actual position of the iterator where the error occurred.

what

What failed: a string decribing the failure.

Expectation Points

You might not have noticed it, but some of our expressions changed from using the >> to >. Look, for example:

end_tag =
        "</"
    >   lit(_r1)
    >   '>'
;

What is it? It's the expectation operator. You will have some "deterministic points" in the grammar. Those are the places where backtracking cannot occur. For our example above, when you get a "</", you definitely must see a valid end-tag label next. It should be the one you got from the start-tag. After that, you definitely must have a '>' next. Otherwise, there is no point in proceeding forward and trying other branches, regardless where they are. The input is definitely erroneous. When this happens, an expectation_failure exception is thrown. Somewhere outward, the error handler will catch the exception.

Try building the parser: ../../example/qi/mini_xml2.cpp. You can find some examples in: ../../example/qi/mini_xml_samples for testing purposes. "4.toyxml" has an error in it:

<foo><bar></foo></bar>

Running the example with this gives you:

Error! Expecting "bar" here: "foo></bar>"
Error! Expecting end_tag here: "<bar></foo></bar>"
-------------------------
Parsing failed
-------------------------

PrevUpHomeNext