Parsers in Depth

This section is not for the faint of heart. In here, are distilled the inner workings of Spirit.Qi parsers, using real code from the Spirit library as examples. On the other hand, here is no reason to fear reading on, though. We tried to explain things step by step while highlighting the important insights.

The Parser class is the base class for all parsers.

template <typename Derived>
struct parser
{
    struct parser_id;
    typedef Derived derived_type;
    typedef qi::domain domain;

    // Requirement: p.parse(f, l, context, skip, attr) -> bool
    //
    //  p:          a parser
    //  f, l:       first/last iterator pair
    //  context:    enclosing rule context (can be unused_type)
    //  skip:       skipper (can be unused_type)
    //  attr:       attribute (can be unused_type)

    // Requirement: p.what(context) -> info
    //
    //  p:          a parser
    //  context:    enclosing rule context (can be unused_type)

    // Requirement: P::template attribute<Ctx, Iter>::type
    //
    //  P:          a parser type
    //  Ctx:        A context type (can be unused_type)
    //  Iter:       An iterator type (can be unused_type)

    Derived const& derived() const
    {
        return *static_cast<Derived const*>(this);
    }
};

The Parser class does not really know how to parse anything but instead relies on the template parameter Derived to do the actual parsing. This technique is known as the "Curiously Recurring Template Pattern" in template meta-programming circles. This inheritance strategy gives us the power of polymorphism without the virtual function overhead. In essence this is a way to implement compile time polymorphism.

The Derived parsers, PrimitiveParser, UnaryParser, BinaryParser and NaryParser provide the necessary facilities for parser detection, introspection, transformation and visitation.

Derived parsers must support the following:

bool parse(f, l, context, skip, attr)

f, l: first/last iterator pair
context: enclosing rule context (can be unused_type)
skip: skipper (can be unused_type)
attr: attribute (can be unused_type)

The parse is the main parser entry point. skipper can be an unused_type. It's a type used every where in Spirit to signify "don't-care". There is an overload for skip for unused_type that is simply a no-op. That way, we do not have to write multiple parse functions for phrase and character level parsing.

Here are the basic rules for parsing:

The parser returns true if successful, false otherwise.
If successful, first is incremented N number of times, where N is the number of characters parsed. N can be zero --an empty (epsilon) match.
If successful, the parsed attribute is assigned to attr
If unsuccessful, first is reset to its position before entering the parser function. attr is untouched.

void what(context)

context: enclosing rule context (can be unused_type)

The what function should be obvious. It provides some information about “what” the parser is. It is used as a debugging aid, for example.

P::template attribute<context>::type

P: a parser type
context: A context type (can be unused_type)

The attribute metafunction returns the expected attribute type of the parser. In some cases, this is context dependent.

In this section, we will dissect two parser types:

Parsers

PrimitiveParser: A parser for primitive data (e.g. integer parsing).
UnaryParser: A parser that has single subject (e.g. kleene star).

Primitive Parsers

For our disection study, we will use a Spirit primitive, the int_parser in the boost::spirit::qi namespace.

template <
    typename T
  , unsigned Radix = 10
  , unsigned MinDigits = 1
  , int MaxDigits = -1>
struct int_parser_impl
  : primitive_parser<int_parser_impl<T, Radix, MinDigits, MaxDigits> >
{
    // check template parameter 'Radix' for validity
    BOOST_SPIRIT_ASSERT_MSG(
        Radix == 2 || Radix == 8 || Radix == 10 || Radix == 16,
        not_supported_radix, ());

    template <typename Context, typename Iterator>
    struct attribute
    {
        typedef T type;
    };

    template <typename Iterator, typename Context
      , typename Skipper, typename Attribute>
    bool parse(Iterator& first, Iterator const& last
      , Context& /*context*/, Skipper const& skipper
      , Attribute& attr) const
    {
        qi::skip_over(first, last, skipper);
        return extract_int<T, Radix, MinDigits, MaxDigits>
            ::call(first, last, attr);
    }

    template <typename Context>
    info what(Context& /*context*/) const
    {
        return info("integer");
    }
};

The int_parser is derived from a PrimitiveParser<Derived>, which in turn derives from parser<Derived>. Therefore, it supports the following requirements:

The parse member function
The what member function
The nested attribute metafunction

parse is the main entry point. For primitive parsers, our first thing to do is call:

qi::skip(first, last, skipper);

to do a pre-skip. After pre-skipping, the parser proceeds to do its thing. The actual parsing code is placed in extract_int<T, Radix, MinDigits, MaxDigits>::call(first, last, attr);

This simple no-frills protocol is one of the reasons why Spirit is fast. If you know the internals of Spirit.Classic and perhaps even wrote some parsers with it, this simple Spirit mechanism is a joy to work with. There are no scanners and all that crap.

The what function just tells us that it is an integer parser. Simple.

The attribute metafunction returns the T template parameter. We associate the int_parser to some placeholders for short_, int_, long_ and long_long types. But, first, we enable these placeholders in namespace boost::spirit:

template <>
struct use_terminal<qi::domain, tag::short_> // enables short_
  : mpl::true_ {};

template <>
struct use_terminal<qi::domain, tag::int_> // enables int_
  : mpl::true_ {};

template <>
struct use_terminal<qi::domain, tag::long_> // enables long_
  : mpl::true_ {};

template <>
struct use_terminal<qi::domain, tag::long_long> // enables long_long
  : mpl::true_ {};

Notice that int_parser is placed in the namespace boost::spirit::qi while these enablers are in namespace boost::spirit. The reason is that these placeholders are shared by other Spirit domains. Spirit.Qi, the parser is one domain. Spirit.Karma, the generator is another domain. Other parser technologies may be developed and placed in yet another domain. Yet, all these can potentially share the same placeholders for interoperability. The interpretation of these placeholders is domain-specific.

Now that we enabled the placeholders, we have to write generators for them. The make_xxx stuff (in boost::spirit::qi namespace):

template <typename T>
struct make_int
{
    typedef int_parser_impl<T> result_type;
    result_type operator()(unused_type, unused_type) const
    {
        return result_type();
    }
};

This one above is our main generator. It's a simple function object with 2 (unused) arguments. These arguments are

The actual terminal value obtained by proto. In this case, either a short, int, long_ or long_long. We don't care about this.
Modifiers. We also don't care about this. This allows directives such as no_case[p] to pass information to inner parser nodes. We'll see how that works later.

Now:

template <typename Modifiers>
struct make_primitive<tag::short_, Modifiers> : make_int<short> {};

template <typename Modifiers>
struct make_primitive<tag::int_, Modifiers> : make_int<int> {};

template <typename Modifiers>
struct make_primitive<tag::long_, Modifiers> : make_int<long> {};

template <typename Modifiers>
struct make_primitive<tag::long_long, Modifiers> 
  : make_int<boost::long_long_type> {};

These, specialize qi:make_primitive for specific tags. They all inherit from make_int which does the actual work.

Composite Parsers

Let me present the kleene star (also in namespace spirit::qi):

template <typename Subject>
struct kleene : unary_parser<kleene<Subject> >
{
    typedef Subject subject_type;

    template <typename Context, typename Iterator>
    struct attribute
    {
        // Build a std::vector from the subject's attribute. Note
        // that build_std_vector may return unused_type if the
        // subject's attribute is an unused_type.
        typedef typename
            traits::build_std_vector<
                typename traits::
                    attribute_of<Subject, Context, Iterator>::type
            >::type
        type;
    };

    kleene(Subject const& subject)
      : subject(subject) {}

    template <typename Iterator, typename Context
      , typename Skipper, typename Attribute>
    bool parse(Iterator& first, Iterator const& last
      , Context& context, Skipper const& skipper
      , Attribute& attr) const
    {
        // create a local value if Attribute is not unused_type
        typedef typename traits::container_value<Attribute>::type 
            value_type;
        value_type val = value_type();

        // Repeat while subject parses ok
        while (subject.parse(first, last, context, skipper, val))
        {
            // push the parsed value into our attribute
            traits::push_back(attr, val);
            traits::clear(val);
        }
        return true;
    }

    template <typename Context>
    info what(Context& context) const
    {
        return info("kleene", subject.what(context));
    }

    Subject subject;
};

Looks similar in form to its primitive cousin, the int_parser. And, again, it has the same basic ingredients required by Derived.

The nested attribute metafunction
The parse member function
The what member function

kleene is a composite parser. It is a parser that composes another parser, its “subject”. It is a UnaryParser and subclasses from it. Like PrimitiveParser, UnaryParser<Derived> derives from parser<Derived>.

unary_parser<Derived>, has these expression requirements on Derived:

p.subject -> subject parser ( p is a UnaryParser parser.)
P::subject_type -> subject parser type ( P is a UnaryParser type.)

parse is the main parser entry point. Since this is not a primitive parser, we do not need to call qi::skip(first, last, skipper). The subject, if it is a primitive, will do the pre-skip. If if it is another composite parser, it will eventually call a primitive parser somewhere down the line which will do the pre-skip. This makes it a lot more efficient than Spirit.Classic. Spirit.Classic puts the skipping business into the so-called "scanner" which blindly attempts a pre-skip everytime we increment the iterator.

What is the attribute of the kleene? In general, it is a std::vector<T> where T is the attribute of the subject. There is a special case though. If T is an unused_type, then the attribute of kleene is also unused_type. traits::build_std_vector takes care of that minor detail.

So, let's parse. First, we need to provide a local attribute of for the subject:

typename traits::attribute_of<Subject, Context>::type val;

traits::attribute_of<Subject, Context> simply calls the subject's struct attribute<Context> nested metafunction.

val starts out default initialized. This val is the one we'll pass to the subject's parse function.

The kleene repeats indefinitely while the subject parser is successful. On each successful parse, we push_back the parsed attribute to the kleen's attribute, which is expected to be, at the very least, compatible with a std::vector. In other words, although we say that we want our attribute to be a std::vector, we try to be more lenient than that. The caller of kleene's parse may pass a different attribute type. For as long as it is also a conforming STL container with push_back, we are ok. Here is the kleene loop:

while (subject.parse(first, last, context, skipper, val))
{
    // push the parsed value into our attribute
    traits::push_back(attr, val);
    traits::clear(val);
}
return true;

Take note that we didn't call attr.push_back(val). Instead, we called a Spirit provided function:

traits::push_back(attr, val);

This is a recurring pattern. The reason why we do it this way is because attr can be unused_type. traits::push_back takes care of that detail. The overload for unused_type is a no-op. Now, you can imagine why Spirit is fast! The parsers are so simple and the generated code is as efficient as a hand rolled loop. All these parser compositions and recursive parse invocations are extensively inlined by a modern C++ compiler. In the end, you get a tight loop when you use the kleene. No more excess baggage. If the attribute is unused, then there is no code generated for that. That's how Spirit is designed.

The what function simply wraps the output of the subject in a "kleene“... "”".

Ok, now, like the int_parser, we have to hook our parser to the qi engine. Here's how we do it:

First, we enable the prefix star operator. In proto, it's called the "dereference":

template <>
struct use_operator<qi::domain, proto::tag::dereference> // enables *p
  : mpl::true_ {};

This is done in namespace boost::spirit like its friend, the use_terminal specialization for our int_parser. Obviously, we use use_operator to enable the dereference for the qi::domain.

Then, we need to write our generator (in namespace qi):

template <typename Elements, typename Modifiers>
struct make_composite<proto::tag::dereference, Elements, Modifiers>
  : make_unary_composite<Elements, kleene>
{};

This essentially says; for all expressions of the form: *p, to build a kleene parser. Elements is a Boost.Fusion sequence. For the kleene, which is a unary operator, expect only one element in the sequence. That element is the subject of the kleene.

We still don't care about the Modifiers. We'll see how the modifiers is all about when we get to deep directives.

Boost C++ Libraries

Parsers in Depth

Primitive Parsers

Composite Parsers