Boost C++ Libraries

...one of the most highly regarded and expertly designed C++ library projects in the world. Herb Sutter and Andrei Alexandrescu, C++ Coding Standards

PrevUpHomeNext

Quickstart 3 - Counting Words Using a Parser

The whole purpose of integrating Spirit.Lex as part of the Spirit library was to add a library allowing the merger of lexical analysis with the parsing process as defined by a Spirit grammar. Spirit parsers read their input from an input sequence accessed by iterators. So naturally, we chose iterators to be used as the interface between the lexer and the parser. A second goal of the lexer/parser integration was to enable the usage of different lexical analyzer libraries. The utilization of iterators seemed to be the right choice from this standpoint as well, mainly because these can be used as an abstraction layer hiding implementation specifics of the used lexer library. The picture below shows the common flow control implemented while parsing combined with lexical analysis.

Figure 7. The common flow control implemented while parsing combined with lexical analysis

The common flow control implemented while parsing combined with lexical analysis


Another problem related to the integration of the lexical analyzer with the parser was to find a way how the defined tokens syntactically could be blended with the grammar definition syntax of Spirit. For tokens defined as instances of the token_def<> class the most natural way of integration was to allow to directly use these as parser components. Semantically these parser components succeed matching their input whenever the corresponding token type has been matched by the lexer. This quick start example will demonstrate this (and more) by counting words again, simply by adding up the numbers inside of semantic actions of a parser (for the full example code see here: word_count.cpp).

Prerequisites

This example uses two of the Spirit library components: Spirit.Lex and Spirit.Qi, consequently we have to #include the corresponding header files. Again, we need to include a couple of header files from the Boost.Phoenix library. This example shows how to attach functors to parser components, which could be done using any type of C++ technique resulting in a callable object. Using Boost.Phoenix for this task simplifies things and avoids adding dependencies to other libraries (Boost.Phoenix is already in use for Spirit anyway).

#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/lex_lexertl.hpp>
#include <boost/spirit/include/phoenix_operator.hpp>
#include <boost/spirit/include/phoenix_statement.hpp>
#include <boost/spirit/include/phoenix_container.hpp>

To make all the code below more readable we introduce the following namespaces.

using namespace boost::spirit;
using namespace boost::spirit::ascii;

Defining Tokens

If compared to the two previous quick start examples (Lex Quickstart 1 - A word counter using Spirit.Lex and Lex Quickstart 2 - A better word counter using Spirit.Lex) the token definition class for this example does not reveal any surprises. However, it uses lexer token definition macros to simplify the composition of the regular expressions, which will be described in more detail in the section FIXME. Generally, any token definition is usable without modification from either a stand alone lexical analyzer or in conjunction with a parser.

template <typename Lexer>
struct word_count_tokens : lex::lexer<Lexer>
{
    word_count_tokens()
    {
        // define patterns (lexer macros) to be used during token definition 
        // below
        this->self.add_pattern
            ("WORD", "[^ \t\n]+")
        ;

        // define tokens and associate them with the lexer
        word = "{WORD}";    // reference the pattern 'WORD' as defined above

        // this lexer will recognize 3 token types: words, newlines, and 
        // everything else
        this->self.add
            (word)          // no token id is needed here
            ('\n')          // characters are usable as tokens as well
            (".", IDANY)    // string literals will not be esacped by the library
        ;
    }

    // the token 'word' exposes the matched string as its parser attribute
    lex::token_def<std::string> word;
};

Using Token Definition Instances as Parsers

While the integration of lexer and parser in the control flow is achieved by using special iterators wrapping the lexical analyzer, we still need a means of expressing in the grammar what tokens to match and where. The token definition class above uses three different ways of defining a token:

All three token definition methods require a different method of grammar integration. But as you can see from the following code snippet, each of these methods are straightforward and blend the corresponding token instances naturally with the surrounding Spirit.Qi grammar syntax.

Token definition

Parser integration

token_def<>

The token_def<> instance is directly usable as a parser component. Parsing of this component will succeed if the regular expression used to define this has been matched successfully.

single character

The single character is directly usable in the grammar. However, under certain circumstances it needs to be wrapped by a char_() parser component. Parsing of this component will succeed if the single character has been matched.

explicit token id

To use an explicit token id in a Spirit.Qi grammar you are required to wrap it with the special token() parser component. Parsing of this component will succeed if the current token has the same token id as specified in the expression token(<id>).

The grammar definition below uses each of the three types demonstrating their usage.

template <typename Iterator>
struct word_count_grammar : qi::grammar<Iterator>
{
    template <typename TokenDef>
    word_count_grammar(TokenDef const& tok)
      : word_count_grammar::base_type(start)
      , c(0), w(0), l(0)
    {
        using boost::phoenix::ref;
        using boost::phoenix::size;

        start =  *(   tok.word          [++ref(w), ref(c) += size(_1)]
                  |   lit('\n')         [++ref(c), ++ref(l)]
                  |   qi::token(IDANY)  [++ref(c)]
                  )
              ;
    }

    std::size_t c, w, l;
    qi::rule<Iterator> start;
};

As already described (see: Attributes), the Spirit.Qi parser library builds upon a set of of fully attributed parser components. Consequently, all token definitions support this attribute model as well. The most natural way of implementing this was to use the token values as the attributes exposed by the parser component corresponding to the token definition (you can read more about this topic here: About Tokens and Token Values). The example above takes advantage of the full integration of the token values as the token_def<>'s parser attributes: the word token definition is declared as a token_def<std::string>, making every instance of a word token carry the string representation of the matched input sequence as its value. The semantic action attached to tok.word receives this string (represented by the _1 placeholder) and uses it to calculate the number of matched characters: ref(c) += size(_1).

Pulling Everything Together

The main function needs to implement a bit more logic now as we have to initialize and start not only the lexical analysis but the parsing process as well. The three type definitions (typedef statements) simplify the creation of the lexical analyzer and the grammar. After reading the contents of the given file into memory it calls the function tokenize_and_parse() to initialize the lexical analysis and parsing processes.

int main(int argc, char* argv[])
{
1  typedef lex::lexertl::token<
        char const*, boost::mpl::vector<std::string>
    > token_type;

2  typedef lex::lexertl::lexer<token_type> lexer_type;

3  typedef word_count_tokens<lexer_type>::iterator_type iterator_type;

    // now we use the types defined above to create the lexer and grammar
    // object instances needed to invoke the parsing process
    word_count_tokens<lexer_type> word_count;          // Our lexer
    word_count_grammar<iterator_type> g (word_count);  // Our parser 

    // read in the file int memory
    std::string str (read_from_file(1 == argc ? "word_count.input" : argv[1]));
    char const* first = str.c_str();
    char const* last = &first[str.size()];

4  bool r = lex::tokenize_and_parse(first, last, word_count, g);

    if (r) {
        std::cout << "lines: " << g.l << ", words: " << g.w
                  << ", characters: " << g.c << "\n";
    }
    else {
        std::string rest(first, last);
        std::cerr << "Parsing failed\n" << "stopped at: \""
                  << rest << "\"\n";
    }
    return 0;
}

1

Define the token type to be used: std::string is available as the type of the token attribute

2

Define the lexer type to use implementing the state machine

3

Define the iterator type exposed by the lexer type

4

Parsing is done based on the the token stream, not the character stream read from the input. The function tokenize_and_parse() wraps the passed iterator range [first, last) by the lexical analyzer and uses its exposed iterators to parse the toke stream.


PrevUpHomeNext