Quickstart 1 - A word counter using Spirit.Lex

Spirit.Lex is very modular, which follows the general building principle of the Spirit libraries. You never pay for features you don't use. It is nicely integrated with the other parts of Spirit but nevertheless can be used separately to build stand alone lexical analyzers. The first quick start example describes a stand alone application: counting characters, words, and lines in a file, very similar to what the well known Unix command wc is doing (for the full example code see here: word_count_functor.cpp).

Prerequisites

The only required #include specific to Spirit.Lex follows. It is a wrapper for all necessary definitions to use Spirit.Lex in a stand alone fashion, and on top of the Lexertl library. Additionally we #include two of the Boost headers to define boost::bind() and boost::ref().

#include <boost/spirit/include/lex_lexertl.hpp>
#include <boost/bind.hpp>
#include <boost/ref.hpp>

To make all the code below more readable we introduce the following namespaces.

namespace lex = boost::spirit::lex;

Defining Tokens

The most important step while creating a lexer using Spirit.Lex is to define the tokens to be recognized in the input sequence. This is normally done by defining the regular expressions describing the matching character sequences, and optionally their corresponding token ids. Additionally the defined tokens need to be associated with an instance of a lexer object as provided by the library. The following code snippet shows how this can be done using Spirit.Lex.

The template word_count_tokens defines three different tokens: ID_WORD, ID_EOL, and ID_CHAR, representing a word (anything except a whitespace or a newline), a newline character, and any other character (ID_WORD, ID_EOL, and ID_CHAR are enum values representing the token ids, but could be anything else convertible to an integer as well). The direct base class of any token definition class needs to be the template lex::lexer<>, where the corresponding template parameter (here: lex::lexertl::lexer<BaseIterator>) defines which underlying lexer engine has to be used to provide the required state machine functionality. In this example we use the Lexertl based lexer engine as the underlying lexer type.

template <typename Lexer>
struct word_count_tokens : lex::lexer<Lexer>
{
    word_count_tokens()
    {
        // define tokens (the regular expression to match and the corresponding
        // token id) and add them to the lexer 
        this->self.add
            ("[^ \t\n]+", ID_WORD) // words (anything except ' ', '\t' or '\n')
            ("\n", ID_EOL)         // newline characters
            (".", ID_CHAR)         // anything else is a plain character
        ;
    }
};

Doing the Useful Work

We will use a setup, where we want the Spirit.Lex library to invoke a given function after any of of the generated tokens is recognized. For this reason we need to implement a functor taking at least the generated token as an argument and returning a boolean value allowing to stop the tokenization process. The default token type used in this example carries a token value of the type boost::iterator_range<BaseIterator> pointing to the matched range in the underlying input sequence.

In this example the struct 'counter' is used as a functor counting the characters, words and lines in the analyzed input sequence by identifying the matched tokens as passed from the Spirit.Lex library.

struct counter
{
    // the function operator gets called for each of the matched tokens
    // c, l, w are references to the counters used to keep track of the numbers
    template <typename Token>
    bool operator()(Token const& t, std::size_t& c, std::size_t& w, std::size_t& l) const
    {
        switch (t.id()) {
        case ID_WORD:       // matched a word
        // since we're using a default token type in this example, every 
        // token instance contains a `iterator_range<BaseIterator>` as its token
        // attribute pointing to the matched character sequence in the input 
            ++w; c += t.value().size();
            break;
        case ID_EOL:        // matched a newline character
            ++l; ++c;
            break;
        case ID_CHAR:       // matched something else
            ++c;
            break;
        }
        return true;        // always continue to tokenize
    }
};

All what is left is to write some boilerplate code helping to tie together the pieces described so far. To simplify this example we call the lex::tokenize() function implemented in Spirit.Lex (for a more detailed description of this function see here: FIXME), even if we could have written a loop to iterate over the lexer iterators [first, last) as well.

Pulling Everything Together

The main function simply loads the given file into memory (as a std::string), instantiates an instance of the token definition template using the correct iterator type (word_count_tokens<char const*>), and finally calls lex::tokenize, passing an instance of the counter function object. The return value of lex::tokenize() will be true if the whole input sequence has been successfully tokenized, and false otherwise.

int main(int argc, char* argv[])
{
    // these variables are used to count characters, words and lines
    std::size_t c = 0, w = 0, l = 0;

    // read input from the given file
    std::string str (read_from_file(1 == argc ? "word_count.input" : argv[1]));

    // create the token definition instance needed to invoke the lexical analyzer
    word_count_tokens<lex::lexertl::lexer<> > word_count_functor;

    // tokenize the given string, the bound functor gets invoked for each of 
    // the matched tokens
    char const* first = str.c_str();
    char const* last = &first[str.size()];
    bool r = lex::tokenize(first, last, word_count_functor,
        boost::bind(counter(), _1, boost::ref(c), boost::ref(w), boost::ref(l)));

    // print results
    if (r) {
        std::cout << "lines: " << l << ", words: " << w
                  << ", characters: " << c << "\n";
    }
    else {
        std::string rest(first, last);
        std::cout << "Lexical analysis failed\n" << "stopped at: \""
                  << rest << "\"\n";
    }
    return 0;
}

Comparing Spirit.Lex with Flex

This example was deliberately chosen to be as much as possible similar to the equivalent Flex program (see below), which isn't too different from what has to be written when using Spirit.Lex.

	Note
	Interestingly enough, performance comparisons of lexical analyzers written using Spirit.Lex with equivalent programs generated by Flex show that both have comparable execution speeds! Generally, thanks to the highly optimized Lexertl library and due its carefully designed integration with Spirit the abstraction penalty to be paid for using Spirit.Lex is negligible.

The remaining examples in this tutorial will use more sophisticated features of Spirit.Lex, mainly to allow further simplification of the code to be written, while maintaining the similarity with corresponding features of Flex. Spirit.Lex has been designed to be as similar to Flex as possible. That is why this documentation will provide the corresponding Flex code for the shown Spirit.Lex examples almost everywhere. So consequently, here is the Flex code corresponding to the example as shown above.

%{
    #define ID_WORD 1000
    #define ID_EOL  1001
    #define ID_CHAR 1002
    int c = 0, w = 0, l = 0;
%}
%%
[^ \t\n]+  { return ID_WORD; }
\n         { return ID_EOL; }
.          { return ID_CHAR; }
%%
bool count(int tok)
{
    switch (tok) {
    case ID_WORD: ++w; c += yyleng; break;
    case ID_EOL:  ++l; ++c; break;
    case ID_CHAR: ++c; break;
    default:
        return false;
    }
    return true;
}
void main()
{
    int tok = EOF;
    do {
        tok = yylex();
        if (!count(tok))
            break;
    } while (EOF != tok);
    printf("%d %d %d\n", l, w, c);
}