The Static Lexer Model

The documentation of Spirit.Lex so far mostly was about describing the features of the dynamic model, where the tables needed for lexical analysis are generated from the regular expressions at runtime. The big advantage of the dynamic model is its flexibility, and its integration with the Spirit library and the C++ host language. Its big disadvantage is the need to spend additional runtime to generate the tables, which especially might be a limitation for larger lexical analyers. The static model strives to build upon the smooth integration with Spirit and C++, and reuses large parts of the Spirit.Lex library as described so far, while overcoming the additional runtime requirements by using pre-generated tables and tokenizer routines. To make the code generation as simple as possible, it is possible reuse the token definition types developed using the dynamic model without any changes. As will be shown in this section, building a code generator based on an existing token definition type is a matter of writing 3 lines of code.

Assuming you already built a dynamic lexer for your problem, there are two more steps needed to create a static lexical analyzer using Spirit.Lex:

generating the C++ code for the static analyzer (including the tokenization function and corresponding tables), and
modifying the dynamic lexical anlyzer to use the generated code.

Both steps are described in more detail in the two sections below (for the full source code used in this example see the code here: the common token definition, the code generator, the generated code, and the static lexical analyzer).

But first we provide the code snippets needed to understand the further descriptions. Both, the definition of the used token identifier and the of the token definition class in this example are put into a separate header file to make these available to the code generator and the static lexical analyzer.

enum tokenids 
{
    IDANY = boost::spirit::lex::min_token_id + 1,
};

The important point here is, that the token definition class is not different from a similar class to be used for a dynamic lexical analyzer. The library has been designed in a way, that all components (dynamic lexical analyzer, code generator, and static lexical analyzer) can reuse the very same token definition syntax.

// This token definition class can be used without any change for all three
// possible use cases: a dynamic lexical analyzer, a code generator, and a
// static lexical analyzer.
template <typename BaseLexer>
struct word_count_tokens : boost::spirit::lex::lexer_def<BaseLexer> 
{
    template <typename Self>
    void def (Self& self)
    {
        // define tokens and associate them with the lexer
        word = "[^ \t\n]+";
        self = word | '\n' | token_def<>(".", IDANY);
    }
    
    boost::spirit::lex::token_def<std::string> word;
};

The only thing changing between the three different use cases is the template parameter used to instantiate a concrete token definition. Fot the dynamic model and the code generator you probably will use the lexertl_lexer<> template, where for the static model you will use the lexertl_static_lexer<> type as the template parameter.

This example not only shows how to build a static lexer, but it additionally demonstrates, how such a lexer can be used for parsing in conjunction with a Spirit.Qi grammar. For completeness we provide the simple grammar used in this example. As you can see, this grammar does not have any dependencies on the static lexical analyzer, and for this reason it is not different from a grammar used either without a lexer or using a dynamic lexical analyzer as described before.

//  This is an ordinary grammar definition following the rules defined by 
//  Spirit.Qi. There is nothing specific about it, except it gets the token
//  definition class instance passed to the constructor to allow accessing the
//  embedded token_def<> instances.
template <typename Iterator>
struct word_count_grammar : grammar<Iterator>
{
    template <typename TokenDef>
    word_count_grammar(TokenDef const& tok)
      : grammar<Iterator>(start), c(0), w(0), l(0)
    {
        using boost::spirit::arg_names::_1;
        using boost::phoenix::ref;
        using boost::phoenix::size;
        
        //  associate the defined tokens with the lexer, at the same time 
        //  defining the actions to be executed 
        start =  *(   tok.word      [++ref(w), ref(c) += size(_1)]
                  |   char_('\n')   [++ref(l), ++ref(c)] 
                  |   token(IDANY)  [++ref(c)]
                  )
              ;
    }

    std::size_t c, w, l;      // counter for characters, words, and lines
    rule<Iterator> start;
};

Generating the Static Analyzer

The first additional step to perform in order to create a static lexical analyzer is to create a small standalone program for creating the lexer tables and the corresponding tokenization function. For this purpose the Spirit.Lex library exposes a special API - the function generate_static(). It implements the whole code generator, no further code is needed. All what it takes to invoke this function is to supply a token definition instance, an output stream to use to generate the code to, and an optional string to be used as a prefix for the name of the generated function. All in all just a couple lines of code.

int main(int argc, char* argv[])
{
    // create the lexer object instance needed to invoke the generator
    word_count_tokens<lexertl_lexer<> > word_count; // the token definition

    // open the output file, where the generated tokenizer function will be 
    // written to
    std::ofstream out(argc < 2 ? "word_count_static.hpp" : argv[1]);

    // invoke the generator, passing the token definition, the output stream 
    // and the name prefix of the tokenizing function to be generated
    char const* function_name = (argc < 3 ? "" : argv[2]);
    return generate_static(make_lexer(word_count), out, function_name) ? 0 : -1;
}

The shown code generator will generate output, which should be stored in a file for later inclusion into the static lexical analzyer as shown in the next topic (the full generated code can be viewed here).

Modifying the Dynamic Analyzer

The second required step to convert an existing dynamic lexer into a static one is to change your main program at two places. First, you need to change the type of the used lexer (that is the template parameter used while instantiating your token definition class). While in the dynamic model we have been using the lexertl_lexer<> template, we now need to change that to the lexertl_static_lexer<> type. The second change is tightly related to the first one and involves correcting the corresponding #include statement to:

#include <boost/spirit/include/lex_lexer_static_lexertl.hpp>

Otherwise the main program is not different from an equivalent program using the dynamic model. This feature makes it really easy for instance to develop the lexer in dynamic mode and to switch to the static mode after the code has been stabilized. The simple generator application showed above enables the integration of the code generator into any existing build process. The following code snippet provides the overall main function, highlighting the code to be changed.

int main(int argc, char* argv[])
{
    // Define the token type to be used: 'std::string' is available as the type 
    // of the token value.
    typedef lexertl_token<
        char const*, boost::mpl::vector<std::string>
    > token_type;

    // Define the lexer type to be used as the base class for our token 
    // definition.
    //
    // This is the only place where the code is different from an equivalent
    // dynamic lexical analyzer. We use the `lexertl_static_lexer<>` instead of
    // the `lexertl_lexer<>` as the base class for our token defintion type.
    //
    typedef lexertl_static_lexer<token_type> lexer_type;
    
    // Define the iterator type exposed by the lexer.
    typedef lexer_iterator<word_count_tokens<lexer_type> >::type iterator_type;

    // Now we use the types defined above to create the lexer and grammar
    // object instances needed to invoke the parsing process.
    word_count_tokens<lexer_type> word_count;           // Our token definition
    word_count_grammar<iterator_type> g (word_count);   // Our grammar definition

    // Read in the file into memory.
    std::string str (read_from_file(1 == argc ? "word_count.input" : argv[1]));
    char const* first = str.c_str();
    char const* last = &first[str.size()];
    
    // Parsing is done based on the the token stream, not the character stream.
    bool r = tokenize_and_parse(first, last, make_lexer(word_count), g);

    if (r) {    // success
        std::cout << "lines: " << g.l << ", words: " << g.w 
                  << ", characters: " << g.c << "\n";
    }
    else {
        std::string rest(first, last);
        std::cerr << "Parsing failed\n" << "stopped at: \"" 
                  << rest << "\"\n";
    }
    return 0;
}

Boost C++ Libraries

The Static Lexer Model

Generating the Static Analyzer

Modifying the Dynamic Analyzer