Tokenizing Input Data

The tokenize function

The tokenize() function is a helper function simplifying the usage of a lexer in a stand alone fashion. For instance, you may have a stand alone lexer where all that functional requirements are implemented inside lexer semantic actions. A good example for this is the word_count_lexer described in more detail in the section Lex Quickstart 2 - A better word counter using Spirit.Lex.

template <typename Lexer>
struct word_count_tokens : lex::lexer<Lexer>
{
    word_count_tokens()
      : c(0), w(0), l(0)
      , word("[^ \t\n]+")     // define tokens
      , eol("\n")
      , any(".")
    {
        using boost::spirit::lex::_start;
        using boost::spirit::lex::_end;
        using boost::phoenix::ref;

        // associate tokens with the lexer
        this->self
            =   word  [++ref(w), ref(c) += distance(_start, _end)]
            |   eol   [++ref(c), ++ref(l)]
            |   any   [++ref(c)]
            ;
    }

    std::size_t c, w, l;
    lex::token_def<> word, eol, any;
};

The construct used to tokenize the given input, while discarding all generated tokens is a common application of the lexer. For this reason Spirit.Lex exposes an API function tokenize() minimizing the code required:

// Read input from the given file
std::string str (read_from_file(1 == argc ? "word_count.input" : argv[1]));

word_count_tokens<lexer_type> word_count_lexer;
std::string::iterator first = str.begin();

// Tokenize all the input, while discarding all generated tokens
bool r = tokenize(first, str.end(), word_count_lexer);

This code is completely equivalent to the more verbose version as shown in the section Lex Quickstart 2 - A better word counter using Spirit.Lex. The function tokenize() will return either if the end of the input has been reached (in this case the return value will be true), or if the lexer couldn't match any of the token definitions in the input (in this case the return value will be false and the iterator first will point to the first not matched character in the input sequence).

The prototype of this function is:

template <typename Iterator, typename Lexer>
bool tokenize(Iterator& first, Iterator last, Lexer const& lex
  , typename Lexer::char_type const* initial_state = 0);

where:

Iterator& first: The beginning of the input sequence to tokenize. The value of this iterator will be updated by the lexer, pointing to the first not matched character of the input after the function returns.
Iterator last: The end of the input sequence to tokenize.
Lexer const& lex: The lexer instance to use for tokenization.
Lexer::char_type const* initial_state: This optional parameter can be used to specify the initial lexer state for tokenization.

A second overload of the tokenize() function allows specifying of any arbitrary function or function object to be called for each of the generated tokens. For some applications this is very useful, as it might avoid having lexer semantic actions. For an example of how to use this function, please have a look at word_count_functor.cpp:

The main function simply loads the given file into memory (as a std::string), instantiates an instance of the token definition template using the correct iterator type (word_count_tokens<char const*>), and finally calls lex::tokenize, passing an instance of the counter function object. The return value of lex::tokenize() will be true if the whole input sequence has been successfully tokenized, and false otherwise.

int main(int argc, char* argv[])
{
    // these variables are used to count characters, words and lines
    std::size_t c = 0, w = 0, l = 0;

    // read input from the given file
    std::string str (read_from_file(1 == argc ? "word_count.input" : argv[1]));

    // create the token definition instance needed to invoke the lexical analyzer
    word_count_tokens<lex::lexertl::lexer<> > word_count_functor;

    // tokenize the given string, the bound functor gets invoked for each of 
    // the matched tokens
    char const* first = str.c_str();
    char const* last = &first[str.size()];
    bool r = lex::tokenize(first, last, word_count_functor,
        boost::bind(counter(), _1, boost::ref(c), boost::ref(w), boost::ref(l)));

    // print results
    if (r) {
        std::cout << "lines: " << l << ", words: " << w
                  << ", characters: " << c << "\n";
    }
    else {
        std::string rest(first, last);
        std::cout << "Lexical analysis failed\n" << "stopped at: \""
                  << rest << "\"\n";
    }
    return 0;
}

Here is the prototype of this tokenize() function overload:

template <typename Iterator, typename Lexer, typename F>
bool tokenize(Iterator& first, Iterator last, Lexer const& lex, F f
  , typename Lexer::char_type const* initial_state = 0);

where:

Iterator& first: The beginning of the input sequence to tokenize. The value of this iterator will be updated by the lexer, pointing to the first not matched character of the input after the function returns.
Iterator last: The end of the input sequence to tokenize.
Lexer const& lex: The lexer instance to use for tokenization.
F f: A function or function object to be called for each matched token. This function is expected to have the prototype: bool f(Lexer::token_type);. The tokenize() function will return immediately if F returns `false.
Lexer::char_type const* initial_state: This optional parameter can be used to specify the initial lexer state for tokenization.

Boost C++ Libraries

Tokenizing Input Data

The tokenize function