Boost C++ Libraries

...one of the most highly regarded and expertly designed C++ library projects in the world. Herb Sutter and Andrei Alexandrescu, C++ Coding Standards

This is the documentation for an old version of Boost. Click here to view this page for the latest version.
PrevUpHomeNext

Abstracts

Lexer Primitives
Tokenizing Input Data
Lexer Semantic Actions
The Static Lexer Model
Parsing using a Lexer
Lexer Attributes
Lexer States

As already discussed, lexical scanning is the process of analyzing the stream of input characters and separating it into strings called tokens, most of the time separated by whitespace. The different token types recognized by a lexical analyzer often get assigned unique integer token identifiers (token ids). These token ids arenormally used by the parser to identifiy the current token without having to look at the matched string again. The Spirit.Lex library is not different with respect to this, as it uses the token ids as the main means of identification of the different token types defined for a particular lexical analyzer. However, it is different from commonly used lexical analyzers in the sense that it returns (references to) instances of a (user defined) token class to the user. The only real limitation posed on this token class is consequently, that it has to carry at least the token id of the token it represents. For more information about the interface a user defined token type has to expose please look at the Token Reference reference. The library provides a default token type based on the Lexertl library which should be sufficient in most use cases: the lexertl_token<> type. This section focusses on the description of general features a token class may implement and how this integrates with the other parts of the Spirit.Lex library.

The Anatomy of a Token

It is very important to understand the difference between a token definition (represented by the token_def<> template) and a token itself (for instance represented by the lexertl_token<> template).

The token definition is used to describe the main features of a particular token type, especially:

  • to simplify the definition of a token type using a regular expression pattern applied while matching this token type,
  • to associate a token type with a particular lexer state,
  • to optionally assign a token id to a token type,
  • to optionally associate some code to execute whenever an instance of this token type has been matched,
  • and to optionally specify the attribute type of the token value.

The token itself is a data structure returned by the lexer iterators. Dereferencing a lexer iterator returns a reference to the last matched token instance. It encapsulates the part of the underlying input sequence matched by the regular expression used during the definiton of this token type. Incrementing the lexer iterator invokes the lexical analyzer to match the next token by advancing the underlying input stream. The token data structure contains at least the token id of the matched token type, allowing to identify the matched character sequence. Optionally, the token instance may contain a token value and/or the lexer state this token instance was matched in. The following figure shows the schematic structure of a token.

Figure 5. The structure of a token

The
            structure of a token

The token value and the token state may be omitted for optimization reasons, avoiding the token to carry more data than actually required. This configuration can be achieved by supplying appropriate template parameters for the lexertl_token<> template while defining the token type.

The lexer iterator returns the same token type for each of the different matched token definitions. To accomodate for the possibly different token value types exposed by the various token types (token definitions), the general type of the token value is a boost.variant<>. As a minimum (for the default configuration) this token value variant will be configured to always hold a boost::iterator_range<> containing the pair of iterators pointing to the matched input sequence for this token instance.

[Note] Note

If the lexical analyzer is used in conjunction with a Spirit.Qi parser, the stored boost::iterator_range<> token value will be converted to the requested token type (parser attribute) exactly once. This happens at the time of the first access to the token value requiring the corresponding type conversion. The converted token value will be stored in the boost.variant<> replacing the initially stored iterator range. This avoids to convert the input sequence to the token value more than once, thus optimizing the integration of the lexer with Spirit.Qi, even during parser backtracking.

Here is the template prototype of the lexertl_token<> template:

template <
    typename Iterator = char const*, 
    typename AttributeTypes = mpl::vector0<>, 
    typename HasState = mpl::true_
>
struct lexertl_token;

where:

Iterator

This is the type of the iterator used to access the underlying input stream. It defaults to a plain char const*.

AttributeTypes

This is either a mpl sequence containing all attribute types used for the token definitions or the type omitted. If the mpl sequence is empty (which is the default), all token instances will store a boost::iterator_range<Iterator> pointing to the start and the end of the matched section in the input stream. If the type is omitted, the generated tokens will contain no token value (attribute) at all.

HasState

This is either mpl::true_ or mpl::false_, allowing to control whether the generated token instances will contain the lexer state they were generated in. The default is mpl::true_, so all token instances will contain the lexer state.

Normally, during its construction, a token instance always holds the boost::iterator_range<> as its token value (except, if it has been defined using the omitted token value type). This iterator range then is converted in place to the requested token value type (attribute) when it is requested for the first time.

The Physiognomy of a Token Definition

The token definitions (represented by the token_def<> template) are normally used as part of the definition of the lexical analyzer. At the same time a token definition instance may be used as a parser component in Spirit.Qi.

The template prototype of this class is shown here:

template<
    typename Attribute = unused_type, 
    typename Char = char
>
class token_def;

where:

Attribute

This is the type of the token value (attribute) supported by token instances representing this token type. This attribute type is exposed to the Spirit.Qi library, whenever this token definition is used as a parser component. The default attribute type is unused_type, which means the token instance holds a boost::iterator_range<> pointing to the start and the end of the matched section in the input stream. If the attribute is omitted the token instance will expose no token type at all. Any other type will be used directly as the token value type.

Char

This is the value type of the iterator for the underlying input sequence. It defaults to char.

The semantics of the template parameters for the token type and the token definition type are very similar and interdependent. As a rule of thumb you can think of the token definition type as the means of specifying everything related to a single specific token type (such as identifier or integer). On the other hand the token type is used to define the general proerties of all token instances generated by the Spirit.Lex library.

[Important] Important

If you don't list any token value types in the token type definition declaration (resulting in the usage of the default boost::iterator_range<> token type) everything will compile and work just fine, just a bit less efficient. This is because the token value will be converted from the matched input sequence every time it is requested.

But as soon as you specify at least one token value type while defining the token type you'll have to list all value types used for token_def<> declarations in the token definition class, otherwise compilation errors will occur.

Examples of using lexertl_token<>

Let's start with some examples. We refer to one of the Spirit.Lex examples (for the full source code of this example please see example4.cpp).

The first code snippet shows an excerpt of the token definition class, the definition of a couple of token types. Some of the token types do not expose a special token value (if_, else_, and while_). Their token value will always hold the iterator range of the matched input sequence only. The token definitions for the identifier and the integer constant are specialized to expose an explicit token type each: std::string and unsigned int.

// these tokens expose the iterator_range of the matched input sequence
token_def<> if_, else_, while_;

// The following two tokens have an associated value type, 'identifier'
// carries a string (the identifier name) and 'constant' carries the 
// matched integer value.
//
// Note: any token value type specified explicitly during a token_def<>
//       declaration needs to be listed during token type definition as 
//       well (see the typedef for the token_type below).
//
// The conversion of the matched input to an instance of this type occurs
// once (on first access), which makes token values as efficient as 
// possible. Moreover, token instances are constructed once by the lexer
// library. From this point on tokens are passed by reference only, 
// avoiding tokens being copied around.
token_def<std::string> identifier;
token_def<unsigned int> constant;

As the parsers generated by Spirit.Qi are fully attributed, any Spirit.Qi parser component needs to expose a certain type as its parser attribute. Naturally, the token_def<> exposes the token value type as its parser attribute, enabling a smooth integration with Spirit.Qi.

The next code snippet demonstrates how the required token value types are specified while defining the token type to use. All of the token value types used for at least one of the token definitions have to be re-iterated for the token definition as well.

// This is the lexer token type to use. The second template parameter lists 
// all attribute types used for token_def's during token definition (see 
// calculator_tokens<> above). Here we use the predefined lexertl token 
// type, but any compatible token type may be used instead.
//
// If you don't list any token value types in the following declaration 
// (or just use the default token type: lexertl_token<base_iterator_type>)  
// it will compile and work just fine, just a bit less efficient. This is  
// because the token value will be generated from the matched input  
// sequence every time it is requested. But as soon as you specify at 
// least one token value type you'll have to list all value types used  
// for token_def<> declarations in the token definition class above,  
// otherwise compilation errors will occur.
typedef lexertl_token<
    base_iterator_type, boost::mpl::vector<unsigned int, std::string> 
> token_type;

To avoid the token to have a token value at all, the special tag omitted can be used: token_def<omitted> and lexertl_token<base_iterator_type, omitted>.


PrevUpHomeNext