Boost C++ Libraries

“...one of the most highly regarded and expertly designed C++ library projects in the world.” Herb Sutter and Andrei Alexandrescu, C++ Coding Standards

Boost C++ Libraries Home Libraries People FAQ More

PrevUpHomeNext

User's Guide

Introduction
Installing xpressive
Quick Start
Creating a Regex Object
Matching and Searching
Accessing Results
String Substitutions
String Splitting and Tokenization
Grammars and Nested Matches
Semantic Actions and User-Defined Assertions
Symbol Tables and Attributes
Localization and Regex Traits
Tips 'N Tricks
Concepts
Examples

This section describes how to use xpressive to accomplish text manipulation and parsing tasks. If you are looking for detailed information regarding specific components in xpressive, check the Reference section.

What is xpressive?

xpressive is a regular expression template library. Regular expressions (regexes) can be written as strings that are parsed dynamically at runtime (dynamic regexes), or as expression templates [4] that are parsed at compile-time (static regexes). Dynamic regexes have the advantage that they can be accepted from the user as input at runtime or read from an initialization file. Static regexes have several advantages. Since they are C++ expressions instead of strings, they can be syntax-checked at compile-time. Also, they can naturally refer to code and data elsewhere in your program, giving you the ability to call back into your code from within a regex match. Finally, since they are statically bound, the compiler can generate faster code for static regexes.

xpressive's dual nature is unique and powerful. Static xpressive is a bit like the Spirit Parser Framework. Like Spirit, you can build grammars with static regexes using expression templates. (Unlike Spirit, xpressive does exhaustive backtracking, trying every possibility to find a match for your pattern.) Dynamic xpressive is a bit like Boost.Regex. In fact, xpressive's interface should be familiar to anyone who has used Boost.Regex. xpressive's innovation comes from allowing you to mix and match static and dynamic regexes in the same program, and even in the same expression! You can embed a dynamic regex in a static regex, or vice versa, and the embedded regex will participate fully in the search, back-tracking as needed to make the match succeed.

Hello, world!

Enough theory. Let's have a look at Hello World, xpressive style:

#include <iostream>
#include <boost/xpressive/xpressive.hpp>

using namespace boost::xpressive;

int main()
{
    std::string hello( "hello world!" );

    sregex rex = sregex::compile( "(\\w+) (\\w+)!" );
    smatch what;

    if( regex_match( hello, what, rex ) )
    {
        std::cout << what[0] << '\n'; // whole match
        std::cout << what[1] << '\n'; // first capture
        std::cout << what[2] << '\n'; // second capture
    }

    return 0;
}

This program outputs the following:

hello world!
hello
world

The first thing you'll notice about the code is that all the types in xpressive live in the boost::xpressive namespace.

[Note] Note

Most of the rest of the examples in this document will leave off the using namespace boost::xpressive; directive. Just pretend it's there.

Next, you'll notice the type of the regular expression object is sregex. If you are familiar with Boost.Regex, this is different than what you are used to. The "s" in "sregex" stands for "string", indicating that this regex can be used to find patterns in std::string objects. I'll discuss this difference and its implications in detail later.

Notice how the regex object is initialized:

sregex rex = sregex::compile( "(\\w+) (\\w+)!" );

To create a regular expression object from a string, you must call a factory method such as basic_regex<>::compile(). This is another area in which xpressive differs from other object-oriented regular expression libraries. Other libraries encourage you to think of a regular expression as a kind of string on steroids. In xpressive, regular expressions are not strings; they are little programs in a domain-specific language. Strings are only one representation of that language. Another representation is an expression template. For example, the above line of code is equivalent to the following:

sregex rex = (s1= +_w) >> ' ' >> (s2= +_w) >> '!';

This describes the same regular expression, except it uses the domain-specific embedded language defined by static xpressive.

As you can see, static regexes have a syntax that is noticeably different than standard Perl syntax. That is because we are constrained by C++'s syntax. The biggest difference is the use of >> to mean "followed by". For instance, in Perl you can just put sub-expressions next to each other:

abc

But in C++, there must be an operator separating sub-expressions:

a >> b >> c

In Perl, parentheses () have special meaning. They group, but as a side-effect they also create back-references like $1 and $2. In C++, there is no way to overload parentheses to give them side-effects. To get the same effect, we use the special s1, s2, etc. tokens. Assign to one to create a back-reference (known as a sub-match in xpressive).

You'll also notice that the one-or-more repetition operator + has moved from postfix to prefix position. That's because C++ doesn't have a postfix + operator. So:

"\\w+"

is the same as:

+_w

We'll cover all the other differences later.

Getting xpressive

There are three ways to get xpressive. The first and simplest is to download the latest version of Boost. Just go to http://sf.net/projects/boost and follow the “Download” link.

The second way is by downloading xpressive.zip at the Boost File Vault in the “Strings - Text Processing” directory. In addition to the source code and the Boost license, this archive contains a copy of this documentation in PDF format. This version will always be stable and at least as current as the version in the latest Boost release. It may be more recent. The version in the File Vault is always guaranteed to work with the latest official Boost release.

The third way is by directly accessing the Boost Subversion repository. Just go to http://svn.boost.org/trac/boost/ and follow the instructions there for anonymous Subversion access. The version in Boost Subversion is unstable.

Building with xpressive

Xpressive is a header-only template library, which means you don't need to alter your build scripts or link to any separate lib file to use it. All you need to do is #include <boost/xpressive/xpressive.hpp>. If you are only using static regexes, you can improve compile times by only including xpressive_static.hpp. Likewise, you can include xpressive_dynamic.hpp if you only plan on using dynamic regexes.

If you would also like to use semantic actions or custom assertions with your static regexes, you will need to additionally include regex_actions.hpp.

Requirements

Xpressive requires Boost version 1.34.1 or higher.

Supported Compilers

Currently, Boost.Xpressive is known to work on the following compilers:

  • Visual C++ 7.1 and higher
  • GNU C++ 3.4 and higher
  • Intel for Linux 8.1 and higher
  • Intel for Windows 10 and higher
  • tru64cxx 71 and higher
  • MinGW 3.4 and higher
  • HP C/aC++ A.06.14

Check the latest tests results at Boost's Regression Results Page.

[Note] Note

Please send any questions, comments and bug reports to eric <at> boost-consulting <dot> com.

You don't need to know much to start being productive with xpressive. Let's begin with the nickel tour of the types and algorithms xpressive provides.

Table 26.1. xpressive's Tool-Box

Tool

Description

basic_regex<>

Contains a compiled regular expression. basic_regex<> is the most important type in xpressive. Everything you do with xpressive will begin with creating an object of type basic_regex<>.

match_results<>, sub_match<>

match_results<> contains the results of a regex_match() or regex_search() operation. It acts like a vector of sub_match<> objects. A sub_match<> object contains a marked sub-expression (also known as a back-reference in Perl). It is basically just a pair of iterators representing the begin and end of the marked sub-expression.

regex_match()

Checks to see if a string matches a regex. For regex_match() to succeed, the whole string must match the regex, from beginning to end. If you give regex_match() a match_results<>, it will write into it any marked sub-expressions it finds.

regex_search()

Searches a string to find a sub-string that matches the regex. regex_search() will try to find a match at every position in the string, starting at the beginning, and stopping when it finds a match or when the string is exhausted. As with regex_match(), if you give regex_search() a match_results<>, it will write into it any marked sub-expressions it finds.

regex_replace()

Given an input string, a regex, and a substitution string, regex_replace() builds a new string by replacing those parts of the input string that match the regex with the substitution string. The substitution string can contain references to marked sub-expressions.

regex_iterator<>

An STL-compatible iterator that makes it easy to find all the places in a string that match a regex. Dereferencing a regex_iterator<> returns a match_results<>. Incrementing a regex_iterator<> finds the next match.

regex_token_iterator<>

Like regex_iterator<>, except dereferencing a regex_token_iterator<> returns a string. By default, it will return the whole sub-string that the regex matched, but it can be configured to return any or all of the marked sub-expressions one at a time, or even the parts of the string that didn't match the regex.

regex_compiler<>

A factory for basic_regex<> objects. It "compiles" a string into a regular expression. You will not usually have to deal directly with regex_compiler<> because the basic_regex<> class has a factory method that uses regex_compiler<> internally. But if you need to do anything fancy like create a basic_regex<> object with a different std::locale, you will need to use a regex_compiler<> explicitly.


Now that you know a bit about the tools xpressive provides, you can pick the right tool for you by answering the following two questions:

  1. What iterator type will you use to traverse your data?
  2. What do you want to do to your data?

Know Your Iterator Type

Most of the classes in xpressive are templates that are parameterized on the iterator type. xpressive defines some common typedefs to make the job of choosing the right types easier. You can use the table below to find the right types based on the type of your iterator.

Table 26.2. xpressive Typedefs vs. Iterator Types

std::string::const_iterator

char const *

std::wstring::const_iterator

wchar_t const *

basic_regex<>

sregex

cregex

wsregex

wcregex

match_results<>

smatch

cmatch

wsmatch

wcmatch

regex_compiler<>

sregex_compiler

cregex_compiler

wsregex_compiler

wcregex_compiler

regex_iterator<>

sregex_iterator

cregex_iterator

wsregex_iterator

wcregex_iterator

regex_token_iterator<>

sregex_token_iterator

cregex_token_iterator

wsregex_token_iterator

wcregex_token_iterator


You should notice the systematic naming convention. Many of these types are used together, so the naming convention helps you to use them consistently. For instance, if you have a sregex, you should also be using a smatch.

If you are not using one of those four iterator types, then you can use the templates directly and specify your iterator type.

Know Your Task

Do you want to find a pattern once? Many times? Search and replace? xpressive has tools for all that and more. Below is a quick reference:


These algorithms and classes are described in excruciating detail in the Reference section.

[Tip] Tip

Try clicking on a task in the table above to see a complete example program that uses xpressive to solve that particular task.

When using xpressive, the first thing you'll do is create a basic_regex<> object. This section goes over the nuts and bolts of building a regular expression in the two dialects xpressive supports: static and dynamic.

Overview

The feature that really sets xpressive apart from other C/C++ regular expression libraries is the ability to author a regular expression using C++ expressions. xpressive achieves this through operator overloading, using a technique called expression templates to embed a mini-language dedicated to pattern matching within C++. These "static regexes" have many advantages over their string-based brethren. In particular, static regexes:

  • are syntax-checked at compile-time; they will never fail at run-time due to a syntax error.
  • can naturally refer to other C++ data and code, including other regexes, making it simple to build grammars out of regular expressions and bind user-defined actions that execute when parts of your regex match.
  • are statically bound for better inlining and optimization. Static regexes require no state tables, virtual functions, byte-code or calls through function pointers that cannot be resolved at compile time.
  • are not limited to searching for patterns in strings. You can declare a static regex that finds patterns in an array of integers, for instance.

Since we compose static regexes using C++ expressions, we are constrained by the rules for legal C++ expressions. Unfortunately, that means that "classic" regular expression syntax cannot always be mapped cleanly into C++. Rather, we map the regex constructs, picking new syntax that is legal C++.

Construction and Assignment

You create a static regex by assigning one to an object of type basic_regex<>. For instance, the following defines a regex that can be used to find patterns in objects of type std::string:

sregex re = '$' >> +_d >> '.' >> _d >> _d;

Assignment works similarly.

Character and String Literals

In static regexes, character and string literals match themselves. For instance, in the regex above, '$' and '.' match the characters '$' and '.' respectively. Don't be confused by the fact that $ and . are meta-characters in Perl. In xpressive, literals always represent themselves.

When using literals in static regexes, you must take care that at least one operand is not a literal. For instance, the following are not valid regexes:

sregex re1 = 'a' >> 'b';         // ERROR!
sregex re2 = +'a';               // ERROR!

The two operands to the binary >> operator are both literals, and the operand of the unary + operator is also a literal, so these statements will call the native C++ binary right-shift and unary plus operators, respectively. That's not what we want. To get operator overloading to kick in, at least one operand must be a user-defined type. We can use xpressive's as_xpr() helper function to "taint" an expression with regex-ness, forcing operator overloading to find the correct operators. The two regexes above should be written as:

sregex re1 = as_xpr('a') >> 'b'; // OK
sregex re2 = +as_xpr('a');       // OK

Sequencing and Alternation

As you've probably already noticed, sub-expressions in static regexes must be separated by the sequencing operator, >>. You can read this operator as "followed by".

// Match an 'a' followed by a digit
sregex re = 'a' >> _d;

Alternation works just as it does in Perl with the | operator. You can read this operator as "or". For example:

// match a digit character or a word character one or more times
sregex re = +( _d | _w );

Grouping and Captures

In Perl, parentheses () have special meaning. They group, but as a side-effect they also create back-references like $1 and $2. In C++, parentheses only group -- there is no way to give them side-effects. To get the same effect, we use the special s1, s2, etc. tokens. Assigning to one creates a back-reference. You can then use the back-reference later in your expression, like using \1 and \2 in Perl. For example, consider the following regex, which finds matching HTML tags:

"<(\\w+)>.*?</\\1>"

In static xpressive, this would be:

'<' >> (s1= +_w) >> '>' >> -*_ >> "</" >> s1 >> '>'

Notice how you capture a back-reference by assigning to s1, and then you use s1 later in the pattern to find the matching end tag.

[Tip] Tip

Grouping without capturing a back-reference

In xpressive, if you just want grouping without capturing a back-reference, you can just use () without s1. That is the equivalent of Perl's (?:) non-capturing grouping construct.

Case-Insensitivity and Internationalization

Perl lets you make part of your regular expression case-insensitive by using the (?i:) pattern modifier. xpressive also has a case-insensitivity pattern modifier, called icase. You can use it as follows:

sregex re = "this" >> icase( "that" );

In this regular expression, "this" will be matched exactly, but "that" will be matched irrespective of case.

Case-insensitive regular expressions raise the issue of internationalization: how should case-insensitive character comparisons be evaluated? Also, many character classes are locale-specific. Which characters are matched by digit and which are matched by alpha? The answer depends on the std::locale object the regular expression object is using. By default, all regular expression objects use the global locale. You can override the default by using the imbue() pattern modifier, as follows:

std::locale my_locale = /* initialize a std::locale object */;
sregex re = imbue( my_locale )( +alpha >> +digit );

This regular expression will evaluate alpha and digit according to my_locale. See the section on Localization and Regex Traits for more information about how to customize the behavior of your regexes.

Static xpressive Syntax Cheat Sheet

The table below lists the familiar regex constructs and their equivalents in static xpressive.

Table 26.4. Perl syntax vs. Static xpressive syntax

Perl

Static xpressive

Meaning

.

_

any character (assuming Perl's /s modifier).

ab

a >> b

sequencing of a and b sub-expressions.

a|b

a | b

alternation of a and b sub-expressions.

(a)

(s1= a)

group and capture a back-reference.

(?:a)

(a)

group and do not capture a back-reference.

\1

s1

a previously captured back-reference.

a*

*a

zero or more times, greedy.

a+

+a

one or more times, greedy.

a?

!a

zero or one time, greedy.

a{n,m}

repeat<n,m>(a)

between n and m times, greedy.

a*?

-*a

zero or more times, non-greedy.

a+?

-+a

one or more times, non-greedy.

a??

-!a

zero or one time, non-greedy.

a{n,m}?

-repeat<n,m>(a)

between n and m times, non-greedy.

^

bos

beginning of sequence assertion.

$

eos

end of sequence assertion.

\b

_b

word boundary assertion.

\B

~_b

not word boundary assertion.

\n

_n

literal newline.

.

~_n

any character except a literal newline (without Perl's /s modifier).

\r?\n|\r

_ln

logical newline.

[^\r\n]

~_ln

any single character not a logical newline.

\w

_w

a word character, equivalent to set[alnum | '_'].

\W

~_w

not a word character, equivalent to ~set[alnum | '_'].

\d

_d

a digit character.

\D

~_d

not a digit character.

\s

_s

a space character.

\S

~_s

not a space character.

[:alnum:]

alnum

an alpha-numeric character.

[:alpha:]

alpha

an alphabetic character.

[:blank:]

blank

a horizontal white-space character.

[:cntrl:]

cntrl

a control character.

[:digit:]

digit

a digit character.

[:graph:]

graph

a graphable character.

[:lower:]

lower

a lower-case character.

[:print:]

print

a printing character.

[:punct:]

punct

a punctuation character.

[:space:]

space

a white-space character.

[:upper:]

upper

an upper-case character.

[:xdigit:]

xdigit

a hexadecimal digit character.

[0-9]

range('0','9')

characters in range '0' through '9'.

[abc]

as_xpr('a') | 'b' |'c'

characters 'a', 'b', or 'c'.

[abc]

(set= 'a','b','c')

same as above

[0-9abc]

set[ range('0','9') | 'a' | 'b' | 'c' ]

characters 'a', 'b', 'c' or in range '0' through '9'.

[0-9abc]

set[ range('0','9') | (set= 'a','b','c') ]

same as above

[^abc]

~(set= 'a','b','c')

not characters 'a', 'b', or 'c'.

(?i:stuff)

icase(stuff)

match stuff disregarding case.

(?>stuff)

keep(stuff)

independent sub-expression, match stuff and turn off backtracking.

(?=stuff)

before(stuff)

positive look-ahead assertion, match if before stuff but don't include stuff in the match.

(?!stuff)

~before(stuff)

negative look-ahead assertion, match if not before stuff.

(?<=stuff)

after(stuff)

positive look-behind assertion, match if after stuff but don't include stuff in the match. (stuff must be constant-width.)

(?<!stuff)

~after(stuff)

negative look-behind assertion, match if not after stuff. (stuff must be constant-width.)



Overview

Static regexes are dandy, but sometimes you need something a bit more ... dynamic. Imagine you are developing a text editor with a regex search/replace feature. You need to accept a regular expression from the end user as input at run-time. There should be a way to parse a string into a regular expression. That's what xpressive's dynamic regexes are for. They are built from the same core components as their static counterparts, but they are late-bound so you can specify them at run-time.

Construction and Assignment

There are two ways to create a dynamic regex: with the basic_regex<>::compile() function or with the regex_compiler<> class template. Use basic_regex<>::compile() if you want the default locale. Use regex_compiler<> if you need to specify a different locale. In the section on regex grammars, we'll see another use for regex_compiler<>.

Here is an example of using basic_regex<>::compile():

sregex re = sregex::compile( "this|that", regex_constants::icase );

Here is the same example using regex_compiler<>:

sregex_compiler compiler;
sregex re = compiler.compile( "this|that", regex_constants::icase );

basic_regex<>::compile() is implemented in terms of regex_compiler<>.

Dynamic xpressive Syntax

Since the dynamic syntax is not constrained by the rules for valid C++ expressions, we are free to use familiar syntax for dynamic regexes. For this reason, the syntax used by xpressive for dynamic regexes follows the lead set by John Maddock's proposal to add regular expressions to the Standard Library. It is essentially the syntax standardized by ECMAScript, with minor changes in support of internationalization.

Since the syntax is documented exhaustively elsewhere, I will simply refer you to the existing standards, rather than duplicate the specification here.

Internationalization

As with static regexes, dynamic regexes support internationalization by allowing you to specify a different std::locale. To do this, you must use regex_compiler<>. The regex_compiler<> class has an imbue() function. After you have imbued a regex_compiler<> object with a custom std::locale, all regex objects compiled by that regex_compiler<> will use that locale. For example:

std::locale my_locale = /* initialize your locale object here */;
sregex_compiler compiler;
compiler.imbue( my_locale );
sregex re = compiler.compile( "\\w+|\\d+" );

This regex will use my_locale when evaluating the intrinsic character sets "\\w" and "\\d".

Overview

Once you have created a regex object, you can use the regex_match() and regex_search() algorithms to find patterns in strings. This page covers the basics of regex matching and searching. In all cases, if you are familiar with how regex_match() and regex_search() in the Boost.Regex library work, xpressive's versions work the same way.

Seeing if a String Matches a Regex

The regex_match() algorithm checks to see if a regex matches a given input.

[Warning] Warning

The regex_match() algorithm will only report success if the regex matches the whole input, from beginning to end. If the regex matches only a part of the input, regex_match() will return false. If you want to search through the string looking for sub-strings that the regex matches, use the regex_search() algorithm.

The input can be a bidirectional range such as std::string, a C-style null-terminated string or a pair of iterators. In all cases, the type of the iterator used to traverse the input sequence must match the iterator type used to declare the regex object. (You can use the table in the Quick Start to find the correct regex type for your iterator.)

cregex cre = +_w;  // this regex can match C-style strings
sregex sre = +_w;  // this regex can match std::strings

if( regex_match( "hello", cre ) )              // OK
    { /*...*/ }

if( regex_match( std::string("hello"), sre ) ) // OK
    { /*...*/ }

if( regex_match( "hello", sre ) )              // ERROR! iterator mis-match!
    { /*...*/ }

The regex_match() algorithm optionally accepts a match_results<> struct as an out parameter. If given, the regex_match() algorithm fills in the match_results<> struct with information about which parts of the regex matched which parts of the input.

cmatch what;
cregex cre = +(s1= _w);

// store the results of the regex_match in "what"
if( regex_match( "hello", what, cre ) )
{
    std::cout << what[1] << '\n'; // prints "o"
}

The regex_match() algorithm also optionally accepts a match_flag_type bitmask. With match_flag_type, you can control certain aspects of how the match is evaluated. See the match_flag_type reference for a complete list of the flags and their meanings.

std::string str("hello");
sregex sre = bol >> +_w;

// match_not_bol means that "bol" should not match at [begin,begin)
if( regex_match( str.begin(), str.end(), sre, regex_constants::match_not_bol ) )
{
    // should never get here!!!
}

Click here to see a complete example program that shows how to use regex_match(). And check the regex_match() reference to see a complete list of the available overloads.

Searching for Matching Sub-Strings

Use regex_search() when you want to know if an input sequence contains a sub-sequence that a regex matches. regex_search() will try to match the regex at the beginning of the input sequence and scan forward in the sequence until it either finds a match or exhausts the sequence.

In all other regards, regex_search() behaves like regex_match() (see above). In particular, it can operate on a bidirectional range such as std::string, C-style null-terminated strings or iterator ranges. The same care must be taken to ensure that the iterator type of your regex matches the iterator type of your input sequence. As with regex_match(), you can optionally provide a match_results<> struct to receive the results of the search, and a match_flag_type bitmask to control how the match is evaluated.

Click here to see a complete example program that shows how to use regex_search(). And check the regex_search() reference to see a complete list of the available overloads.

Overview

Sometimes, it is not enough to know simply whether a regex_match() or regex_search() was successful or not. If you pass an object of type match_results<> to regex_match() or regex_search(), then after the algorithm has completed successfully the match_results<> will contain extra information about which parts of the regex matched which parts of the sequence. In Perl, these sub-sequences are called back-references, and they are stored in the variables $1, $2, etc. In xpressive, they are objects of type sub_match<>, and they are stored in the match_results<> structure, which acts as a vector of sub_match<> objects.

match_results

So, you've passed a match_results<> object to a regex algorithm, and the algorithm has succeeded. Now you want to examine the results. Most of what you'll be doing with the match_results<> object is indexing into it to access its internally stored sub_match<> objects, but there are a few other things you can do with a match_results<> object besides.

The table below shows how to access the information stored in a match_results<> object named what.

Table 26.5. match_results<> Accessors

Accessor

Effects

what.size()

Returns the number of sub-matches, which is always greater than zero after a successful match because the full match is stored in the zero-th sub-match.

what[n]

Returns the n-th sub-match.

what.length(n)

Returns the length of the n-th sub-match. Same as what[n].length().

what.position(n)

Returns the offset into the input sequence at which the n-th sub-match begins.

what.str(n)

Returns a std::basic_string<> constructed from the n-th sub-match. Same as what[n].str().

what.prefix()

Returns a sub_match<> object which represents the sub-sequence from the beginning of the input sequence to the start of the full match.

what.suffix()

Returns a sub_match<> object which represents the sub-sequence from the end of the full match to the end of the input sequence.

what.regex_id()

Returns the regex_id of the basic_regex<> object that was last used with this match_results<> object.


There is more you can do with the match_results<> object, but that will be covered when we talk about Grammars and Nested Matches.

sub_match

When you index into a match_results<> object, you get back a sub_match<> object. A sub_match<> is basically a pair of iterators. It is defined like this:

template< class BidirectionalIterator >
struct sub_match
    : std::pair< BidirectionalIterator, BidirectionalIterator >
{
    bool matched;
    // ...
};

Since it inherits publicaly from