Boost C++ Libraries

...one of the most highly regarded and expertly designed C++ library projects in the world. Herb Sutter and Andrei Alexandrescu, C++ Coding Standards

Click here to view the latest version of this page.
PrevUpHomeNext

Appendices

Appendix 1: History
Appendix 2: Not Yet Implemented
Appendix 3: Differences from Boost.Regex
Appendix 4: Performance Comparison

Version 2.0.0, 10/12/2007

New Features:

  • Semantic actions
  • Custom assertions
  • Named captures
  • Dynamic regex grammars
  • Recursive dynamic regexes with (?R) construct
  • Support for searching non-character data
  • Better errors for invalid static regexes
  • Range-based regex algorithm interface
  • match_flag_type::format_perl, match_flag_type::format_sed, and match_flag_type::format_all
  • operator+(std::string, sub_match<>) and variants
  • Version 2 regex traits get tolower() and toupper()

Bugs Fixed:

  • Complementing single-character sets like ~(set='a') works.

Version 1.0.2, April 27, 2007

Bugs Fixed:

  • Back-references greater than nine work as advertized.

This is the version that shipped as part of Boost 1.34.

Version 1.0.1, October 2, 2006

Bugs Fixed:

  • match_results::position() works for nested results.

Version 1.0.0, March 16, 2006

Version 1.0!

Version 0.9.6, August 19, 2005

The version reviewed for acceptance into Boost. The review began September 8, 2005. Xpressive was accepted into Boost on September 28, 2005.

Version 0.9.3, June 30, 2005

New Features:

  • TR1-style regex_traits interface
  • Speed enhancements
  • syntax_option_type::ignore_white_space

Version 0.9.0, September 2, 2004

New Features:

  • It sort of works.

Version 0.0.1, November 16, 2003

Announcement of xpressive: http://lists.boost.org/Archives/boost/2003/11/56312.php

The following features are planned for xpressive 2.X:

  • syntax_option_type::collate
  • Collation sequences such as [.a.]
  • Equivalence classes like [=a=]
  • Control of nested results generation with syntax_option_type::nosubs, and a nosubs() modifier for static xpressive.

Here are some wish-list features. You or your company should consider hiring me to implement them!

  • Optimized DFA back-end for simple, fast regexing.
  • Different regex compiler front ends for basic, extended, awk, grep and egrep regex syntax.
  • Fine-grained control over the dynamic regex syntax
  • Optional integration with ICU for full Unicode support.
  • Improved localization support, possibly as a custom facet for std::locale.

Since many of xpressive's users are likely to be familiar with the Boost.Regex library, I would be remiss if I failed to point out some important differences between xpressive and Boost.Regex. In particular:

  • xpressive::basic_regex<> is a template on the iterator type, not the character type.
  • xpressive::basic_regex<> cannot be constructed directly from a string; rather, you must use basic_regex::compile() or regex_compiler<> to build a regex object from a string.
  • xpressive::basic_regex<> does not have an imbue() member function; rather, the imbue() member function is in the xpressive::regex_compiler<> factory.
  • boost::basic_regex<> has a subset of std::basic_string<>'s members. xpressive::basic_regex<> does not. The members lacking are: assign(), operator[](), max_size(), begin(), end(), size(), compare(), and operator=(std::basic_string<>).
  • Other member functions that exist in boost::basic_regex<> but do not exist in xpressive::basic_regex<> are: set_expression(), get_allocator(), imbue(), getloc(), getflags(), and str().
  • xpressive::basic_regex<> does not have a RegexTraits template parameter. Customization of regex syntax and localization behavior will be controlled by regex_compiler<> and a custom regex facet for std::locale.
  • xpressive::basic_regex<> and xpressive::match_results<> do not have an Allocator template parameter. This is by design.
  • match_not_dot_null and match_not_dot_newline have moved from the match_flag_type enum to the syntax_option_type enum, and they have changed names to not_dot_null and not_dot_newline.
  • The following syntax_option_type enumeration values are not supported: escape_in_lists, char_classes, intervals, limited_ops, newline_alt, bk_plus_qm, bk_braces, bk_parens, bk_refs, bk_vbar, use_except, failbit, literal, perlex, basic, extended, emacs, awk, grep ,egrep, sed, JavaScript, JScript.
  • The following match_flag_type enumeration values are not supported: match_not_bob, match_not_eob, match_perl, match_posix, and match_extra.

Also, in the current implementation, the regex algorithms in xpressive will not detect pathological behavior and abort by throwing an exception. It is up to you to write efficient patterns that do not behave pathologically.

The performance of xpressive is competitive with Boost.Regex. I have run performance benchmarks comparing static xpressive, dynamic xpressive and Boost.Regex on two platforms: gcc (Cygwin) and Visual C++. The tests include short matches and long searches. For both platforms, xpressive comes off well on short matches and roughly on par with Boost.Regex on long searches.

<disclaimer> As with all benchmarks, the true test is how xpressive performs with your patterns, your input, and your platform, so if performance matters in your application, it's best to run your own tests. </disclaimer>

Below are the results of a performance comparison between:

Test Specifications

Hardware:

hyper-threaded 3GHz Xeon with 1Gb RAM

Operating System:

Windows XP Pro + Cygwin

Compiler:

GNU C++ version 3.4.4 (Cygwin special)

C++ Standard Library:

GNU libstdc++ version 3.4.4

Boost.Regex Version:

1.33+, BOOST_REGEX_USE_CPP_LOCALE, BOOST_REGEX_RECURSIVE

xpressive Version:

0.9.6a

Comparison 1: Short Matches

The following tests evaluate the time taken to match the expression to the input string. For each result, the top number has been normalized relative to the fastest time, so 1.0 is as good as it gets. The bottom number (in parentheses) is the actual time in seconds. The best time has been marked in green.

Short Matches
static xpressive dynamic xpressive Boost Text Expression
1

(8.79e‑07s)
1.08

(9.54e‑07s)
2.51

(2.2e‑06s)
100- this is a line of ftp response which contains a message string ^([0-9]+)(\-| |$)(.*)$
1.06

(1.07e‑06s)
1

(1.01e‑06s)
4.01

(4.06e‑06s)
1234-5678-1234-456 ([[:digit:]]{4}[- ]){3}[[:digit:]]{3,4}
1

(1.4e‑06s)
1.13

(1.58e‑06s)
2.89

(4.05e‑06s)
john_maddock@compuserve.com ^([a-zA-Z0-9_\-\.]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})(\]?)$
1

(1.28e‑06s)
1.16

(1.49e‑06s)
3.07

(3.94e‑06s)
foo12@foo.edu ^([a-zA-Z0-9_\-\.]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})(\]?)$
1

(1.22e‑06s)
1.2

(1.46e‑06s)
3.22

(3.93e‑06s)
bob.smith@foo.tv ^([a-zA-Z0-9_\-\.]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})(\]?)$
1.04

(8.64e‑07s)
1

(8.34e‑07s)
2.5

(2.09e‑06s)
EH10 2QQ ^[a-zA-Z]{1,2}[0-9][0-9A-Za-z]{0,1} {0,1}[0-9][A-Za-z]{2}$
1.11

(9.09e‑07s)
1

(8.19e‑07s)
2.47

(2.03e‑06s)
G1 1AA ^[a-zA-Z]{1,2}[0-9][0-9A-Za-z]{0,1} {0,1}[0-9][A-Za-z]{2}$
1.12

(9.38e‑07s)
1

(8.34e‑07s)
2.5

(2.08e‑06s)
SW1 1ZZ ^[a-zA-Z]{1,2}[0-9][0-9A-Za-z]{0,1} {0,1}[0-9][A-Za-z]{2}$
1

(7.9e‑07s)
1.06

(8.34e‑07s)
2.49

(1.96e‑06s)
4/1/2001 ^[[:digit:]]{1,2}/[[:digit:]]{1,2}/[[:digit:]]{4}$
1

(8.19e‑07s)
1.04

(8.49e‑07s)
2.4

(1.97e‑06s)
12/12/2001 ^[[:digit:]]{1,2}/[[:digit:]]{1,2}/[[:digit:]]{4}$
1.09

(8.95e‑07s)
1

(8.19e‑07s)
2.4

(1.96e‑06s)
123 ^[-+]?[[:digit:]]*\.?[[:digit:]]*$
1.11

(8.79e‑07s)
1

(7.9e‑07s)
2.57

(2.03e‑06s)
+3.14159 ^[-+]?[[:digit:]]*\.?[[:digit:]]*$
1.09

(8.94e‑07s)
1

(8.19e‑07s)
2.47

(2.03e‑06s)
-3.14159 ^[-+]?[[:digit:]]*\.?[[:digit:]]*$

Comparison 2: Long Searches

The next test measures the time to find all matches in a long English text. The text is the complete works of Mark Twain, from Project Gutenberg. The text is 19Mb long. As above, the top number is the normalized time and the bottom number is the actual time. The best time is in green.

Long Searches
static xpressive dynamic xpressive Boost Expression
1

(0.0263s)
1

(0.0263s)
1.78

(0.0469s)
Twain
1

(0.0234s)
1

(0.0234s)
1.79

(0.042s)
Huck[[:alpha:]]+
1.84

(1.26s)
2.21

(1.51s)
1

(0.687s)
[[:alpha:]]+ing
1.09

(0.192s)
2

(0.351s)
1

(0.176s)
^[^ ]*?Twain
1.41

(0.08s)
1.21

(0.0684s)
1

(0.0566s)
Tom|Sawyer|Huckleberry|Finn
1.56

(0.195s)
1.12

(0.141s)
1

(0.125s)
(Tom|Sawyer|Huckleberry|Finn).{0,30}river|river.{0,30}(Tom|Sawyer|Huckleberry|Finn)

Below are the results of a performance comparison between:

Test Specifications

Hardware:

hyper-threaded 3GHz Xeon with 1Gb RAM

Operating System:

Windows XP Pro

Compiler:

Visual C++ .NET 2003 (7.1)

C++ Standard Library:

Dinkumware, version 313

Boost.Regex Version:

1.33+, BOOST_REGEX_USE_CPP_LOCALE, BOOST_REGEX_RECURSIVE

xpressive Version:

0.9.6a

Comparison 1: Short Matches

The following tests evaluate the time taken to match the expression to the input string. For each result, the top number has been normalized relative to the fastest time, so 1.0 is as good as it gets. The bottom number (in parentheses) is the actual time in seconds. The best time has been marked in green.

Short Matches
static xpressive dynamic xpressive Boost Text Expression
1

(3.2e‑007s)
1.37

(4.4e‑007s)
2.38

(7.6e‑007s)
100- this is a line of ftp response which contains a message string ^([0-9]+)(\-| |$)(.*)$
1

(6.4e‑007s)
1.12

(7.15e‑007s)
1.72

(1.1e‑006s)
1234-5678-1234-456 ([[:digit:]]{4}[- ]){3}[[:digit:]]{3,4}
1

(9.82e‑007s)
1.3

(1.28e‑006s)
1.61

(1.58e‑006s)
john_maddock@compuserve.com ^([a-zA-Z0-9_\-\.]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})(\]?)$
1

(8.94e‑007s)
1.3

(1.16e‑006s)
1.7

(1.52e‑006s)
foo12@foo.edu ^([a-zA-Z0-9_\-\.]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})(\]?)$
1

(9.09e‑007s)
1.28

(1.16e‑006s)
1.67

(1.52e‑006s)
bob.smith@foo.tv ^([a-zA-Z0-9_\-\.]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})(\]?)$
1

(3.06e‑007s)
1.07

(3.28e‑007s)
1.95

(5.96e‑007s)
EH10 2QQ ^[a-zA-Z]{1,2}[0-9][0-9A-Za-z]{0,1} {0,1}[0-9][A-Za-z]{2}$
1

(3.13e‑007s)
1.09

(3.42e‑007s)
1.86

(5.81e‑007s)
G1 1AA ^[a-zA-Z]{1,2}[0-9][0-9A-Za-z]{0,1} {0,1}[0-9][A-Za-z]{2}$
1

(3.2e‑007s)
1.09

(3.5e‑007s)
1.86

(5.96e‑007s)
SW1 1ZZ ^[a-zA-Z]{1,2}[0-9][0-9A-Za-z]{0,1} {0,1}[0-9][A-Za-z]{2}$
1

(2.68e‑007s)
1.22

(3.28e‑007s)
2

(5.36e‑007s)
4/1/2001 ^[[:digit:]]{1,2}/[[:digit:]]{1,2}/[[:digit:]]{4}$
1

(2.76e‑007s)
1.16

(3.2e‑007s)
1.94

(5.36e‑007s)
12/12/2001 ^[[:digit:]]{1,2}/[[:digit:]]{1,2}/[[:digit:]]{4}$
1

(2.98e‑007s)
1.03

(3.06e‑007s)
1.85

(5.51e‑007s)
123 ^[-+]?[[:digit:]]*\.?[[:digit:]]*$
1

(3.2e‑007s)
1.12

(3.58e‑007s)
1.81

(5.81e‑007s)
+3.14159 ^[-+]?[[:digit:]]*\.?[[:digit:]]*$
1

(3.28e‑007s)
1.11

(3.65e‑007s)
1.77

(5.81e‑007s)
-3.14159 ^[-+]?[[:digit:]]*\.?[[:digit:]]*$

Comparison 2: Long Searches

The next test measures the time to find all matches in a long English text. The text is the complete works of Mark Twain, from Project Gutenberg. The text is 19Mb long. As above, the top number is the normalized time and the bottom number is the actual time. The best time is in green.

Long Searches
static xpressive dynamic xpressive Boost Expression
1

(0.019s)
1

(0.019s)
2.98

(0.0566s)
Twain
1

(0.0176s)
1

(0.0176s)
3.17

(0.0556s)
Huck[[:alpha:]]+
3.62

(1.78s)
3.97

(1.95s)
1

(0.492s)
[[:alpha:]]+ing
2.32

(0.344s)
3.06

(0.453s)
1

(0.148s)
^[^ ]*?Twain
1

(0.0576s)
1.05

(0.0606s)
1.15

(0.0664s)
Tom|Sawyer|Huckleberry|Finn
1.24

(0.164s)
1.44

(0.191s)
1

(0.133s)
(Tom|Sawyer|Huckleberry|Finn).{0,30}river|river.{0,30}(Tom|Sawyer|Huckleberry|Finn)

PrevUpHomeNext