Boost C++ Libraries

...one of the most highly regarded and expertly designed C++ library projects in the world. — Herb Sutter and Andrei Alexandrescu, C++ Coding Standards

Unicode and Boost.Regex
PrevUpHomeNext

There are two ways to use Boost.Regex with Unicode strings:

Rely on wchar_t

If your platform's wchar_t type can hold Unicode strings, and your platform's C/C++ runtime correctly handles wide character constants (when passed to std::iswspace std::iswlower etc), then you can use boost::wregex to process Unicode. However, there are several disadvantages to this approach:

  • It's not portable: there's no guarantee on the width of wchar_t, or even whether the runtime treats wide characters as Unicode at all, most Windows compilers do so, but many Unix systems do not.
  • There's no support for Unicode-specific character classes: [[:Nd:]], [[:Po:]] etc.
  • You can only search strings that are encoded as sequences of wide characters, it is not possible to search UTF-8, or even UTF-16 on many platforms.
Use a Unicode Aware Regular Expression Type.

If you have the ICU library, then Boost.Regex can be configured to make use of it, and provide a distinct regular expression type (boost::u32regex), that supports both Unicode specific character properties, and the searching of text that is encoded in either UTF-8, UTF-16, or UTF-32. See: ICU string class support.


PrevUpHomeNext