Unicode Regular Expression Algorithms

The regular expression algorithms regex_match, regex_search and regex_replace all expect that the character sequence upon which they operate, is encoded in the same character encoding as the regular expression object with which they are used. For Unicode regular expressions that behavior is undesirable: while we may want to process the data in UTF-32 "chunks", the actual data is much more likely to encoded as either UTF-8 or UTF-16. Therefore the header <boost/regex/icu.hpp> provides a series of thin wrappers around these algorithms, called u32regex_match, u32regex_search, and u32regex_replace. These wrappers use iterator-adapters internally to make external UTF-8 or UTF-16 data look as though it's really a UTF-32 sequence, that can then be passed on to the "real" algorithm.

u32regex_match

For each regex_match algorithm defined by <boost/regex.hpp>, then <boost/regex/icu.hpp> defines an overloaded algorithm that takes the same arguments, but which is called u32regex_match, and which will accept UTF-8, UTF-16 or UTF-32 encoded data, as well as an ICU UnicodeString as input.

Example: match a password, encoded in a UTF-16 UnicodeString:

//
// Find out if *password* meets our password requirements,
// as defined by the regular expression *requirements*.
//
bool is_valid_password(const UnicodeString& password, const UnicodeString& requirements)
{
   return boost::u32regex_match(password, boost::make_u32regex(requirements));
}

Example: match a UTF-8 encoded filename:

//
// Extract filename part of a path from a UTF-8 encoded std::string and return the result
// as another std::string:
//
std::string get_filename(const std::string& path)
{
   boost::u32regex r = boost::make_u32regex("(?:\\A|.*\\\\)([^\\\\]+)");
   boost::smatch what;
   if(boost::u32regex_match(path, what, r))
   {
      // extract $1 as a std::string:
      return what.str(1);
   }
   else
   {
      throw std::runtime_error("Invalid pathname");
   }
}

u32regex_search

For each regex_search algorithm defined by <boost/regex.hpp>, then <boost/regex/icu.hpp> defines an overloaded algorithm that takes the same arguments, but which is called u32regex_search, and which will accept UTF-8, UTF-16 or UTF-32 encoded data, as well as an ICU UnicodeString as input.

Example: search for a character sequence in a specific language block:

UnicodeString extract_greek(const UnicodeString& text)
{
   // searches through some UTF-16 encoded text for a block encoded in Greek,
   // this expression is imperfect, but the best we can do for now - searching
   // for specific scripts is actually pretty hard to do right.
   //
   // Here we search for a character sequence that begins with a Greek letter,
   // and continues with characters that are either not-letters ( [^[:L*:]] )
   // or are characters in the Greek character block ( [\\x{370}-\\x{3FF}] ).
   //
   boost::u32regex r = boost::make_u32regex(
         L"[\\x{370}-\\x{3FF}](?:[^[:L*:]]|[\\x{370}-\\x{3FF}])*");
   boost::u16match what;
   if(boost::u32regex_search(text, what, r))
   {
      // extract $0 as a UnicodeString:
      return UnicodeString(what[0].first, what.length(0));
   }
   else
   {
      throw std::runtime_error("No Greek found!");
   }
}

u32regex_replace

For each regex_replace algorithm defined by <boost/regex.hpp>, then <boost/regex/icu.hpp> defines an overloaded algorithm that takes the same arguments, but which is called u32regex_replace, and which will accept UTF-8, UTF-16 or UTF-32 encoded data, as well as an ICU UnicodeString as input. The input sequence and the format string specifier passed to the algorithm, can be encoded independently (for example one can be UTF-8, the other in UTF-16), but the result string / output iterator argument must use the same character encoding as the text being searched.

Example: Credit card number reformatting:

//
// Take a credit card number as a string of digits, 
// and reformat it as a human readable string with "-"
// separating each group of four digit;, 
// note that we're mixing a UTF-32 regex, with a UTF-16
// string and a UTF-8 format specifier, and it still all 
// just works:
//
const boost::u32regex e = boost::make_u32regex(
      "\\A(\\d{3,4})[- ]?(\\d{4})[- ]?(\\d{4})[- ]?(\\d{4})\\z");
const char* human_format = "$1-$2-$3-$4";

UnicodeString human_readable_card_number(const UnicodeString& s)
{
   return boost::u32regex_replace(s, e, human_format);
}

Boost C++ Libraries

Unicode Regular Expression Algorithms

u32regex_match

u32regex_search

u32regex_replace