...one of the most highly
regarded and expertly designed C++ library projects in the
world.
— Herb Sutter and Andrei
Alexandrescu, C++
Coding Standards
A character set represents a subset of low-ASCII characters, used as a building block for constructing rules. The library models them as callable predicates invocable with this equivalent signature:
/// Return true if ch is in the set bool( char ch ) const noexcept;
The CharSet concept describes the requirements on syntax and semantics for these types. Here we declare a character set type that includes the horizontal and vertical whitespace characters:
struct ws_chars_t { constexpr bool operator()( char c ) const noexcept { return c == '\t' || c == ' ' || c == '\r' || c == '\n'; } };
The type trait is_charset
determines if a type
meets the requirements:
static_assert( is_charset< ws_chars_t >::value, "CharSet requirements not met" );
Character sets are always passed as values. As with rules, we declare an
instance of the type for notational convenience. The constexpr
designation is used to make it a zero-cost abstraction:
constexpr ws_chars_t ws_chars{};
For best results, ensure that user-defined character set types are constexpr
constructible.
The functions find_if
and find_if_not
are used to search a
string for the first matching or the first non-matching character from a
set. The example below skips any leading whitespace and then returns everything
from the first non-whitespace character to the last non-whitespace character:
core::string_view get_token( core::string_view s ) noexcept { auto it0 = s.data(); auto const end = it0 + s.size(); // find the first non-whitespace character it0 = find_if_not( it0, end, ws_chars ); if( it0 == end ) { // all whitespace or empty string return {}; } // find the next whitespace character auto it1 = find_if( it0, end, ws_chars ); // [it0, it1) is the part we want return core::string_view( it0, it1 - it0 ); }
The function can now be called thusly:
assert( get_token( " \t john-doe\r\n \t jane-doe\r\n") == "john-doe" );
The library provides these often-used character sets:
Table 1.33. Character Sets
Value |
Description |
---|---|
Contains the uppercase and lowercase letters, and digits. |
|
Contains the uppercase and lowercase letters. |
|
Contains the decimal digit characters. |
|
Contains the uppercase and lowercase hexadecimal digit characters. |
|
Contains the visible characters (i.e. non whitespace). |
Some of the character sets in the library have implementations optimized for the particular character set or optimized in general, often in ways that take advantage of opportunities not available to standard library facilities. For example, custom code enhancements using Streaming SIMD Extensions 2 (SSE2), available on all x86 and x64 architectures.
The lut_chars
type satisfies the CharSet requirements
and offers an optimized constexpr
implementation which provides enhanced performance and notational convenience
for specifying character sets. Compile-time instances can be constructed
from strings:
constexpr lut_chars vowels = "AEIOU" "aeiou";
We can use operator+
and operator-
notation to add and remove elements from the set at compile time. For example,
sometimes the character 'y' sounds like a vowel:
constexpr auto vowels_and_y = vowels + 'y' + 'Y';
The type is named after its implementation, which is a lookup table ("lut") of packed bits. This allows for a variety of construction methods and flexible composition. Here we create the set of visible characters using a lambda:
struct is_visible { constexpr bool operator()( char ch ) const noexcept { return ch >= 33 && ch <= 126; } }; constexpr lut_chars visible_chars( is_visible{} ); // (since C++11)
Alternatively:
constexpr lut_chars visible_chars( [](char ch) { return ch >= 33 && ch <= 126; } ); // (since C++17)
Differences can be calculated with operator-
:
constexpr auto visible_non_vowels = visible_chars - vowels;
We can also remove individual characters:
constexpr auto visible_non_vowels_or_y = visible_chars - vowels - 'y';