Boost C++ Libraries

...one of the most highly regarded and expertly designed C++ library projects in the world. Herb Sutter and Andrei Alexandrescu, C++ Coding Standards

libs/regex/doc/syntax_basic.qbk

[/ 
  Copyright 2006-2007 John Maddock.
  Distributed under the Boost Software License, Version 1.0.
  (See accompanying file LICENSE_1_0.txt or copy at
  http://www.boost.org/LICENSE_1_0.txt).
]


[section:basic_syntax POSIX Basic Regular Expression Syntax]

[h3 Synopsis]

The POSIX-Basic regular expression syntax is used by the Unix utility `sed`, 
and variations are used by `grep` and `emacs`.  You can construct POSIX 
basic regular expressions in Boost.Regex by passing the flag `basic` to the 
regex constructor (see [syntax_option_type]), for example:

   // e1 is a case sensitive POSIX-Basic expression:
   boost::regex e1(my_expression, boost::regex::basic);
   // e2 a case insensitive POSIX-Basic expression:
   boost::regex e2(my_expression, boost::regex::basic|boost::regex::icase);

[#boost_regex.posix_basic][h3 POSIX Basic Syntax]

In POSIX-Basic regular expressions, all characters are match themselves except 
for the following special characters:

[pre .\[\\*^$]

[h4 Wildcard:]

The single character '.' when used outside of a character set will match any 
single character except:

* The NULL character when the flag `match_no_dot_null` is passed to the 
matching algorithms.
* The newline character when the flag `match_not_dot_newline` is passed to 
the matching algorithms.

[h4 Anchors:]

A '^' character shall match the start of a line when used as the first 
character of an expression, or the first character of a sub-expression.

A '$' character shall match the end of a line when used as the last 
character of an expression, or the last character of a sub-expression.

[h4 Marked sub-expressions:]

A section beginning `\(` and ending `\)` acts as a marked sub-expression.  
Whatever matched the sub-expression is split out in a separate field by the 
matching algorithms.  Marked sub-expressions can also repeated, or 
referred-to by a back-reference.

[h4 Repeats:]

Any atom (a single character, a marked sub-expression, or a character class) 
can be repeated with the \* operator.

For example `a*` will match any number of letter a's repeated zero or more 
times (an atom repeated zero times matches an empty string), so the 
expression `a*b` will match any of the following:

[pre
b
ab
aaaaaaaab
]

An atom can also be repeated with a bounded repeat:

`a\{n\}`  Matches 'a' repeated exactly n times.

`a\{n,\}`  Matches 'a' repeated n or more times.

`a\{n, m\}`  Matches 'a' repeated between n and m times inclusive.

For example:

[pre ^a\{2,3\}$]

Will match either of:

[pre
aa
aaa
]

But neither of:

[pre
a
aaaa
]

It is an error to use a repeat operator, if the preceding construct can not be 
repeated, for example:

[pre a\(*\)]

Will raise an error, as there is nothing for the \* operator to be applied to.

[h4 Back references:]

An escape character followed by a digit /n/, where /n/ is in the range 1-9, 
matches the same string that was matched by sub-expression /n/.  For example 
the expression:

[pre ^\\(a\*\\)\[\^a\]\*\\1$]

Will match the string:

[pre aaabbaaa]

But not the string:

[pre aaabba]

[h4 Character sets:]

A character set is a bracket-expression starting with \[ and ending with \], 
it defines a set of characters, and matches any single character that is a 
member of that set.

A bracket expression may contain any combination of the following:

[h5 Single characters:]

For example `[abc]`, will match any of the characters 'a', 'b', or 'c'.

[h5 Character ranges:]

For example `[a-c]` will match any single character in the range 'a' to 'c'.  
By default, for POSIX-Basic regular expressions, a character /x/ is within the 
range /y/ to /z/, if it collates within that range; this results in 
locale specific behavior.  This behavior can be turned off by unsetting 
the `collate` option flag when constructing the regular expression 
- in which case whether a character appears within 
a range is determined by comparing the code points of the characters only.

[h5 Negation:]

If the bracket-expression begins with the ^ character, then it matches the 
complement of the characters it contains, for example `[^a-c]` matches 
any character that is not in the range a-c.

[h5 Character classes:]

An expression of the form `[[:name:]]` matches the named character class "name", 
for example `[[:lower:]]` matches any lower case character.  
See [link boost_regex.syntax.character_classes character class names].

[h5 Collating Elements:]

An expression of the form `[[.col.]` matches the collating element /col/.  
A collating element is any single character, or any sequence of 
characters that collates as a single unit.  Collating elements may also 
be used as the end point of a range, for example: `[[.ae.]-c]` matches 
the character sequence "ae", plus any single character in the range "ae"-c, 
assuming that "ae" is treated as a single collating element in the current locale.

Collating elements may be used in place of escapes (which are not 
normally allowed inside character sets), for example `[[.^.]abc]` would 
match either one of the characters 'abc^'.

As an extension, a collating element may also be specified via its 
symbolic name, for example:

[pre \[\[\.NUL\.\]\]]

matches a 'NUL' character.  
See [link boost_regex.syntax.collating_names collating element names].

[h5 Equivalence classes:]

An expression of the form `[[=col=]]`, matches any character or collating 
element whose primary sort key is the same as that for collating element 
/col/, as with collating elements the name /col/ may be a 
[link boost_regex.syntax.collating_names collating symbolic name].  
A primary sort key is one that ignores case, accentation, or 
locale-specific tailorings; so for example `[[=a=]]` matches any of 
the characters: a, '''À''', '''Á''', '''Â''', 
'''Ã''', '''Ä''', '''Å''', A, '''à''', '''á''', 
'''â''', '''ã''', '''ä''' and '''å'''.  
Unfortunately implementation of this is reliant on the platform's 
collation and localisation support; this feature can not be relied 
upon to work portably across all platforms, or even all locales on one platform.

[h5 Combinations:]

All of the above can be combined in one character set declaration, for 
example: `[[:digit:]a-c[.NUL.]].`

[h4 Escapes]

With the exception of the escape sequences \\{, \\}, \\(, and \\), 
which are documented above, an escape followed by any character matches 
that character.  This can be used to make the special characters 

[pre .\[\\\*^$]

"ordinary".  Note that the escape character loses its special meaning 
inside a character set, so `[\^]` will match either a literal '\\' or a '^'.

[h3 What Gets Matched]

When there is more that one way to match a regular expression, the 
"best" possible match is obtained using the 
[link boost_regex.syntax.leftmost_longest_rule leftmost-longest rule].

[h3 Variations]

[#boost_regex.grep_syntax][h4 Grep]

When an expression is compiled with the flag `grep` set, then the 
expression is treated as a newline separated list of 
[link boost_regex.posix_basic POSIX-Basic expressions], 
a match is found if any of the expressions in the list match, for example:

   boost::regex e("abc\ndef", boost::regex::grep);

will match either of the [link boost_regex.posix_basic POSIX-Basic expressions] 
"abc" or "def".

As its name suggests, this behavior is consistent with the Unix utility grep.

[h4 emacs]

In addition to the [link boost_regex.posix_basic POSIX-Basic features] 
the following characters are also special:

[table
[[Character][Description]]
[[+][repeats the preceding atom one or more times.]]
[[?][repeats the preceding atom zero or one times.]]
[[*?][A non-greedy version of *.]]
[[+?][A non-greedy version of +.]]
[[??][A non-greedy version of ?.]]
]

And the following escape sequences are also recognised:

[table
[[Escape][Description]]
[[\\|][specifies an alternative.]]
[[\\(?:  ...  \)][is a non-marking grouping construct - allows you to lexically group something without spitting out an extra sub-expression.]]
[[\\w][matches any word character.]]
[[\\W][matches any non-word character.]]
[[\\sx][matches any character in the syntax group x, the following 
   emacs groupings are supported: 's', ' ', '_', 'w', '.', ')', '(', '"', '\\'', '>' and '<'.  Refer to the emacs docs for details.]]
[[\\Sx][matches any character not in the syntax grouping x.]]
[[\\c and \\C][These are not supported.]]
[[\\`][matches zero characters only at the start of a buffer (or string being matched).]]
[[\\'][matches zero characters only at the end of a buffer (or string being matched).]]
[[\\b][matches zero characters at a word boundary.]]
[[\\B][matches zero characters, not at a word boundary.]]
[[\\<][matches zero characters only at the start of a word.]]
[[\\>][matches zero characters only at the end of a word.]]
]

Finally, you should note that emacs style regular expressions are matched 
according to the 
[link boost_regex.syntax.perl_syntax.what_gets_matched Perl "depth first search" rules].  
Emacs expressions are 
matched this way because they contain Perl-like extensions, that do not 
interact well with the 
[link boost_regex.syntax.leftmost_longest_rule POSIX-style leftmost-longest rule].

[h3 Options]

There are a [link boost_regex.ref.syntax_option_type.syntax_option_type_basic variety of flags] that may be combined with the `basic` and `grep` 
options when constructing the regular expression, in particular note 
that the 
[link boost_regex.ref.syntax_option_type.syntax_option_type_basic `newline_alt`, `no_char_classes`, `no-intervals`, `bk_plus_qm` 
and `bk_plus_vbar`] options all alter the syntax, while the 
[link boost_regex.ref.syntax_option_type.syntax_option_type_basic `collate` and `icase` options] modify how the case and locale sensitivity 
are to be applied.

[h3 References]

[@http://www.opengroup.org/onlinepubs/000095399/basedefs/xbd_chap09.html IEEE Std 1003.1-2001, Portable Operating System Interface (POSIX ), Base Definitions and Headers, Section 9, Regular Expressions (FWD.1).]

[@http://www.opengroup.org/onlinepubs/000095399/utilities/grep.html IEEE Std 1003.1-2001, Portable Operating System Interface (POSIX ), Shells and Utilities, Section 4, Utilities, grep (FWD.1).]

[@http://www.gnu.org/software/emacs/ Emacs Version 21.3.]

[endsect]