Boost.Locale
Boundary analysis

Basics

Boost.Locale provides a boundary analysis tool, allowing you to split text into characters, words, or sentences, and find appropriate places for line breaks.

Note:
This task is not a trivial task.
A Unicode code point and a character are not equivalent, for example: Hebrew word Shalom - "שָלוֹם" that consists of 4 characters and 6 code points (4 base letters and 2 diacritical marks)
Words may not be separated by space characters in some languages like in Japanese or Chinese.

Boost.Locale provides 2 major classes for boundary analysis:

Each of the classes above use an iterator type as template parameter. Both of these classes accept in their constructor:

  • A flag that defines boundary analysis boundary_type.
  • The pair of iterators that define the text range that should be analysed
  • A locale parameter (if not given the global one is used)

For example:

namespace ba=boost::locale::boundary;
std::string text= ... ;
std::locale loc = ... ;
ba::segment_index<std::string::const_iterator> map(ba::word,text.begin(),text.end(),loc);

Each of them provide a members begin(), end() and find() that allow to iterate over the selected segments or boundaries in the text or find a location of a segment or boundary for given iterator.

Convenience a typedefs like ssegment_index or wcboundary_point_index provided as well, where "w", "u16" and "u32" prefixes define a character type wchar_t, char16_t and char32_t and "c" and "s" prefixes define whether std::basic_string<CharType>::const_iterator or CharType const * are used.

Iterating Over Segments

Basic Iteration

The text segments analysis is done using segment_index class.

It provides a bidirectional iterator that returns segment object. The segment object represents a pair of iterators that define this segment and a rule according to which it was selected. It can be automatically converted to std::basic_string object.

To perform boundary analysis, we first create an index object and then iterate over it:

For example:

using namespace boost::locale::boundary;
boost::locale::generator gen;
std::string text="To be or not to be, that is the question."
// Create mapping of text for token iterator using global locale.
ssegment_index map(word,text.begin(),text.end(),gen("en_US.UTF-8")); 
// Print all "words" -- chunks of word boundary
for(ssegment_index::iterator it=map.begin(),e=map.end();it!=e;++it)
    std::cout <<"\""<< * it << "\", ";
std::cout << std::endl;

Would print:

"To", " ", "be", " ", "or", " ", "not", " ", "to", " ", "be", ",", " ", "that", " ", "is", " ", "the", " ", "question", ".",

This sentence "生きるか死ぬか、それが問題だ。" (from Tatoeba database) would be split into following segments in ja_JP.UTF-8 (Japanese) locale:

"生", "きるか", "死", "ぬか", "、", "それが", "問題", "だ", "。", 

The boundary analysis that is done by Boost.Locale is much more complicated then just splitting the text according to white space characters, even thou it is not perfect.

Using Rules

The segments selection can be customized using rule() and full_select() member functions.

By default segment_index's iterator return each text segment defined by two boundary points regardless the way they were selected. Thus in the example above we could see text segments like "." or " " that were selected as words.

Using a rule() member function we can specify a binary mask of rules we want to use for selection of the boundary points using word, line and sentence boundary rules.

For example, by calling

map.rule(word_any);

Before starting the iteration process, specify a selection mask that fetches: numbers, letter, Kana letters and ideographic characters ignoring all non-word related characters like white space or punctuation marks.

So the code:

using namespace boost::locale::boundary;
std::string text="To be or not to be, that is the question."
// Create mapping of text for token iterator using global locale.
ssegment_index map(word,text.begin(),text.end()); 
// Define a rule
map.rule(word_any);
// Print all "words" -- chunks of word boundary
for(ssegment_index::iterator it=map.begin(),e=map.end();it!=e;++it)
    std::cout <<"\""<< * it << "\", ";
std::cout << std::endl;

Would print:

"To", "be", "or", "not", "to", "be", "that", "is", "the", "question",

And the for given text="生きるか死ぬか、それが問題だ。" and rule(word_ideo), the example above would print.

"生", "死", "問題",

You can access specific rules the segments where selected it using segment::rule() member function. Using a bit-mask of rules.

For example:

boost::locale::generator gen;
using namespace boost::locale::boundary;
std::string text="生きるか死ぬか、それが問題だ。";
ssegment_index map(word,text.begin(),text.end(),gen("ja_JP.UTF-8")); 
for(ssegment_index::iterator it=map.begin(),e=map.end();it!=e;++it) {
    std::cout << "Segment " << *it << " contains: ";
    if(it->rule() & word_none)
        std::cout << "white space or punctuation marks ";
    if(it->rule() & word_kana)
        std::cout << "kana characters ";
    if(it->rule() & word_ideo)
        std::cout << "ideographic characters";
    std::cout<< std::endl;
}

Would print

Segment 生 contains: ideographic characters
Segment きるか contains: kana characters 
Segment 死 contains: ideographic characters
Segment ぬか contains: kana characters 
Segment 、 contains: white space or punctuation marks 
Segment それが contains: kana characters 
Segment 問題 contains: ideographic characters
Segment だ contains: kana characters 
Segment 。 contains: white space or punctuation marks 

One important things that should be noted that each segment is defined by a pair of boundaries and the rule of its ending point defines if it is selected or not.

In some cases it may be not what we actually look like.

For example we have a text:

Hello! How
are you?

And we want to fetch all sentences from the text.

The sentence rules have two options:

  • Split the text on the point where sentence terminator like ".!?" detected: sentence_term
  • Split the text on the point where sentence separator like "line feed" detected: sentence_sep

Naturally to ignore sentence separators we would call segment_index::rule(rule_type v) with sentence_term parameter and then run the iterator.

boost::locale::generator gen;
using namespace boost::locale::boundary;
std::string text=   "Hello! How\n"
                    "are you?\n";
ssegment_index map(sentence,text.begin(),text.end(),gen("en_US.UTF-8")); 
map.rule(sentence_term);
for(ssegment_index::iterator it=map.begin(),e=map.end();it!=e;++it) 
    std::cout << "Sentence [" << *it << "]" << std::endl;

However we would get the expected segments:

Sentence [Hello! ]
Sentence [are you?
]

The reason is that "How\n" is still considered a sentence but selected by different rule.

This behavior can be changed by setting segment_index::full_select(bool) to true. It would force iterator to join the current segment with all previous segments that may not fit the required rule.

So we add this line:

map.full_select(true);

Right after "map.rule(sentence_term);" and get expected output:

Sentence [Hello! ]
Sentence [How
are you?
]

Locating Segments

Sometimes it is useful to find a segment that some specific iterator is pointing on.

For example a user had clicked at specific point, we want to select a word on this location.

segment_index provides find(base_iterator p) member function for this purpose.

This function returns the iterator to the segmet such that p points to.

For example:

text="to be or ";
ssegment_index map(word,text.begin(),text.end(),gen("en_US.UTF-8"));
ssegment_index::iterator  p = map.find(text.begin() + 4);
if(p!=map.end())
    std::cout << *p << std::endl;

Would print:

be
Note:

if the iterator lays inside the segment this segment returned. If the segment does not fit the selection rules, then the segment following requested position is returned.

For example: For word boundary analysis with word_any rule:

  • "t|o be or ", would point to "to" - the iterator in the middle of segment "to".
  • "to |be or ", would point to "be" - the iterator at the beginning of the segment "be"
  • "to| be or ", would point to "be" - the iterator does is not point to segment with required rule so next valid segment is selected "be".
  • "to be or| ", would point to end as not valid segment found.

Iterating Over Boundary Points

Basic Iteration

The boundary_point_index is similar to segment_index in its interface but as a different role. Instead of returning text chunks (segments, it returns boundary_point object that represents a position in text - a base iterator used that is used for iteration of the source text C++ characters. The boundary_point object also provides a rule() member function that defines a rule this boundary was selected according to.

Note:
The beginning and the ending of the text are considered boundary points, so even an empty text consists of at least one boundary point.

Lets see an example of selecting first two sentences from a text:

using namespace boost::locale::boundary;
boost::locale::generator gen;

// our text sample
std::string const text="First sentence. Second sentence! Third one?";
// Create an index 
sboundary_point_index map(sentence,text.begin(),text.end(),gen("en_US.UTF-8"));

// Count two boundary points
sboundary_point_index::iterator p = map.begin(),e=map.end();
int count = 0;
while(p!=e && count < 2) {
    ++count;
    ++p;
}

if(p!=e) {
    std::cout   << "First two sentences are: " 
                << std::string(text.begin(),p->iterator()) 
                << std::endl;
}
else {
    std::cout   <<"There are less then two sentences in this "
                <<"text: " << text << std::endl;
}

Would print:

First two sentences are: First sentence. Second sentence!

Using Rules

Similarly to the segment_index the boundary_point_index provides a rule(rule_type mask) member function to filter boundary points that interest us.

It allows to set word, line and sentence rules for filtering boundary points.

Lets change an example above a little:

// our text sample
std::string const text= "First sentence. Second\n"
                        "sentence! Third one?";

If we run our program as is on the sample above we would get:

First two sentences are: First sentence. Second

Which is not something that we really expected. As the "Second\n" is considered an independent sentence that was separated by a line separator "Line Feed".

However, we can set set a rule sentence_term and the iterator would use only boundary points that are created by a sentence terminators like ".!?".

So by adding:

map.rule(sentence_term);

Right after the generation of the index we would get the desired output:

First two sentences are: First sentence. Second
sentence! 

You can also use boundary_point::rule() member function to learn about the reason this boundary point was created by comparing it with an appropriate mask.

For example:

using namespace boost::locale::boundary;
boost::locale::generator gen;
// our text sample
std::string const text= "First sentence. Second\n"
                        "sentence! Third one?";
sboundary_point_index map(sentence,text.begin(),text.end(),gen("en_US.UTF-8"));

for(sboundary_point_index::iterator p = map.begin(),e=map.end();p!=e;++p) {
    if(p->rule() & sentence_term)
        std::cout << "There is a sentence terminator: ";
    else if(p->rule() & sentence_sep)
        std::cout << "There is a sentence separator: ";
    if(p->rule()!=0) // print if some rule exists
        std::cout   << "[" << std::string(text.begin(),p->iterator()) 
                    << "|" << std::string(p->iterator(),text.end()) 
                    << "]\n";
}

Would give the following output:

There is a sentence terminator: [First sentence. |Second
sentence! Third one?]
There is a sentence separator: [First sentence. Second
|sentence! Third one?]
There is a sentence terminator: [First sentence. Second
sentence! |Third one?]
There is a sentence terminator: [First sentence. Second
sentence! Third one?|]

Locating Boundary Points

Sometimes it is useful to find a specific boundary point according to given iterator.

boundary_point_index provides a iterator find(base_iterator p) member function.

It would return an iterator to a boundary point on p's location or at the location following it if p does not point to appropriate position.

For example, for word boundary analysis:

  • If a base iterator points to "to |be", then the returned boundary point would be "to |be" (same position)
  • If a base iterator points to "t|o be", then the returned boundary point would be "to| be" (next valid position)

For example if we want to select 6 words around specific boundary point we can use following code:

using namespace boost::locale::boundary;
boost::locale::generator gen;
// our text sample
std::string const text= "To be or not to be, that is the question.";

// Create a mapping
sboundary_point_index map(word,text.begin(),text.end(),gen("en_US.UTF-8"));
// Ignore wite space
map.rule(word_any);

// define our arbitraty point
std::string::const_iterator pos = text.begin() + 12; // "no|t";

// Get the search range
sboundary_point_index::iterator 
    begin =map.begin(),
    end = map.end(),
    it = map.find(pos); // find a boundary

// go 3 words backward
for(int count = 0;count <3 && it!=begin; count ++) 
    --it;

// Save the start
std::string::const_iterator start = *it;

// go 6 words forward
for(int count = 0;count < 6 && it!=end; count ++)
    ++it;

// make sure we at valid position
if(it==end)
    --it;

// print the text
std::cout << std::string(start,it->iterator()) << std::endl;

That would print:

 be or not to be, that