Searching Algorithms

Boyer-Moore Search

Overview

The header file 'boyer_moore.hpp' contains an implementation of the Boyer-Moore algorithm for searching sequences of values.

The Boyer–Moore string search algorithm is a particularly efficient string searching algorithm, and it has been the standard benchmark for the practical string search literature. The Boyer-Moore algorithm was invented by Bob Boyer and J. Strother Moore, and published in the October 1977 issue of the Communications of the ACM , and a copy of that article is available at http://www.cs.utexas.edu/~moore/publications/fstrpos.pdf.

The Boyer-Moore algorithm uses two precomputed tables to give better performance than a naive search. These tables depend on the pattern being searched for, and give the Boyer-Moore algorithm larger a memory footprint and startup costs than a simpler algorithm, but these costs are recovered quickly during the searching process, especially if the pattern is longer than a few elements.

However, the Boyer-Moore algorithm cannot be used with comparison predicates like std::search.

Nomenclature: I refer to the sequence being searched for as the "pattern", and the sequence being searched in as the "corpus".

Interface

For flexibility, the Boyer-Moore algorithm has two interfaces; an object-based interface and a procedural one. The object-based interface builds the tables in the constructor, and uses operator () to perform the search. The procedural interface builds the table and does the search all in one step. If you are going to be searching for the same pattern in multiple corpora, then you should use the object interface, and only build the tables once.

Here is the object interface:

template <typename patIter>
class boyer_moore {
public:
    boyer_moore ( patIter first, patIter last );
    ~boyer_moore ();

    template <typename corpusIter>
    corpusIter operator () ( corpusIter corpus_first, corpusIter corpus_last );
    };

and here is the corresponding procedural interface:

template <typename patIter, typename corpusIter>
corpusIter boyer_moore_search (
        corpusIter corpus_first, corpusIter corpus_last,
        patIter pat_first, patIter pat_last );

Each of the functions is passed two pairs of iterators. The first two define the corpus and the second two define the pattern. Note that the two pairs need not be of the same type, but they do need to "point" at the same type. In other words, patIter::value_type and curpusIter::value_type need to be the same type.

The return value of the function is an iterator pointing to the start of the pattern in the corpus. If the pattern is not found, it returns the end of the corpus (corpus_last).

Performance

The execution time of the Boyer-Moore algorithm, while still linear in the size of the string being searched, can have a significantly lower constant factor than many other search algorithms: it doesn't need to check every character of the string to be searched, but rather skips over some of them. Generally the algorithm gets faster as the pattern being searched for becomes longer. Its efficiency derives from the fact that with each unsuccessful attempt to find a match between the search string and the text it is searching, it uses the information gained from that attempt to rule out as many positions of the text as possible where the string cannot match.

Memory Use

The algorithm allocates two internal tables. The first one is proportional to the length of the pattern; the second one has one entry for each member of the "alphabet" in the pattern. For (8-bit) character types, this table contains 256 entries.

Complexity

The worst-case performance to find a pattern in the corpus is O(N) (linear) time; that is, proportional to the length of the corpus being searched. In general, the search is sub-linear; not every entry in the corpus need be checked.

Exception Safety

Both the object-oriented and procedural versions of the Boyer-Moore algorithm take their parameters by value and do not use any information other than what is passed in. Therefore, both interfaces provide the strong exception guarantee.

Notes

When using the object-based interface, the pattern must remain unchanged for during the searches; i.e, from the time the object is constructed until the final call to operator () returns.
The Boyer-Moore algorithm requires random-access iterators for both the pattern and the corpus.

Customization points

The Boyer-Moore object takes a traits template parameter which enables the caller to customize how one of the precomputed tables is stored. This table, called the skip table, contains (logically) one entry for every possible value that the pattern can contain. When searching 8-bit character data, this table contains 256 elements. The traits class defines the table to be used.

The default traits class uses a boost::array for small 'alphabets' and a tr1::unordered_map for larger ones. The array-based skip table gives excellent performance, but could be prohibitively large when the 'alphabet' of elements to be searched grows. The unordered_map based version only grows as the number of unique elements in the pattern, but makes many more heap allocations, and gives slower lookup performance.

To use a different skip table, you should define your own skip table object and your own traits class, and use them to instantiate the Boyer-Moore object. The interface to these objects is described TBD.

Boyer-Moore-Horspool Search

Overview

The header file 'boyer_moore_horspool.hpp' contains an implementation of the Boyer-Moore-Horspool algorithm for searching sequences of values.

The Boyer-Moore-Horspool search algorithm was published by Nigel Horspool in 1980. It is a refinement of the Boyer-Moore algorithm that trades space for time. It uses less space for internal tables than Boyer-Moore, and has poorer worst-case performance.

The Boyer-Moore-Horspool algorithm cannot be used with comparison predicates like std::search.

Interface

Nomenclature: I refer to the sequence being searched for as the "pattern", and the sequence being searched in as the "corpus".

For flexibility, the Boyer-Moore-Horspool algorithm has two interfaces; an object-based interface and a procedural one. The object-based interface builds the tables in the constructor, and uses operator () to perform the search. The procedural interface builds the table and does the search all in one step. If you are going to be searching for the same pattern in multiple corpora, then you should use the object interface, and only build the tables once.

Here is the object interface:

template <typename patIter>
class boyer_moore_horspool {
public:
    boyer_moore_horspool ( patIter first, patIter last );
    ~boyer_moore_horspool ();

    template <typename corpusIter>
    corpusIter operator () ( corpusIter corpus_first, corpusIter corpus_last );
    };

and here is the corresponding procedural interface:

template <typename patIter, typename corpusIter>
corpusIter boyer_moore_horspool_search (
        corpusIter corpus_first, corpusIter corpus_last,
        patIter pat_first, patIter pat_last );

Each of the functions is passed two pairs of iterators. The first two define the corpus and the second two define the pattern. Note that the two pairs need not be of the same type, but they do need to "point" at the same type. In other words, patIter::value_type and curpusIter::value_type need to be the same type.

The return value of the function is an iterator pointing to the start of the pattern in the corpus. If the pattern is not found, it returns the end of the corpus (corpus_last).

Performance

The execution time of the Boyer-Moore-Horspool algorithm is linear in the size of the string being searched; it can have a significantly lower constant factor than many other search algorithms: it doesn't need to check every character of the string to be searched, but rather skips over some of them. Generally the algorithm gets faster as the pattern being searched for becomes longer. Its efficiency derives from the fact that with each unsuccessful attempt to find a match between the search string and the text it is searching, it uses the information gained from that attempt to rule out as many positions of the text as possible where the string cannot match.

Memory Use

The algorithm an internal table that has one entry for each member of the "alphabet" in the pattern. For (8-bit) character types, this table contains 256 entries.

Complexity

The worst-case performance is O(m x n), where m is the length of the pattern and n is the length of the corpus. The average time is O(n). The best case performance is sub-linear, and is, in fact, identical to Boyer-Moore, but the initialization is quicker and the internal loop is simpler than Boyer-Moore.

Exception Safety

Both the object-oriented and procedural versions of the Boyer-Moore-Horspool algorithm take their parameters by value and do not use any information other than what is passed in. Therefore, both interfaces provide the strong exception guarantee.

Notes

When using the object-based interface, the pattern must remain unchanged for during the searches; i.e, from the time the object is constructed until the final call to operator () returns.
The Boyer-Moore-Horspool algorithm requires random-access iterators for both the pattern and the corpus.

Customization points

The Boyer-Moore-Horspool object takes a traits template parameter which enables the caller to customize how the precomputed table is stored. This table, called the skip table, contains (logically) one entry for every possible value that the pattern can contain. When searching 8-bit character data, this table contains 256 elements. The traits class defines the table to be used.

The default traits class uses a boost::array for small 'alphabets' and a tr1::unordered_map for larger ones. The array-based skip table gives excellent performance, but could be prohibitively large when the 'alphabet' of elements to be searched grows. The unordered_map based version only grows as the number of unique elements in the pattern, but makes many more heap allocations, and gives slower lookup performance.

To use a different skip table, you should define your own skip table object and your own traits class, and use them to instantiate the Boyer-Moore-Horspool object. The interface to these objects is described TBD.

Knuth-Morris-Pratt Search

Overview

The header file 'knuth_morris_pratt.hpp' contains an implementation of the Knuth-Morris-Pratt algorithm for searching sequences of values.

The basic premise of the Knuth-Morris-Pratt algorithm is that when a mismatch occurs, there is information in the pattern being searched for that can be used to determine where the next match could begin, enabling the skipping of some elements of the corpus that have already been examined.

It does this by building a table from the pattern being searched for, with one entry for each element in the pattern.

The algorithm was conceived in 1974 by Donald Knuth and Vaughan Pratt, and independently by James H. Morris. The three published it jointly in 1977 in the SIAM Journal on Computing http://citeseer.ist.psu.edu/context/23820/0

However, the Knuth-Morris-Pratt algorithm cannot be used with comparison predicates like std::search.

Interface

Nomenclature: I refer to the sequence being searched for as the "pattern", and the sequence being searched in as the "corpus".

For flexibility, the Knuth-Morris-Pratt algorithm has two interfaces; an object-based interface and a procedural one. The object-based interface builds the table in the constructor, and uses operator () to perform the search. The procedural interface builds the table and does the search all in one step. If you are going to be searching for the same pattern in multiple corpora, then you should use the object interface, and only build the tables once.

Here is the object interface:

template <typename patIter>
class knuth_morris_pratt {
public:
    knuth_morris_pratt ( patIter first, patIter last );
    ~knuth_morris_pratt ();

    template <typename corpusIter>
    corpusIter operator () ( corpusIter corpus_first, corpusIter corpus_last );
    };

and here is the corresponding procedural interface:

template <typename patIter, typename corpusIter>
corpusIter knuth_morris_pratt_search (
        corpusIter corpus_first, corpusIter corpus_last,
        patIter pat_first, patIter pat_last );

Each of the functions is passed two pairs of iterators. The first two define the corpus and the second two define the pattern. Note that the two pairs need not be of the same type, but they do need to "point" at the same type. In other words, patIter::value_type and curpusIter::value_type need to be the same type.

The return value of the function is an iterator pointing to the start of the pattern in the corpus. If the pattern is not found, it returns the end of the corpus (corpus_last).

Performance

The execution time of the Knuth-Morris-Pratt algorithm is linear in the size of the string being searched. Generally the algorithm gets faster as the pattern being searched for becomes longer. Its efficiency derives from the fact that with each unsuccessful attempt to find a match between the search string and the text it is searching, it uses the information gained from that attempt to rule out as many positions of the text as possible where the string cannot match.

When using the object-based interface, the pattern must remain unchanged for during the searches; i.e, from the time the object is constructed until the final call to operator () returns.
The Knuth-Morris-Pratt algorithm requires random-access iterators for both the pattern and the corpus. It should be possible to write this to use bidirectional iterators (or possibly even forward ones), but this implementation does not do that.