Boost C++ Libraries

...one of the most highly regarded and expertly designed C++ library projects in the world. Herb Sutter and Andrei Alexandrescu, C++ Coding Standards

PrevUpHomeNext

String Sort

string_sort is a hybrid radix-based/comparison-based algorithm that sorts strings of characters (or arrays of binary data) in ascending order.

The simplest version (no functors) sorts strings of items that can cast to an unsigned data type (such as an unsigned char), have a < operator, have a size function, and have a data() function that returns a pointer to an array of characters, such as a std::string. The functor version can sort any data type that has a strict weak ordering, via templating, but requires definitions of a get_char (acts like x[offset] on a string or a byte array), get_length (returns length of the string being sorted), and a comparison functor. Individual characters returned by get_char must support the != operator and have an unsigned value that defines their lexicographical order.

This algorithm is not efficient for character types larger than 2 bytes each, and is optimized for one-byte character strings. For this reason, std::sort will be called instead if the character type is of size > 2.

string_sort has a special optimization for identical substrings. This adds some overhead on random data, but identical substrings are common in real strings.

reverse_string_sort sorts strings in reverse (decending) order, but is otherwise identical. string_sort is sufficiently flexible that it should sort any data type that std::sort can, assuming the user provides appropriate functors that index into a key.

Windows String Sort

OSX String Sort

See stringfunctorsample.cpp for an example of how to sort structs using a string key and all functors:

struct lessthan {
  inline bool operator()(const DATA_TYPE &x, const DATA_TYPE &y) const {
    return x.a < y.a;
  }
};
struct bracket {
  inline unsigned char operator()(const DATA_TYPE &x, size_t offset) const {
    return x.a[offset];
  }
};
struct getsize {
  inline size_t operator()(const DATA_TYPE &x) const{ return x.a.size(); }
};

and these functors are used thus:

string_sort(array.begin(), array.end(), bracket(), getsize(), lessthan());

See generalizedstruct.cpp for a working example of a generalized approach to sort structs by a sequence of integer, float, and multiple string keys using string_sort:

struct DATA_TYPE {
  time_t birth;
  float net_worth;
  string first_name;
  string last_name;
};

static const int birth_size = sizeof(time_t);
static const int first_name_offset = birth_size + sizeof(float);
static const boost::uint64_t base_mask = 0xff;

struct lessthan {
  inline bool operator()(const DATA_TYPE &x, const DATA_TYPE &y) const {
    if (x.birth != y.birth) {
      return x.birth < y.birth;
    }
    if (x.net_worth != y.net_worth) {
      return x.net_worth < y.net_worth;
    }
    if (x.first_name != y.first_name) {
      return x.first_name < y.first_name;
    }
    return x.last_name < y.last_name;
  }
};

struct bracket {
  inline unsigned char operator()(const DATA_TYPE &x, size_t offset) const {
    // Sort date as a signed int, returning the appropriate byte.
    if (offset < birth_size) {
      const int bit_shift = 8 * (birth_size - offset - 1);
      unsigned char result = (x.birth & (base_mask << bit_shift)) >> bit_shift;
      // Handling the sign bit.  Unnecessary if the data is always positive.
      if (offset == 0) {
        return result ^ 128;
      }

      return result;
    }

    // Sort a signed float.  This requires reversing the order of negatives
    // because of the way floats are represented in bits.
    if (offset < first_name_offset) {
      const int bit_shift = 8 * (first_name_offset - offset - 1);
      unsigned key = float_mem_cast<float, unsigned>(x.net_worth);
      unsigned char result = (key & (base_mask << bit_shift)) >> bit_shift;
      // Handling the sign.
      if (x.net_worth < 0) {
        return 255 - result;
      }
      // Increasing positives so they are higher than negatives.
      if (offset == birth_size) {
        return 128 + result;
      }

      return result;
    }

    // Sort a string that is before the end.  This approach supports embedded
    // nulls.  If embedded nulls are not required, then just delete the "* 2"
    // and the inside of the following if just becomes:
    // return x.first_name[offset - first_name_offset];
    const unsigned first_name_end_offset =
      first_name_offset + x.first_name.size() * 2;
    if (offset < first_name_end_offset) {
      int char_offset = offset - first_name_offset;
      // This signals that the string continues.
      if (!(char_offset & 1)) {
        return 1;
      }
      return x.first_name[char_offset >> 1];
    }

    // This signals that the string has ended, so that shorter strings come
    // before longer ones.
    if (offset == first_name_end_offset) {
      return 0;
    }

    // The final string needs no special consideration.
    return x.last_name[offset - first_name_end_offset - 1];
  }
};

struct getsize {
  inline size_t operator()(const DATA_TYPE &x) const {
    return first_name_offset + x.first_name.size() * 2 + 1 +
      x.last_name.size();
  }
};
string_sort(array.begin(), array.end(), bracket(), getsize(), lessthan());

Other examples:

String sort.

Reverse string sort.

Wide character string sort.

Case insensitive string sort.

Sort structs using a string key and indexing functors.

Sort structs using a string keynd all functors in reverse order.


PrevUpHomeNext