Boost.Sort

Introduction

The Boost.Sort library provides a generic implementation of high-speed sorting algorithms that outperform those in the C++ standard in both average and worst case performance when there are over 1000 elements in the list to sort.

They fall back to STL std::sort on small data sets.

	Warning
	These algorithms all only work on random access iterators.

They are hybrids using both radix and comparison-based sorting, specialized to sorting common data types, such as integers, floats, and strings.

These algorithms are encoded in a generic fashion and accept functors, enabling them to sort any object that can be processed like these basic data types. In the case of string_sort, this includes anything with a defined strict-weak-ordering that std::sort can sort, but writing efficient functors for some complex key types may not be worth the additional effort relative to just using std::sort, depending on how important speed is to your application. Sample usages are available in the example directory.

Unlike many radix-based algorithms, the underlying spreadsort algorithm is designed around worst-case performance. It performs better on chunky data (where it is not widely distributed), so that on real data it can perform substantially better than on random data. Conceptually, spreadsort can sort any data for which an absolute ordering can be determined, and string_sort is sufficiently flexible that this should be possible.

Situations where spreadsort is fastest relative to std::sort:

Large number of elements to sort (N >= 10000).
Slow comparison function (such as floating-point numbers on x86 processors or strings).
Large data elements (such as key + data sorted on a key).
Completely sorted data when spreadsort has an optimization to quit early in this case.

Situations where spreadsort is slower than std::sort:

Data sorted in reverse order. Both std::sort and spreadsort are faster on reverse-ordered data than randomized data, but std::sort speeds up more in this special case.
Very small amounts of data (< 1000 elements). For this reason there is a fallback in spreadsort to std::sort if the input size is less than 1000, so performance is identical for small amounts of data in practice.

These functions are defined in namespace boost::sort::spreadsort.

Overloading

	Tip
	In the Boost.Sort C++ Reference section, click on the appropriate overload, for example `float_sort(RandomAccessIter, RandomAccessIter, Right_shift, Compare);` to get full details of that overload.

Each of integer_sort, float_sort, and string_sort have 3 main versions: The base version, which takes a first iterator and a last iterator, just like std::sort:

integer_sort(array.begin(), array.end());
float_sort(array.begin(), array.end());
string_sort(array.begin(), array.end());

The version with an overridden shift functor, providing flexibility in case the operator>> already does something other than a bitshift. The rightshift functor takes two args, first the data type, and second a natural number of bits to shift right.

For string_sort this variant is slightly different; it needs a bracket functor equivalent to operator[], taking a number corresponding to the character offset, along with a second getlength functor to get the length of the string in characters. In all cases, this operator must return an integer type that compares with the operator< to provide the intended order (integers can be negated to reverse their order).

In other words:

rightshift(A, n) < rightshift(B, n) -> A < B

integer_sort(array.begin(), array.end(), rightshift());

string_sort(array.begin(), array.end(), bracket(), getsize());

See rightshiftsample.cpp for a working example of integer sorting with a rightshift functor.

And a version with a comparison functor for maximum flexibility. This functor must provide the same sorting order as the integers returned by the rightshift:

rightshift(A, n) < rightshift(B, n) -> compare(A, B)

integer_sort(array.begin(), array.end(), negrightshift(), std::greater<DATA_TYPE>());

Examples of functors are:

struct lessthan {
  inline bool operator()(const DATA_TYPE &x, const DATA_TYPE &y) const {
    return x.a < y.a;
  }
};

struct bracket {
  inline unsigned char operator()(const DATA_TYPE &x, size_t offset) const {
    return x.a[offset];
  }
};

struct getsize {
  inline size_t operator()(const DATA_TYPE &x) const{ return x.a.size(); }
};

and these functors are used thus:

string_sort(array.begin(), array.end(), bracket(), getsize(), lessthan());

See stringfunctorsample.cpp for a working example of sorting strings with all functors.

Performance

The spreadsort algorithm is a hybrid algorithm; when the number of elements being sorted is below a certain number, comparison-based sorting is used. Above it, radix sorting is used. The radix-based algorithm will thus cut up the problem into small pieces, and either completely sort the data based upon its radix if the data is clustered, or finish sorting the cut-down pieces with comparison-based sorting.

The Spreadsort algorithm dynamically chooses either comparison-based or radix-based sorting when recursing, whichever provides better worst-case performance. This way worst-case performance is guaranteed to be the better of 𝑶(N⋅log2(N)) comparisons and 𝑶(N⋅log2(K/S + S)) operations where

N is the number of elements being sorted,
K is the length in bits of the key, and
S is a constant.

This results in substantially improved performance for large N; integer_sort tends to be 50% to 2X faster than std::sort, while float_sort and _string_sort are roughly 2X faster than std::sort.

Performance graphs are provided for integer_sort, float_sort, and string_sort in their description.

Runtime Performance comparisons and graphs were made on a Core 2 Duo laptop running Windows Vista 64 with MSVC 8.0, and an old G4 laptop running Mac OSX with gcc. Boost bjam/b2 was used to control compilation.

Direct performance comparisons on a newer x86 system running Ubuntu, with the fallback to std::sort at lower input sizes disabled are below.

	Note
	The fallback to std::sort for smaller input sizes prevents the worse performance seen on the left sides of the first two graphs.

integer_sort starts to become faster than std::sort at about 1000 integers (4000 bytes), and string_sort becomes faster than std::sort at slightly fewer bytes (as few as 30 strings).

	Note
	The 4-threaded graph has 4 threads doing separate sorts simultaneously (not splitting up a single sort) as a test for thread cache collision and other multi-threaded performance issues.

float_sort times are very similar to integer_sort times.

Histogramming with a fixed maximum number of splits is used because it reduces the number of cache misses, thus improving performance relative to the approach described in detail in the original SpreadSort publication.

The importance of cache-friendly histogramming is described in Arne Maus, Adaptive Left Reflex, though without the worst-case handling described below.

The time taken per radix iteration is:

𝑶(N) iterations over the data

𝑶(N) integer-type comparisons (even for _float_sort and string_sort)

𝑶(N) swaps

𝑶(2^S) bin operations.

To obtain 𝑶(N) worst-case performance per iteration, the restriction S <= log2(N) is applied, and 𝑶(2^S) becomes 𝑶(N). For each such iteration, the number of unsorted bits log2(range) (referred to as K) per element is reduced by S. As S decreases depending upon the amount of elements being sorted, it can drop from a maximum of S_max to the minimum of S_min.

Assumption: std::sort is assumed to be 𝑶(N*log2(N)), as introsort exists and is commonly used. (If you have a quibble with this please take it up with the implementor of your std::sort; you're welcome to replace the recursive calls to std::sort to calls to introsort if your std::sort library call is poorly implemented).

Introsort is not included with this algorithm for simplicity and because the implementor of the std::sort call is assumed to know what they're doing.

To maintain a minimum value for S (S_min), comparison-based sorting has to be used to sort when n <= log2(meanbinsize), where log2(meanbinsize) (lbs) is a small constant, usually between 0 and 4, used to minimize bin overhead per element. There is a small corner-case where if K < S_min and n >= 2^K, then the data can be sorted in a single radix-based iteration with an S = K (this bucketsorting special case is by default only applied to float_sort). So for the final recursion, worst-case performance is:

1 radix-based iteration if K <= S_min,

or S_min + lbs comparison-based iterations if K > S_min but n <= 2^{(S_min + lbs)}.

So for the final iteration, worst-case runtime is 𝑶(N*(S_min + lbs)) but if K > S_min and N > 2^{(S_min + lbs)} then more than 1 radix recursion will be required.

For the second to last iteration, K <= S_min * 2 + 1 can be handled, (if the data is divided into 2^{(S_min + 1)} pieces) or if N < 2^{(S_min + lbs + 1)}, then it is faster to fallback to std::sort.

In the case of a radix-based sort plus recursion, it will take 𝑶(N*(S_min + lbs)) + 𝑶(N) = 𝑶(N*(S_min + lbs + 1)) worst-case time, as K_remaining = K_start - (S_min + 1), and K_start <= S_min * 2 + 1.

Alternatively, comparison-based sorting is used if N < 2^{(S_min +
lbs + 1)}, which will take 𝑶(N*(S_min + lbs + 1)) time.

So either way 𝑶(N*(S_min + lbs + 1)) is the worst-case time for the second to last iteration, which occurs if K <= S_min * 2 + 1 or N < 2^{(S_min + lbs + 1)}.

This continues as long as S_min <= S <= S_max, so that for K_m <= K_(m-1) + S_min + m where m is the maximum number of iterations after this one has finished, or where N < 2^{(S_min + lbs + m)}, then the worst-case runtime is 𝑶(N*(S_min + lbs + m)).

K_m at m <= (S_max - S_min) works out to:

K_1 <= (S_min) + S_min + 1 <= 2S_min + 1

K_2 <= (2S_min + 1) + S_min + 2

as the sum from 0 to m is m(m + 1)/2

K_m <= (m + 1)S_min + m(m + 1)/2 <= (S_min + m/2)(m + 1)

substituting in S_max - S_min for m

K_(S_max - S_min) <= (S_min + (S_max - S_min)/2)*(S_max - S_min + 1)

K_(S_max - S_min) <= (S_min + S_max) * (S_max - S_min + 1)/2

Since this involves S_max - S_min + 1 iterations, this works out to dividing K into an average (S_min + S_max)/2 pieces per iteration.

To finish the problem from this point takes 𝑶(N * (S_max - S_min)) for m iterations, plus the worst-case of 𝑶(N*(S_min + lbs)) for the last iteration, for a total of 𝑶(N *(S_max + lbs)) time.

When m > S_max - S_min, the problem is divided into S_max pieces per iteration, or std::sort is called if N < 2^(m + S_min + lbs). For this range:

K_m <= K_(m - 1) + S_max, providing runtime of

𝑶(N *((K - K_(S_max - S_min))/S_max + S_max + lbs)) if recursive,

or 𝑶(N * log(2^(m + S_min + lbs))) if comparison-based,

which simplifies to 𝑶(N * (m + S_min + lbs)), which substitutes to 𝑶(N * ((m - (S_max - S_min)) + S_max + lbs)), which given that m - (S_max - S_min) <= (K - K_(S_max - S_min))/S_max (otherwise a lesser number of radix-based iterations would be used)

also comes out to 𝑶(N *((K - K_(S_max - S_min))/S_max + S_max + lbs)).

Asymptotically, for large N and large K, this simplifies to:

𝑶(N * (K/S_max + S_max + lbs)),

simplifying out the constants related to the S_max - S_min range, providing an additional 𝑶(N * (S_max + lbs)) runtime on top of the 𝑶(N * (K/S)) performance of LSD radix sort, but without the 𝑶(N) memory overhead. For simplicity, because lbs is a small constant (0 can be used, and performs reasonably), it is ignored when summarizing the performance in further discussions. By checking whether comparison-based sorting is better, Spreadsort is also 𝑶(N*log(N)), whichever is better, and unlike LSD radix sort, can perform much better than the worst-case if the data is either evenly distributed or highly clustered.

This analysis was for integer_sort and float_sort. string_sort differs in that S_min = S_max = sizeof(Char_type) * 8, lbs is 0, and that std::sort's comparison is not a constant-time operation, so strictly speaking string_sort runtime is

𝑶(N * (K/S_max + (S_max comparisons))).

Worst-case, this ends up being 𝑶(N * K) (where K is the mean string length in bytes), as described for American flag sort, which is better than the

𝑶(N * K * log(N))

worst-case for comparison-based sorting.

Tuning

integer_sort and float_sort have tuning constants that control how the radix-sorting portion of those algorithms work. The ideal constant values for integer_sort and float_sort vary depending on the platform, compiler, and data being sorted. By far the most important constant is max_splits, which defines how many pieces the radix-sorting portion splits the data into per iteration.

The ideal value of max_splits depends upon the size of the L1 processor cache, and is between 10 and 13 on many systems. A default value of 11 is used. For mostly-sorted data, a much larger value is better, as swaps (and thus cache misses) are rare, but this hurts runtime severely for unsorted data, so is not recommended.

On some x86 systems, when the total number of elements being sorted is small ( less than 1 million or so), the ideal max_splits can be substantially larger, such as 17. This is suspected to be because all the data fits into the L2 cache, and misses from L1 cache to L2 cache do not impact performance as severely as misses to main memory. Modifying tuning constants other than max_splits is not recommended, as the performance improvement for changing other constants is usually minor.

If you can afford to let it run for a day, and have at least 1GB of free memory, the perl command: ./tune.pl -large -tune (UNIX) or perl tune.pl -large -tune -windows (Windows) can be used to automatically tune these constants. This should be run from the libs/sort directory inside the boost home directory. This will work to identify the ideal constants.hpp settings for your system, testing on various distributions in a 20 million element (80MB) file, and additionally verifies that all sorting routines sort correctly across various data distributions. Alternatively, you can test with the file size you're most concerned with ./tune.pl number -tune (UNIX) or perl tune.pl number -tune -windows (Windows). Substitute the number of elements you want to test with for number. Otherwise, just use the options it comes with, they're decent. With default settings ./tune.pl -tune (UNIX) perl tune.pl -tune -windows (Windows), the script will take hours to run (less than a day), but may not pick the correct max_splits if it is over 10. Alternatively, you can add the -small option to make it take just a few minutes, tuning for smaller vector sizes (one hundred thousand elements), but the resulting constants may not be good for large files (see above note about max_splits on Windows).

The tuning script can also be used just to verify that sorting works correctly on your system, and see how much of a speedup it gets, by omiting the "-tune" option. This runs at the end of tuning runs. Default args will take about an hour to run and give accurate results on decent-sized test vectors. ./tune.pl -small (UNIX) perl tune.pl -small -windows (Windows) is a faster option, that tests on smaller vectors and isn't as accurate.

If any differences are encountered during tuning, please call tune.pl with -debug > log_file_name. If the resulting log file contains compilation or permissions issues, it is likely an issue with your setup. If some other type of error is encountered (or result differences), please send them to the library author at spreadsort@gmail.com. Including the zipped input.txt that was being used is also helpful.

Boost C++ Libraries

Steven Ross

Overview