 Boost C++ Libraries

...one of the most highly regarded and expertly designed C++ library projects in the world.

3.- Parallel Algorithms

3.1- block_indirect_sort
Description
Benchmark
Programming
Internal Description
Design Process
3.2.- Sample_Sort
Description
Programming
3.3.- Parallel_stable_sort
Description
Programming
3.4- Linux Benchmarks
3.5- Windows Benchmark
Parallel Algorithms

This table provides a brief description of the parallel algorithms of the library.

|       |                        |                              |
Algorithm             |Stable |   Additional memory    |Best, average, and worst case |
----------------------+-------+------------------------+------------------------------+
block_indirect_sort   |  no   |block_size * num_threads| N, N LogN , N LogN           |
sample_sort           |  yes  |        N               | N, N LogN , N LogN           |
parallel_stable_sort  |  yes  |      N / 2             | N, N LogN , N LogN           |
|       |                        |                              |

• Sample_sort is a implementation of the Samplesort algorithm done by Francisco Tapia.
• Parallel_stable_sort is based on the samplesort algorithm, but using a half of the memory used by sample_sort, conceived and implemented by Francisco Tapia.
• Block_indirect_sort is a novel parallel sort algorithm, very fast, with low additional memory consumption, conceived and implemented by Francisco Tapia.

The block_size is an internal parameter of the algorithm, which in order to achieve the highest speed, changes according to the size of the objects to sort according to the next table. The strings use a block_size of 128.

|        |         |         |         |         |         |          |
object size (bytes)             | 1 - 15 | 16 - 31 | 32 - 63 | 64 - 127|128 - 255|256 - 511| 512 -    |
--------------------------------+--------+---------+---------+---------+---------+---------+----------+
block_size (number of elements) |  4096  |  2048   |   1024  |   768   |   512   |   256   |  128     |
|        |         |         |         |         |         |          |

The parallel algorithms have a integer parameter indicating the number of threads to use in the sorting process, which always is the last value in the call.

The default value (if left unspecified) is the number of HW threads of the machine where the program is running provided by std::thread::hardware_concurrency(). If the number is 1 or 0, the algorithm runs with only 1 thread.

The number of threads is not a fixed number, but is calculated in each execution. The number of threads passed can be greater than the number of hardware threads on the machine. We can pass 100 threads in a machine with 4 HW threads, and in the same mode we can pass a function as (std::thread::hardware_concurrency() / 4 ). If this value is 0, the program is executed with 1 thread.

Programming

You only need to include the file boost/sort/sort.hpp to use these algorithms.

All the algorithms run in the namespace boost::sort

#include <boost/sort/sort.hpp>

The parallel algorithms have 4 invocation formats:

algorithm ( first iterator, last iterator, comparison object, number of threads )
algorithm ( first iterator, last iterator, comparison object )
algorithm ( first iterator, last iterator, number of threads )
algorithm ( first iterator, last iterator )

These algorithms need a C++11 compliant compiler, but don't need any other code or library. With older compilers correct operation isn't guaranteed.

If the number of threads is unspecified, use the result of std::thread::hardware_concurrency()

These algorithms use a comparison object, in the same way as the standard library sort algorithms. If you don't define it, the comparison object defaults to std::less, which uses the < operator internally for comparisons.

These algorithms are exception safe, meaning that, the exceptions generated by the algorithms guarantee the integrity of the objects to sort, but not their relative order. If the exception is generated inside the objects (in the move or copy constructors) the results are undefined.

3.1- block_indirect_sort

Description

BLOCK_INDIRECT_SORT is a new unstable parallel sort, conceived and implemented by Francisco Jose Tapia for the Boost Library.

The most important characteristics of this algorithm are the speed and the low memory consumption.

|       |                        |                              |
Algorithm             |Stable |   Additional memory    |Best, average, and worst case |
----------------------+-------+------------------------+------------------------------+
block_indirect_sort   |  no   |block_size * num_threads|     N, N LogN, N LogN        |
|       |                        |                              |

The block_size is an internal parameter of the algorithm, which in order to achieve the highest speed, changes according with the size of the objects to sort according to the next table.The strings use a block_size of 128.

|        |         |         |         |         |         |          |
object size (bytes)             | 1 - 15 | 16 - 31 | 32 - 63 | 64 - 127|128 - 255|256 - 511| 512 -    |
--------------------------------+--------+---------+---------+---------+---------+---------+----------+
block_size (number of elements) |  4096  |  2048   |   1024  |   768   |   512   |   256   |  128     |
|        |         |         |         |         |         |          |

Benchmark

Sorting 100 000 000 64 bits numbers, the measured memory used was:

|             |             |
Algorithm                         | Time (secs) | Memory used |
----------------------------------+-------------+-------------+
Open MP Parallel Sort             |   1.1990    |  1564 MB    |
Threading Building Blocks (TBB)   |   1.6411    |   789 MB    |
Block Indirect Sort               |   0.9270    |   790 MB    |
|             |             |

Programming

This algorithm has an integer parameter indicating the number of threads to use in the sorting process, which always is the last value in the call. The default value (if left unspecified) is the number of HW threads of the machine where the program is running provided by std::thread::hardware_concurrency().

If the number is 1 or 0, the algorithm runs with only 1 thread.

The number of threads is not a fixed number, but is calculated in each execution. The number of threads passed can be greater than the number of hardware threads on the machine. We can pass 100 threads in a machine with 4 HW threads, and in the same mode we can pass a function as (std::thread::hardware_concurrency() / 4 ). If this value is 0, the program is executed with 1 thread.

Programming

You only need to include the file boost/sort/sort.hpp to use this algorithm

The algorithm runs in the namespace boost::sort

#include <boost/sort/sort.hpp>

template <class iter_t>
void block_indirect_sort (iter_t first, iter_t last);

template <class iter_t, typename compare>
void block_indirect_sort (iter_t first, iter_t last, compare comp);

template <class iter_t>
void block_indirect_sort (iter_t first, iter_t last, uint32_t num_thread);

template <class iter_t, typename compare>
void block_indirect_sort (iter_t first, iter_t last, compare comp, uint32_t num_thread);

This algorithm needs a C++11 compliant compiler. You don't need any other code or library. With older compilers correct operation is not guaranteed.

If the number of threads is unspecified, use the result of std::thread::hardware_concurrency()

This algorithm uses a comparison object, in the same way as the standard library sort algorithms. If not defined, the comparison object is std::less, which uses the < operator internally.

This algorithm is exception safe, meaning that, the exceptions generated by the algorithm guarantee the integrity of the objects to sort, but not their relative order. If the exception is generated inside the objects (in the move or in the copy constructor.. ) the results can be unpredictable.

Internal Description

There are two primary categories of parallelization in sorting algorithms.

Subdivision Algorithms

Filter the data and generate two or more parts. Each part obtained is filtered and divided by other threads. This filter and division process is done until the size of the part is smaller than a predefined size, then is sorted by a single thread.

The algorithm most frequently used in the filter and sorting is quick sort

These algorithms are fast with a small number of threads, but are inefficient with a great number of threads. Examples of this category are

• Intel Threading Building Blocks (TBB)
• Microsoft PPL Parallel Sort.
Merging Algorithms

Divide the data in parts, and each part is sorted by a thread. When the parts are sorted, they are merged to obtain the final results. These algorithms need additional memory for the merge, usually the same size as the data.

With a small number of threads, these algorithms have similar speed to the subdivision algorithms, but with many threads are much faster.

Examples of this category are

• GCC Parallel Sort (based on OpenMP)
• Microsoft PPL Parallel Buffered Sort

This generates an undesirable duality. With a small number of threads the optimal algorithm is not the optimal for a big number of threads.

For this reason, the SW designed for a small machine is inadequate for a big machine and vice versa.

Using only merging algorithms, has the problem of the additional memory used, usually of the same size as the data.

New Parallel Sort Algorithm (Block Indirect Sort)

This algorithm, named Block Indirect Sort, created for processors connected with shared memory, is a hybrid algorithm.

• With small number of threads, it is a subdivision algorithm.
• With many threads it is a merging algorithm, with a small auxiliary memory ( block_size * number of threads).

This algorithm eliminates the duality. The same code has optimal performance with a small and a big number of threads.

The number of threads to use is evaluated in each execution. When the program runs with a small number of threads the algorithm internally uses a subdivision algorithm and has similar performance to TBB, and when run with many threads, it internally uses the new algorithm and has the performance of GCC Parallel Sort, with the additional advantage of reduced memory consumption.

Design Process

Initial Idea

The initial idea, was to build a merge algorithm, to be fast with many threads, with a low additional memory.

This algortihm is not based in any other idea or algorithm. The technique used in the algorithm (indirect blocks) is new, and had been conceived and implemented for this algorithm.

As sample of their results, we can see the the sorting 100 000 000 64 bits numbers, ramdomly generated, in a machine with 12 threads.

|             |             |
Algorithm                         | Time (secs) | Memory used |
----------------------------------+-------------+-------------+
Open MP Parallel Sort             |   1.1990    |  1564 MB    |
Threading Building Blocks (TBB)   |   1.6411    |   789 MB    |
Block Indirect Sort               |   0.9270    |   790 MB    |
|             |             |