3.- Parallel Algorithms

3.1- block_indirect_sort

Description
Benchmark
Programming
Internal Description
Design Process

3.2.- Sample_Sort

Description
Programming

3.3.- Parallel_stable_sort

Description
Programming

3.4- Linux Benchmarks

3.5- Windows Benchmark

Parallel Algorithms
This table provides a brief description of the parallel algorithms of the library.
                      |       |                        |                              |
Algorithm             |Stable |   Additional memory    |Best, average, and worst case |
----------------------+-------+------------------------+------------------------------+
block_indirect_sort   |  no   |block_size * num_threads| N, N LogN , N LogN           |
sample_sort           |  yes  |        N               | N, N LogN , N LogN           |
parallel_stable_sort  |  yes  |      N / 2             | N, N LogN , N LogN           |
                      |       |                        |                              |
Sample_sort is a implementation of the Samplesort algorithm done by Francisco Tapia.

Parallel_stable_sort is based on the samplesort algorithm, but using a half of the memory used by sample_sort, conceived and implemented by Francisco Tapia.

Block_indirect_sort is a novel parallel sort algorithm, very fast, with low additional memory consumption, conceived and implemented by Francisco Tapia.

The block_size is an internal parameter of the algorithm, which in order to achieve the highest speed, changes according to the size of the objects to sort according to the next table. The strings use a block_size of 128.
                                |        |         |         |         |         |         |          |
object size (bytes)             | 1 - 15 | 16 - 31 | 32 - 63 | 64 - 127|128 - 255|256 - 511| 512 -    |
--------------------------------+--------+---------+---------+---------+---------+---------+----------+
block_size (number of elements) |  4096  |  2048   |   1024  |   768   |   512   |   256   |  128     |
                                |        |         |         |         |         |         |          |
Thread Specification

The parallel algorithms have a integer parameter indicating the number of threads to use in the sorting process, which always is the last value in the call.

The default value (if left unspecified) is the number of HW threads of the machine where the program is running provided by std::thread::hardware_concurrency(). If the number is 1 or 0, the algorithm runs with only 1 thread.

The number of threads is not a fixed number, but is calculated in each execution. The number of threads passed can be greater than the number of hardware threads on the machine. We can pass 100 threads in a machine with 4 HW threads, and in the same mode we can pass a function as (std::thread::hardware_concurrency() / 4 ). If this value is 0, the program is executed with 1 thread.

Programming
You only need to include the file boost/sort/sort.hpp to use these algorithms.

All the algorithms run in the namespace boost::sort
#include <boost/sort/sort.hpp>
The parallel algorithms have 4 invocation formats:
algorithm ( first iterator, last iterator, comparison object, number of threads )
algorithm ( first iterator, last iterator, comparison object )
algorithm ( first iterator, last iterator, number of threads )
algorithm ( first iterator, last iterator )
These algorithms need a C++11 compliant compiler, but don't need any other code or library. With older compilers correct operation isn't guaranteed.

If the number of threads is unspecified, use the result of std::thread::hardware_concurrency()

These algorithms use a comparison object, in the same way as the standard library sort algorithms. If you don't define it, the comparison object defaults to std::less, which uses the < operator internally for comparisons.

These algorithms are exception safe, meaning that, the exceptions generated by the algorithms guarantee the integrity of the objects to sort, but not their relative order. If the exception is generated inside the objects (in the move or copy constructors) the results are undefined.

3.1- block_indirect_sort

Description

BLOCK_INDIRECT_SORT is a new unstable parallel sort, conceived and implemented by Francisco Jose Tapia for the Boost Library.

The most important characteristics of this algorithm are the speed and the low memory consumption.
                      |       |                        |                              |
Algorithm             |Stable |   Additional memory    |Best, average, and worst case |
----------------------+-------+------------------------+------------------------------+
block_indirect_sort   |  no   |block_size * num_threads|     N, N LogN, N LogN        |
                      |       |                        |                              |
The block_size is an internal parameter of the algorithm, which in order to achieve the highest speed, changes according with the size of the objects to sort according to the next table.The strings use a block_size of 128.
                                |        |         |         |         |         |         |          |
object size (bytes)             | 1 - 15 | 16 - 31 | 32 - 63 | 64 - 127|128 - 255|256 - 511| 512 -    |
--------------------------------+--------+---------+---------+---------+---------+---------+----------+
block_size (number of elements) |  4096  |  2048   |   1024  |   768   |   512   |   256   |  128     |
                                |        |         |         |         |         |         |          |

Benchmark

Sorting 100 000 000 64 bits numbers, the measured memory used was:

                                  |             |             |
Algorithm                         | Time (secs) | Memory used |
----------------------------------+-------------+-------------+
Open MP Parallel Sort             |   1.1990    |  1564 MB    |
Threading Building Blocks (TBB)   |   1.6411    |   789 MB    |
Block Indirect Sort               |   0.9270    |   790 MB    |
                                  |             |             |

Programming

Thread specification

This algorithm has an integer parameter indicating the number of threads to use in the sorting process, which always is the last value in the call. The default value (if left unspecified) is the number of HW threads of the machine where the program is running provided by std::thread::hardware_concurrency().

If the number is 1 or 0, the algorithm runs with only 1 thread.

The number of threads is not a fixed number, but is calculated in each execution. The number of threads passed can be greater than the number of hardware threads on the machine. We can pass 100 threads in a machine with 4 HW threads, and in the same mode we can pass a function as (std::thread::hardware_concurrency() / 4 ). If this value is 0, the program is executed with 1 thread.

Programming
You only need to include the file boost/sort/sort.hpp to use this algorithm

The algorithm runs in the namespace boost::sort
#include <boost/sort/sort.hpp>



template <class iter_t>
void block_indirect_sort (iter_t first, iter_t last);

template <class iter_t, typename compare>
void block_indirect_sort (iter_t first, iter_t last, compare comp);

template <class iter_t>
void block_indirect_sort (iter_t first, iter_t last, uint32_t num_thread);

template <class iter_t, typename compare>
void block_indirect_sort (iter_t first, iter_t last, compare comp, uint32_t num_thread);
This algorithm needs a C++11 compliant compiler. You don't need any other code or library. With older compilers correct operation is not guaranteed.

If the number of threads is unspecified, use the result of std::thread::hardware_concurrency()

This algorithm uses a comparison object, in the same way as the standard library sort algorithms. If not defined, the comparison object is std::less, which uses the < operator internally.

This algorithm is exception safe, meaning that, the exceptions generated by the algorithm guarantee the integrity of the objects to sort, but not their relative order. If the exception is generated inside the objects (in the move or in the copy constructor.. ) the results can be unpredictable.

Internal Description

There are two primary categories of parallelization in sorting algorithms.

Subdivision Algorithms

Filter the data and generate two or more parts. Each part obtained is filtered and divided by other threads. This filter and division process is done until the size of the part is smaller than a predefined size, then is sorted by a single thread.

The algorithm most frequently used in the filter and sorting is quick sort

These algorithms are fast with a small number of threads, but are inefficient with a great number of threads. Examples of this category are

Intel Threading Building Blocks (TBB)

Microsoft PPL Parallel Sort.

Merging Algorithms

Divide the data in parts, and each part is sorted by a thread. When the parts are sorted, they are merged to obtain the final results. These algorithms need additional memory for the merge, usually the same size as the data.

With a small number of threads, these algorithms have similar speed to the subdivision algorithms, but with many threads are much faster.

Examples of this category are

GCC Parallel Sort (based on OpenMP)

Microsoft PPL Parallel Buffered Sort

This generates an undesirable duality. With a small number of threads the optimal algorithm is not the optimal for a big number of threads.

For this reason, the SW designed for a small machine is inadequate for a big machine and vice versa.

Using only merging algorithms, has the problem of the additional memory used, usually of the same size as the data.

New Parallel Sort Algorithm (Block Indirect Sort)

This algorithm, named Block Indirect Sort, created for processors connected with shared memory, is a hybrid algorithm.

With small number of threads, it is a subdivision algorithm.

With many threads it is a merging algorithm, with a small auxiliary memory ( block_size * number of threads).

This algorithm eliminates the duality. The same code has optimal performance with a small and a big number of threads.

The number of threads to use is evaluated in each execution. When the program runs with a small number of threads the algorithm internally uses a subdivision algorithm and has similar performance to TBB, and when run with many threads, it internally uses the new algorithm and has the performance of GCC Parallel Sort, with the additional advantage of reduced memory consumption.

Design Process

Initial Idea
The initial idea, was to build a merge algorithm, to be fast with many threads, with a low additional memory.

This algortihm is not based in any other idea or algorithm. The technique used in the algorithm (indirect blocks) is new, and had been conceived and implemented for this algorithm.

As sample of their results, we can see the the sorting 100 000 000 64 bits numbers, ramdomly generated, in a machine with 12 threads.
                                  |             |             |
Algorithm                         | Time (secs) | Memory used |
----------------------------------+-------------+-------------+
Open MP Parallel Sort             |   1.1990    |  1564 MB    |
Threading Building Blocks (TBB)   |   1.6411    |   789 MB    |
Block Indirect Sort               |   0.9270    |   790 MB    |
                                  |             |             |
The best words about this algorithm are expressed by their Benchmarks results
Design process

The process had been long and very hard, mainly, by the uncertainty about if the ideas are correct and run so fast as need for to be useful. This is complicated by the fact that we can’t be sure of the efficiency until the last part of the code is done and the first benchmark has run.

But it had been a very exciting process, each time a problem is resolved, a new internal algorithm is designed, tested …, and see, step by step, the advance of the process.

I discovered many new problems during this process, unknown until now, which forced me to design new internal algorithms to resolve them, and divide the work in many parts to execute in parallel mode. Due to this, you can find many nice algorithms inside the sorting algorithm written to resolve and parallelize the internal problems.

If you are interested in a detailed description of the algorithm, you can find it here : block_indirect_sort_en.pdf.