Parallel computation with OpenMP and MPI

OpenMP

odeint's OpenMP support is implemented as an external backend, which needs to be manually included. Depending on the compiler some additional flags may be needed, i.e. -fopenmp for GCC.

#include <omp.h>
#include <boost/numeric/odeint.hpp>
#include <boost/numeric/odeint/external/openmp/openmp.hpp>

In the easiest parallelization approach with OpenMP we use a standard vector as the state type:

typedef std::vector< double > state_type;

We initialize the state with some random data:

size_t N = 131101;
state_type x( N );
boost::random::uniform_real_distribution<double> distribution( 0.0 , 2.0*pi );
boost::random::mt19937 engine( 0 );
generate( x.begin() , x.end() , boost::bind( distribution , engine ) );

Now we have to configure the stepper to use the OpenMP backend. This is done by explicitly providing the openmp_range_algebra as a template parameter to the stepper. This algebra requires the state type to be a model of Random Access Range and will be used from multiple threads by the algebra.

typedef runge_kutta4<
                  state_type , double ,
                  state_type , double ,
                  openmp_range_algebra
                > stepper_type;

Additional to providing the stepper with OpenMP parallelization we also need a parallelized system function to exploit the available cores. Here this is shown for a simple one-dimensional chain of phase oscillators with nearest neighbor coupling:

struct phase_chain
{
    phase_chain( double gamma = 0.5 )
    : m_gamma( gamma ) { }

    void operator()( const state_type &x , state_type &dxdt , double /* t */ ) const
    {
        const size_t N = x.size();
        #pragma omp parallel for schedule(runtime)
        for(size_t i = 1 ; i < N - 1 ; ++i)
        {
            dxdt[i] = coupling_func( x[i+1] - x[i] ) +
                      coupling_func( x[i-1] - x[i] );
        }
        dxdt[0  ] = coupling_func( x[1  ] - x[0  ] );
        dxdt[N-1] = coupling_func( x[N-2] - x[N-1] );
    }

    double coupling_func( double x ) const
    {
        return sin( x ) - m_gamma * ( 1.0 - cos( x ) );
    }

    double m_gamma;
};

	Note
	In the OpenMP backends the system function will always be called sequentially from the thread used to start the integration.

Finally, we perform the integration by using one of the integrate functions from odeint. As you can see, the parallelization is completely hidden in the stepper and the system function. OpenMP will take care of distributing the work among the threads and join them automatically.

integrate_n_steps( stepper_type() , phase_chain( 1.2 ) ,
                   x , 0.0 , 0.01 , 100 );

After integrating, the data can be accessed immediately and be processed further. Note, that you can specify the OpenMP scheduling by calling omp_set_schedule in the beginning of your program:

int chunk_size = N/omp_get_max_threads();
omp_set_schedule( omp_sched_static , chunk_size );

See openmp/phase_chain.cpp for the complete example.

Split state

For advanced cases odeint offers another approach to use OpenMP that allows for a more exact control of the parallelization. For example, for odd-sized data where OpenMP's thread boundaries don't match cache lines and hurt performance it might be advisable to copy the data from the continuous vector<T> into separate, individually aligned, vectors. For this, odeint provides the openmp_state<T> type, essentially an alias for vector<vector<T>>.

Here, the initialization is done with a vector<double>, but then we use odeint's split function to fill an openmp_state. The splitting is done such that the sizes of the individual regions differ at most by 1 to make the computation as uniform as possible.

const size_t N = 131101;
vector<double> x( N );
boost::random::uniform_real_distribution<double> distribution( 0.0 , 2.0*pi );
boost::random::mt19937 engine( 0 );
generate( x.begin() , x.end() , boost::bind( distribution , engine ) );
const size_t blocks = omp_get_max_threads();
state_type x_split( blocks );
split( x , x_split );

Of course, the system function has to be changed to deal with the openmp_state. Note that each sub-region of the state is computed in a single task, but at the borders read access to the neighbouring regions is required.

struct phase_chain_omp_state
{
    phase_chain_omp_state( double gamma = 0.5 )
    : m_gamma( gamma ) { }

    void operator()( const state_type &x , state_type &dxdt , double /* t */ ) const
    {
        const size_t N = x.size();
        #pragma omp parallel for schedule(runtime)
        for(size_t n = 0 ; n < N ; ++n)
        {
            const size_t M = x[n].size();
            for(size_t m = 1 ; m < M-1 ; ++m)
            {
                dxdt[n][m] = coupling_func( x[n][m+1] - x[n][m] ) +
                             coupling_func( x[n][m-1] - x[n][m] );
            }
            dxdt[n][0] = coupling_func( x[n][1] - x[n][0] );
            if( n > 0 )
            {
                dxdt[n][0] += coupling_func( x[n-1].back() - x[n].front() );
            }
            dxdt[n][M-1] = coupling_func( x[n][M-2] - x[n][M-1] );
            if( n < N-1 )
            {
                dxdt[n][M-1] += coupling_func( x[n+1].front() - x[n].back() );
            }
        }
    }

    double coupling_func( double x ) const
    {
        return sin( x ) - m_gamma * ( 1.0 - cos( x ) );
    }

    double m_gamma;
};

Using the openmp_state<T> state type automatically selects openmp_algebra which executes odeint's internal computations on parallel regions. Hence, no manual configuration of the stepper is necessary. At the end of the integration, we use unsplit to concatenate the sub-regions back together into a single vector.

integrate_n_steps( runge_kutta4<state_type>() , phase_chain_omp_state( 1.2 ) ,
                   x_split , 0.0 , 0.01 , 100 );
unsplit( x_split , x );

	Note
	You don't actually need to use `openmp_state<T>` for advanced use cases, `openmp_algebra` is simply an alias for `openmp_nested_algebra<range_algebra>` and supports any model of Random Access Range as the outer, parallel state type, and will use the given algebra on its elements.

See openmp/phase_chain_omp_state.cpp for the complete example.

MPI

To expand the parallel computation across multiple machines we can use MPI.

The system function implementation is similar to the OpenMP variant with split data, the main difference being that while OpenMP uses a spawn/join model where everything not explicitly paralleled is only executed in the main thread, in MPI's model each node enters the main() method independently, diverging based on its rank and synchronizing through message-passing and explicit barriers.

odeint's MPI support is implemented as an external backend, too. Depending on the MPI implementation the code might need to be compiled with i.e. mpic++.

#include <boost/numeric/odeint.hpp>
#include <boost/numeric/odeint/external/mpi/mpi.hpp>

Instead of reading another thread's data, we asynchronously send and receive the relevant data from neighbouring nodes, performing some computation in the interim to hide the latency.

struct phase_chain_mpi_state
{
    phase_chain_mpi_state( double gamma = 0.5 )
    : m_gamma( gamma ) { }

    void operator()( const state_type &x , state_type &dxdt , double /* t */ ) const
    {
        const size_t M = x().size();
        const bool have_left = x.world.rank() > 0,
                   have_right = x.world.rank() < x.world.size()-1;
        double x_left, x_right;
        boost::mpi::request r_left, r_right;
        if( have_left )
        {
            x.world.isend( x.world.rank()-1 , 0 , x().front() ); // send to x_right
            r_left = x.world.irecv( x.world.rank()-1 , 0 , x_left ); // receive from x().back()
        }
        if( have_right )
        {
            x.world.isend( x.world.rank()+1 , 0 , x().back() ); // send to x_left
            r_right = x.world.irecv( x.world.rank()+1 , 0 , x_right ); // receive from x().front()
        }
        for(size_t m = 1 ; m < M-1 ; ++m)
        {
            dxdt()[m] = coupling_func( x()[m+1] - x()[m] ) +
                        coupling_func( x()[m-1] - x()[m] );
        }
        dxdt()[0] = coupling_func( x()[1] - x()[0] );
        if( have_left )
        {
            r_left.wait();
            dxdt()[0] += coupling_func( x_left - x().front() );
        }
        dxdt()[M-1] = coupling_func( x()[M-2] - x()[M-1] );
        if( have_right )
        {
            r_right.wait();
            dxdt()[M-1] += coupling_func( x_right - x().back() );
        }
    }

    double coupling_func( double x ) const
    {
        return sin( x ) - m_gamma * ( 1.0 - cos( x ) );
    }

    double m_gamma;
};

Analogous to openmp_state<T> we use mpi_state< InnerState<T> >, which automatically selects mpi_nested_algebra and the appropriate MPI-oblivious inner algebra (since our inner state is a vector, the inner algebra will be range_algebra as in the OpenMP example).

typedef mpi_state< vector<double> > state_type;

In the main program we construct a communicator which tells us the size of the cluster and the current node's rank within that. We generate the input data on the master node only, avoiding unnecessary work on the other nodes. Instead of simply copying chunks, split acts as a MPI collective function here and sends/receives regions from master to each slave. The input argument is ignored on the slaves, but the master node receives a region in its output and will participate in the computation.

boost::mpi::environment env( argc , argv );
boost::mpi::communicator world;

const size_t N = 131101;
vector<double> x;
if( world.rank() == 0 )
{
    x.resize( N );
    boost::random::uniform_real_distribution<double> distribution( 0.0 , 2.0*pi );
    boost::random::mt19937 engine( 0 );
    generate( x.begin() , x.end() , boost::bind( distribution , engine ) );
}

state_type x_split( world );
split( x , x_split );

Now that x_split contains (only) the local chunk for each node, we start the integration.

To print the result on the master node, we send the processed data back using unsplit.

integrate_n_steps( runge_kutta4<state_type>() , phase_chain_mpi_state( 1.2 ) ,
                   x_split , 0.0 , 0.01 , 100 );
unsplit( x_split , x );

	Note
	`mpi_nested_algebra::for_each``N` doesn't use any MPI constructs, it simply calls the inner algebra on the local chunk and the system function is not guarded by any barriers either, so if you don't manually place any (for example in parameter studies cases where the elements are completely independent) you might see the nodes diverging, returning from this call at different times.

See mpi/phase_chain.cpp for the complete example.

Concepts

MPI State
OpenMP Split State
Splitter

MPI State

As used by mpi_nested_algebra.

Notation

InnerState: The inner state type
State: The MPI-state type
state: Object of type State
world: Object of type boost::mpi::communicator

Valid Expressions

Name	Expression	Type	Semantics
Construct a state with a communicator	`State(world)`	`State`	Constructs the State.
Construct a state with the default communicator	`State()`	`State`	Constructs the State.
Get the current node's inner state	`state()`	`InnerState`	Returns a (const) reference.
Get the communicator	`state.world`	`boost::mpi::communicator`	See Boost.MPI.

Models

mpi_state<InnerState>

OpenMP Split State

As used by openmp_nested_algebra, essentially a Random Access Container with ValueType = InnerState.

Notation

InnerState: The inner state type
State: The split state type
state: Object of type State

Valid Expressions

Name	Expression	Type	Semantics
Construct a state for `n` chunks	`State(n)`	`State`	Constructs underlying `vector`.
Get a chunk	`state[i]`	`InnerState`	Accesses underlying `vector`.
Get the number of chunks	`state.size()`	`size_type`	Returns size of underlying `vector`.

Models

openmp_state<ValueType> with InnerState = vector<ValueType>

Splitter

Notation

Container1: The continuous-data container type
x: Object of type Container1
Container2: The chunked-data container type
y: Object of type Container2

Valid Expressions

Name	Expression	Type	Semantics
Copy chunks of input to output elements	`split(x, y)`	`void`	Calls `split_impl<Container1, Container2>::split(x, y)`, splits `x` into `y.size()` chunks.
Join chunks of input elements to output	`unsplit(y, x)`	`void`	Calls `unsplit_impl<Container2, Container1>::unsplit(y, x)`, assumes `x` is of the correct size σ `y[i].size()`, does not resize `x`.

Models

defined for Container1 = Boost.Range and Container2 = openmp_state
and Container2 = mpi_state.

To implement splitters for containers incompatible with Boost.Range, specialize the split_impl and unsplit_impl types:

template< class Container1, class Container2 , class Enabler = void >
struct split_impl {
    static void split( const Container1 &from , Container2 &to );
};

template< class Container2, class Container1 , class Enabler = void >
struct unsplit_impl {
    static void unsplit( const Container2 &from , Container1 &to );
};