...one of the most highly
regarded and expertly designed C++ library projects in the
world.
— Herb Sutter and Andrei
Alexandrescu, C++
Coding Standards
Home | Libraries | People | FAQ | More |
Modern graphic cards (graphic processing units - GPUs) can be used to speed up the performance of time consuming algorithms by means of massive parallelization. They are designed to execute many operations in parallel. odeint can utilize the power of GPUs by means of CUDA and Thrust, which is a STL-like interface for the native CUDA API.
Important | |
---|---|
Thrust also supports parallelization using OpenMP and Intel Threading Building Blocks (TBB). You can switch between CUDA, OpenMP and TBB parallelizations by a simple compiler switch. Hence, this also provides an easy way to get basic OpenMP parallelization into odeint. The examples discussed below are focused on GPU parallelization, though. |
To use odeint with CUDA a few points have to be taken into account. First of all, the problem has to be well chosen. It makes absolutely no sense to try to parallelize the code for a three dimensional system, it is simply too small and not worth the effort. One single function call (kernel execution) on the GPU is slow but you can do the operation on a huge set of data with only one call. We have experienced that the vector size over which is parallelized should be of the order of 10^{6} to make full use of the GPU. Secondly, you have to use Thrust's algorithms and functors when implementing the rhs the ODE. This might be tricky since it involves some kind of functional programming knowledge.
Typical applications for CUDA and odeint are large systems, like lattices or discretizations of PDE, and parameter studies. We introduce now three examples which show how the power of GPUs can be used in combination with odeint.
Important | |
---|---|
The full power of CUDA is only available for really large systems where the number of coupled ordinary differential equations is of order N=10^{6} or larger. For smaller systems the CPU is usually much faster. You can also integrate an ensemble of different uncoupled ODEs in parallel as shown in the last example. |
The first example is the phase oscillator ensemble from the previous section:
dφ_{k} / dt = ω_{k} + ε / N Σ_{j} sin( φ_{j} - φ_{k} ).
It has a phase transition at ε = 2 in the limit of infinite numbers of oscillators N. In the case of finite N this transition is smeared out but still clearly visible.
Thrust and CUDA are
perfectly suited for such kinds of problems where one needs a large number
of particles (oscillators). We start by defining the state type which is
a thrust::device_vector
. The content of this vector
lives on the GPU. If you are not familiar with this we recommend reading
the Getting started section on the Thrust
website.
//change this to float if your device does not support double computation typedef double value_type; //change this to host_vector< ... > of you want to run on CPU typedef thrust::device_vector< value_type > state_type; // typedef thrust::host_vector< value_type > state_type;
Thrust follows a functional programming approach. If you want to perform
a calculation on the GPU you usually have to call a global function like
thrust::for_each
, thrust::reduce
,
... with an appropriate local functor which performs the basic operation.
An example is
struct add_two { template< class T > __host__ __device__ void operator()( T &t ) const { t += T( 2 ); } }; // ... thrust::for_each( x.begin() , x.end() , add_two() );
This code generically adds two to every element in the container x
.
For the purpose of integrating the phase oscillator ensemble we need
The mean field is calculated in a class mean_field_calculator
struct mean_field_calculator { struct sin_functor : public thrust::unary_function< value_type , value_type > { __host__ __device__ value_type operator()( value_type x) const { return sin( x ); } }; struct cos_functor : public thrust::unary_function< value_type , value_type > { __host__ __device__ value_type operator()( value_type x) const { return cos( x ); } }; static std::pair< value_type , value_type > get_mean( const state_type &x ) { value_type sin_sum = thrust::reduce( thrust::make_transform_iterator( x.begin() , sin_functor() ) , thrust::make_transform_iterator( x.end() , sin_functor() ) ); value_type cos_sum = thrust::reduce( thrust::make_transform_iterator( x.begin() , cos_functor() ) , thrust::make_transform_iterator( x.end() , cos_functor() ) ); cos_sum /= value_type( x.size() ); sin_sum /= value_type( x.size() ); value_type K = sqrt( cos_sum * cos_sum + sin_sum * sin_sum ); value_type Theta = atan2( sin_sum , cos_sum ); return std::make_pair( K , Theta ); } };
Inside this class two member structures sin_functor
and cos_functor
are defined.
They compute the sine and the cosine of a value and they are used within
a transform iterator to calculate the sum of sin(φ_{k})
and cos(φ_{k}). The classifiers __host__
and __device__
are CUDA
specific and define a function or operator which can be executed on the
GPU as well as on the CPU. The line
value_type sin_sum = thrust::reduce( thrust::make_transform_iterator( x.begin() , sin_functor() ) , thrust::make_transform_iterator( x.end() , sin_functor() ) );
performs the calculation of this sine-sum on the GPU (or on the CPU, depending on your thrust configuration).
The system function is defined via
class phase_oscillator_ensemble { public: struct sys_functor { value_type m_K , m_Theta , m_epsilon; sys_functor( value_type K , value_type Theta , value_type epsilon ) : m_K( K ) , m_Theta( Theta ) , m_epsilon( epsilon ) { } template< class Tuple > __host__ __device__ void operator()( Tuple t ) { thrust::get<2>(t) = thrust::get<1>(t) + m_epsilon * m_K * sin( m_Theta - thrust::get<0>(t) ); } }; // ... void operator() ( const state_type &x , state_type &dxdt , const value_type dt ) const { std::pair< value_type , value_type > mean_field = mean_field_calculator::get_mean( x ); thrust::for_each( thrust::make_zip_iterator( thrust::make_tuple( x.begin() , m_omega.begin() , dxdt.begin() ) ), thrust::make_zip_iterator( thrust::make_tuple( x.end() , m_omega.end() , dxdt.end()) ) , sys_functor( mean_field.first , mean_field.second , m_epsilon ) ); } // ... };
This class is used within the do_step
and integrate
method. It
defines a member structure sys_functor
for the r.h.s. of each individual oscillator and the operator()
for the use in the steppers and integrators
of odeint. The functor computes first the mean field of φ_{k}
and secondly calculates the whole r.h.s. of the ODE using this mean field.
Note, how nicely thrust::tuple
and thrust::zip_iterator
play together.
Now, we are ready to put everything together. All we have to do for making odeint ready for using the GPU is to parametrize the stepper with the appropriate thrust algebra/operations:
typedef runge_kutta4< state_type , value_type , state_type , value_type , thrust_algebra , thrust_operations > stepper_type;
You can also use a controlled or dense output stepper, e.g.
typedef runge_kutta_dopri5< state_type , value_type , state_type , value_type , thrust_algebra , thrust_operations > stepper_type;
Then, it is straightforward to integrate the phase ensemble by creating an instance of the rhs class and using an integration function:
phase_oscillator_ensemble ensemble( N , 1.0 );
size_t steps1 = integrate_const( make_controlled( 1.0e-6 , 1.0e-6 , stepper_type() ) , boost::ref( ensemble ) , x , 0.0 , t_transients , dt );
We have to use boost::ref
here in order to pass the rhs class
as reference and not by value. This ensures that the natural frequencies
of each oscillator are not copied when calling integrate_const
.
In the full example the performance and results of the Runge-Kutta-4 and
the Dopri5 solver are compared.
The full example can be found at phase_oscillator_example.cu.
The next example is a large, one-dimensional chain of nearest-neighbor coupled phase oscillators with the following equations of motion:
d φ_{k} / dt = ω_{k} + sin( φ_{k+1} - φ_{k} ) + sin( φ_{k} - φ_{k-1})
In principle we can use all the techniques from the previous phase oscillator ensemble example, but we have to take special care about the coupling of the oscillators. To efficiently implement the coupling you can use a very elegant way employing Thrust's permutation iterator. A permutation iterator behaves like a normal iterator on a vector but it does not iterate along the usual order of the elements. It rather iterates along some permutation of the elements defined by some index map. To realize the nearest neighbor coupling we create one permutation iterator which travels one step behind a usual iterator and another permutation iterator which travels one step in front. The full system class is:
//change this to host_vector< ... > if you want to run on CPU typedef thrust::device_vector< value_type > state_type; typedef thrust::device_vector< size_t > index_vector_type; //typedef thrust::host_vector< value_type > state_type; //typedef thrust::host_vector< size_t > index_vector_type; class phase_oscillators { public: struct sys_functor { template< class Tuple > __host__ __device__ void operator()( Tuple t ) // this functor works on tuples of values { // first, unpack the tuple into value, neighbors and omega const value_type phi = thrust::get<0>(t); const value_type phi_left = thrust::get<1>(t); // left neighbor const value_type phi_right = thrust::get<2>(t); // right neighbor const value_type omega = thrust::get<3>(t); // the dynamical equation thrust::get<4>(t) = omega + sin( phi_right - phi ) + sin( phi - phi_left ); } }; phase_oscillators( const state_type &omega ) : m_omega( omega ) , m_N( omega.size() ) , m_prev( omega.size() ) , m_next( omega.size() ) { // build indices pointing to left and right neighbours thrust::counting_iterator<size_t> c( 0 ); thrust::copy( c , c+m_N-1 , m_prev.begin()+1 ); m_prev[0] = 0; // m_prev = { 0 , 0 , 1 , 2 , 3 , ... , N-1 } thrust::copy( c+1 , c+m_N , m_next.begin() ); m_next[m_N-1] = m_N-1; // m_next = { 1 , 2 , 3 , ... , N-1 , N-1 } } void operator() ( const state_type &x , state_type &dxdt , const value_type dt ) { thrust::for_each( thrust::make_zip_iterator( thrust::make_tuple( x.begin() , thrust::make_permutation_iterator( x.begin() , m_prev.begin() ) , thrust::make_permutation_iterator( x.begin() , m_next.begin() ) , m_omega.begin() , dxdt.begin() ) ), thrust::make_zip_iterator( thrust::make_tuple( x.end() , thrust::make_permutation_iterator( x.begin() , m_prev.end() ) , thrust::make_permutation_iterator( x.begin() , m_next.end() ) , m_omega.end() , dxdt.end()) ) , sys_functor() ); } private: const state_type &m_omega; const size_t m_N; index_vector_type m_prev; index_vector_type m_next; };
Note, how easy you can obtain the value for the left and right neighboring
oscillator in the system functor using the permutation iterators. But,
the call of the thrust::for_each
function looks relatively complicated. Every term of the r.h.s. of the
ODE is resembled by one iterator packed in exactly the same way as it is
unpacked in the functor above.
Now we put everything together. We create random initial conditions and decreasing frequencies such that we should get synchronization. We copy the frequencies and the initial conditions onto the device and finally initialize and perform the integration. As result we simply write out the current state, hence the phase of each oscillator.
// create initial conditions and omegas on host: vector< value_type > x_host( N ); vector< value_type > omega_host( N ); for( size_t i=0 ; i<N ; ++i ) { x_host[i] = 2.0 * pi * drand48(); omega_host[i] = ( N - i ) * epsilon; // decreasing frequencies } // copy to device state_type x = x_host; state_type omega = omega_host; // create stepper runge_kutta4< state_type , value_type , state_type , value_type , thrust_algebra , thrust_operations > stepper; // create phase oscillator system function phase_oscillators sys( omega ); // integrate integrate_const( stepper , sys , x , 0.0 , 10.0 , dt ); thrust::copy( x.begin() , x.end() , std::ostream_iterator< value_type >( std::cout , "\n" ) ); std::cout << std::endl;
The full example can be found at phase_oscillator_chain.cu.
Another important use case for Thrust and CUDA are parameter studies of relatively small systems. Consider for example the three-dimensional Lorenz system from the chaotic systems example in the previous section which has three parameters. If you want to study the behavior of this system for different parameters you usually have to integrate the system for many parameter values. Using thrust and odeint you can do this integration in parallel, hence you integrate a whole ensemble of Lorenz systems where each individual realization has a different parameter value.
In the following we will show how you can use Thrust to integrate the above mentioned ensemble of Lorenz systems. We will vary only the parameter β but it is straightforward to vary other parameters or even two or all three parameters. Furthermore, we will use the largest Lyapunov exponent to quantify the behavior of the system (chaoticity).
We start by defining the range of the parameters we want to study. The
state_type is again a thrust::device_vector< value_type
>
.
vector< value_type > beta_host( N ); const value_type beta_min = 0.0 , beta_max = 56.0; for( size_t i=0 ; i<N ; ++i ) beta_host[i] = beta_min + value_type( i ) * ( beta_max - beta_min ) / value_type( N - 1 ); state_type beta = beta_host;
The next thing we have to implement is the Lorenz system without perturbations. Later, a system with perturbations is also implemented in order to calculate the Lyapunov exponent. We will use an ansatz where each device function calculates one particular realization of the Lorenz ensemble
struct lorenz_system { struct lorenz_functor { template< class T > __host__ __device__ void operator()( T t ) const { // unpack the parameter we want to vary and the Lorenz variables value_type R = thrust::get< 3 >( t ); value_type x = thrust::get< 0 >( t ); value_type y = thrust::get< 1 >( t ); value_type z = thrust::get< 2 >( t ); thrust::get< 4 >( t ) = sigma * ( y - x ); thrust::get< 5 >( t ) = R * x - y - x * z; thrust::get< 6 >( t ) = -b * z + x * y ; } }; lorenz_system( size_t N , const state_type &beta ) : m_N( N ) , m_beta( beta ) { } template< class State , class Deriv > void operator()( const State &x , Deriv &dxdt , value_type t ) const { thrust::for_each( thrust::make_zip_iterator( thrust::make_tuple( boost::begin( x ) , boost::begin( x ) + m_N , boost::begin( x ) + 2 * m_N , m_beta.begin() , boost::begin( dxdt ) , boost::begin( dxdt ) + m_N , boost::begin( dxdt ) + 2 * m_N ) ) , thrust::make_zip_iterator( thrust::make_tuple( boost::begin( x ) + m_N , boost::begin( x ) + 2 * m_N , boost::begin( x ) + 3 * m_N , m_beta.begin() , boost::begin( dxdt ) + m_N , boost::begin( dxdt ) + 2 * m_N , boost::begin( dxdt ) + 3 * m_N ) ) , lorenz_functor() ); } size_t m_N; const state_type &m_beta; };
As state_type
a thrust::device_vector
or a Boost.Range
of a device_vector
is used.
The length of the state is 3N where N
is the number of systems. The system is encoded into this vector such that
all x components come first, then every y
components and finally every z components. Implementing
the device function is then a simple task, you only have to decompose the
tuple originating from the zip iterators.
Besides the system without perturbations we furthermore need to calculate
the system including linearized equations governing the time evolution
of small perturbations. Using the method from above this is straightforward,
with a small difficulty that Thrust's tuples have a maximal arity of 10.
But this is only a small problem since we can create a zip iterator packed
with zip iterators. So the top level zip iterator contains one zip iterator
for the state, one normal iterator for the parameter, and one zip iterator
for the derivative. Accessing the elements of this tuple in the system
function is then straightforward, you unpack the tuple with thrust::get<>()
.
We will not show the code here, it is to large. It can be found here and
is easy to understand.
Furthermore, we need an observer which determines the norm of the perturbations, normalizes them and averages the logarithm of the norm. The device functor which is used within this observer is defined
struct lyap_functor { template< class T > __host__ __device__ void operator()( T t ) const { value_type &dx = thrust::get< 0 >( t ); value_type &dy = thrust::get< 1 >( t ); value_type &dz = thrust::get< 2 >( t ); value_type norm = sqrt( dx * dx + dy * dy + dz * dz ); dx /= norm; dy /= norm; dz /= norm; thrust::get< 3 >( t ) += log( norm ); } };
Note, that this functor manipulates the state, i.e. the perturbations.
Now we complete the whole code to calculate the Lyapunov exponents. First,
we have to define a state vector. This vector contains 6N
entries, the state x,y,z and its perturbations dx,dy,dz.
We initialize them such that x=y=z=10, dx=1,
and dy=dz=0. We define a stepper type, a controlled
Runge-Kutta Dormand-Prince 5 stepper. We start with some integration to
overcome the transient behavior. For this, we do not involve the perturbation
and run the algorithm only on the state x,y,z without
any observer. Note, how Boost.Range
is used for partial integration of the state vector without perturbations
(the first half of the whole state). After the transient, the full system
with perturbations is integrated and the Lyapunov exponents are calculated
and written to stdout
.
state_type x( 6 * N ); // initialize x,y,z thrust::fill( x.begin() , x.begin() + 3 * N , 10.0 ); // initial dx thrust::fill( x.begin() + 3 * N , x.begin() + 4 * N , 1.0 ); // initialize dy,dz thrust::fill( x.begin() + 4 * N , x.end() , 0.0 ); // create error stepper, can be used with make_controlled or make_dense_output typedef runge_kutta_dopri5< state_type , value_type , state_type , value_type , thrust_algebra , thrust_operations > stepper_type; lorenz_system lorenz( N , beta ); lorenz_perturbation_system lorenz_perturbation( N , beta ); lyap_observer obs( N , 1 ); // calculate transients integrate_adaptive( make_controlled( 1.0e-6 , 1.0e-6 , stepper_type() ) , lorenz , std::make_pair( x.begin() , x.begin() + 3 * N ) , 0.0 , 10.0 , dt ); // calculate the Lyapunov exponents -- the main loop double t = 0.0; while( t < 10000.0 ) { integrate_adaptive( make_controlled( 1.0e-6 , 1.0e-6 , stepper_type() ) , lorenz_perturbation , x , t , t + 1.0 , 0.1 ); t += 1.0; obs( x , t ); } vector< value_type > lyap( N ); obs.fill_lyap( lyap ); for( size_t i=0 ; i<N ; ++i ) cout << beta_host[i] << "\t" << lyap[i] << "\n";
The full example can be found at lorenz_parameters.cu.