Advanced Topics

Vector Data Types
Custom Functions
Custom Types
Complex Values
Lambda Expressions
Asynchronous Operations
Performance Timing
OpenCL API Interoperability

The following topics show advanced features of the Boost Compute library.

In addition to the built-in scalar types (e.g. int and float), OpenCL also provides vector data types (e.g. int2 and vector4). These can be used with the Boost Compute library on both the host and device.

Boost.Compute provides typedefs for these types which take the form: boost::compute::scalarN_ where scalar is a scalar data type (e.g. int, float, char) and N is the size of the vector. Supported vector sizes are: 2, 4, 8, and 16.

The following example shows how to transfer a set of 3D points stored as an array of floats on the host the device and then calculate the sum of the point coordinates using the accumulate() function. The sum is transferred to the host and the centroid computed by dividing by the total number of points.

Note that even though the points are in 3D, they are stored as float4 due to OpenCL's alignment requirements.

#include <iostream>

#include <boost/compute/algorithm/copy.hpp>
#include <boost/compute/algorithm/accumulate.hpp>
#include <boost/compute/container/vector.hpp>
#include <boost/compute/types/fundamental.hpp>

namespace compute = boost::compute;

// the point centroid example calculates and displays the
// centroid of a set of 3D points stored as float4's
int main()
{
    using compute::float4_;

    // get default device and setup context
    compute::device device = compute::system::default_device();
    compute::context context(device);
    compute::command_queue queue(context, device);

    // point coordinates
    float points[] = { 1.0f, 2.0f, 3.0f, 0.0f,
                       -2.0f, -3.0f, 4.0f, 0.0f,
                       1.0f, -2.0f, 2.5f, 0.0f,
                       -7.0f, -3.0f, -2.0f, 0.0f,
                       3.0f, 4.0f, -5.0f, 0.0f };

    // create vector for five points
    compute::vector<float4_> vector(5, context);

    // copy point data to the device
    compute::copy(
        reinterpret_cast<float4_ *>(points),
        reinterpret_cast<float4_ *>(points) + 5,
        vector.begin(),
        queue
    );

    // calculate sum
    float4_ sum = compute::accumulate(
        vector.begin(), vector.end(), float4_(0, 0, 0, 0), queue
    );

    // calculate centroid
    float4_ centroid;
    for(size_t i = 0; i < 3; i++){
        centroid[i] = sum[i] / 5.0f;
    }

    // print centroid
    std::cout << "centroid: " << centroid << std::endl;

    return 0;
}

Custom Functions

The OpenCL runtime and the Boost Compute library provide a number of built-in functions such as sqrt() and dot() but many times these are not sufficient for solving the problem at hand.

The Boost Compute library provides a few different ways to create custom functions that can be passed to the provided algorithms such as transform() and reduce().

The most basic method is to provide the raw source code for a function:

boost::compute::function<int (int)> add_four =
    boost::compute::make_function_from_source<int (int)>(
        "add_four",
        "int add_four(int x) { return x + 4; }"
    );

boost::compute::transform(input.begin(), input.end(), output.begin(), add_four, queue);

This can also be done more succinctly using the BOOST_COMPUTE_FUNCTION macro:

BOOST_COMPUTE_FUNCTION(int, add_four, (int x),
{
    return x + 4;
});

boost::compute::transform(input.begin(), input.end(), output.begin(), add_four, queue);

Also see "Custom OpenCL functions in C++ with Boost.Compute" for more details.

Custom Types

Boost.Compute provides the BOOST_COMPUTE_ADAPT_STRUCT macro which allows a C++ struct/class to be wrapped and used in OpenCL.

Complex Values

While OpenCL itself doesn't natively support complex data types, the Boost Compute library provides them.

To use complex values first include the following header:

#include <boost/compute/types/complex.hpp>

A vector of complex values can be created like so:

// create vector on device
boost::compute::vector<std::complex<float> > vector;

// insert two complex values
vector.push_back(std::complex<float>(1.0f, 3.0f));
vector.push_back(std::complex<float>(2.0f, 4.0f));

Lambda Expressions

The lambda expression framework allows for functions and predicates to be defined at the call-site of an algorithm.

Lambda expressions use the placeholders _1 and _2 to indicate the arguments. The following declarations will bring the lambda placeholders into the current scope:

using boost::compute::lambda::_1;
using boost::compute::lambda::_2;

The following examples show how to use lambda expressions along with the Boost.Compute algorithms to perform more complex operations on the device.

To count the number of odd values in a vector:

boost::compute::count_if(vector.begin(), vector.end(), _1 % 2 == 1, queue);

To multiply each value in a vector by three and subtract four:

boost::compute::transform(vector.begin(), vector.end(), vector.begin(), _1 * 3 - 4, queue);

Lambda expressions can also be used to create function<> objects:

boost::compute::function<int(int)> add_four = _1 + 4;

Asynchronous Operations

A major performance bottleneck in GPGPU applications is memory transfer. This can be alleviated by overlapping memory transfer with computation. The Boost Compute library provides the copy_async() function which performs an asynchronous memory transfers between the host and the device.

For example, to initiate a copy from the host to the device and then perform other actions:

// data on the host
std::vector<float> host_vector = ...

// create a vector on the device
boost::compute::vector<float> device_vector(host_vector.size(), context);

// copy data to the device asynchronously
boost::compute::future<void> f = boost::compute::copy_async(
    host_vector.begin(), host_vector.end(), device_vector.begin(), queue
);

// perform other work on the host or device
// ...

// ensure the copy is completed
f.wait();

// use data on the device (e.g. sort)
boost::compute::sort(device_vector.begin(), device_vector.end(), queue);

Performance Timing

For example, to measure the time to copy a vector of data from the host to the device:

#include <vector>
#include <cstdlib>
#include <iostream>

#include <boost/compute/event.hpp>
#include <boost/compute/system.hpp>
#include <boost/compute/algorithm/copy.hpp>
#include <boost/compute/async/future.hpp>
#include <boost/compute/container/vector.hpp>

namespace compute = boost::compute;

int main()
{
    // get the default device
    compute::device gpu = compute::system::default_device();

    // create context for default device
    compute::context context(gpu);

    // create command queue with profiling enabled
    compute::command_queue queue(
        context, gpu, compute::command_queue::enable_profiling
    );

    // generate random data on the host
    std::vector<int> host_vector(16000000);
    std::generate(host_vector.begin(), host_vector.end(), rand);

    // create a vector on the device
    compute::vector<int> device_vector(host_vector.size(), context);

    // copy data from the host to the device
    compute::future<void> future = compute::copy_async(
        host_vector.begin(), host_vector.end(), device_vector.begin(), queue
    );

    // wait for copy to finish
    future.wait();

    // get elapsed time from event profiling information
    boost::chrono::milliseconds duration =
        future.get_event().duration<boost::chrono::milliseconds>();

    // print elapsed time in milliseconds
    std::cout << "time: " << duration.count() << " ms" << std::endl;

    return 0;
}

OpenCL API Interoperability

The Boost Compute library is designed to easily interoperate with the OpenCL API. All of the wrapped classes have conversion operators to their underlying OpenCL types which allows them to be passed directly to the OpenCL functions.

For example,

// create context object
boost::compute::context ctx = boost::compute::default_context();

// query number of devices using the OpenCL API
cl_uint num_devices;
clGetContextInfo(ctx, CL_CONTEXT_NUM_DEVICES, sizeof(cl_uint), &num_devices, 0);
std::cout << "num_devices: " << num_devices << std::endl;