Boost C++ Libraries

...one of the most highly regarded and expertly designed C++ library projects in the world. Herb Sutter and Andrei Alexandrescu, C++ Coding Standards

PrevUpHomeNext

NUMA

Modern micro-processors contain integrated memory controllers that are connected via channels to the memory. Accessing the memory can be organized in two kinds:
Uniform Memory Access (UMA) and Non-Uniform Memory Access (NUMA).

In contrast to UMA, that provides a centralized pool of memory (and thus does not scale after a certain number of processors), a NUMA architecture divides the memory into local and remote memory relative to the micro-processor.
Local memory is directly attached to the processor's integrated memory controller. Memory connected to the memory controller of another micro-processor (multi-socket systems) is considered as remote memory. If a memory controller access remote memory it has to traverse the interconnect[8] and connect to the remote memory controller.
Thus accessing remote memory adds additional latency overhead to local memory access. Because of the different memory locations, a NUMA-system experiences non-uniform memory access time.
As a consequence the best performance is achieved by keeping the memory access local.

NUMA

NUMA support in Boost.Fiber

Because only a subset of the NUMA-functionality is exposed by several operating systems, Boost.Fiber provides only a minimalistic NUMA API.

Table 1.1. Supported functionality/operating systems

AIX

FreeBSD

HP/UX

Linux

Solaris

Windows

pin thread

+

+

+

+

+

+

logical CPUs/NUMA nodes

+

+

+

+

+

+[a]

NUMA node distance

-

-

-

+

-

-

tested on

AIX 7.2

FreeBSD 11

-

Arch Linux (4.10.13)

OpenIndiana HIPSTER

Windows 10

[a] Windows organizes logical cpus in groups of 64; boost.fiber maps {group-id,cpud-id} to a scalar equivalent to cpu ID of Linux (64 * group ID + cpu ID).


In order to keep the memory access local as possible, the NUMA topology must be evaluated.

std::vector< boost::fibers::numa::node > topo = boost::fibers::numa::topology();
for ( auto n : topo) {
    std::cout << "node: " << n.id << " | ";
    std::cout << "cpus: ";
    for ( auto cpu_id : n.logical_cpus) {
        std::cout << cpu_id << " ";
    }
    std::cout << "| distance: ";
    for ( auto d : n.distance) {
        std::cout << d << " ";
    }
    std::cout << std::endl;
}
std::cout << "done" << std::endl;

output:
    node: 0 | cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23 | distance: 10 21
    node: 1 | cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31 | distance: 21 10
    done

The example shows that the systems consits out of 2 NUMA-nodes, to each NUMA-node belong 16 logical cpus. The distance measures the costs to access the memory of another NUMA-node. A NUMA-node has always a distance 10 to itself (lowest possible value).
The position in the array corresponds with the NUMA-node ID.

Some work-loads benefit from pinning threads to a logical cpus. For instance scheduling algorithm numa::work_stealing pins the thread that runs the fiber scheduler to a logical cpu. This prevents the operating system scheduler to move the thread to another logical cpu that might run other fiber scheduler(s) or migrating the thread to a logical cpu part of another NUMA-node.

void thread( std::uint32_t cpu_id, std::uint32_t node_id, std::vector< boost::fibers::numa::node > const& topo) {
    // thread registers itself at work-stealing scheduler
    boost::fibers::use_scheduling_algorithm< boost::fibers::algo::numa::work_stealing >( cpu_id, node_id, topo);
    ...
}

// evaluate the NUMA topology
std::vector< boost::fibers::numa::node > topo = boost::fibers::numa::topology();
// start-thread runs on NUMA-node `0`
auto node = topo[0];
// start-thread is pinnded to first cpu ID in the list of logical cpus of NUMA-node `0`
auto start_cpu_id = * node.logical_cpus.begin();
// start-thread registers itself on work-stealing scheduler
boost::fibers::use_scheduling_algorithm< boost::fibers::algo::numa::work_stealing >( start_cpu_id, node.id, topo);
std::vector< std::thread > threads;
for ( auto & node : topo) {
    for ( std::uint32_t cpu_id : node.logical_cpus) {
        // exclude start-thread
        if ( start_cpu_id != cpu_id) {
            // spawn thread
            threads.emplace_back( thread, cpu_id, node.id, std::cref( topo) );
        }
    }
}
...

The example evaluates the NUMA topology with boost::fibers::numa::topology() and spawns for each logical cpu a thread. Each spawned thread installs the NUMA-aware work-stealing scheduler. The scheduler pins the thread to the logical cpu that was specified at construction.
If the local queue of one thread runs out of ready fibers, the thread tries to steal a ready fiber from another thread running at logical cpu that belong to the same NUMA-node (local memory access). If no fiber could be stolen, the thread tries to steal fibers from logical cpus part of other NUMA-nodes (remote memory access).

Synopsis

#include <boost/fiber/numa/pin_thread.hpp>
#include <boost/fiber/numa/topology.hpp>

namespace boost {
namespace fibers {
namespace numa {

struct node {
    std::uint32_t                   id;
    std::set< std::uint32_t >       logical_cpus;
    std::vector< std::uint32_t >    distance;
};
bool operator<( node const&, node const&) noexcept;

std::vector< node > topology();

void pin_thread( std::uint32_t);

}}}

#include <boost/fiber/algo/numa/work_stealing.hpp>

namespace boost {
namespace fibers {
namespace algo {
namespace numa {

class work_stealing;

}}}

Class numa::node

#include <boost/fiber/numa/topology.hpp>

namespace boost {
namespace fibers {
namespace numa {

struct node {
    std::uint32_t                   id;
    std::set< std::uint32_t >       logical_cpus;
    std::vector< std::uint32_t >    distance;
};
bool operator<( node const&, node const&) noexcept;

}}}

Data member id

std::uint32_t id;

Effects:

ID of the NUMA-node

Data member logical_cpus

std::set< std::uint32_t > logical_cpus;

Effects:

set of logical cpu IDs belonging to the NUMA-node

Data member distance

std::vector< std::uint32_t > distance;

Effects:

The distance between NUMA-nodes describe the cots of accessing the remote memory.

Note:

A NUMA-node has a distance of 10 to itself, remote NUMA-nodes have a distance > 10. The index in the array corresponds to the ID id of the NUMA-node. At the moment only Linux returns the correct distances, for all other operating systems remote NUMA-nodes get a default value of 20.

Member function operator<()

bool operator<( node const& lhs, node const& rhs) const noexcept;

Returns:

true if lhs != rhs is true and the implementation-defined total order of node::id values places lhs before rhs, false otherwise.

Throws:

Nothing.

Non-member function numa::topology()

#include <boost/fiber/numa/topology.hpp>

namespace boost {
namespace fibers {
namespace numa {

std::vector< node > topology();

}}}

Effects:

Evaluates the NUMA topology.

Returns:

a vector of NUMA-nodes describing the NUMA architecture of the system (each element represents a NUMA-node).

Throws:

system_error

Non-member function numa::pin_thread()

#include <boost/fiber/numa/pin_thread.hpp>

namespace boost {
namespace fibers {
namespace numa {

void pin_thread( std::uint32_t);

}}}

Effects:

Pins this thread to the logical cpu with ID cpu_id, e.g. the operating system scheduler will not migrate the thread to another logical cpu.

Throws:

system_error

Class numa::work_stealing

This class implements algorithm; the thread running this scheduler is pinned to the given logical cpu. If the local ready-queue runs out of ready fibers, ready fibers are stolen from other schedulers that run on logical cpus that belong to the same NUMA-node (local memory access).
If no ready fibers can be stolen from the local NUMA-node, the algorithm selects schedulers running on other NUMA-nodes (remote memory access).
The victim scheduler (from which a ready fiber is stolen) is selected at random.

#include <boost/fiber/algo/numa/work_stealing.hpp>

namespace boost {
namespace fibers {
namespace algo {
namespace numa {

class work_stealing : public algorithm {
public:
    work_stealing( std::uint32_t cpu_id,
                   std::uint32_t node_id,
                   std::vector< boost::fibers::numa::node > const& topo,
                   bool suspend = false);

    work_stealing( work_stealing const&) = delete;
    work_stealing( work_stealing &&) = delete;

    work_stealing & operator=( work_stealing const&) = delete;
    work_stealing & operator=( work_stealing &&) = delete;

    virtual void awakened( context *) noexcept;

    virtual context * pick_next() noexcept;

    virtual bool has_ready_fibers() const noexcept;

    virtual void suspend_until( std::chrono::steady_clock::time_point const&) noexcept;

    virtual void notify() noexcept;
};

}}}}

Constructor

work_stealing( std::uint32_t cpu_id, std::uint32_t node_id,
               std::vector< boost::fibers::numa::node > const& topo,
               bool suspend = false);

Effects:

Constructs work-stealing scheduling algorithm. The thread is pinned to logical cpu with ID cpu_id. If local ready-queue runs out of ready fibers, ready fibers are stolen from other schedulers using topology (represents the NUMA-topology of the system).

Throws:

system_error

Note:

If suspend is set to true, then the scheduler suspends if no ready fiber could be stolen. The scheduler will by woken up if a sleeping fiber times out or it was notified from remote (other thread or fiber scheduler).

Member function awakened()

virtual void awakened( context * f) noexcept;

Effects:

Enqueues fiber f onto the shared ready queue.

Throws:

Nothing.

Member function pick_next()

virtual context * pick_next() noexcept;

Returns:

the fiber at the head of the ready queue, or nullptr if the queue is empty.

Throws:

Nothing.

Note:

Placing ready fibers onto the tail of the sahred queue, and returning them from the head of that queue, shares the thread between ready fibers in round-robin fashion.

Member function has_ready_fibers()

virtual bool has_ready_fibers() const noexcept;

Returns:

true if scheduler has fibers ready to run.

Throws:

Nothing.

Member function suspend_until()

virtual void suspend_until( std::chrono::steady_clock::time_point const& abs_time) noexcept;

Effects:

Informs work_stealing that no ready fiber will be available until time-point abs_time. This implementation blocks in std::condition_variable::wait_until().

Throws:

Nothing.

Member function notify()

virtual void notify() noexcept = 0;

Effects:

Wake up a pending call to work_stealing::suspend_until(), some fibers might be ready. This implementation wakes suspend_until() via std::condition_variable::notify_all().

Throws:

Nothing.



[8] On x86 the interconnection is implemented by Intel's Quick Path Interconnect (QPI) and AMD's HyperTransport.


PrevUpHomeNext