Boost C++ Libraries

...one of the most highly regarded and expertly designed C++ library projects in the world. Herb Sutter and Andrei Alexandrescu, C++ Coding Standards

Concurrent Containers :: Boost.Unordered

Concurrent Containers

Boost.Unordered provides boost::concurrent_node_set, boost::concurrent_node_map, boost::concurrent_flat_set and boost::concurrent_flat_map, hash tables that allow concurrent write/read access from different threads without having to implement any synchronzation mechanism on the user’s side.

std::vector<int>                    input;
boost::concurrent_flat_map<int,int> m;

...

// process input in parallel
const int                 num_threads = 8;
std::vector<std::jthread> threads;
std::size_t               chunk = input.size() / num_threads; // how many elements per thread

for (int i = 0; i < num_threads; ++i) {
  threads.emplace_back([&,i] {
    // calculate the portion of input this thread takes care of
    std::size_t start = i * chunk;
    std::size_t end = (i == num_threads - 1)? input.size(): (i + 1) * chunk;

    for (std::size_t n = start; n < end; ++n) {
      m.emplace(input[n], calculation(input[n]));
    }
  });
}
c++

In the example above, threads access m without synchronization, just as we’d do in a single-threaded scenario. In an ideal setting, if a given workload is distributed among N threads, execution is N times faster than with one thread —this limit is never attained in practice due to synchronization overheads and contention (one thread waiting for another to leave a locked portion of the map), but Boost.Unordered concurrent containers are designed to perform with very little overhead and typically achieve linear scaling (that is, performance is proportional to the number of threads up to the number of logical cores in the CPU).

Visitation-based API

The first thing a new user of Boost.Unordered concurrent containers will notice is that these classes do not provide iterators (which makes them technically not Containers in the C++ standard sense). The reason for this is that iterators are inherently thread-unsafe. Consider this hypothetical code:

auto it = m.find(k);  // A: get an iterator pointing to the element with key k
if (it != m.end() ) {
  some_function(*it); // B: use the value of the element
}
c++

In a multithreaded scenario, the iterator it may be invalid at point B if some other thread issues an m.erase(k) operation between A and B. There are designs that can remedy this by making iterators lock the element they point to, but this approach lends itself to high contention and can easily produce deadlocks in a program. operator[] has similar concurrency issues, and is not provided by boost::concurrent_flat_map/boost::concurrent_node_map either. Instead, element access is done through so-called visitation functions:

m.visit(k, [](const auto& x) { // x is the element with key k (if it exists)
  some_function(x);            // use it
});
c++

The visitation function passed by the user (in this case, a lambda function) is executed internally by Boost.Unordered in a thread-safe manner, so it can access the element without worrying about other threads interfering in the process.

On the other hand, a visitation function can not access the container itself:

m.visit(k, [&](const auto& x) {
  some_function(x, m.size()); // forbidden: m can't be accessed inside visitation
});
c++

Access to a different container is allowed, though:

m.visit(k, [&](const auto& x) {
  if (some_function(x)) {
    m2.insert(x); // OK, m2 is a different boost::concurrent_flat_map
  }
});
c++

But, in general, visitation functions should be as lightweight as possible to reduce contention and increase parallelization. In some cases, moving heavy work outside of visitation may be beneficial:

std::optional<value_type> o;
bool found = m.visit(k, [&](const auto& x) {
  o = x;
});
if (found) {
  some_heavy_duty_function(*o);
}
c++

Visitation is prominent in the API provided by concurrent containers, and many classical operations have visitation-enabled variations:

m.insert_or_visit(x, [](auto& y) {
  // if insertion failed because of an equivalent element y,
  // do something with it, for instance:
  ++y.second; // increment the mapped part of the element
});
c++

Note that in this last example the visitation function could actually modify the element: as a general rule, operations on a concurrent map m will grant visitation functions const/non-const access to the element depending on whether m is const/non-const. Const access can be always be explicitly requested by using cvisit overloads (for instance, insert_or_cvisit) and may result in higher parallelization. For concurrent sets, on the other hand, visitation is always const access.

Although expected to be used much less frequently, concurrent containers also provide insertion operations where an element can be visited right after element creation (in addition to the usual visitation when an equivalent element already exists):

  m.insert_and_cvisit(x,
    [](const auto& y) {
      std::cout<< "(" << y.first << ", " << y.second <<") inserted\n";
    },
    [](const auto& y) {
      std::cout<< "(" << y.first << ", " << y.second << ") already exists\n";
    });
c++

Consult the references of boost::concurrent_node_set, boost::concurrent_node_map, boost::concurrent_flat_set and boost::concurrent_flat_map for the complete list of visitation-enabled operations.

Whole-Table Visitation

In the absence of iterators, visit_all is provided as an alternative way to process all the elements in the container:

m.visit_all([](auto& x) {
  x.second = 0; // reset the mapped part of the element
});
c++

In C++17 compilers implementing standard parallel algorithms, whole-table visitation can be parallelized:

m.visit_all(std::execution::par, [](auto& x) { // run in parallel
  x.second = 0; // reset the mapped part of the element
});
c++

Traversal can be interrupted midway:

// finds the key to a given (unique) value

int  key = 0;
int  value = ...;
bool found = !m.visit_while([&](const auto& x) {
  if(x.second == value) {
    key = x.first;
    return false; // finish
  }
  else {
    return true;  // keep on visiting
  }
});

if(found) { ... }
c++

There is one last whole-table visitation operation, erase_if:

m.erase_if([](auto& x) {
  return x.second == 0; // erase the elements whose mapped value is zero
});
c++

visit_while and erase_if can also be parallelized. Note that, in order to increase efficiency, whole-table visitation operations do not block the table during execution: this implies that elements may be inserted, modified or erased by other threads during visitation. It is advisable not to assume too much about the exact global state of a concurrent container at any point in your program.

Bulk visitation

Suppose you have an std::array of keys you want to look up for in a concurrent map:

std::array<int, N> keys;
...
for(const auto& key: keys) {
  m.visit(key, [](auto& x) { ++x.second; });
}
c++

Bulk visitation allows us to pass all the keys in one operation:

m.visit(keys.begin(), keys.end(), [](auto& x) { ++x.second; });
c++

This functionality is not provided for mere syntactic convenience, though: by processing all the keys at once, some internal optimizations can be applied that increase performance over the regular, one-at-a-time case (consult the benchmarks). In fact, it may be beneficial to buffer incoming keys so that they can be bulk visited in chunks:

static constexpr auto bulk_visit_size = boost::concurrent_flat_map<int,int>::bulk_visit_size;
std::array<int, bulk_visit_size> buffer;
std::size_t                      i=0;
while(...) { // processing loop
  ...
  buffer[i++] = k;
  if(i == bulk_visit_size) {
    map.visit(buffer.begin(), buffer.end(), [](auto& x) { ++x.second; });
    i = 0;
  }
  ...
}
// flush remaining keys
map.visit(buffer.begin(), buffer.begin() + i, [](auto& x) { ++x.second; });
c++

There’s a latency/throughput tradeoff here: it will take longer for incoming keys to be processed (since they are buffered), but the number of processed keys per second is higher. bulk_visit_size is the recommended chunk size —smaller buffers may yield worse performance.

Blocking Operations

Concurrent containers can be copied, assigned, cleared and merged just like any other Boost.Unordered container. Unlike most other operations, these are blocking, that is, all other threads are prevented from accesing the tables involved while a copy, assignment, clear or merge operation is in progress. Blocking is taken care of automatically by the library and the user need not take any special precaution, but overall performance may be affected.

Another blocking operation is rehashing, which happens explicitly via rehash/reserve or during insertion when the table’s load hits max_load(). As with non-concurrent containers, reserving space in advance of bulk insertions will generally speed up the process.

Interoperability with non-concurrent containers

As open-addressing and concurrent containers are based on the same internal data structure, they can be efficiently move-constructed from their non-concurrent counterpart, and vice versa.

Table 1. Concurrent/non-concurrent interoperatibility

boost::concurrent_node_set

boost::unordered_node_set

boost::concurrent_node_map

boost::unordered_node_map

boost::concurrent_flat_set

boost::unordered_flat_set

boost::concurrent_flat_map

boost::unordered_flat_map

This interoperability comes handy in multistage scenarios where parts of the data processing happen in parallel whereas other steps are non-concurrent (or non-modifying). In the following example, we want to construct a histogram from a huge input vector of words: the population phase can be done in parallel with boost::concurrent_flat_map and results then transferred to the final container.

std::vector<std::string> words = ...;

// Insert words in parallel
boost::concurrent_flat_map<std::string_view, std::size_t> m0;
std::for_each(
  std::execution::par, words.begin(), words.end(),
  [&](const auto& word) {
    m0.try_emplace_or_visit(word, 1, [](auto& x) { ++x.second; });
  });

// Transfer to a regular unordered_flat_map
boost::unordered_flat_map m=std::move(m0);
c++