...one of the most highly
regarded and expertly designed C++ library projects in the
world.
— Herb Sutter and Andrei
Alexandrescu, C++
Coding Standards
Performance measurements were taken using std::chrono::highresolution_clock
,
with overhead corrections. The code was compiled with gcc-6.20, using build
options: variant = release, optimization = speed. Tests were executed on Intel
Core i7-4770S 3.10GHz, 4 cores with 8 hyperthreads (4C/8T), running Linux (4.7.0/x86_64).
Measurements headed 1C/1T were run in a single-threaded process.
The microbenchmark syknet from Alexander Temerev was ported and used for performance measurements. At the root the test spawns 10 threads-of-execution (ToE), e.g. actor/goroutine/fiber etc.. Each spawned ToE spawns additional 10 ToEs ... until 100000 ToEs are created. ToEs return back ther ordinal numbers (0 ... 99999), which are summed on the previous level and sent back upstream, until reaching the root. The test was run 10-20 times, producing a range of values for each measurement.
Table 1.1. time per actor/erlang process/goroutine (other languages) (average over 100,000)
Haskell | stack-1.0.4 |
Erlang | erts-7.0 |
Go | go1.6.1 (GOMAXPROCS == default) |
Go | go1.6.1 (GOMAXPROCS == 8) |
---|---|---|---|
0.32 µs |
0.64 µs - 1.21 µs |
1.52 µs - 1.64 µs |
0.70 µs - 0.98 µs |
std::thread
can not be tested at this time (C++14)
because the API does not allow to set thread stack size (idefault on Linux
2MB, on Windows 1MB). An out-of-memory error is likely. With pthreads the stack
size is set 8kBC.
Table 1.2. time per thread (average over *10,000* - unable to spawn 100,000 threads)
pthread |
|
---|---|
14.4 µs - 20.8 µs |
18.8 µs - 28.1 µs |
The test utilizes 4 cores with Symmetric MultiThreading enabled (8 logical
CPUs). The fiber stacks are allocated by fixedsize_stack
.
As the benchmark shows, the memory allocation algorithm is significant for performance in a multithreaded environment. The tests use glibc’s memory allocation algorithm (based on ptmalloc2) as well as Google’s TCmalloc (via linkflags="-ltcmalloc").[9]
The shared_work
scheduling algorithm uses one global queue,
containing fibers ready to run, shared between all threads. The work is distributed
equally over all threads. In the work_stealing
scheduling
algorithm, each thread has its own local queue. Fibers that are ready to run
are pushed to and popped from the local queue. If the queue runs out of ready
fibers, fibers are stolen from the local queues of other participating threads.
Table 1.3. time per fiber (average over 100,000)
fiber (4C/8T, work stealing, tcmalloc) |
fiber (4C/8T, work stealing) |
fiber (4C/8T, work sharing, tcmalloc) |
fiber (4C/8T, work sharing) |
fiber (1C/1T, round robin, tcmalloc) |
fiber (1C/1T, round robin) |
---|---|---|---|---|---|
0.13 µs - 0.26 µs |
0.35 µs - 0.66 µs |
0.62 µs - 0.80 µs |
0.90 µs - 1.11 µs |
0.90 µs - 1.03 µs |
0.91 µs - 1.28 µs |
[9] Tais B. Ferreira, Rivalino Matias, Autran Macedo, Lucio B. Araujo “An Experimental Study on Memory Allocators in Multicore and Multithreaded Applications”, PDCAT ’11 Proceedings of the 2011 12th International Conference on Parallel and Distributed Computing, Applications and Technologies, pages 92-98