Performance

Performance measurements were taken using std::chrono::highresolution_clock, with overhead corrections. The code was compiled with gcc-6.3.1, using build options: variant = release, optimization = speed. Tests were executed on dual Intel XEON E5 2620v4 2.2GHz, 16C/32T, 64GB RAM, running Linux (x86_64).

Measurements headed 1C/1T were run in a single-threaded process.

The microbenchmark syknet from Alexander Temerev was ported and used for performance measurements. At the root the test spawns 10 threads-of-execution (ToE), e.g. actor/goroutine/fiber etc.. Each spawned ToE spawns additional 10 ToEs ... until 1,000,000 ToEs are created. ToEs return back their ordinal numbers (0 ... 999,999), which are summed on the previous level and sent back upstream, until reaching the root. The test was run 10-20 times, producing a range of values for each measurement.

Table 1.2. time per actor/erlang process/goroutine (other languages) (average over 1,000,000)

Haskell \| stack-1.4.0/ghc-8.0.1	Go \| go1.8.1	Erlang \| erts-8.3
0.05 µs - 0.06 µs	0.42 µs - 0.49 µs	0.63 µs - 0.73 µs

Pthreads are created with a stack size of 8kB while std::thread's use the system default (1MB - 2MB). The microbenchmark could not be run with 1,000,000 threads because of resource exhaustion (pthread and std::thread). Instead the test runs only at 10,000 threads.

Table 1.3. time per thread (average over 10,000 - unable to spawn 1,000,000 threads)

pthread	`std::thread`	`std::async`
54 µs - 73 µs	52 µs - 73 µs	106 µs - 122 µs

The test utilizes 16 cores with Symmetric MultiThreading enabled (32 logical CPUs). The fiber stacks are allocated by fixedsize_stack.

As the benchmark shows, the memory allocation algorithm is significant for performance in a multithreaded environment. The tests use glibc’s memory allocation algorithm (based on ptmalloc2) as well as Google’s TCmalloc (via linkflags="-ltcmalloc").^[9]

In the work_stealing scheduling algorithm, each thread has its own local queue. Fibers that are ready to run are pushed to and popped from the local queue. If the queue runs out of ready fibers, fibers are stolen from the local queues of other participating threads.

Table 1.4. time per fiber (average over 1.000.000)

fiber (16C/32T, work stealing, tcmalloc)	fiber (1C/1T, round robin, tcmalloc)
0.05 µs - 0.09 µs	1.69 µs - 1.79 µs

^[9] Tais B. Ferreira, Rivalino Matias, Autran Macedo, Lucio B. Araujo “An Experimental Study on Memory Allocators in Multicore and Multithreaded Applications”, PDCAT ’11 Proceedings of the 2011 12th International Conference on Parallel and Distributed Computing, Applications and Technologies, pages 92-98

Boost C++ Libraries

Performance