tmc::ex_cpu#

ex_cpu is the primary executor of TooManyCooks. It provides a high-performance work-stealing thread pool with the option for multiple priority levels.

It is recommended to use a single instance of ex_cpu for general work submission across your entire application. This maximizes the efficiency of the work-stealing architecture. However, you can create additional ex_cpu ‘s if your application requires it. For example, a separate single-threaded executor may be useful to process calls to certain external libraries.

You must call .init() on each executor instance before you use it. Before calling init(), you may configure it by calling any of the member functions that begin with set_.

A global instance of ex_cpu is available from tmc::cpu_executor(). This is provided as a convenience so that you can access the executor without needing to explicitly inject it. However, if you don’t want to use the global instance, just don’t init() it, and it won’t consume any resources.

Usage Example#

#define TMC_IMPL
#include "tmc/all_headers.hpp"

int main() {
  // Configure and init the global CPU executor
  tmc::cpu_executor().set_thread_count(8).set_priority_count(1).init();

  return tmc::async_main(main_task());
}
#define TMC_IMPL
#include "tmc/all_headers.hpp"

int main() {
  // Create a standalone instance of the CPU executor
  tmc::ex_cpu my_executor;
  my_executor.set_thread_count(8).set_priority_count(1).init();

  tmc::post_waitable(my_executor, main_task()).wait();
}

TMC_USE_HWLOC additional features#

When TMC_USE_HWLOC is enabled, tmc::ex_cpu gains many features:

Hardware-Optimized Defaults#

By default, tmc::ex_cpu will create 1 thread per physical core reported by hwloc. These threads will be organized into work-stealing groups by cores of the same CPU kind that share a cache. Threads will be pinned so that they can occupy any core in that cache, but are not allowed to be migrated to a different cache. Threads will prefer to steal work from other threads in their cache group before checking other groups for work.

P-cores and E-cores will be included, but LP E-cores (as on Intel Meteor Lake laptops) will be excluded.

// Automagically configures itself for optimal performance
tmc::cpu_executor().init();

Thread Occupancy#

Automatically fill to the SMT level of each CPU kind. This handles systems with differing SMT levels:

  • AMD/Intel homogeneous: 2.0 for all cores

  • Intel hybrid: 2.0 for P-cores, 1.0 for E-cores

  • Apple M-series: 1.0 for all cores

tmc::cpu_executor()
  .fill_thread_occupancy()
  .init();

Or manually control how many threads are created per physical core:

tmc::cpu_executor()
  // 1.0 = 1 thread per core (default)
  // 2.0 = full SMT (2 threads per core on most consumer CPUs)
  .set_thread_occupancy(2.0f, tmc::topology::cpu_kind::PERFORMANCE)
  .set_thread_occupancy(1.0f, tmc::topology::cpu_kind::EFFICIENCY1)
  .init();

CPU Partitioning#

Create separate executors for for different cache groups or NUMA nodes:

auto topo = tmc::topology::query();

// Construct an independent ex_cpu for each NUMA node
std::vector<tmc::ex_cpu> exCpus(topo.numa_count());
for (size_t i = 0; i < topo.numa_count(); ++i) {
  tmc::topology::topology_filter f{};
  f.set_numa_indexes({i});
  exCpus[i].add_partition(f).init();
}

Hybrid Work Steering#

When using a machine with multiple CPU kinds (P- and E-cores), care should be taken to designate latency-sensitive work to P-cores, while making use of E-cores for high-parallelism jobs, or background tasks. There are several ways to do this in TooManyCooks.

First, setup your filters:

tmc::topology::topology_filter p_cores;
p_cores.set_cpu_kinds(tmc::topology::cpu_kind::PERFORMANCE);

tmc::topology::topology_filter e_cores;
e_cores.set_cpu_kinds(tmc::topology::cpu_kind::EFFICIENCY1);

Option 1: Create independent executors for the different CPU kinds:

tmc::ex_cpu ex_p_core;
ex_p_core.add_partition(p_cores).init();

tmc::ex_cpu ex_e_core;
ex_e_core.add_partition(e_cores).init();

Option 2: Create a single executor with independent priority partitions:

// P-cores handle high (priority 0) work
// E-cores handle low (priority 1) work
// They exist in the same executor but no work stealing can occur between core types
tmc::cpu_executor()
  .set_priority_count(2)
  .add_partition(p_cores, 0, 1)
  .add_partition(e_cores, 1, 2)
  .init();

Option 3: Create a single executor with overlapping priority partitions:

// P-cores handle high (priority 0) and medium (priority 1) work
// E-cores handle medium (priority 1) and low (priority 2) work
// Work stealing between core types can happen for priority 1 work
tmc::cpu_executor()
  .set_priority_count(3)
  .add_partition(p_cores, 0, 2)
  .add_partition(e_cores, 1, 3)
  .init();

For a complete implementation, including fallback behavior for homogeneous processors, see the hwloc_hybrid_executor example.

…and more#

The following additional functions become available with TMC_USE_HWLOC:

API Reference#

constexpr ex_cpu &tmc::cpu_executor()#

Returns a reference to the global instance of tmc::ex_cpu.

class ex_cpu#

The default multi-threaded executor of TooManyCooks.

Public Functions

ex_cpu &set_thread_occupancy(float ThreadOccupancy, tmc::topology::cpu_kind::value CpuKinds = tmc::topology::cpu_kind::PERFORMANCE)#

Requires TMC_USE_HWLOC. Builder func to set the number of threads per core before calling init(). The default is 1.0f, which will cause init() to automatically create threads equal to the number of physical cores. If you want full SMT, set it to 2.0. Smaller increments (1.5, 1.75) are also valid to increase thread occupancy without full saturation. If the input is less than 1.0f, the minimum number of threads this can reduce a group to is 1.

This only applies to CPU kinds specified in the 2nd parameter (defaults to P-cores). It can be called multiple times to set different occupancies for different CPU kinds.

ex_cpu &fill_thread_occupancy()#

Requires TMC_USE_HWLOC. Builder func to fill the SMT level of each core. On systems with multiple CPU kinds, the occupancy will be set separately for each CPU kind, based on its SMT level. (e.g. on Intel Hybrid, only P-cores have SMT, but on Apple M, neither P-cores nor E-cores have SMT)

ex_cpu &add_partition(tmc::topology::topology_filter Filter, size_t PriorityRangeBegin = 0, size_t PriorityRangeEnd = TMC_MAX_PRIORITY_COUNT)#

Requires TMC_USE_HWLOC. Builder func to limit threads to a subset of the available CPUs. This affects both the thread count and thread affinities.

If called multiple times, this can be used to create multiple subsets in the same executor, which can take tasks of different priorities. This can be used to steer work to different partitions based on priority, e.g. between P-cores and E-cores on hybrid CPUs. See the hybrid_executor.cpp example.

ex_cpu &set_thread_pinning_level(tmc::topology::thread_pinning_level Level)#

Requires TMC_USE_HWLOC. Builder func to specify whether threads should be pinned/bound to specific cores, groups, or NUMA nodes. The default is GROUP.

ex_cpu &set_thread_packing_strategy(tmc::topology::thread_packing_strategy Strategy)#

Requires TMC_USE_HWLOC. Builder func to configure how threads should be allocated when the thread occupancy is less than the full system. This will only have any effect if set_thread_count() is called with a number less than the count of physical cores in the system.

ex_cpu &set_thread_init_hook(std::function<void(tmc::topology::thread_info)> Hook)#

Builder func to set a hook that will be invoked at the startup of each thread owned by this executor, and passed information about this thread. This overload requires TMC_USE_HWLOC.

ex_cpu &set_thread_teardown_hook(std::function<void(tmc::topology::thread_info)> Hook)#

Builder func to set a hook that will be invoked before destruction of each thread owned by this executor, and passed information about this thread. This overload requires TMC_USE_HWLOC.

ex_cpu &set_thread_count(size_t ThreadCount)#

Builder func to set the number of threads before calling init(). The maximum allowed value is equal to the number of bits on your platform (32 or 64 bit), unless TMC_MORE_THREADS is defined, in which case the number of threads is unlimited. If this is not called, the default behavior is:

  • If Linux cgroups CPU quota is detected, that will be used to set the number of threads (rounded down, to a minimum of 1).

  • Otherwise, if TMC_USE_HWLOC is enabled, 1 thread per physical core will be created.

  • Otherwise, std::thread::hardware_concurrency() threads will be created.

size_t thread_count()#

Gets the number of worker threads. Only useful after init() has been called.

ex_cpu &set_priority_count(size_t PriorityCount)#

Builder func to set the number of priority levels before calling init(). The value must be in the range [1, 16]. The default is 1.

size_t priority_count()#

Gets the number of priority levels. Only useful after init() has been called.

ex_cpu &set_thread_init_hook(std::function<void(size_t)> Hook)#

Builder func to set a hook that will be invoked at the startup of each thread owned by this executor, and passed the ordinal index [0..thread_count()-1] of the thread.

ex_cpu &set_thread_teardown_hook(std::function<void(size_t)> Hook)#

Builder func to set a hook that will be invoked before destruction of each thread owned by this executor, and passed the ordinal index [0..thread_count()-1] of the thread.

ex_cpu &set_spins(size_t Spins)#

Builder func to set the number of times that a thread worker will spin looking for new work when all queues appear to be empty before suspending the thread. Each spin is an asm(“pause”) followed by re-checking all queues. The default is 4.

ex_cpu &set_work_stealing_strategy(tmc::work_stealing_strategy Strategy)#

Builder func to configure the work-stealing strategy used internally by this executor. The default is HIERARCHY_MATRIX.

void init()#

Initializes the executor. If you want to customize the behavior, call the set_X() functions before calling init(). By default, uses hwloc to automatically generate threads, and creates 1 (or TMC_PRIORITY_COUNT) priority levels.

If the executor is already initialized, calling init() will do nothing.

void teardown()#

Stops the executor, joins the worker threads, and destroys resources. Does not wait for any queued work to complete. teardown() must not be called from one of this executor’s threads.

Restores the executor to an uninitialized state. After calling teardown(), you may call set_X() to reconfigure the executor and call init() again.

If the executor is not initialized, calling teardown() will do nothing.

ex_cpu()#

After constructing, you must call init() before use.

~ex_cpu()#

Invokes teardown(). Must not be called from one of this executor’s threads.

void post(work_item &&Item, size_t Priority = 0, size_t ThreadHint = NO_HINT)#

Submits a single work_item to the executor. If Priority is out of range, it will be clamped to an in-range value.

Rather than calling this directly, it is recommended to use the tmc::post() free function template.

tmc::ex_any *type_erased()#

Returns a pointer to the type erased ex_any version of this executor. This object shares a lifetime with this executor, and can be used for pointer-based equality comparison against the thread-local tmc::current_executor().

template<typename It>
inline void post_bulk(It &&Items, size_t Count, size_t Priority = 0, size_t ThreadHint = NO_HINT)#

Submits count items to the executor. It is expected to be an iterator type that implements operator*() and It& operator++(). If Priority is out of range, it will be clamped to an in-range value.

Rather than calling this directly, it is recommended to use the tmc::post_bulk() free function template.

Supporting Data Types#

enum class tmc::topology::thread_pinning_level#

Specifies whether threads should be pinned/bound to specific cores, groups, or NUMA nodes.

Values:

enumerator CORE#

Threads will be pinned to individual physical cores. This is useful for applications where threads have exclusive access to cores.

enumerator GROUP#

Threads may run on any core in their group. This prevent threads from being migrated across last-level caches, but allows flexibility in placement within that cache. This is optimal for interactive applications that run in the presence of external threads that may compete for the same execution resources.

enumerator NUMA#

Threads may run on any core in their NUMA node.

enumerator NONE#

Threads may be moved freely by the OS.

enum class tmc::topology::thread_packing_strategy#

Specifiese how threads should be allocated when the thread occupancy is less than the full system. This will only have any effect if set_thread_count() is called with a number less than the count of physical cores in the system.

Values:

enumerator PACK#

Threads will be packed next to each other to maximize locality. Threads will be allocated at the low core indexes of the executor (core 0,1,2…). This optimizes for inter-thread work-stealing efficiency, at the expense of individual thread last-level cache space.

enumerator FAN#

Threads will be spread equally among the available thread groups in the executor. This will negatively impact work-stealing latency between groups, but allows individual threads to have more exclusive access to their own last-level cache.

struct thread_info#

Data passed into the callback that was provided to set_thread_init_hook() and set_thread_teardown_hook(). Contains information about this thread, and the thread group that it runs on.

Public Members

core_group group#

The core group that this thread is part of.

size_t index#

The index of this thread among all threads in its executor. Ranges from 0 to thread_count() - 1.

size_t index_within_group#

The index of this thread among all threads in its group. Starts from 0 for each group.