CPU Topology & Thread Configuration

CPU Topology & Thread Configuration#

TooManyCooks can use the hwloc library to query CPU topology and automatically configure executor threads for optimal performance. This functionality is enabled by defining TMC_USE_HWLOC and linking to libhwloc. When hwloc is enabled:

You can query the system topology using tmc::topology::query(). This includes information about cache groups, physical cores, SMT levels, and CPU kinds (P-cores vs E-cores).
You can call .add_partition() to restrict executor threads to specific cores, groups, or NUMA nodes
You can call a 2nd overload of .set_thread_init_hook() which receives detailed information about the thread, group, and CPU kind that an executor thread runs on.
tmc::ex_cpu also gains some additional capabilities, which you can read about here.

Querying CPU Topology#

Use tmc::topology::query() to get a snapshot of the system’s CPU topology. For full info, try running the hwloc_topo example.

#define TMC_IMPL
#include "tmc/all_headers.hpp"

int main() {
  tmc::topology::cpu_topology topo = tmc::topology::query();

  std::cout << "Logical processors: " << topo.pu_count() << std::endl;
  std::cout << "Physical cores: " << topo.core_count() << std::endl;
  std::cout << "Core groups: " << topo.group_count() << std::endl;
  std::cout << "NUMA nodes: " << topo.numa_count() << std::endl;
  std::cout << "Hybrid architecture: " << (topo.is_hybrid() ? "yes" : "no") << std::endl;
}

CPU Kinds#

On hybrid architectures (such as Intel hybrid or Apple M-series), cores are classified by their cpu_kind:

cpu_kind	Description
`PERFORMANCE`	P-Cores, or regular cores on non-hybrid systems
`EFFICIENCY1`	E-Cores, Compact Cores, or Dense Cores
`EFFICIENCY2`	Low Power E-Cores (e.g. Intel Meteor Lake LP E-cores)
`ALL`	Matches all CPU kinds (convenience value for filtering)

cpu_kind is a flags bitmap, so you can OR together multiple values when constructing a filter:

using tmc::topology::cpu_kind;

// Match both P-cores and E-cores
size_t kinds = cpu_kind::PERFORMANCE | cpu_kind::EFFICIENCY1;

Core Groups#

TMC groups cores together based on shared cache and CPU kind. The topology query exposes a core_group data structure with the following info:

numa_index - Index of the NUMA node this group belongs to
index - Unique index among all groups on the machine
core_indexes - Indexes of cores in this group (global across all groups)
cpu_kind - The CPU kind of all cores in this group
smt_level - SMT/hyperthreading level (1 if no SMT, typically 2 for x86 monolithic/P-cores)

Groups are sorted so that Performance cores always come first, followed by Efficiency cores. On multi-NUMA systems with multiple CPU kinds, NUMA node is the major sort dimension.

Topology Filter#

topology_filter is an input to an executor’s add_partition() function which allows you to specify which physical cores that executor can use. If multiple set_*() operations are combined on the same filter, they are exclusive - only cores that match all of the criteria will be used.

tmc::topology::topology_filter filter;

// Use only P-cores
filter.set_cpu_kinds(tmc::topology::cpu_kind::PERFORMANCE);

// Use only specific NUMA nodes
filter.set_numa_indexes({0, 1});

// Use only specific core groups
filter.set_group_indexes({0, 2, 4});

// Use only specific cores
filter.set_core_indexes({0, 1, 2, 3});

Pinning External Threads#

Use tmc::topology::pin_thread() to pin a non-executor thread to specific hardware resources:

tmc::topology::topology_filter numa_filter;
numa_filter.set_numa_indexes({0});

auto external_thread = std::thread([&filter](){
  // This thread will only run on NUMA node 0.
  tmc::topology::pin_thread(filter);
});

// This executor will also only run on NUMA node 0.
// This prevents cross-NUMA latency between the executor and the external thread.
tmc::ex_cpu ex;
ex.add_partition(numa_filter).init();

Apple platforms do not allow thread pinning. Instead, this sets the QoS class based on the CPU kind of the allowed resources.

API Reference#

Topology Types#

struct cpu_kind#

CPU kind types for hybrid architectures (P-cores vs E-cores). cpu_kind is a flags bitmap; you can OR together multiple flags to combine them in a filter.

Public Types

enum value#

CPU kind types for hybrid architectures (P-cores vs E-cores). cpu_kind is a flags bitmap; you can OR together multiple flags to combine them in a filter.

Values:

enumerator PERFORMANCE#

enumerator EFFICIENCY1#

enumerator EFFICIENCY2#

enumerator ALL#

struct core_group#

Public Members

size_t numa_index#: Index of this group’s NUMA node. Indexes start at 0 and count up.

size_t index#: Index among all groups on this machine. Indexes start at 0 and count up.

std::vector<size_t> core_indexes#: Indexes of cores that are in this group. Indexes start at 0 in the first group, and count up. The index is global across all groups: groups[0].core_indexes.back() + 1 == groups[1].core_indexes[0]

tmc::topology::cpu_kind::value cpu_kind#: All cores in this group will be of the same kind.

size_t smt_level#: SMT (hyperthreading) level of this group’s CPU kind. If a core does not support SMT, this will be 1. Most consumer CPUs have SMT == 2.

struct cpu_topology#

The public API for the TMC CPU topology. It exposes a view of “core groups”, which are used internally by TMC to construct the work-stealing matrix. Cores are partitioned into groups based on shared cache and CPU kind.

This is a “plain old data” type with no internal or external references.

Public Functions

bool is_hybrid() const#: Returns true if this machine has more than one CPU kind.

size_t pu_count() const#: The total number of logical processors (including SMT/hyperthreading).

size_t core_count() const#: The total number of physical processors (not including SMT/hyperthreading).

size_t group_count() const#: The total number of core groups that TMC sees. These groups are based on shared caches and CPU kinds. For more detail on the group construction rules, see the documentation.

size_t numa_count() const#: The total number of NUMA nodes.

Public Members

std::vector<core_group> groups#

Groups are sorted so that all fields are in strictly increasing order. That is, groups[i].field < groups[i+1].field, for any field.

This means that Performance cores always come first in this ordering. This may differ from your OS ordering (some OS put Efficiency cores first).

There is one exception: if your system has multiple NUMA nodes and multiple CPU kinds, the NUMA node will be the major sort dimension.

std::vector<size_t> cpu_kind_counts#: Core counts, grouped by CPU kind. Index 0 is the number of P-cores, or homogeneous cores. Index 1 (if it exists) is the number of E-cores. Index 2 (if it exists) is the number of LP E-cores.

float container_cpu_quota#

Container CPU quota detection result. If running in a container with CPU limits, this will contain the effective number of allowed CPUs. This only detects limits from Linux cgroups (v1 or v2) based containerization.

If container CPU quota is detected, it will become the default number of threads (rounded down, to a minimum of 1) for tmc::ex_cpu. If .set_thread_count() is called explicitly, that will override the quota.

This will be populated if running with docker run --cpus=2.

It will not be populated if running with docker run --cpuset-cpus=0,1, which doesn’t appear as a cgroups limit, and will instead be detected by hwloc as a change in the topology that only exposes 2 cores.

If no limit is detected, this will be 0.0f.

class topology_filter#

Constructs a filter to limit the allowed CPU resources for an executor. The default filter allows everything except EFFICIENCY2 cores (LP E-cores). Calling the same set_* function twice will override the previous set. Calling different set_* functions will produce an allowed set that is the intersection of the two sets. Be careful as you can easily create an empty set this way.

Public Functions

void set_core_indexes(std::vector<size_t> Indexes)#: Set the allowed core indexes.

void set_group_indexes(std::vector<size_t> Indexes)#: Set the allowed group indexes.

void set_numa_indexes(std::vector<size_t> Indexes)#: Set the allowed NUMA indexes.

topology_filter operator|(topology_filter const &rhs) const#: OR together two filters to produce a filter that allows elements that match any filter. This is a union, not an intersection.

std::vector<size_t> const &core_indexes() const#: Gets the allowed core indexes.

std::vector<size_t> const &group_indexes() const#: Gets the allowed group indexes.

std::vector<size_t> const &numa_indexes() const#: Gets the allowed NUMA indexes.

size_t cpu_kinds() const#: Gets the allowed CPU kinds. This is a bitmap that may combine multiple cpu_kind values.

Topology Functions#

cpu_topology tmc::topology::query()#: Query the system CPU topology. Returns a copy of the topology; modifications to the this copy will have no effect on other systems.

void tmc::topology::pin_thread(topology_filter const &Allowed)#

Pins the current thread to the set of hardware resources defined by the provided filter. You don’t need to call this on any TMC executor threads, but you can call it on an external thread so that it will reside in the same portion of the processor as an executor that it communicates with.

On Apple platforms, direct thread pinning is not allowed. This will set the QoS class based on the cpu_kind of the allowed resources instead. If the allowed resources span multiple cpu_kinds, QoS will not be set.