.. _ex_cpu:

tmc::ex_cpu
-----------------------------------------------------------------------------------
``ex_cpu`` is the primary executor of TooManyCooks. It provides a high-performance work-stealing thread pool with the option for multiple priority levels.

It is recommended to use a single instance of ``ex_cpu`` for general work submission across your entire application. This maximizes the efficiency of the work-stealing architecture.
However, you can create additional ``ex_cpu`` 's if your application requires it. For example, a separate single-threaded executor may be useful to process calls to certain external libraries.

You must call ``.init()`` on each executor instance before you use it. Before calling ``init()``, you may configure it by calling any of the member functions that begin with ``set_``.

A global instance of ``ex_cpu`` is available from ``tmc::cpu_executor()``.
This is provided as a convenience so that you can access the executor without needing to explicitly inject it.
However, if you don't want to use the global instance, just don't ``init()`` it, and it won't consume any resources.

Usage Example
-----------------------------------------------------------------------------------

.. code-block:: cpp

  #define TMC_IMPL
  #include "tmc/all_headers.hpp"

  int main() {
    // Configure and init the global CPU executor
    tmc::cpu_executor().set_thread_count(8).set_priority_count(1).init();

    return tmc::async_main(main_task());
  }

.. code-block:: cpp

  #define TMC_IMPL
  #include "tmc/all_headers.hpp"

  int main() {
    // Create a standalone instance of the CPU executor
    tmc::ex_cpu my_executor;
    my_executor.set_thread_count(8).set_priority_count(1).init();

    tmc::post_waitable(my_executor, main_task()).wait();
  }

.. _ex_cpu_hwloc:

TMC_USE_HWLOC additional features
-------------------------------------------
When :literal_ref:`TMC_USE_HWLOC<tmc_use_hwloc>` is enabled, :literal_ref:`tmc::ex_cpu<ex_cpu>` gains many features:

Hardware-Optimized Defaults
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
By default, :literal_ref:`tmc::ex_cpu<ex_cpu>` will create 1 thread per physical core reported by hwloc. These threads will be organized
into work-stealing groups by cores of the same :ref:`CPU kind<cpu_kinds>` that share a cache. Threads will be pinned so that they can occupy any core
in that cache, but are not allowed to be migrated to a different cache. Threads will prefer to steal work from other
threads in their cache group before checking other groups for work.

P-cores and E-cores will be included, but LP E-cores (as on Intel Meteor Lake laptops) will be excluded.

.. code-block:: cpp

  // Automagically configures itself for optimal performance
  tmc::cpu_executor().init();

Thread Occupancy
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Automatically fill to the SMT level of each :ref:`CPU kind<cpu_kinds>`. This handles systems with differing SMT levels:

* AMD/Intel homogeneous: 2.0 for all cores
* Intel hybrid: 2.0 for P-cores, 1.0 for E-cores
* Apple M-series: 1.0 for all cores

.. code-block:: cpp

  tmc::cpu_executor()
    .fill_thread_occupancy()
    .init();

Or manually control how many threads are created per physical core:

.. code-block:: cpp

  tmc::cpu_executor()
    // 1.0 = 1 thread per core (default)
    // 2.0 = full SMT (2 threads per core on most consumer CPUs)
    .set_thread_occupancy(2.0f, tmc::topology::cpu_kind::PERFORMANCE)
    .set_thread_occupancy(1.0f, tmc::topology::cpu_kind::EFFICIENCY1)
    .init();

CPU Partitioning
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Create separate executors for for different cache groups or NUMA nodes:

.. code-block:: cpp

  auto topo = tmc::topology::query();

  // Construct an independent ex_cpu for each NUMA node
  std::vector<tmc::ex_cpu> exCpus(topo.numa_count());
  for (size_t i = 0; i < topo.numa_count(); ++i) {
    tmc::topology::topology_filter f{};
    f.set_numa_indexes({i});
    exCpus[i].add_partition(f).init();
  }

Hybrid Work Steering
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
When using a machine with multiple CPU kinds (P- and E-cores), care should be taken to designate
latency-sensitive work to P-cores, while making use of E-cores for high-parallelism jobs, or background tasks.
There are several ways to do this in TooManyCooks.

First, setup your filters:

.. code-block:: cpp

  tmc::topology::topology_filter p_cores;
  p_cores.set_cpu_kinds(tmc::topology::cpu_kind::PERFORMANCE);
  
  tmc::topology::topology_filter e_cores;
  e_cores.set_cpu_kinds(tmc::topology::cpu_kind::EFFICIENCY1);

Option 1: Create independent executors for the different CPU kinds:

.. code-block:: cpp

  tmc::ex_cpu ex_p_core;
  ex_p_core.add_partition(p_cores).init();

  tmc::ex_cpu ex_e_core;
  ex_e_core.add_partition(e_cores).init();

Option 2: Create a single executor with independent priority partitions:

.. code-block:: cpp

  // P-cores handle high (priority 0) work
  // E-cores handle low (priority 1) work
  // They exist in the same executor but no work stealing can occur between core types
  tmc::cpu_executor()
    .set_priority_count(2)
    .add_partition(p_cores, 0, 1)
    .add_partition(e_cores, 1, 2)
    .init();

Option 3: Create a single executor with overlapping priority partitions:

.. code-block:: cpp

  // P-cores handle high (priority 0) and medium (priority 1) work
  // E-cores handle medium (priority 1) and low (priority 2) work
  // Work stealing between core types can happen for priority 1 work
  tmc::cpu_executor()
    .set_priority_count(3)
    .add_partition(p_cores, 0, 2)
    .add_partition(e_cores, 1, 3)
    .init();

For a complete implementation, including fallback behavior for homogeneous processors, see the `hwloc_hybrid_executor example <https://github.com/tzcnt/tmc-examples/blob/main/examples/hwloc/hybrid_executor.cpp>`_.

...and more
^^^^^^^^^^^^^^^^^^^
The following additional functions become available with :literal_ref:`TMC_USE_HWLOC<tmc_use_hwloc>`:

* :cpp:func:`tmc::ex_cpu::set_thread_pinning_level()` - controls how tightly threads are pinned to cores
* :cpp:func:`tmc::ex_cpu::set_thread_packing_strategy()` - controls how threads are allocated when the number of requested threads is less than the number of cores
* :cpp:func:`void tmc::ex_cpu::set_thread_init_hook(std::function\<void(tmc::topology::thread_info)\>)` - overload that receives info about this thread's group and CPU kind
* :cpp:func:`void tmc::ex_cpu::set_thread_teardown_hook(std::function\<void(tmc::topology::thread_info)\>)` - overload that receives info about this thread's group and CPU kind

API Reference
-----------------------------------------------------------------------------------
.. doxygenfunction:: tmc::cpu_executor

.. doxygenclass:: tmc::ex_cpu
   :members:

Supporting Data Types
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. doxygenenum:: tmc::topology::thread_pinning_level

.. doxygenenum:: tmc::topology::thread_packing_strategy

.. doxygenstruct:: tmc::topology::thread_info
   :members:
