.. _build_flags:

Build-Time Options
----------------------------------------
TooManyCooks will work out-of-the-box without any configuration.
However, there are several options that may be specified at build time that are recommended for best performance.
They mostly take the form of a preprocessor definition that should be defined globally for your application.
The best place to do this is in your build script (by passing ``-DTMC_FLAG_NAME`` to the compiler).
For an example of a build script that contains all of these configurations,
see the `tmc-examples CMakeLists.txt <https://github.com/tzcnt/tmc-examples/blob/main/CMakeLists.txt>`_.

These configurations are listed roughly in order by the amount of performance impact they will have.
The most important options are listed first.

.. rubric:: Recommended Build-Time Options
   :heading-level: 1

Link to tcmalloc / mimalloc / jemalloc
----------------------------------------
Since each new coroutine requires an allocation, they are sensitive to allocator performance.
Any of tcmalloc, mimalloc, or jemalloc provide greatly superior performance to the default system allocator.
For synthetic benchmarks on my machines, tcmalloc appears to perform the best, but you should benchmark in your own application.

This doesn't require a TMC compilation flag; you only need to link your application to one of these libraries
to replace your global allocator.

.. _tmc_use_hwloc:

TMC_USE_HWLOC
----------------------------------------
Enable TMC to use the `Portable Hardware Locality (hwloc) library <https://www.open-mpi.org/projects/hwloc/>`_
to optimize thread layout and work-stealing groups of :literal_ref:`tmc::ex_cpu<ex_cpu>` according to your processor architecture. This will yield noticeable
improvements on systems with non-uniform cache architecture, such as modern AMD Ryzen/Epyc or Intel Hybrid Core architectures. It also
enables you to query information about the CPU topology, and have fine-grained control of thread affinity.
See :ref:`CPU Topology & Thread Configuration<topology>` for more information.

In addition to defining ``TMC_USE_HWLOC``, you must also make ``<hwloc.h>`` available on your include path
and link ``libhwloc`` to your application.

.. rubric:: Default Behavior (off)
   :heading-level: 6

:literal_ref:`tmc::ex_cpu<ex_cpu>` will create threads according to ``std::thread::hardware_concurrency()``
and these threads will not be assigned to any particular core. All threads will be part of the same work-stealing group.

.. rubric:: TMC_USE_HWLOC (on)
   :heading-level: 6

:literal_ref:`tmc::ex_cpu<ex_cpu>` will create 1 thread per physical core reported by hwloc. These threads will be organized
into work-stealing groups by cores that share L3 cache. Threads will be pinned to core groups that share L3 cache
(not to a specific core, but to any core in that L3 cache group). Threads will prefer to steal work from other threads in their
L3 cache group before checking other groups for work. You must make ``<hwloc.h>`` available on your include path
and link ``libhwloc`` to your application.

.. rubric:: Other Build-Time Options
   :heading-level: 1

.. _priority_count:

TMC_PRIORITY_COUNT
----------------------------------------
Allows you to specify the number of priority levels at compile time. This allows certain runtime checks to be optimized away.

.. rubric:: Default Behavior
   :heading-level: 6

The number of priority levels must be specified at runtime with :cpp:func:`tmc::ex_cpu::set_priority_count()`. If unspecified, the default priority count is 1.

.. rubric:: TMC_PRIORITY_COUNT=1
   :heading-level: 6

All operations will run at the same priority, and some priority checking and tracking logic is completely removed from the code.

.. rubric:: TMC_PRIORITY_COUNT=<N between 2 and 16>
   :heading-level: 6

Observable behavior will be the same as if you called :cpp:func:`tmc::ex_cpu::set_priority_count()` but the compiler is able to inline / unroll certain checks.

.. _tmc_trivial_task:

TMC_TRIVIAL_TASK
----------------------------------------
By default, tasks are rvalue-only awaitables / linear types, like most other TMC awaitables; they must be passed around by move operations
and then consumed by awaiting them exactly once. This prevents accidental memory leaks or use-after-free issues. However, since
:literal_ref:`tmc::task<task>` is the size of a single pointer, the linear type checks (move constructor and destructor)
prevent optimizations that could occur if it were a trivial type (such as passing it in a register).
Enabling this flag is provided to allow you to disable the linear type checks in order to improve performance.

Enabling will not change any behaviors within TMC; it simply replaces the the copy+move constructor and destructor of :literal_ref:`tmc::task<task>` with defaulted ones.
In doing so, it removes the guardrails that would alert you if you have violated
the linear type rules (and leaked a coroutine or awaited it twice). If you are going to enable this, it is encouraged to
do so only in final release builds. You should always build and run your application at least once without this.


.. rubric:: Default Behavior (off)
   :heading-level: 6

:literal_ref:`tmc::task<task>` is a move-only type. There is an assert in the destructor that checks that the task was executed to completion.

.. rubric:: TMC_TRIVIAL_TASK (on)
   :heading-level: 6

:literal_ref:`tmc::task<task>` is a trivial type. It can be freely copied and has no runtime checks.

.. _tmc_more_threads:

TMC_MORE_THREADS
----------------------------------------
By default, :literal_ref:`tmc::ex_cpu<ex_cpu>` uses a machine-word-sized bitmap to track thread states. This is highly efficient, but limits the number of threads.
Enabling this configuration allows an unlimited number of threads, but this requires a dynamic bitmap, which has a small negative performance impact.

.. rubric:: Default Behavior (off)
   :heading-level: 6

:literal_ref:`tmc::ex_cpu<ex_cpu>` is limited to a maximum of 64 threads in a single ex_cpu, or 32 threads on a 32-bit system.

.. rubric:: TMC_MORE_THREADS (on)
   :heading-level: 6

:literal_ref:`tmc::ex_cpu<ex_cpu>` can use an unlimited number of threads in a single ex_cpu.

.. _work_item:

TMC_WORK_ITEM
----------------------------------------
Controls the type used to submit work items to TMC executors and store them in their internal work queues. This type alias is known as :literal:`tmc::work_item` and
is not directly exposed in most cases, as TMC public APIs are templated to transform inputs into this type internally.
Any of these types can store either a coroutine or a functor, but the performance characteristics are different.

.. rubric:: Default Behavior (TMC_WORK_ITEM=CORO)
   :heading-level: 6

:literal:`tmc::work_item` is an alias for :literal:`std::coroutine_handle<>`. This type is 1 pointer in size.
Coroutines can be stored directly in it, but functors will be wrapped in a coroutine trampoline.
This option yields the best performance for coroutines and the worst performance for functors.

.. rubric:: TMC_WORK_ITEM=FUNCORO
   :heading-level: 6
   
:literal:`tmc::work_item` is an alias for :literal:`tmc::coro_functor`. This type is 2 pointers in size.
This is a custom type provided by this library that can directly store both coroutines and functors.
It provides excellent performance for both coroutines and functors.
It supports move-only functors, and has pointer/lvalue/rvalue reference constructor overloads that make the ownership semantics clear.
It does not support small-object optimization.

.. rubric:: TMC_WORK_ITEM=FUNC
   :heading-level: 6
   
:literal:`tmc::work_item` is an alias for :literal:`std::function<void()>`. This type is 4 pointers in size on most systems.
Both coroutines and functors can be stored directly in it.
This option yields the best performance for functors, if able to make use of small-object optimization, and the worst performance for coroutines.
This type doesn't support move-only functors and always makes a copy of your functor, which may block certain use cases.
Because this type requires its parameter to be copyable, it also requires you to define :ref:`TMC_TRIVIAL_TASK <tmc_trivial_task>`.

.. doxygenclass:: tmc::coro_functor
  :members:

Debug Options
-------------------------

TMC_DEBUG_TASK_ALLOC_COUNT
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
See the :ref:`HALO<tmc_debug_task_alloc_count>` section.

TMC_DEBUG_THREAD_CREATION
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Prints information to stdout about thread groups, affinities, and work-stealing matrixes at executor ``init()`` time.

TMC_NO_UNKNOWN_AWAITABLES
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
By default, TMC allows you to ``co_await`` any awaitable type from within a TMC coroutine, even if TMC doesn't know about it, by using an
``await_transform()`` wrapper that makes those awaitables affinity-aware. This carries a small performance cost for those awaitables.

Defining this macro disables that behavior, so that only known TMC awaitables can be directly ``co_await``-ed. Attempting to ``co_await`` an unknown awaitable will cause a compilation error.

This can be useful as a diagnostic tool to ensure that custom awaitables are properly integrated via a :literal_ref:`tmc::detail::awaitable_traits<external_awaitables>` specialization.

