Update the idle topology selection order, the current logic is:
core architecture (big/little) -> LLC -> NUMA -> Machine
It's probably better to try to keep cache lines clean and do:
LLC -> core architecture (big/little) -> NUMA -> Machine
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
Improve the performance on non topology aware paths by skipping some map
lookups and uneccessary initializations.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
Add support for layer configuration for idle CPU selection. This allows
layers to choose whether or not to restrict idle CPU selection to SMT
idle CPUs.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
In the non topology aware code the idle smt mask is used for finding
idle cpus. Update topology aware idle selection to also use the idle
smt mask. In certain benchmarks this can improve performance.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
Add big cpumask to scx_layered and prefer selecting big idle cores when
using the BigLittle growth algo.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
In lowlatency mode (option --lowlatency) tasks are ordered using a
deadline that is evaluated as the vruntime minus a certain "bonus",
determined in function of the max time slice and the average amount of
voluntary context switches, to amplify the priority boost of the tasks
that are voluntarily releasing the CPU (which are typically
interactive).
However, this method can be extremely unfair in some cases: tasks with
short bursts of voluntary context switches may receive a huge priority
boost, making the rest of the system almost unresponsive (see massive
hackbench stress tests for example).
To prevent this rework the task's deadline logic to use the vruntime and
a "deadline component" that is a function of the average used time
slice, scaled using a dynamic task priority (evaluated as the static
task priority and the its average amount of voluntary context switches).
This logic seems to prevent excessive prioritization of tasks performing
short intensive bursts of voluntary context switches.
It also makes lowlatency mode in scx_bpfland (somehow) more similar to
the deadline logic used by scx_rusty.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Add a flag to control DSQ iteration across layers by layer weight. This
helps prevent starvation by iterating over layers with the lowest weight
first.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
Add two new flags `layer_preempt_weight_disable` and
`layer_growth_weight_disable` to disabled preemption and layer growth
when weighted layer load exceeds the configured threshold.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
Add weights to layers and use the infeasible weights crate to properly
apply weights during contention to prevent starvation.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
`layer_core_order` provided multiple core growth implementation
Break it up into smaller function. Also, attach the method to
LayerGrowthAlgo. And `LayerCoreOrderGenerator` is added to make future
growth algo extension easy.
Signed-off-by: Ming Yang <minos.future@gmail.com>
As the main.bpf.c file grows, it gets hard to maintain.
So, split it into multiple logical files. There is no
functional change.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Remove a short-circuit in cpu_to_dom_id that will return domain id 0 for
any input.
This fixes a crash of scx_rusty when running with a single domain and
any CPU is offline.
Signed-off-by: Fredrik Lönnegren <fredrik@frelon.se>
"struct task_struct *p" isn't used within the function
"task_load_adj()". Delete the function parameter for cleaner code.
Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
Use scx_utils::NR_CPU_IDS to iterate whole CPUs and separately count the
number of online CPUs to support CPU hotplug correctly.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
`#stat_doc` extends the document from stat desc property.
Add this attribute macro to the remaining Stats structs.
Signed-off-by: Ming Yang <minos.future@gmail.com>
task_avg_nvcsw() was incorrectly returning a bool instead of u64,
limiting the impact of the lowlatency boost.
Fix it by returning the proper type (u64).
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
When a task is the last one running on a CPU and still wants to
continue, allow it to run and replenish its time only if the used CPU is
part a fully idle SMT core.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
During ttwu, the kernel may decide to skip ->select_task_rq() (e.g.,
when only one CPU is allowed or migration is disabled). This causes to
call ops.enqueue() directly without having a chance to call
ops.select_cpu().
Therefore, introduce a new flag (select_cpu_done) in the local task
context to determine if ops.select_cpu() was bypassed and, in that case,
attempt to find an idle CPU directly from ops.enqueue().
In the future this information will be supplied by the kernel through a
special enqueue flag (SCX_ENQ_CPU_SELECTED) [1]. However, the custom
flag in the local task context ensures to reliably determine the same
information, even on older kernels where this flag is not available.
[1] https://lore.kernel.org/lkml/20240928003840.GA2717@maniforge/T
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Fix a bug in cache initialization where the first node would repeated
get all CPUs added to the mask. Refactor some consts to be more clear.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
When finding a victim candidate for preemption, a randomly chosen
candidate could be out of valid CPU range due to CPU offline, etc. In
this case, try another CPU randomly.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
The doc of scx_layered `Opt` is out of sync.
Implement attribute macro #stat_doc to generate doc from the `desc`
property.
Apply #stat_doc to `LayerStats` and `SysStats in scx_layered.
Signed-off-by : Ming Yang <minos.future@gmail.com>