Disable topology currently defaults to `false` (topology enabled...). Change
this so that topology is enabled by default on hardware that may benefit from
it (multiple NUMA nodes or LLCs) and disabled on hardware that does not benefit
from it.
This is a slightly noisy change as we have to move ownership of the newly
mutable layer specs into the `Scheduler` object (previously they were a
borrow). We don't have a `Topology` object to make the default decision from
until `Scheduler::init`, and I think this is because of the possibility of hot
plugs. We therefore have to clone the `Vec<LayerSpec>` each time as it is
potentially mutable.
Test plan:
- CI. Updated to be explicit about topology in both cases.
Single NUMA multi-LLC machine:
```
$ scx_layered --run-example
...
13:34:01 [INFO] Topology awareness not specified, selecting enabled based on
hardware
...
$ scx_layered --run-example --disable-topology=true
...
13:33:41 [INFO] Disabling topology awareness
...
$ scx_layered --run-example -t
...
13:33:15 [INFO] Disabling topology awareness
...
$ scx_layered --run-example --disable-topology=false
# none of the above messages present
```
Single NUMA single LLC machine:
```
$ scx_layered --run-example
15:33:10 [INFO] Topology awareness not specified, selecting disabled based on
hardware
```
In the WAKE_SYNC path lf L3 cache awareness is disabled (--disable-l3)
we may hit the following error:
Error: EXIT: scx_bpf_error (CPU L3 cpumask not initialized)
Fix this by setting the L3 cpumask to the whole primary domain if L3
cache awareness is disabled.
Tested-by: Eric Naim <dnaim@cachyos.org>
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Refactor topology preemption logic so the non topology aware code is
contianed to a separate function. This should make maintaining the non
topology aware code path far easier.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
Rename the `load_adj` statistic to `load_frac_adj`, which is a more
accurate representation of what the statistic is calculating. The
statistic is a fractional representation of the load of a layer adjusted
for infeasible weights.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
Refactor layered_dispatch into two functions: layered_dispatch_no_topo and
layered_dispatch. layered_dispatch will delegate to layered_dispatch_no_topo in
the disable_topology case.
Although this code doesn't run when loaded by BPF due to the global constant
bool blocking it, it makes the functions really hard to parse as a human. As
they diverge more and more it makes sense to split them into separate
manageable functions.
This is basically a mechanical change. I duplicated the existing function,
replaced all `disable_topology` with true in `no_topo` and false in the
existing function, then removed all branches which can't be hit.
Test plan:
- Runs on my dev box (6.9.0 fbkernel) with `scx_layered --run-example -n`.
- As above with `-t`.
- CI.
clang is correctly warning that we use various uninitialised variables. clean
these up so real errors are easier to read.
The largest change here is to non-topological layered_dispatch. The
matching_dsq logic seems to be incorrect. It checks whether an uninitialised
variable is 0, if it is sets it, then only uses the variable if the value is 0.
I have changed this to default to -1, then use the value if it is no longer -1.
Since per-CPU kthreads may show an inconsistent prev_cpu and/or cpumask,
dispatch them directly to local DSQ and allow to preempt the current
running task.
This allows to prevent per-CPU kthread stalls and it also helps to
prioritize them, as are usually important for system performance and
responsiveness.
Moreover, change the behavior of --local-kthreads to prioritize all
kthreads when this option is used.
This addresses issue #728.
NOTE: ideally we may want to fix this in the kernel by making sure to
always expose a consistent prev_cpu and cpumask also for kthreads, but
at the moment this change allows to prevent some annoying stalls and
performance-wise it doesn't seem to introduce any regression. In fact,
the usual gaming/fps benchmarks show even a slight improvement in
responsiveness with this change applied.
Thanks to YUBY from the CachyOS community for all the extremely valuable
help with the intensive stress tests.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Add doc comment to `CpuPool` as a quick reference for each member.
Most importantly, differentiate "cpu" and "core", as logical core and
physical core, respectively.
Signed-off-by: Ming Yang <minos.future@gmail.com>
When hotplugging CPUs in rapid succession, scx_rusty would crash with:
```
scx_bpf_error (Failed to lookup dom[4294967295]
```
The root cause is if the scheduler is restarted fast enough, a task
on a previously hotplugged CPU may not have moved off that CPU yet.
Thus, the CPU -> domain map would contain an invalid domain (u32::max)
and we would fail to lookup the domain correctly in rusty_select_cpu
for prev_cpu.
To fix this, if the CPU is offline, we do not try to allocate to the
same NUMA node (assuming hotplug is a rare operation) beyond domestic
domain. Instead we use greedy allocation - first idle, then busy - then
any CPU.
Update the idle topology selection order, the current logic is:
core architecture (big/little) -> LLC -> NUMA -> Machine
It's probably better to try to keep cache lines clean and do:
LLC -> core architecture (big/little) -> NUMA -> Machine
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
Improve the performance on non topology aware paths by skipping some map
lookups and uneccessary initializations.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
Add support for layer configuration for idle CPU selection. This allows
layers to choose whether or not to restrict idle CPU selection to SMT
idle CPUs.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
In the non topology aware code the idle smt mask is used for finding
idle cpus. Update topology aware idle selection to also use the idle
smt mask. In certain benchmarks this can improve performance.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
Add big cpumask to scx_layered and prefer selecting big idle cores when
using the BigLittle growth algo.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
In lowlatency mode (option --lowlatency) tasks are ordered using a
deadline that is evaluated as the vruntime minus a certain "bonus",
determined in function of the max time slice and the average amount of
voluntary context switches, to amplify the priority boost of the tasks
that are voluntarily releasing the CPU (which are typically
interactive).
However, this method can be extremely unfair in some cases: tasks with
short bursts of voluntary context switches may receive a huge priority
boost, making the rest of the system almost unresponsive (see massive
hackbench stress tests for example).
To prevent this rework the task's deadline logic to use the vruntime and
a "deadline component" that is a function of the average used time
slice, scaled using a dynamic task priority (evaluated as the static
task priority and the its average amount of voluntary context switches).
This logic seems to prevent excessive prioritization of tasks performing
short intensive bursts of voluntary context switches.
It also makes lowlatency mode in scx_bpfland (somehow) more similar to
the deadline logic used by scx_rusty.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>