If we're not on the wakeup path, we may see enqueue() invoked without
select_cpu() which will require an idle cpu lookup. In order to fix
this, we refactor the idle_cpu lookup in select_cpu so it can be invoked
from enqueue().
Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
Add an integration test for testing that the `llcs` field on the layer
config works properly.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
Add a bpftrace script that does a topology aware test. The test script
runs a bpftrace script that asserts that stress-ng processes are
scheduled on NUMA node 0 only.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
u32 is not big enough to hold the sum of lat_cri in a period,
so sum_lat_cri (u32) was overflown, resulting in incorrect
avg_lat_cri. Change the type from u32 to u64, avoiding the
interger overflow. Note that {sum/avg}_lat_cri is only for
deubugging so it is irrelevant in making scheduling decisions.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
The downscaling is not necessary in calculating task's virtual
deadline because virtual dealine represents only relative order
in task scheduling. Hence downscaling incurs only inacuracy
caused by truncation.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Using per-CPU DSQs seems to introduce more issues than benefits
(potential stalls, etc.). Therefore, let's get rid of the per-CPU DSQs
and use SCX_DSQ_LOCAL for tasks directly dispatched to specific CPUs.
This change seems to also improve performance on 6.12 and it makes the
scheduler a lot more stable and consistent.
The issues will be investigated separately, providing a separate stress
test scheduler, designed to stress test per-CPU DSQs.
Tested-by: Piotr Gorski <piotrgorski@cachyos.org>
Tested-by: Eric Naim <dnaim@cachyos.org>
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Return more meaningful error codes from pick_idle_cpu(). No functional
change, just improved code readability.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
When a task exhausts its timeslice and no other tasks are ready to run,
we automatically refill its timeslice, but only if the current CPU is a
fully idle SMT core.
If we don’t handle the refill, the sched_ext core will default to
refilling using SCX_SLICE_DFL, which may not be optimal.
To ensure better control over the task’s timeslice, always refill it
when no other tasks are available to run.
Fixes: 6e24fcc ("scx_bpfland: keep tasks running on full-idle SMT cores")
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Pick any random idle CPU when the previous CPU isn't valid anymore
according to the task's cpumask.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Disable topology currently defaults to `false` (topology enabled...). Change
this so that topology is enabled by default on hardware that may benefit from
it (multiple NUMA nodes or LLCs) and disabled on hardware that does not benefit
from it.
This is a slightly noisy change as we have to move ownership of the newly
mutable layer specs into the `Scheduler` object (previously they were a
borrow). We don't have a `Topology` object to make the default decision from
until `Scheduler::init`, and I think this is because of the possibility of hot
plugs. We therefore have to clone the `Vec<LayerSpec>` each time as it is
potentially mutable.
Test plan:
- CI. Updated to be explicit about topology in both cases.
Single NUMA multi-LLC machine:
```
$ scx_layered --run-example
...
13:34:01 [INFO] Topology awareness not specified, selecting enabled based on
hardware
...
$ scx_layered --run-example --disable-topology=true
...
13:33:41 [INFO] Disabling topology awareness
...
$ scx_layered --run-example -t
...
13:33:15 [INFO] Disabling topology awareness
...
$ scx_layered --run-example --disable-topology=false
# none of the above messages present
```
Single NUMA single LLC machine:
```
$ scx_layered --run-example
15:33:10 [INFO] Topology awareness not specified, selecting disabled based on
hardware
```
Move the LayerConfig and its children from `main.rs` into `lib.rs`. This allows
other tooling, such as config managers or test executors, to modify layered
configs programmatically.
The end goal is to move everything in `layered` except for the argument parsing
into a `run_layered` function, but I haven't done it in this diff because it's
a larger change. This is a common pattern in Rust projects to do as little as
possible in `main.rs` for extensibility.
The only change here, other than publicity and where things are located, is the
signature of `CpuPool::alloc_cpus`. It previously relied on `&Layer`, and this
changes it to the two elements of `Layer` it uses. This allows `Layer` to stay
confined to `main.rs` (for now) to prevent scope creep in this PR.
This may be inconvenient in the short term for WIPs and anyone doing non-Cargo
builds (cough me), but having things split into more files should make
rebases/merges easier in the long run.
Test plan:
- `cargo build --release`
- CI.
When a task holds a lock, it should not yield its time slice or it
should not be preempted out. In this way, we can mitigate harmful
preemption of lock holders and reduce the total preemption counts.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
When a lock holder exhausts its time slide, it will be re-enqueued
to a DSQ waiting for shceduling while holding a lock. In this case,
prioritize its latency criticality proportionally, so a lock holder
would be not stuck in a DSQ for a long time, improving system-wide
progress.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Trace the acquisition and release of blocking locks for kernel and
fuxtexes for user-space. This is necessary to boost a lock holder
task in terms of latency and time slice. We do not boost shared
lock holders (e.g., read lock in rw_semaphore) since the kernel
already prioritizes the readers over writers.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
In the WAKE_SYNC path lf L3 cache awareness is disabled (--disable-l3)
we may hit the following error:
Error: EXIT: scx_bpf_error (CPU L3 cpumask not initialized)
Fix this by setting the L3 cpumask to the whole primary domain if L3
cache awareness is disabled.
Tested-by: Eric Naim <dnaim@cachyos.org>
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Refactor topology preemption logic so the non topology aware code is
contianed to a separate function. This should make maintaining the non
topology aware code path far easier.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
Rename the `load_adj` statistic to `load_frac_adj`, which is a more
accurate representation of what the statistic is calculating. The
statistic is a fractional representation of the load of a layer adjusted
for infeasible weights.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
Refactor layered_dispatch into two functions: layered_dispatch_no_topo and
layered_dispatch. layered_dispatch will delegate to layered_dispatch_no_topo in
the disable_topology case.
Although this code doesn't run when loaded by BPF due to the global constant
bool blocking it, it makes the functions really hard to parse as a human. As
they diverge more and more it makes sense to split them into separate
manageable functions.
This is basically a mechanical change. I duplicated the existing function,
replaced all `disable_topology` with true in `no_topo` and false in the
existing function, then removed all branches which can't be hit.
Test plan:
- Runs on my dev box (6.9.0 fbkernel) with `scx_layered --run-example -n`.
- As above with `-t`.
- CI.