With user-space scheduling we don't usually dispatch a task immediately
after selecting an idle CPU, so there's not much benefit at trying to
optimize the WAKE_SYNC scenario (when a task is waking up another task
and releaing the CPU) when picking an idle CPU.
Therefore, get rid of the WAKE_SYNC logic in select_cpu() and rely on
the user-space logic (that has access to the WAKE_SYNC information) to
handle this particular case.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Do not kick a CPU from rs_select_cpu() (called by the user-space
scheduler), since we may not immediately dispatch the task.
Instead, always try to wake up the task's assigned CPU after dispatching
to a global DSQ, ensuring it can be consumed immediately.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Prevent CPUs from going idle when the user-space scheduler has some
pending activities to complete.
Keeping the CPU alive allows to consume tasks from the user-space
scheduler more efficiently, preventing bubbles in the scheduling
pipeline.
To achieve this, trigger a CPU kick from ops.update_idle() and set a
flag in the CPU context to prevent it from going idle. Then keep kicking
the CPU from ops.dispatch() until the flag is cleared, which occurs when
no more tasks are pending or when the CPU exits idle as a task starts
running on it.
This allows to fix the performance regression introduced by the
put_prev_task_scx() behavior change in Linux 6.12 (see #788).
Link: https://lore.kernel.org/lkml/20241015111539.12136-1-andrea.righi@linux.dev/
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
User-space schedulers may still hit some stalls during CPU hotplugging
events.
There is no reason to overcomplicate the code and trying to handle
hotplug events within the scx_rustland_core framework and we can simply
handle a scheduler restart performed by the scx core.
This makes CPU hotplugging more reliable with scx_rustland_core-based
schedulers.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Assign an infinite time slice to the user-space scheduler itself, so
that it can completely drain all the pending tasks and voluntarily
release the CPU when it's done.
This allows to achieve more consistent performance and we can also
remove the speculative user-space scheduler wakeup from ops.stopping().
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Provide additional task metrics to user-space schedulers via QueuedTask:
- nvcsw: total amount of voluntary context switches
- slice: task time slice "budget" (from p->scx.slice)
- dsq_vtime: current task vtime (from p->scx.dsq_vtime)
In this way user-space schedulers can quickly access these metrics to
implement better scheduling policy.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Pinned tasks should just be routed to a fallback DSQ. kthreads are given
a higher priority than non-kthreads so use two fallback DSQs.
Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
If we're not on the wakeup path, we may see enqueue() invoked without
select_cpu() which will require an idle cpu lookup. In order to fix
this, we refactor the idle_cpu lookup in select_cpu so it can be invoked
from enqueue().
Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
Add an integration test for testing that the `llcs` field on the layer
config works properly.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
Merge the sched_switch ftrace helper scripts into a single python script
that prints the result to stdout.
In this way it's possible to generate a perfetto-compatible trace
running:
$ sudo ./scripts/sched_ftrace.py > sched.ftrace
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Add a bpftrace script that does a topology aware test. The test script
runs a bpftrace script that asserts that stress-ng processes are
scheduled on NUMA node 0 only.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
vmlinux.h is not compatible across archs.
Handle this compatibility issue by
* Add arch info into vmlinux.h real file name
* Link vmlinux.h to the target-arch real file at build time
* Use target-arch real file for scx_utils bindgen.
Also refactored clang related logic into a new clang_info mod, which is
shared by bpf_builder.rs and builder.rs.
Signed-off-by: Ming Yang <minos.future@gmail.com>
Add a set of ftrace helper scripts for making perfetto compatible ftrace
scheduler profiles.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
u32 is not big enough to hold the sum of lat_cri in a period,
so sum_lat_cri (u32) was overflown, resulting in incorrect
avg_lat_cri. Change the type from u32 to u64, avoiding the
interger overflow. Note that {sum/avg}_lat_cri is only for
deubugging so it is irrelevant in making scheduling decisions.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
The downscaling is not necessary in calculating task's virtual
deadline because virtual dealine represents only relative order
in task scheduling. Hence downscaling incurs only inacuracy
caused by truncation.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Using per-CPU DSQs seems to introduce more issues than benefits
(potential stalls, etc.). Therefore, let's get rid of the per-CPU DSQs
and use SCX_DSQ_LOCAL for tasks directly dispatched to specific CPUs.
This change seems to also improve performance on 6.12 and it makes the
scheduler a lot more stable and consistent.
The issues will be investigated separately, providing a separate stress
test scheduler, designed to stress test per-CPU DSQs.
Tested-by: Piotr Gorski <piotrgorski@cachyos.org>
Tested-by: Eric Naim <dnaim@cachyos.org>
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Return more meaningful error codes from pick_idle_cpu(). No functional
change, just improved code readability.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
When a task exhausts its timeslice and no other tasks are ready to run,
we automatically refill its timeslice, but only if the current CPU is a
fully idle SMT core.
If we don’t handle the refill, the sched_ext core will default to
refilling using SCX_SLICE_DFL, which may not be optimal.
To ensure better control over the task’s timeslice, always refill it
when no other tasks are available to run.
Fixes: 6e24fcc ("scx_bpfland: keep tasks running on full-idle SMT cores")
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Pick any random idle CPU when the previous CPU isn't valid anymore
according to the task's cpumask.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Disable topology currently defaults to `false` (topology enabled...). Change
this so that topology is enabled by default on hardware that may benefit from
it (multiple NUMA nodes or LLCs) and disabled on hardware that does not benefit
from it.
This is a slightly noisy change as we have to move ownership of the newly
mutable layer specs into the `Scheduler` object (previously they were a
borrow). We don't have a `Topology` object to make the default decision from
until `Scheduler::init`, and I think this is because of the possibility of hot
plugs. We therefore have to clone the `Vec<LayerSpec>` each time as it is
potentially mutable.
Test plan:
- CI. Updated to be explicit about topology in both cases.
Single NUMA multi-LLC machine:
```
$ scx_layered --run-example
...
13:34:01 [INFO] Topology awareness not specified, selecting enabled based on
hardware
...
$ scx_layered --run-example --disable-topology=true
...
13:33:41 [INFO] Disabling topology awareness
...
$ scx_layered --run-example -t
...
13:33:15 [INFO] Disabling topology awareness
...
$ scx_layered --run-example --disable-topology=false
# none of the above messages present
```
Single NUMA single LLC machine:
```
$ scx_layered --run-example
15:33:10 [INFO] Topology awareness not specified, selecting disabled based on
hardware
```
Move the LayerConfig and its children from `main.rs` into `lib.rs`. This allows
other tooling, such as config managers or test executors, to modify layered
configs programmatically.
The end goal is to move everything in `layered` except for the argument parsing
into a `run_layered` function, but I haven't done it in this diff because it's
a larger change. This is a common pattern in Rust projects to do as little as
possible in `main.rs` for extensibility.
The only change here, other than publicity and where things are located, is the
signature of `CpuPool::alloc_cpus`. It previously relied on `&Layer`, and this
changes it to the two elements of `Layer` it uses. This allows `Layer` to stay
confined to `main.rs` (for now) to prevent scope creep in this PR.
This may be inconvenient in the short term for WIPs and anyone doing non-Cargo
builds (cough me), but having things split into more files should make
rebases/merges easier in the long run.
Test plan:
- `cargo build --release`
- CI.
The symbol __handle_mm_fault isn't available anymore in 6.12, let's rely
on handle_mm_fault that is available both on 6.12 and older kernels.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>