Using per-CPU DSQs seems to introduce more issues than benefits
(potential stalls, etc.). Therefore, let's get rid of the per-CPU DSQs
and use SCX_DSQ_LOCAL for tasks directly dispatched to specific CPUs.
This change seems to also improve performance on 6.12 and it makes the
scheduler a lot more stable and consistent.
The issues will be investigated separately, providing a separate stress
test scheduler, designed to stress test per-CPU DSQs.
Tested-by: Piotr Gorski <piotrgorski@cachyos.org>
Tested-by: Eric Naim <dnaim@cachyos.org>
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Return more meaningful error codes from pick_idle_cpu(). No functional
change, just improved code readability.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
When a task exhausts its timeslice and no other tasks are ready to run,
we automatically refill its timeslice, but only if the current CPU is a
fully idle SMT core.
If we don’t handle the refill, the sched_ext core will default to
refilling using SCX_SLICE_DFL, which may not be optimal.
To ensure better control over the task’s timeslice, always refill it
when no other tasks are available to run.
Fixes: 6e24fcc ("scx_bpfland: keep tasks running on full-idle SMT cores")
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Pick any random idle CPU when the previous CPU isn't valid anymore
according to the task's cpumask.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Disable topology currently defaults to `false` (topology enabled...). Change
this so that topology is enabled by default on hardware that may benefit from
it (multiple NUMA nodes or LLCs) and disabled on hardware that does not benefit
from it.
This is a slightly noisy change as we have to move ownership of the newly
mutable layer specs into the `Scheduler` object (previously they were a
borrow). We don't have a `Topology` object to make the default decision from
until `Scheduler::init`, and I think this is because of the possibility of hot
plugs. We therefore have to clone the `Vec<LayerSpec>` each time as it is
potentially mutable.
Test plan:
- CI. Updated to be explicit about topology in both cases.
Single NUMA multi-LLC machine:
```
$ scx_layered --run-example
...
13:34:01 [INFO] Topology awareness not specified, selecting enabled based on
hardware
...
$ scx_layered --run-example --disable-topology=true
...
13:33:41 [INFO] Disabling topology awareness
...
$ scx_layered --run-example -t
...
13:33:15 [INFO] Disabling topology awareness
...
$ scx_layered --run-example --disable-topology=false
# none of the above messages present
```
Single NUMA single LLC machine:
```
$ scx_layered --run-example
15:33:10 [INFO] Topology awareness not specified, selecting disabled based on
hardware
```
Move the LayerConfig and its children from `main.rs` into `lib.rs`. This allows
other tooling, such as config managers or test executors, to modify layered
configs programmatically.
The end goal is to move everything in `layered` except for the argument parsing
into a `run_layered` function, but I haven't done it in this diff because it's
a larger change. This is a common pattern in Rust projects to do as little as
possible in `main.rs` for extensibility.
The only change here, other than publicity and where things are located, is the
signature of `CpuPool::alloc_cpus`. It previously relied on `&Layer`, and this
changes it to the two elements of `Layer` it uses. This allows `Layer` to stay
confined to `main.rs` (for now) to prevent scope creep in this PR.
This may be inconvenient in the short term for WIPs and anyone doing non-Cargo
builds (cough me), but having things split into more files should make
rebases/merges easier in the long run.
Test plan:
- `cargo build --release`
- CI.
When a task holds a lock, it should not yield its time slice or it
should not be preempted out. In this way, we can mitigate harmful
preemption of lock holders and reduce the total preemption counts.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
When a lock holder exhausts its time slide, it will be re-enqueued
to a DSQ waiting for shceduling while holding a lock. In this case,
prioritize its latency criticality proportionally, so a lock holder
would be not stuck in a DSQ for a long time, improving system-wide
progress.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Trace the acquisition and release of blocking locks for kernel and
fuxtexes for user-space. This is necessary to boost a lock holder
task in terms of latency and time slice. We do not boost shared
lock holders (e.g., read lock in rw_semaphore) since the kernel
already prioritizes the readers over writers.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
In the WAKE_SYNC path lf L3 cache awareness is disabled (--disable-l3)
we may hit the following error:
Error: EXIT: scx_bpf_error (CPU L3 cpumask not initialized)
Fix this by setting the L3 cpumask to the whole primary domain if L3
cache awareness is disabled.
Tested-by: Eric Naim <dnaim@cachyos.org>
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Refactor topology preemption logic so the non topology aware code is
contianed to a separate function. This should make maintaining the non
topology aware code path far easier.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
Rename the `load_adj` statistic to `load_frac_adj`, which is a more
accurate representation of what the statistic is calculating. The
statistic is a fractional representation of the load of a layer adjusted
for infeasible weights.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
Refactor layered_dispatch into two functions: layered_dispatch_no_topo and
layered_dispatch. layered_dispatch will delegate to layered_dispatch_no_topo in
the disable_topology case.
Although this code doesn't run when loaded by BPF due to the global constant
bool blocking it, it makes the functions really hard to parse as a human. As
they diverge more and more it makes sense to split them into separate
manageable functions.
This is basically a mechanical change. I duplicated the existing function,
replaced all `disable_topology` with true in `no_topo` and false in the
existing function, then removed all branches which can't be hit.
Test plan:
- Runs on my dev box (6.9.0 fbkernel) with `scx_layered --run-example -n`.
- As above with `-t`.
- CI.
clang is correctly warning that we use various uninitialised variables. clean
these up so real errors are easier to read.
The largest change here is to non-topological layered_dispatch. The
matching_dsq logic seems to be incorrect. It checks whether an uninitialised
variable is 0, if it is sets it, then only uses the variable if the value is 0.
I have changed this to default to -1, then use the value if it is no longer -1.
Since per-CPU kthreads may show an inconsistent prev_cpu and/or cpumask,
dispatch them directly to local DSQ and allow to preempt the current
running task.
This allows to prevent per-CPU kthread stalls and it also helps to
prioritize them, as are usually important for system performance and
responsiveness.
Moreover, change the behavior of --local-kthreads to prioritize all
kthreads when this option is used.
This addresses issue #728.
NOTE: ideally we may want to fix this in the kernel by making sure to
always expose a consistent prev_cpu and cpumask also for kthreads, but
at the moment this change allows to prevent some annoying stalls and
performance-wise it doesn't seem to introduce any regression. In fact,
the usual gaming/fps benchmarks show even a slight improvement in
responsiveness with this change applied.
Thanks to YUBY from the CachyOS community for all the extremely valuable
help with the intensive stress tests.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Add doc comment to `CpuPool` as a quick reference for each member.
Most importantly, differentiate "cpu" and "core", as logical core and
physical core, respectively.
Signed-off-by: Ming Yang <minos.future@gmail.com>
When hotplugging CPUs in rapid succession, scx_rusty would crash with:
```
scx_bpf_error (Failed to lookup dom[4294967295]
```
The root cause is if the scheduler is restarted fast enough, a task
on a previously hotplugged CPU may not have moved off that CPU yet.
Thus, the CPU -> domain map would contain an invalid domain (u32::max)
and we would fail to lookup the domain correctly in rusty_select_cpu
for prev_cpu.
To fix this, if the CPU is offline, we do not try to allocate to the
same NUMA node (assuming hotplug is a rare operation) beyond domestic
domain. Instead we use greedy allocation - first idle, then busy - then
any CPU.
Update the idle topology selection order, the current logic is:
core architecture (big/little) -> LLC -> NUMA -> Machine
It's probably better to try to keep cache lines clean and do:
LLC -> core architecture (big/little) -> NUMA -> Machine
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
Improve the performance on non topology aware paths by skipping some map
lookups and uneccessary initializations.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
Add support for layer configuration for idle CPU selection. This allows
layers to choose whether or not to restrict idle CPU selection to SMT
idle CPUs.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
In the non topology aware code the idle smt mask is used for finding
idle cpus. Update topology aware idle selection to also use the idle
smt mask. In certain benchmarks this can improve performance.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
Add big cpumask to scx_layered and prefer selecting big idle cores when
using the BigLittle growth algo.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
In lowlatency mode (option --lowlatency) tasks are ordered using a
deadline that is evaluated as the vruntime minus a certain "bonus",
determined in function of the max time slice and the average amount of
voluntary context switches, to amplify the priority boost of the tasks
that are voluntarily releasing the CPU (which are typically
interactive).
However, this method can be extremely unfair in some cases: tasks with
short bursts of voluntary context switches may receive a huge priority
boost, making the rest of the system almost unresponsive (see massive
hackbench stress tests for example).
To prevent this rework the task's deadline logic to use the vruntime and
a "deadline component" that is a function of the average used time
slice, scaled using a dynamic task priority (evaluated as the static
task priority and the its average amount of voluntary context switches).
This logic seems to prevent excessive prioritization of tasks performing
short intensive bursts of voluntary context switches.
It also makes lowlatency mode in scx_bpfland (somehow) more similar to
the deadline logic used by scx_rusty.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Add a flag to control DSQ iteration across layers by layer weight. This
helps prevent starvation by iterating over layers with the lowest weight
first.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
Add two new flags `layer_preempt_weight_disable` and
`layer_growth_weight_disable` to disabled preemption and layer growth
when weighted layer load exceeds the configured threshold.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
Add weights to layers and use the infeasible weights crate to properly
apply weights during contention to prevent starvation.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
`layer_core_order` provided multiple core growth implementation
Break it up into smaller function. Also, attach the method to
LayerGrowthAlgo. And `LayerCoreOrderGenerator` is added to make future
growth algo extension easy.
Signed-off-by: Ming Yang <minos.future@gmail.com>
As the main.bpf.c file grows, it gets hard to maintain.
So, split it into multiple logical files. There is no
functional change.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Remove a short-circuit in cpu_to_dom_id that will return domain id 0 for
any input.
This fixes a crash of scx_rusty when running with a single domain and
any CPU is offline.
Signed-off-by: Fredrik Lönnegren <fredrik@frelon.se>
"struct task_struct *p" isn't used within the function
"task_load_adj()". Delete the function parameter for cleaner code.
Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
Use scx_utils::NR_CPU_IDS to iterate whole CPUs and separately count the
number of online CPUs to support CPU hotplug correctly.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
`#stat_doc` extends the document from stat desc property.
Add this attribute macro to the remaining Stats structs.
Signed-off-by: Ming Yang <minos.future@gmail.com>
task_avg_nvcsw() was incorrectly returning a bool instead of u64,
limiting the impact of the lowlatency boost.
Fix it by returning the proper type (u64).
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
When a task is the last one running on a CPU and still wants to
continue, allow it to run and replenish its time only if the used CPU is
part a fully idle SMT core.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
During ttwu, the kernel may decide to skip ->select_task_rq() (e.g.,
when only one CPU is allowed or migration is disabled). This causes to
call ops.enqueue() directly without having a chance to call
ops.select_cpu().
Therefore, introduce a new flag (select_cpu_done) in the local task
context to determine if ops.select_cpu() was bypassed and, in that case,
attempt to find an idle CPU directly from ops.enqueue().
In the future this information will be supplied by the kernel through a
special enqueue flag (SCX_ENQ_CPU_SELECTED) [1]. However, the custom
flag in the local task context ensures to reliably determine the same
information, even on older kernels where this flag is not available.
[1] https://lore.kernel.org/lkml/20240928003840.GA2717@maniforge/T
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Fix a bug in cache initialization where the first node would repeated
get all CPUs added to the mask. Refactor some consts to be more clear.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
When finding a victim candidate for preemption, a randomly chosen
candidate could be out of valid CPU range due to CPU offline, etc. In
this case, try another CPU randomly.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
The doc of scx_layered `Opt` is out of sync.
Implement attribute macro #stat_doc to generate doc from the `desc`
property.
Apply #stat_doc to `LayerStats` and `SysStats in scx_layered.
Signed-off-by : Ming Yang <minos.future@gmail.com>
We used the average performance criticality of tasks as a threshold to
determine the proper core type (big or little). However, if the big
core's compute capacity is not half of the total compute capacity, such
an average-based determination becomes suboptimal. If fewer tasks are
classified as performance-critical tasks and requested to run on big
cores, the big cores would be wasted by stealing arbitrary
non-performance-critical tasks. That could result in performance
instability.
Hence, determine the threshold more accurately by considering (active)
big cores' compute capacity and the (approximated) distribution of
performance criticality of tasks.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
As a preparation to improve the performance criticality logic, we first
rename "avg_perf_cri" to "thr_perf_cri" since average is no longer the
threshold.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Add an enum for the layer growth algo to the bpf layer config. This will
be useful for implementing topology aware layer growth algorithms.
When selecting an idle CPU the current logic tries to keep tasks
local to LLC/NUMA node. However, for certain growth algorithms (ex:
RoundRobin) this is suboptimal. Adding the layer growth algorithm
will allow for different paths for CPU selection in the idle/preemption
paths.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
* enable ide's etc. to work on the bpf.c files
this makes it so that clangd and ide tools which use clangd
can work on the bpf.c code.
nothing should actually be changed outside of that ide/editor
environment, all the changes are ifdef'ed on LSP which is set
in the added .clangd file.
* move intf include out of both sides of ifdef toggle
When preempting restrict preemption to the current layer cpumask. This
may reduce the amount of preemption, but cause better cache locality
of preempted tasks.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
Remove cast_mask() function distributed throughout different schedulers
and add it in common.bpf.h so every scheduler can reference it once they
need to.
Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
If a waker is more latency critical than a wakee, inherit a waker's
latency criticality for the wakee. This allows the wakee to consider the
context of who wakes me up. For now, we limit such inheritance to one
hop and one schedule.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Use the cast_mask helper to clean up some of the bpf cpumask conversion
code for preemption.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
Add topology aware preemption that begins in the local LLC and attempts
to preempt from cpus nearest in the topology.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
Previously, we found a victim from the entire CPUs, which include remote
or non-compatible CPUs. Now we limit our search for victim finding
within a task's compute domain.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Add core growth algos for Big/Little core support. The algos allow
layers to grow layers by preferring either big or little cores first.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
The usage of cast_mask() within bpfland_enqueue aims to cast the type of
"p->cpus_ptr" from "struct bpf_cpumask *" to "const struct cpumask *".
However, the type of "p->cpus_ptr" is already "const cpumask_t *" aka
"const struct cpumask *", so no conversion is needed.
Passing a value of type "struct cpumask *" into "struct bpf_cpumask *"
also leads to compiling error.
Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
Refactor match_layer() to prevent the compiling error caused by
uninitialization of the variable "nr_match_ors" before usage.
Move the checking of "nr_match_ors" after it access the value within
"layer->nr_match_ors" to make sure it's initiailized successfully.
Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
Pass enqueue flags to user-space: flags will be passed via
QueuedTask.flags and can be forwarded back to BPF via
DispatchedTask.flags.
These flags can be also passed to BpfScheduler.select_cpu() to apply a
more refined CPU selection policy.
Moreover, avoid to prioritize the user-space scheduler too much and
dispatch it only if there are no other tasks that needs to be dispatched
in ops.dispatch().
This improves CPU utilization and enhances the fairness, robustness, and
resilience of schedulers based on scx_rustland_core, particularly under
stress test conditions.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
On WAKE_SYNC attempt to migrate the wakee on the same CPU as the waker
if the waker is not exiting, the wakee can use the waker's CPU, the
waker's L3 domain is not saturated and there are not other tasks queued
to the local DSQ of the waker's CPU.
This is the same logic used in scx_rusty.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Using the turbo boosted CPUs as preferred scheduling seems to be
beneficial only a very few corner cases, for example on battery-powered
devices with an aggressive cpufreq governor that constantly tries to
scale down the frequency (and even in this case it's probably better to
not force the tasks to run on the fast CPUs, to save power).
In practive the preferred domain seems to introduce more overhead than
benefits overall, so let's get rid of it.
This can be improved in the future adding multiple user-configurable
scheduling domains.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Many kernel threads performs latency critical tasks (e.g., net, gpu). In
particular, AMD GPU driver runs the most part in the kernel space using
kworker. Hence, treat kernel threads as if a woken up task.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Initialize the node cpumask, which was previously uninitialized causing
metric calculations to be wrong when attempting to lookup CPUs in the
node cpumask.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
Use `cargo fmt` with a specific nightly branch in the CI to enforce formatting. Globally format these files while the diff is still small so we can stay on top of it.
Test plan:
- CI lint check passes.
The domains are added to the aggregator when load is added (and
duty_cycle is not 0.0f64).
This commit makes sure that all domains are added to the aggregator even
when the calculated duty_cycle is 0.
Signed-off-by: Fredrik Lönnegren <fredrik@frelon.se>
Pass in the layer spec when determining the layer core growth algo. This
should make it easier to implement layer growth algos that are spec
specific.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
Using p->scx.slice to evaluate the consumed time slice can be a bit
imprecise, because the sched_ext core implements yielding by setting
p->scx.slice to 0.
When the task's vruntime is evaluated this is considered as the task has
exhausted its entire allocated time slice, even though it voluntarily
released the CPU before the slice fully expired.
To avoid this inaccuracy and prevent penalizing tasks that voluntarily
release the CPU, always evaluate the used time slice based on the
difference in the task's total execution time (p->se.sum_exec_runtime).
This method provides a more precise calculation of vruntime and results
in a fairer task's deadline evaluation.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Rust build was using two separate workspaces - rust/ and scheds/rust.
There's no reason to separate them and it makes doc generation tricky. Use
single top level workspace so that we can drive all rust building from
cargo.
split build and test jobs to reduce ci turnaround time
and make it clear what is failing when something fails.
also add virtiofsd to deps to make test compilation faster
(most test time is compliation) and remove all force 9ps.
Simplify scx_rlfifo code, add detailed documentation of the
scx_rustland_core API and get rid of the additional task queue, since it
just makes the code bigger, slower and it doesn't really provide any
benefit (considering that we are dispatching the tasks in FIFO order
anyway).
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Pass the enqueue flags to the user-space scheduler through the
QueuedTask struct.
These flags allow the user-space scheduler to make more informed
scheduling decisions.
Also bump up scx_rustland_core minor version to reflect the new API (we
are not breaking the old API, so we don't need to bump the major version
in this case).
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Unexpectedly, little cores, which have relative short time slices, have
more chance to schedule performance-critical tasks. Hence it is better
to keep the time slice same regardless the core types.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
When selecting an idle CPU for a task that has been woken up, prioritize
reusing the same CPU if the waker and wakee share the same L3 cache.
Otherwise, attempt to migrate the wakee to the waker's CPU, provided it
is allowed by the wakee's scheduling domain.
This seems to consistently improve FPS performance when the system is
not operating over its full capacity.
Example:
$ __GL_SYNC_TO_VBLANK=0 vblank_mode=0 glxgears -geometry 800x600
- before: ~18305.77 FPS
- after: ~19060.62 FPS
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Rename "turbo domain" to "preferred domain", that conceptually is more
generic and introduce the new option `--preferred-domain CPUMASK`, which
allows users to define the preferred domain, specifying a cpumask as a
hex number. By default ("auto") the scheduler will always try to detect
and use the fastest CPUs in the system.
Moreover, adjust the cpufreq logic to use "auto" both with the
"balance_power" and "balance_performance" EPP profiles.
Then, enable "auto" mode by default: the scheduler will try to
automatically determine the optimal primary domain, preferred domain and
cpufreq level, based on the selected scheduler and energy profiles.
Tested-by: Piotr Gorski < piotr.gorski@cachyos.org >
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Fix formatting precision of stats to have lower precision for
readability. The existing formatting is hard to read:
tot= 1538 local=31.27 open_idle= 2.73 affn_viol=23.80 proc=4ms
busy= 1.1 util= 16.6 load= 32.7 fallback_cpu= 6
excl_coll=0.06501950585175553 excl_preempt=0.26007802340702213 excl_idle=0.16384915474642392 excl_wakeup=0.25097529258777634
With this fix stats are far more readable formatting:
tot= 441 local=33.56 open_idle= 0.00 affn_viol=20.63 proc=3ms
busy= 0.4 util= 6.3 load= 33.6 fallback_cpu= 6
excl_coll=0.454 excl_preempt=0.000 excl_idle=0.132 excl_wakeup=0.200
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
When a pinned task cannot run on either active or overflow sets, we try
to stay on the previous CPU which is still okay to run on.
Signed-off-by: Changwoo Min <changwoo@igalia.com>