Schedule all tasks using a single global DSQ. This gives a better
control to prevent potential starvation conditions.
With this change, scx_bpfland adopts a logic similar to scx_rusty and
scx_lavd, prioritizing tasks based on the frequency of their wait and
wake-up events, rather than relying exclusively on the average amount of
voluntary context switches.
Tasks are still classified as interactive / non-interactive based on the
amount of voluntary context switches, but this is only affecting the
cpufreq logic.
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Rather then always migrating tasks across LLC domains when no idle CPU
is available in their current LLC domain, allow migration but attempt to
bring tasks back to their original LLC domain whenever possible.
To do so, define the task's scheduling domain upon task creation or when
its affinity changes, and ensure the task remains within this domain
throughout its lifetime.
In the future we will add a proper load balancing logic, but for now
this change seems to provide consistent performance improvement in
certain server workloads.
For example, simple CUDA benchmarks show a performance boost of about
+10-20% with this change applied (on multi-LLC / NUMA machines).
Signed-off-by: Andrea Righi <arighi@nvidia.com>
This allows to prevent excessive starvation of regular tasks in presence
of high amount of interactive tasks (e.g., when running stress tests,
such as hackbench).
Signed-off-by: Andrea Righi <arighi@nvidia.com>
This can lead to stalls when a high number of interactive tasks are
running in the system (i.e.., hackbench or similar stress tests).
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Add SCX_OPS_ENQ_EXITING to the scheduler flags, since we are not using
bpf_task_from_pid() and the scheduler can handle exiting tasks.
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Ensure that task vruntime is always updated in ops.running() to maintain
consistency with other schedulers.
Signed-off-by: Andrea Righi <arighi@nvidia.com>
The dynamic nvcsw threshold is not used anymore in the scheduler and it
doesn't make sense to report it in the scheduler's statistics, so let's
just drop it.
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Get rid of the static MAX_LATENCY_WEIGHT and always rely on the value
specified by --nvcsw-max-thresh.
This allows to tune the maximum latency weight when running in
lowlatency mode (via --nvcsw-max-thresh) and it also restores the
maximum nvcsw limit in non-lowlatency mode, that was incorrectly changed
during the lowlatency refactoring.
Fixes: 4d68133 ("scx_bpfland: rework lowlatency mode to adjust tasks priority")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Evalute the amount of voluntary context switches directly in the BPF
code, without relying on the kernel p->nvcsw metric.
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Adjust some default settings after the rework done with commit 112a5d4
("scx_bpfland: rework lowlatency mode to adjust tasks priority").
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Rework lowlatency mode as following:
- introduce task dynamic priority: task weight multiplied by the
average amount of voluntary context switches
- use dynamic priority to determine task's vruntime (instead of the
static task's weight)
- task's minimum vruntime is evaluated in function of the dynamic
priority (tasks with a higher dynamic priority can have a smaller
vruntime compared to tasks with a lower dynamic priority)
The dynamic priority allows to maintain a good system responsiveness
also without applying the classification of tasks in "interactive" and
"regular", therefore in lowlatency mode only the shared DSQ will be
used (priority DSQ is disabled).
Using a separate priority queue to dispatch "interactive" tasks makes
the scheduler less fair, allowing latency-sensitive tasks to be
prioritized even when there is a high number of tasks in the system
(e.g., `stress-ng -c 1024` or similar scenarios), where relying solely
on dynamic priority may not be sufficient.
On the other hand, disabling the classification of "interactive" tasks
results in a fairer scheduler and more predictable performance, making
it better suited for soft real-time applications (e.g, audio and
multimedia).
Therefore, the --lowlatency option is retained to allow users to choose
between more predictable performance (by disabling the interactive task
classification) or a more responsive system (default).
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Using per-CPU DSQs seems to introduce more issues than benefits
(potential stalls, etc.). Therefore, let's get rid of the per-CPU DSQs
and use SCX_DSQ_LOCAL for tasks directly dispatched to specific CPUs.
This change seems to also improve performance on 6.12 and it makes the
scheduler a lot more stable and consistent.
The issues will be investigated separately, providing a separate stress
test scheduler, designed to stress test per-CPU DSQs.
Tested-by: Piotr Gorski <piotrgorski@cachyos.org>
Tested-by: Eric Naim <dnaim@cachyos.org>
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Return more meaningful error codes from pick_idle_cpu(). No functional
change, just improved code readability.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
When a task exhausts its timeslice and no other tasks are ready to run,
we automatically refill its timeslice, but only if the current CPU is a
fully idle SMT core.
If we don’t handle the refill, the sched_ext core will default to
refilling using SCX_SLICE_DFL, which may not be optimal.
To ensure better control over the task’s timeslice, always refill it
when no other tasks are available to run.
Fixes: 6e24fcc ("scx_bpfland: keep tasks running on full-idle SMT cores")
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Pick any random idle CPU when the previous CPU isn't valid anymore
according to the task's cpumask.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
In the WAKE_SYNC path lf L3 cache awareness is disabled (--disable-l3)
we may hit the following error:
Error: EXIT: scx_bpf_error (CPU L3 cpumask not initialized)
Fix this by setting the L3 cpumask to the whole primary domain if L3
cache awareness is disabled.
Tested-by: Eric Naim <dnaim@cachyos.org>
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Since per-CPU kthreads may show an inconsistent prev_cpu and/or cpumask,
dispatch them directly to local DSQ and allow to preempt the current
running task.
This allows to prevent per-CPU kthread stalls and it also helps to
prioritize them, as are usually important for system performance and
responsiveness.
Moreover, change the behavior of --local-kthreads to prioritize all
kthreads when this option is used.
This addresses issue #728.
NOTE: ideally we may want to fix this in the kernel by making sure to
always expose a consistent prev_cpu and cpumask also for kthreads, but
at the moment this change allows to prevent some annoying stalls and
performance-wise it doesn't seem to introduce any regression. In fact,
the usual gaming/fps benchmarks show even a slight improvement in
responsiveness with this change applied.
Thanks to YUBY from the CachyOS community for all the extremely valuable
help with the intensive stress tests.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
In lowlatency mode (option --lowlatency) tasks are ordered using a
deadline that is evaluated as the vruntime minus a certain "bonus",
determined in function of the max time slice and the average amount of
voluntary context switches, to amplify the priority boost of the tasks
that are voluntarily releasing the CPU (which are typically
interactive).
However, this method can be extremely unfair in some cases: tasks with
short bursts of voluntary context switches may receive a huge priority
boost, making the rest of the system almost unresponsive (see massive
hackbench stress tests for example).
To prevent this rework the task's deadline logic to use the vruntime and
a "deadline component" that is a function of the average used time
slice, scaled using a dynamic task priority (evaluated as the static
task priority and the its average amount of voluntary context switches).
This logic seems to prevent excessive prioritization of tasks performing
short intensive bursts of voluntary context switches.
It also makes lowlatency mode in scx_bpfland (somehow) more similar to
the deadline logic used by scx_rusty.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
`#stat_doc` extends the document from stat desc property.
Add this attribute macro to the remaining Stats structs.
Signed-off-by: Ming Yang <minos.future@gmail.com>
task_avg_nvcsw() was incorrectly returning a bool instead of u64,
limiting the impact of the lowlatency boost.
Fix it by returning the proper type (u64).
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
When a task is the last one running on a CPU and still wants to
continue, allow it to run and replenish its time only if the used CPU is
part a fully idle SMT core.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
During ttwu, the kernel may decide to skip ->select_task_rq() (e.g.,
when only one CPU is allowed or migration is disabled). This causes to
call ops.enqueue() directly without having a chance to call
ops.select_cpu().
Therefore, introduce a new flag (select_cpu_done) in the local task
context to determine if ops.select_cpu() was bypassed and, in that case,
attempt to find an idle CPU directly from ops.enqueue().
In the future this information will be supplied by the kernel through a
special enqueue flag (SCX_ENQ_CPU_SELECTED) [1]. However, the custom
flag in the local task context ensures to reliably determine the same
information, even on older kernels where this flag is not available.
[1] https://lore.kernel.org/lkml/20240928003840.GA2717@maniforge/T
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Remove cast_mask() function distributed throughout different schedulers
and add it in common.bpf.h so every scheduler can reference it once they
need to.
Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
The usage of cast_mask() within bpfland_enqueue aims to cast the type of
"p->cpus_ptr" from "struct bpf_cpumask *" to "const struct cpumask *".
However, the type of "p->cpus_ptr" is already "const cpumask_t *" aka
"const struct cpumask *", so no conversion is needed.
Passing a value of type "struct cpumask *" into "struct bpf_cpumask *"
also leads to compiling error.
Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
On WAKE_SYNC attempt to migrate the wakee on the same CPU as the waker
if the waker is not exiting, the wakee can use the waker's CPU, the
waker's L3 domain is not saturated and there are not other tasks queued
to the local DSQ of the waker's CPU.
This is the same logic used in scx_rusty.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Using the turbo boosted CPUs as preferred scheduling seems to be
beneficial only a very few corner cases, for example on battery-powered
devices with an aggressive cpufreq governor that constantly tries to
scale down the frequency (and even in this case it's probably better to
not force the tasks to run on the fast CPUs, to save power).
In practive the preferred domain seems to introduce more overhead than
benefits overall, so let's get rid of it.
This can be improved in the future adding multiple user-configurable
scheduling domains.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Use `cargo fmt` with a specific nightly branch in the CI to enforce formatting. Globally format these files while the diff is still small so we can stay on top of it.
Test plan:
- CI lint check passes.
Using p->scx.slice to evaluate the consumed time slice can be a bit
imprecise, because the sched_ext core implements yielding by setting
p->scx.slice to 0.
When the task's vruntime is evaluated this is considered as the task has
exhausted its entire allocated time slice, even though it voluntarily
released the CPU before the slice fully expired.
To avoid this inaccuracy and prevent penalizing tasks that voluntarily
release the CPU, always evaluate the used time slice based on the
difference in the task's total execution time (p->se.sum_exec_runtime).
This method provides a more precise calculation of vruntime and results
in a fairer task's deadline evaluation.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
When selecting an idle CPU for a task that has been woken up, prioritize
reusing the same CPU if the waker and wakee share the same L3 cache.
Otherwise, attempt to migrate the wakee to the waker's CPU, provided it
is allowed by the wakee's scheduling domain.
This seems to consistently improve FPS performance when the system is
not operating over its full capacity.
Example:
$ __GL_SYNC_TO_VBLANK=0 vblank_mode=0 glxgears -geometry 800x600
- before: ~18305.77 FPS
- after: ~19060.62 FPS
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Rename "turbo domain" to "preferred domain", that conceptually is more
generic and introduce the new option `--preferred-domain CPUMASK`, which
allows users to define the preferred domain, specifying a cpumask as a
hex number. By default ("auto") the scheduler will always try to detect
and use the fastest CPUs in the system.
Moreover, adjust the cpufreq logic to use "auto" both with the
"balance_power" and "balance_performance" EPP profiles.
Then, enable "auto" mode by default: the scheduler will try to
automatically determine the optimal primary domain, preferred domain and
cpufreq level, based on the selected scheduler and energy profiles.
Tested-by: Piotr Gorski < piotr.gorski@cachyos.org >
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
In auto mode, rather than keeping the previous fixed cpuperf factor,
dynamically calculate it based on CPU utilization and apply it before a
task runs within its allocated time slot.
Interactive tasks consistently receive the maximum scaling factor to
ensure optimal performance.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Always consider the turbo domain when running in "auto" mode.
Additionally, when the turbo domain is used, split the CPU idle
selection logic into two stages:
1) in ops.select_cpu(), provide the task with a second opportunity to
remain within the same LLC
2) in ops.enqueue(), perform another check for an idle CPU, allowing
the task to move to a different LLC if an idle CPU within the same
LLC is not available.
This allows tasks to stick more on turbo-boosted CPUs and CPUs within
the same LLC.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
When tasks are changing CPU affinity it is pointless to try to find an
optimal idle CPU. In this case just skip the the idle CPU selection step
and let the task being dispatched to a global DSQ if needed.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Add hints for the cpufreq governor based on the selected scheduler's
performance profile and the current energy performance preference (EPP).
With this change applied the scheduler works as following:
scheduler profile (--primary-domain option):
- default:
- use all cores
- cpufreq: use default scaling factor
- powersave:
- use E-cores
- cpufreq: use min scaling factor
- performance:
- use P-cores
- cpufreq: use max scaling factor
- auto:
- EPP: power, powersave
- use E-cores
- cpufreq: use min scaling factor
- EPP: balance_power (typically battery-powered systems)
- use E-cores
- cpufreq: use default scaling factor
- EPP: balance_performance, performance
- use P-cores
- cpufreq: use max scaling factor
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
We want to directly dispatch only kthreads when local_kthreads is
enabled, not all tasks that can run on a single CPU.
Fixes: 7cc1846 ("scx_bpfland: always rely on prev_cpu with single-CPU tasks")
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
When selecting an idle for tasks that can only run on a single CPU,
always check if the previously used CPU is sill usable, instead of
trying to figure out the single allowed CPU looking at the task's
cpumask.
Apparently, single-CPU tasks can report a prev_cpu that is not in the
allowed cpumask when they rapidly change affinity.
This could lead to stalls, because we may end up dispatching the kthread
to a per-CPU DSQ that is not compatible with its allowed cpumask.
Example:
kworker/u32:2[173797] triggered exit kind 1026:
runnable task stall (kworker/2:1[70] failed to run for 7.552s)
...
R kworker/2:1[70] -7552ms
scx_state/flags=3/0x9 dsq_flags=0x1 ops_state/qseq=0/0
sticky/holding_cpu=-1/-1 dsq_id=0x8 dsq_vtime=234483011369
cpus=04
In this case kworker/2 can only run on CPU #2 (cpus=0x4), but it's
dispatched to dsq_id=0x8, that can only be consumed by CPU 8 => stall.
To prevent this, do not try to figure out the best idle CPU for tasks
that are changing affinity and just dispatch them to a global DSQ
(either priority or regular, depending on its interactive state).
Moreover, introduce an explicit error check in dispatch_direct_cpu() to
improve detection of similar issues in the future, and drop
lookup_task_ctx() in favor of try_lookup_task_ctx(), since we can now
safely handle all the cases where the task context is not found.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Aggressively try to keep tasks running on the same CPU / cache / domain,
to achieve higher performance when the system is not over commissioned.
This is done by giving a second chance in ops.enqueue(), in addition to
ops.select_cpu(), to find an idle CPU close to the previously used CPU.
Moreover, even if the task is dispatched to the global DSQs, always try
to check if there is an idle CPU in the primary domain that can
immediately consume the task.
= Results =
This change seems to provide a minor, but consistent, boost of
performance with the CPU-intensive benchmarks from the CachyOS
benchmarks selection [1].
Similar results can also be noticed with some WebGL benchmarks [2], when
system usage is close to its maximum capacity.
Test:
- cachyos-benchmarker
System:
- AMD Ryzen 7 5800X 8-Core Processor
Metrics:
- total time: elapsed time of all benchmarks
- total score: geometric mean of all benchmarks
NOTE: total time is the most relevant, since it gives a measure of the
aggregate performance, while the total score emphasizes more on
performance consistency across all benchmarks.
== Results: summary ==
+-------------------------+---------------------+---------------------+
| Scheduler | Total Time | Total Score |
| | (less = better) | (less = better) |
+-------------------------+---------------------+---------------------+
| EEVDF | 624.44 sec | 123.68 |
| bpfland | 625.34 sec | 122.21 |
| bpfland-task-affinity | 623.67 sec | 122.27 |
+-------------------------+---------------------+---------------------+
== Conclusion ==
With this patch applied, bpfland shows both a better performance and
consistency. Although the gains are small (less than 1%), they are still
significant for this type of benchmark and consistently appear across
multiple runs.
[1] https://github.com/CachyOS/cachyos-benchmarker
[2] https://webglsamples.org/aquarium/aquarium.html
Tested-by: Piotr Gorski < piotr.gorski@cachyos.org >
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Rely on scx_utils::Topology to classify Big, Little and Turbo CPUs.
Moreover, support the special keyword "all" with --primary-domain to
include all the CPUs in the system (default).
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>