Make `--primar-domain auto` aware of turbo boosted CPUs and prioritize
them over the primary scheduling domain when the energy model
`balance_power` is used (typically when running on battery power with
the "balanced" profile).
With this change the scheduling hierarchy becomes the following:
1) CPUs in the turbo scheduling domain
2) CPUs in the primary scheduling domain
3) full-idle SMT CPUs
4) CPUs in the same L2 cache
5) CPUs in the same L3 cache
6) CPUs in the task's allowed domain
And the idle selection logic is modified as following:
- In the turbo scheduling domain:
- pick same full-idle SMT CPU
- pick any other full-idle SMT CPU sharing the same L2 cache
- pick any other full-idle SMT CPU sharing the same L3 cache
- pick any other full-idle SMT CPU
- pick same idle CPU
- pick any other idle CPU sharing the same L2 cache
- pick any other idle CPU sharing the same L3 cache
- pick any other idle SMT CPU
- In the primary scheduling domain:
- pick same full-idle SMT CPU
- pick any other full-idle SMT CPU sharing the same L2 cache
- pick any other full-idle SMT CPU sharing the same L3 cache
- pick any other full-idle SMT CPU
- pick same idle CPU
- pick any other idle CPU sharing the same L2 cache
- pick any other idle CPU sharing the same L3 cache
- pick any other idle SMT CPU
- In the entire task domain:
- pick any other idle CPU
Keep in mind that the turbo domain will be evaluated only when the
scheduler is started with `--primary-domain auto` and only when the
`balance_power` energy profile is used.
The turbo domain is always made using the subset of CPUs in the system
with the highest max frequency. If such subset can't be determined (for
example if all the CPUs in the primary domain have all the same
frequency), the turbo domain will be ignored.
Prioritizing turbo boosted CPUs can help to improve performance by
forcing the governor to scale up their frequency, without increasing too
much power consumption, due to the fact that tasks will be preferably
confined into a reduced amount of cores.
This change seems to improve performance, without increasing much
power consuption, on Intel laptops while using the `balanced_power`
energy profile.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Introduce the new option `--primary-domain auto`. With this option the
scheduler will dynamically adjusts the primary scheduling domain at
run-time, in function of the current energy profile reported in
/sys/devices/system/cpu/cpufreq/policy0/energy_performance_preference.
When the `power` energy profile is selected, the primary scheduling
domain will prioritize E-cores. Alternatively, when the `performance`
profile is selected, it will prioritize P-cores. For all the other
energy profiles, all the CPUs in the system will be used.
Note that this option is only relevant on hybrid architectures with
P-cores and E-cores.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Introduce the new `--lowlatency` option, which enables switching between
the default pure vruntime-based scheduling (more optimized for server
workloads) and a deadline-based scheduling (better suited for
low-latency workloads).
When the low-latency mode is activated, a task's deadline is calculated
as its vruntime, adjusted by a bonus proportional to the task's average
number of voluntary context switches (the more voluntary context
switches, the shorter the deadline).
This feature enhances the prioritization of interactive tasks even more,
proportionally to their average voluntary context switches, also within
the two main global queues (priority / shared) and it helps to maintain
interactive workloads always responsive, even in presence of heavy
non-interactive background work.
Low-latency mode allows to prevent audio cracking even in presence of a
large amount of short-lived tasks with pseudo-interactive behavior (i.e,
hackbench) and it enables achieving approximately a +33% average
frames-per-second (FPS) in the typical "gaming while building the
kernel" benchmark.
However, it can also amplify the de-prioritization of CPU-intensive
tasks, making this option more suitable for specific low-latency
scenarios. Therefore the low-latency mode is disabled by default and it
can only be enabled via the `--lowlatency` option.
Tested-by: Piotr Gorski (piotrgorski@cachyos.org)
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Explicitly replenish the task's time slice from ops.dispatch() if the
task still wants to run and no other task is selected. In this way the
sched_ext core won't automatically re-schedule the task on the same CPU,
implicitly assigning a time slice of SCX_SLICE_DFL.
Moreover, instead of determining the task time slice in ops.enqueue(),
refresh the time slice immediately before the task is started on its
assigned CPU in ops.running().
This allows to use a more precise time slice, adjusted based on the
actual amount of tasks that are currently waiting to be scheduled.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
With the global scx_utils::NR_CPU_IDS we don't need Topology anymore in
init_primary_domain(), so drop the variable to fix the following build
warning:
warning: unused variable: `topo`
--> src/main.rs:385:9
|
385 | topo: &Topology,
| ^^^^ help: if this is intentional, prefix it with an underscore: `_topo`
|
= note: `#[warn(unused_variables)]` on by default
Fixes: 1da249f ("scx_utils::topology: Always use NR_CPU_IDS and NR_CPUS_POSSIBLE")
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Use the base frequency, instead of maximum frequency, to classify fast
and slow CPUs. This ensures accurate distinction between Intel Turbo
Boost CPUs and genuinely faster CPUs when auto-detecting the primary
scheduling domain.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Tasks enqueued with SCX_ENQ_WAKEUP are immediately classified as
interactive. However, if interactive tasks classification is disabled
(via `-c 0`), we should avoid promoting them as interactive.
This is particularly important because, with the nvcsw logic disabled,
tasks can remain classified as interactive indefinitely and they will
never be demoted to regular tasks.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Rely on scx_utils::Cpumask instead of re-implementing a custom struct to
parse and manage CPU masks.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Rely on scx_utils::Topology to get CPU and cache information, instead of
re-implementing custom methods.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
The primary scheduling domain represents a group of CPUs in the system
where the scheduler will initially attempt to assign tasks. Tasks will
only be dispatched to CPUs within this primary domain until they are
fully utilized, after which tasks may overflow to other available CPUs.
The primary scheduling domain can defined using the option
`--primary-domain CPUMASK` (by default all the CPUs in the system are
used as primary domain).
This change introduces two new special values for the CPUMASK argument:
- `performance`: automatically detect the fastest CPUs in the system
and use them as primary scheduling domain,
- `powersave`: automatically detect the slowest CPUs in the system and
use them as primary scheduling domain.
The current logic only supports creating two groups: fast and slow CPUs.
The fast CPU group is created by excluding CPUs with the lowest
frequency from the overall set, which means that within the fast CPU
group, CPUs may have different maximum frequencies.
When using the `performance` mode the fast CPUs will be used as primary
domain, whereas in `powersave` mode, the slow CPUs will be used instead.
This option is particularly useful in hybrid architectures (with P-cores
and E-cores), as it allows the use of bpfland to prioritize task
scheduling on either P-cores or E-cores, depending on the desired
performance profile.
Example:
- Dell Precision 5480
- CPU: 13th Gen Intel(R) Core(TM) i7-13800H
- P-cores: 0-11 / max freq: 5.2GHz
- E-cores: 12-19 / max freq: 4.0GHz
$ scx_bpfland --primary-domain performance
0[||||||||| 24.5%] 10[|||||||| 22.8%]
1[|||||| 14.9%] 11[||||||||||||| 36.9%]
2[|||||| 16.2%] 12[ 0.0%]
3[||||||||| 25.3%] 13[ 0.0%]
4[||||||||||| 33.3%] 14[ 0.0%]
5[|||| 9.9%] 15[ 0.0%]
6[||||||||||| 31.5%] 16[ 0.0%]
7[||||||| 17.4%] 17[ 0.0%]
8[|||||||| 23.4%] 18[ 0.0%]
9[||||||||| 26.1%] 19[ 0.0%]
Avg power consumption: 3.29W
$ scx_bpfland --primary-domain powersave
0[| 2.5%] 10[ 0.0%]
1[ 0.0%] 11[ 0.0%]
2[ 0.0%] 12[|||| 8.0%]
3[ 0.0%] 13[||||||||||||||||||||| 64.2%]
4[ 0.0%] 14[|||||||||| 29.6%]
5[ 0.0%] 15[||||||||||||||||| 52.5%]
6[ 0.0%] 16[||||||||| 24.7%]
7[ 0.0%] 17[|||||||||| 30.4%]
8[ 0.0%] 18[||||||| 22.4%]
9[ 0.0%] 19[||||| 12.4%]
Avg power consumption: 2.17W
(Info collected from htop and turbostat)
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
While the system is not saturated the scheduler will use the following
strategy to select the next CPU for a task:
- pick the same CPU if it's a full-idle SMT core
- pick any full-idle SMT core in the primary scheduling group that
shares the same L2 cache
- pick any full-idle SMT core in the primary scheduling grouop that
shares the same L3 cache
- pick the same CPU (ignoring SMT)
- pick any idle CPU in the primary scheduling group that shares the
same L2 cache
- pick any idle CPU in the primary scheduling group that shares the
same L3 cache
- pick any idle CPU in the system
While the system is completely saturated (no idle CPUs available), tasks
will be dispatched on the first CPU that becomes available.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Allow to specify a primary scheduling domain via the new command line
option `--primary-domain CPUMASK`, where CPUMASK can be a hex number of
arbitrary length, representing the CPUs assigned to the domain.
If this option is not specified the scheduler will use all the available
CPUs in the system as primary domain (no behavior change).
Otherwise, if a primary scheduling domain is defined, the scheduler will
try to dispatch tasks only to the CPUs assigned to the primary domain,
until these CPUs are saturated, at which point tasks may overflow to
other available CPUs.
This feature can be used to prioritize certain cores over others and it
can be really effective in systems with heterogeneous cores (e.g.,
hybrid systems with P-cores and E-cores).
== Example (hybrid architecture) ==
Hardware:
- Dell Precision 5480 with 13th Gen Intel(R) Core(TM) i7-13800H
- 6 P-cores 0..5 with 2 CPUs each (CPU from 0..11)
- 8 E-cores 6..13 with 1 CPU each (CPU from 12..19)
== Test ==
WebGL application (https://webglsamples.org/aquarium/aquarium.html):
this allows to generate a steady workload in the system without
over-saturating the CPUs.
Use different scheduler configurations:
- EEVDF (default)
- scx_bpfland using P-cores only (--primary-domain 0x00fff)
- scx_bpfland using E-cores only (--primary-domain 0xff000)
Measure performance (fps) and power consumption (W).
== Result ==
+-----+-----+------+-----+----------+
| min | max | avg | | |
| fps | fps | fps | stdev | power |
+-----------------+-----+-----+------+-------+--------+
| EEVDF | 28 | 34 | 31.0 | 1.73 | 3.5W |
| bpfland-p-cores | 33 | 34 | 33.5 | 0.29 | 3.5W |
| bpfland-e-cores | 25 | 26 | 25.5 | 0.29 | 2.2W |
+-----------------+-----+-----+------+-------+--------+
Using a primary scheduling domain of only P-cores with scx_bpfland
allows to achieve a more stable and predictable level of performance,
with an average of 33.5 fps and an error of ±0.5 fps.
In contrast, using EEVDF results in an average frame rate of 31.0 fps
with an error of ±3.0 fps, indicating slightly less consistency, due to
the fact that tasks are evenly distributed across all the cores in the
system (both slow and fast cores).
On the other hand, using a scheduling domain solely of E-cores with
scx_bpfland results in a lower average frame rate (25.5 fps), though it
maintains a stable performance (error of ±0.5 fps), but the power
consumption is also reduced, averaging 2.2W, compared to 3.5W with
either of the other configurations.
== Conclusion ==
In summary, with this change users have the flexibility to prioritize
scheduling on performance cores for better performance and consistency,
or prioritize energy efficient cores for reduced power consumption, on
hybrid architectures.
Moreover, this feature can also be used to minimize the number of cores
used by the scheduler, until they reach full capacity. This capability
can be useful for reducing power consumption even in homogeneous systems
or for conducting scheduling experiments with smaller sets of cores,
provided the system is not overcommitted.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Abbreviate the statistics reported to stdout and remove the slice_ms
metric: this metric can be easily derived from slice_ns, slice_ns_min
and nr_wait, which is already reported to stdout.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Use the same idle selection logic used in scx_bpfland also in
scx_rustland_core.
Also drop fifo_mode and always use the BPF idle selection logic by
default as long as the system is not saturated, unless full_user is
specified.
This approach allows user-space schedulers aiming for maximum
performance to leverage the BPF idle selection logic (bypassing
user-space), while those seeking full control can enable full_user to
bypass the BPF CPU idle selection logic and choose the target CPU for
each task from user-space.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Immediately re-align p->scx.dsq_vtime to the global vruntime (+/- slice
lag) as soon as we are evaluating the task's vruntime.
This allows rapidly chase the minimum global vruntime, ensuring to not
over prioritize tasks tasks with a predominantly sleeping behavior
pattern.
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
Using negative values with --slice-us-lag can be useful to make
performance more consistent and prioritize newly created tasks over the
running tasks.
Therefore, allow to specify negative values from the command line and
also update the documentation of this option.
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
Periodically report to stdout samples of the effective time slice
applied to tasks.
While one could determine this metric by examining the max slice_ns and
nr_waiting metrics, directly reporting it to stdout allows users to
quickly identify what is happening and it provides a clearer overview of
the scheduling behavior.
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
Dispatching per-CPU kthreads directly is disabled by default, reporting
this metric can generate some confusion (since it is always 0), and even
if local kthread dispatches are enabled, they should be still considered
as regular direct dispatches (there is no difference in practice).
Therefore, merge direct kthread dispatches into direct dispatches and
drop the separate nr_kthread_dispatches metric.
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
Scale the task's time slice based on the average amount of tasks that
are currently waiting to be dispatched.
Use a moving average for the amount of waiting tasks to smooth out
potential spikes caused by temporary bursts of tasks piling in the wait
queues.
This was initially modeled in scx_rustland and it seems to work pretty
well also in scx_bpfland now.
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
Instead of using a static value to classify tasks based on their average
amount of voluntary context switches, try to periodically evaluate an
optimal threshold, based on a global average of voluntary context
switches among of all the running tasks.
Tasks with an average amount of voluntary context switches greater than
the global average will be classified as interactive.
The global average is evaluated as an exponentially weighted moving
average (EWMA), as:
avg(t) = avg(t - 1) * 0.75 - task_avg(t) * 0.25
This approach is more efficient than iterating through all tasks and it
helps to prevent rapid fluctuations that may be caused by bursts of
voluntary context switch events.
The dynamic nvcsw threshold enables a more precise adjustment of
the classification criteria to swiftly respond to global system changes:
tasks can be quickly classified as interactive, but if the system
experiences too many interactive events, the criteria for maintaining
interactive status become stricter. This creates a natural selection
process where only the most deserving tasks remain interactive.
Additionally, introduce the new option `--nvcsw-max-thresh N`, which
allows to extend or restrict the fluctuation range of the global average
threshold for voluntary context switches.
Tested-by: Piotr Gorski <piotrgorski@cachyos.org>
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
Update libbpf-rs & libbpf-cargo to 0.24. Among other things, generated
skeletons now contain directly accessible map and program objects, no
longer necessitating the use of accessor methods. As a result, the risk
for mutability conflicts is reduced greatly.
Signed-off-by: Daniel Müller <deso@posteo.net>
We always use nr_cpu_ids to represent the maximum CPU id returned by
scx_bpf_nr_cpu_ids().
Replace cpu_max with nr_cpu_ids to be more consistent with the rest of
the code.
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
We can rely on scx_bpf_nr_cpu_ids() to create all the possible per-CPU
DSQs, eliminating the need for the hard-coded limit MAX_CPUS.
In this way scx_bpfland can support the same amount of CPUs that the
kernel can handle.
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
Instead of constantly checking the need to drain tasks from the DSQs of
the offline CPUs, provide an atomic flag to notify when there are tasks
to be drained from the offline CPUs.
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
Refine the safeguard mechanism to avoid generating too many interactive
tasks in the system, which could nullify the effect of the
interactive/regular task classification.
The safeguard mechanism operates by pausing the promotion of new tasks
to interactive status during the task wake-up process, whenever the
number of interactive tasks in the priority queue exceeds a specific
limit (set to 4x the number of online CPUs).
Halting the promotion of additional interactive tasks allows to
prioritize those already classified as interactive, thereby preventing
potential "bursts" of excessive interactive tasks in the system.
This refines the mitigation already provided by commit 640bd562
("scx_bpfland: prevent tasks from abusing interactive priority boost").
Fixes: 640bd562 ("scx_bpfland: prevent tasks from abusing interactive priority boost")
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
Always assign the maximum time slice if there are idle CPUs in the
system.
Otherwise, double the task's unused time slice to reward tasks that use
less CPU time and at the same time refill the time slice of the tasks
every time they're dispatched.
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
sched_ext is about to be merged upstream. There are some compatibility
breaking changes and we're making the current sched_ext/for-6.11
1edab907b57d ("sched_ext/scx_qmap: Pick idle CPU for direct dispatch on
!wakeup enqueues") the baseline.
Tag everything except scx_mitosis as 1.0.0. As scx_mitosis is still in early
development and is currently temporarily disabled, only the patchlevel is
bumped.
The priority boost for interactive tasks can be exploited to render the
system nearly unresponsive by creating numerous tasks that constantly
switch between wait/wakeup states.
For example, stress tests like `hackbench -l 10000` can significantly
degrade system responsiveness.
To mitigate this, limit the number of interactive tasks added to the
priority queue to 4x the number of online CPUs.
This simple approach appears to be a quite effective at identifying
potential spam of "fake" interactive tasks, while still prioritizing
legitimate interactive tasks.
Additionally, periodically refresh the interactive status of the tasks
based on their most recent average of voluntary context switches,
preventing the interactive status from being too "sticky".
Tested-by: Piotr Gorski <lucjan.lucjanov@gmail.com>
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
Avoid dispatching per-CPU kthreads directly, since this may cause
interactivity problems or unfairness, for example if there are too many
softirqs being scheduled (e.g., in presence of high RX network traffic
or when running certain stress tests, like hackbench).
Moreover, in order to help with testing and benchmarks, introduce the
option --local-kthread, that allows to restore the old behavior if
enabled.
Tested-by: Piotr Gorski <lucjan.lucjanov@gmail.com>
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
When updating the task vruntime, ensure the time slice delta is always a
positive value. Failing to do so may cause the global vruntime to
increase excessively due to overflows.
Tested-by: Piotr Gorski <lucjan.lucjanov@gmail.com>
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
Periodically report the amount of online CPUs to stdout.
The online CPUs are initially evaluated looking at the online cpumask,
then the value is updated in the .cpu_offline() / .cpu_online()
callbacks.
Tested-by: Piotr Gorski <lucjan.lucjanov@gmail.com>
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
Keep track of the CPUs that are running interactive tasks and report
their amount to stdout.
Tested-by: Piotr Gorski <lucjan.lucjanov@gmail.com>
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
The correct default value of slice_ns 5ms, not 5s.
This change doesn't really make any difference in practice, since these
values are changed by the Rust part when the scheduler is started, but
it's good to keep this aligned to the proper values for consistency.
Tested-by: Piotr Gorski <lucjan.lucjanov@gmail.com>
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
Every time we need to dispatch a task re-evalate its time slice as:
(unused_time_slice + min_time_slice) / 2
This allows to refill the time slice for tasks that haven't used much of
their previously assigned time, improving fairness.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Make sure to always classify interactive tasks, even when the system is
not fully utilized. This ensures that if the system suddenly becomes
overloaded, we already know which tasks need to be dispatched to the
priority DSQ.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>