Scheduling sample reporting is switched to use scx_stats. This makes the
scheduler run without making too much noise while still allowing monitoring
on demand. It can also make introspection more dynamic - e.g. it shouldn't
be difficult to add other monitoring commands which take scheduling samples
based on different criteria or add other types of staisitcs.
--nr_sched-samples is replaced with --monitor-nr-samples.
Keep evaluating the average number of voluntary context switches for
each task when lowlatency mode is enabled, even when interactive tasks
classification is disabled (via `-c 0`).
The average nvcsw is also used in lowlatency mode to evaluate the
proportional bonus to the tasks' deadline and it shouldn't be ignored
when interactive tasks classification is disabled. Moreover, make sure
that such bonus never exceeds the starvation threshold.
Keep in mind that it is still possible to disable the periodic average
nvcsw evaluation with `-c 0`, without specifying `--lowlatency`.
Fixes: 6a22853 ("scx_bpfland: introduce --lowlatency option")
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
A lot of scx_lavd's options do not clearly explain what they do. Add
some short explanations, clean up the existing ones, and direct the user
to read the in-code documentation for more info.
And move related ops into it. This is a bit more natural and will also allow
doing other operaitons (e.g. describing stats) without launching the server.
Make `--primar-domain auto` aware of turbo boosted CPUs and prioritize
them over the primary scheduling domain when the energy model
`balance_power` is used (typically when running on battery power with
the "balanced" profile).
With this change the scheduling hierarchy becomes the following:
1) CPUs in the turbo scheduling domain
2) CPUs in the primary scheduling domain
3) full-idle SMT CPUs
4) CPUs in the same L2 cache
5) CPUs in the same L3 cache
6) CPUs in the task's allowed domain
And the idle selection logic is modified as following:
- In the turbo scheduling domain:
- pick same full-idle SMT CPU
- pick any other full-idle SMT CPU sharing the same L2 cache
- pick any other full-idle SMT CPU sharing the same L3 cache
- pick any other full-idle SMT CPU
- pick same idle CPU
- pick any other idle CPU sharing the same L2 cache
- pick any other idle CPU sharing the same L3 cache
- pick any other idle SMT CPU
- In the primary scheduling domain:
- pick same full-idle SMT CPU
- pick any other full-idle SMT CPU sharing the same L2 cache
- pick any other full-idle SMT CPU sharing the same L3 cache
- pick any other full-idle SMT CPU
- pick same idle CPU
- pick any other idle CPU sharing the same L2 cache
- pick any other idle CPU sharing the same L3 cache
- pick any other idle SMT CPU
- In the entire task domain:
- pick any other idle CPU
Keep in mind that the turbo domain will be evaluated only when the
scheduler is started with `--primary-domain auto` and only when the
`balance_power` energy profile is used.
The turbo domain is always made using the subset of CPUs in the system
with the highest max frequency. If such subset can't be determined (for
example if all the CPUs in the primary domain have all the same
frequency), the turbo domain will be ignored.
Prioritizing turbo boosted CPUs can help to improve performance by
forcing the governor to scale up their frequency, without increasing too
much power consumption, due to the fact that tasks will be preferably
confined into a reduced amount of cores.
This change seems to improve performance, without increasing much
power consuption, on Intel laptops while using the `balanced_power`
energy profile.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Introduce the new option `--primary-domain auto`. With this option the
scheduler will dynamically adjusts the primary scheduling domain at
run-time, in function of the current energy profile reported in
/sys/devices/system/cpu/cpufreq/policy0/energy_performance_preference.
When the `power` energy profile is selected, the primary scheduling
domain will prioritize E-cores. Alternatively, when the `performance`
profile is selected, it will prioritize P-cores. For all the other
energy profiles, all the CPUs in the system will be used.
Note that this option is only relevant on hybrid architectures with
P-cores and E-cores.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Introduce the new `--lowlatency` option, which enables switching between
the default pure vruntime-based scheduling (more optimized for server
workloads) and a deadline-based scheduling (better suited for
low-latency workloads).
When the low-latency mode is activated, a task's deadline is calculated
as its vruntime, adjusted by a bonus proportional to the task's average
number of voluntary context switches (the more voluntary context
switches, the shorter the deadline).
This feature enhances the prioritization of interactive tasks even more,
proportionally to their average voluntary context switches, also within
the two main global queues (priority / shared) and it helps to maintain
interactive workloads always responsive, even in presence of heavy
non-interactive background work.
Low-latency mode allows to prevent audio cracking even in presence of a
large amount of short-lived tasks with pseudo-interactive behavior (i.e,
hackbench) and it enables achieving approximately a +33% average
frames-per-second (FPS) in the typical "gaming while building the
kernel" benchmark.
However, it can also amplify the de-prioritization of CPU-intensive
tasks, making this option more suitable for specific low-latency
scenarios. Therefore the low-latency mode is disabled by default and it
can only be enabled via the `--lowlatency` option.
Tested-by: Piotr Gorski (piotrgorski@cachyos.org)
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Explicitly replenish the task's time slice from ops.dispatch() if the
task still wants to run and no other task is selected. In this way the
sched_ext core won't automatically re-schedule the task on the same CPU,
implicitly assigning a time slice of SCX_SLICE_DFL.
Moreover, instead of determining the task time slice in ops.enqueue(),
refresh the time slice immediately before the task is started on its
assigned CPU in ops.running().
This allows to use a more precise time slice, adjusted based on the
actual amount of tasks that are currently waiting to be scheduled.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
The meaning of SCX_OPS_ENQ_LAST will change with future kernel updates and
enqueueing on local DSQ will no longer be sufficient to avoid stalls. No
reason to do it anyway. Just drop it.
With the global scx_utils::NR_CPU_IDS we don't need Topology anymore in
init_primary_domain(), so drop the variable to fix the following build
warning:
warning: unused variable: `topo`
--> src/main.rs:385:9
|
385 | topo: &Topology,
| ^^^^ help: if this is intentional, prefix it with an underscore: `_topo`
|
= note: `#[warn(unused_variables)]` on by default
Fixes: 1da249f ("scx_utils::topology: Always use NR_CPU_IDS and NR_CPUS_POSSIBLE")
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Use the base frequency, instead of maximum frequency, to classify fast
and slow CPUs. This ensures accurate distinction between Intel Turbo
Boost CPUs and genuinely faster CPUs when auto-detecting the primary
scheduling domain.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>