The update_tasks() API is somewhat confusing, so replace it with a
clearer API, notify_complete().
This new API will return control to the BPF component and inform it
about the number of tasks still pending in the user-space scheduler.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
The low-power API is a bit of a hack implemented purely in the BPF
layer, this should be better re-implemented with some concepts of
topology awareness.
Therefore, get rid of this API for now.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
The current API used to notify the user-space scheduler when a task
exits is really confusing (setting a negative value in
queued_task_ctx.cpu), and it's also possible to detect task exiting
events from user-space (or check in procfs, even if it's slower).
In any case, a better API should be provided for this, so drop the
current one for now.
NOTE: this will cause additional memory usage for scx_rustland, but it
can be fixed/addressed later in a separate commit (i.e., providing a
periodic garbage collector for the unused task entries).
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Allow user-space scheduler to pick an idle CPU via
self.bpf.select_cpu(pid, prev_task, flags), mimicking the BPF's
select_cpu() iterface.
Also remove the full_user option and always rely on the idle selection
logic from user-space.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Keep evaluating the average number of voluntary context switches for
each task when lowlatency mode is enabled, even when interactive tasks
classification is disabled (via `-c 0`).
The average nvcsw is also used in lowlatency mode to evaluate the
proportional bonus to the tasks' deadline and it shouldn't be ignored
when interactive tasks classification is disabled. Moreover, make sure
that such bonus never exceeds the starvation threshold.
Keep in mind that it is still possible to disable the periodic average
nvcsw evaluation with `-c 0`, without specifying `--lowlatency`.
Fixes: 6a22853 ("scx_bpfland: introduce --lowlatency option")
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
A lot of scx_lavd's options do not clearly explain what they do. Add
some short explanations, clean up the existing ones, and direct the user
to read the in-code documentation for more info.
And move related ops into it. This is a bit more natural and will also allow
doing other operaitons (e.g. describing stats) without launching the server.
Make `--primar-domain auto` aware of turbo boosted CPUs and prioritize
them over the primary scheduling domain when the energy model
`balance_power` is used (typically when running on battery power with
the "balanced" profile).
With this change the scheduling hierarchy becomes the following:
1) CPUs in the turbo scheduling domain
2) CPUs in the primary scheduling domain
3) full-idle SMT CPUs
4) CPUs in the same L2 cache
5) CPUs in the same L3 cache
6) CPUs in the task's allowed domain
And the idle selection logic is modified as following:
- In the turbo scheduling domain:
- pick same full-idle SMT CPU
- pick any other full-idle SMT CPU sharing the same L2 cache
- pick any other full-idle SMT CPU sharing the same L3 cache
- pick any other full-idle SMT CPU
- pick same idle CPU
- pick any other idle CPU sharing the same L2 cache
- pick any other idle CPU sharing the same L3 cache
- pick any other idle SMT CPU
- In the primary scheduling domain:
- pick same full-idle SMT CPU
- pick any other full-idle SMT CPU sharing the same L2 cache
- pick any other full-idle SMT CPU sharing the same L3 cache
- pick any other full-idle SMT CPU
- pick same idle CPU
- pick any other idle CPU sharing the same L2 cache
- pick any other idle CPU sharing the same L3 cache
- pick any other idle SMT CPU
- In the entire task domain:
- pick any other idle CPU
Keep in mind that the turbo domain will be evaluated only when the
scheduler is started with `--primary-domain auto` and only when the
`balance_power` energy profile is used.
The turbo domain is always made using the subset of CPUs in the system
with the highest max frequency. If such subset can't be determined (for
example if all the CPUs in the primary domain have all the same
frequency), the turbo domain will be ignored.
Prioritizing turbo boosted CPUs can help to improve performance by
forcing the governor to scale up their frequency, without increasing too
much power consumption, due to the fact that tasks will be preferably
confined into a reduced amount of cores.
This change seems to improve performance, without increasing much
power consuption, on Intel laptops while using the `balanced_power`
energy profile.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Introduce the new option `--primary-domain auto`. With this option the
scheduler will dynamically adjusts the primary scheduling domain at
run-time, in function of the current energy profile reported in
/sys/devices/system/cpu/cpufreq/policy0/energy_performance_preference.
When the `power` energy profile is selected, the primary scheduling
domain will prioritize E-cores. Alternatively, when the `performance`
profile is selected, it will prioritize P-cores. For all the other
energy profiles, all the CPUs in the system will be used.
Note that this option is only relevant on hybrid architectures with
P-cores and E-cores.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Introduce the new `--lowlatency` option, which enables switching between
the default pure vruntime-based scheduling (more optimized for server
workloads) and a deadline-based scheduling (better suited for
low-latency workloads).
When the low-latency mode is activated, a task's deadline is calculated
as its vruntime, adjusted by a bonus proportional to the task's average
number of voluntary context switches (the more voluntary context
switches, the shorter the deadline).
This feature enhances the prioritization of interactive tasks even more,
proportionally to their average voluntary context switches, also within
the two main global queues (priority / shared) and it helps to maintain
interactive workloads always responsive, even in presence of heavy
non-interactive background work.
Low-latency mode allows to prevent audio cracking even in presence of a
large amount of short-lived tasks with pseudo-interactive behavior (i.e,
hackbench) and it enables achieving approximately a +33% average
frames-per-second (FPS) in the typical "gaming while building the
kernel" benchmark.
However, it can also amplify the de-prioritization of CPU-intensive
tasks, making this option more suitable for specific low-latency
scenarios. Therefore the low-latency mode is disabled by default and it
can only be enabled via the `--lowlatency` option.
Tested-by: Piotr Gorski (piotrgorski@cachyos.org)
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Explicitly replenish the task's time slice from ops.dispatch() if the
task still wants to run and no other task is selected. In this way the
sched_ext core won't automatically re-schedule the task on the same CPU,
implicitly assigning a time slice of SCX_SLICE_DFL.
Moreover, instead of determining the task time slice in ops.enqueue(),
refresh the time slice immediately before the task is started on its
assigned CPU in ops.running().
This allows to use a more precise time slice, adjusted based on the
actual amount of tasks that are currently waiting to be scheduled.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
The meaning of SCX_OPS_ENQ_LAST will change with future kernel updates and
enqueueing on local DSQ will no longer be sufficient to avoid stalls. No
reason to do it anyway. Just drop it.
With the global scx_utils::NR_CPU_IDS we don't need Topology anymore in
init_primary_domain(), so drop the variable to fix the following build
warning:
warning: unused variable: `topo`
--> src/main.rs:385:9
|
385 | topo: &Topology,
| ^^^^ help: if this is intentional, prefix it with an underscore: `_topo`
|
= note: `#[warn(unused_variables)]` on by default
Fixes: 1da249f ("scx_utils::topology: Always use NR_CPU_IDS and NR_CPUS_POSSIBLE")
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Use the base frequency, instead of maximum frequency, to classify fast
and slow CPUs. This ensures accurate distinction between Intel Turbo
Boost CPUs and genuinely faster CPUs when auto-detecting the primary
scheduling domain.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
- Update scx_utils/build.rs so that 12 char SHA1 is generated instead of
full one.
- Add --version to scx_rusty. Use custom one as we don't want to use the
default cargo version one.
Tasks enqueued with SCX_ENQ_WAKEUP are immediately classified as
interactive. However, if interactive tasks classification is disabled
(via `-c 0`), we should avoid promoting them as interactive.
This is particularly important because, with the nvcsw logic disabled,
tasks can remain classified as interactive indefinitely and they will
never be demoted to regular tasks.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Rely on scx_utils::Cpumask instead of re-implementing a custom struct to
parse and manage CPU masks.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Fix a bug introduced in #510 where it assumed core ids are incremental.
This refactors the core ordering for layers to be far more simple and
provide some space for layer core isolation in low utilization.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
Rely on scx_utils::Topology to get CPU and cache information, instead of
re-implementing custom methods.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Currently the core selection logic in scx_layered uses the first
available core in the bitmask. This is suboptimal when the scheduler is
configured with specific NUMA/LLC restrictions. The ideal core selection
logic should try to find the least used cores within the preferred
scheduling domain and allocate new cpus from shared cores within that
domain.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
- If --monitor is specified with layer specs, the scheduler also starts
stats monitoring on a thread.
- Standalone monitoring mode no longer exits when the scheduler isn't there.
Instead of keeping one copy of sched_stats, each stats server session
carries their own so that stats can be generated independently by each
client at any interval. CPU allocation min/max tracking is broken for now.
The primary scheduling domain represents a group of CPUs in the system
where the scheduler will initially attempt to assign tasks. Tasks will
only be dispatched to CPUs within this primary domain until they are
fully utilized, after which tasks may overflow to other available CPUs.
The primary scheduling domain can defined using the option
`--primary-domain CPUMASK` (by default all the CPUs in the system are
used as primary domain).
This change introduces two new special values for the CPUMASK argument:
- `performance`: automatically detect the fastest CPUs in the system
and use them as primary scheduling domain,
- `powersave`: automatically detect the slowest CPUs in the system and
use them as primary scheduling domain.
The current logic only supports creating two groups: fast and slow CPUs.
The fast CPU group is created by excluding CPUs with the lowest
frequency from the overall set, which means that within the fast CPU
group, CPUs may have different maximum frequencies.
When using the `performance` mode the fast CPUs will be used as primary
domain, whereas in `powersave` mode, the slow CPUs will be used instead.
This option is particularly useful in hybrid architectures (with P-cores
and E-cores), as it allows the use of bpfland to prioritize task
scheduling on either P-cores or E-cores, depending on the desired
performance profile.
Example:
- Dell Precision 5480
- CPU: 13th Gen Intel(R) Core(TM) i7-13800H
- P-cores: 0-11 / max freq: 5.2GHz
- E-cores: 12-19 / max freq: 4.0GHz
$ scx_bpfland --primary-domain performance
0[||||||||| 24.5%] 10[|||||||| 22.8%]
1[|||||| 14.9%] 11[||||||||||||| 36.9%]
2[|||||| 16.2%] 12[ 0.0%]
3[||||||||| 25.3%] 13[ 0.0%]
4[||||||||||| 33.3%] 14[ 0.0%]
5[|||| 9.9%] 15[ 0.0%]
6[||||||||||| 31.5%] 16[ 0.0%]
7[||||||| 17.4%] 17[ 0.0%]
8[|||||||| 23.4%] 18[ 0.0%]
9[||||||||| 26.1%] 19[ 0.0%]
Avg power consumption: 3.29W
$ scx_bpfland --primary-domain powersave
0[| 2.5%] 10[ 0.0%]
1[ 0.0%] 11[ 0.0%]
2[ 0.0%] 12[|||| 8.0%]
3[ 0.0%] 13[||||||||||||||||||||| 64.2%]
4[ 0.0%] 14[|||||||||| 29.6%]
5[ 0.0%] 15[||||||||||||||||| 52.5%]
6[ 0.0%] 16[||||||||| 24.7%]
7[ 0.0%] 17[|||||||||| 30.4%]
8[ 0.0%] 18[||||||| 22.4%]
9[ 0.0%] 19[||||| 12.4%]
Avg power consumption: 2.17W
(Info collected from htop and turbostat)
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
While the system is not saturated the scheduler will use the following
strategy to select the next CPU for a task:
- pick the same CPU if it's a full-idle SMT core
- pick any full-idle SMT core in the primary scheduling group that
shares the same L2 cache
- pick any full-idle SMT core in the primary scheduling grouop that
shares the same L3 cache
- pick the same CPU (ignoring SMT)
- pick any idle CPU in the primary scheduling group that shares the
same L2 cache
- pick any idle CPU in the primary scheduling group that shares the
same L3 cache
- pick any idle CPU in the system
While the system is completely saturated (no idle CPUs available), tasks
will be dispatched on the first CPU that becomes available.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
This option chooses little (effiency) cores over big (performance) cores
to save power consumption for core compaction.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
The changes include 1) chopping down a big function into smaller ones
for readability and maintainability and 2) using the interior mutability
pattern (Cell and RefCell) to avoid unnecessary clone() calls. There
are no functional changes.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Fix the uninitialized variable "layer" in the function match_layer which
caused the compiling process to fail. "layer" is supposed to be the same
as "&layers[layer_id]".
Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
- Allow no-value user attributes which are automatically assigned "true"
when specified.
- Make "top" attribute string "true" instead of bool true for consistency.
Testing for existence is always enough for value-less attributes.
- Don't drop leading "_" from user attribute names when storing in dicts.
Dropping makes things more confusing.
- Add "_om_skip" to scx_layered fields which don't jive well with OM.
scxstats_to_openmetrics.py is updated accordignly and no longer generates
warnings on those fields.
- Examples and README updated accordingly.
After updating scx_layered to be topology aware the nr_cpus field on the
layer was not being updated properly. Update layer growing/shrinking
logic to correctly update the nr_cpus count.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
- This makes the scheduler side simpler and allows on-demand monitoring.
- OpenMetrics support is dropped for now. Will add a generic tool for it.
- This is a naive conversion. Will be further refined.
scx_layered no longer prints statistics by default. To watch statistics, run
`scx_layered --monitor` while the scheduler is running.
Allow to specify a primary scheduling domain via the new command line
option `--primary-domain CPUMASK`, where CPUMASK can be a hex number of
arbitrary length, representing the CPUs assigned to the domain.
If this option is not specified the scheduler will use all the available
CPUs in the system as primary domain (no behavior change).
Otherwise, if a primary scheduling domain is defined, the scheduler will
try to dispatch tasks only to the CPUs assigned to the primary domain,
until these CPUs are saturated, at which point tasks may overflow to
other available CPUs.
This feature can be used to prioritize certain cores over others and it
can be really effective in systems with heterogeneous cores (e.g.,
hybrid systems with P-cores and E-cores).
== Example (hybrid architecture) ==
Hardware:
- Dell Precision 5480 with 13th Gen Intel(R) Core(TM) i7-13800H
- 6 P-cores 0..5 with 2 CPUs each (CPU from 0..11)
- 8 E-cores 6..13 with 1 CPU each (CPU from 12..19)
== Test ==
WebGL application (https://webglsamples.org/aquarium/aquarium.html):
this allows to generate a steady workload in the system without
over-saturating the CPUs.
Use different scheduler configurations:
- EEVDF (default)
- scx_bpfland using P-cores only (--primary-domain 0x00fff)
- scx_bpfland using E-cores only (--primary-domain 0xff000)
Measure performance (fps) and power consumption (W).
== Result ==
+-----+-----+------+-----+----------+
| min | max | avg | | |
| fps | fps | fps | stdev | power |
+-----------------+-----+-----+------+-------+--------+
| EEVDF | 28 | 34 | 31.0 | 1.73 | 3.5W |
| bpfland-p-cores | 33 | 34 | 33.5 | 0.29 | 3.5W |
| bpfland-e-cores | 25 | 26 | 25.5 | 0.29 | 2.2W |
+-----------------+-----+-----+------+-------+--------+
Using a primary scheduling domain of only P-cores with scx_bpfland
allows to achieve a more stable and predictable level of performance,
with an average of 33.5 fps and an error of ±0.5 fps.
In contrast, using EEVDF results in an average frame rate of 31.0 fps
with an error of ±3.0 fps, indicating slightly less consistency, due to
the fact that tasks are evenly distributed across all the cores in the
system (both slow and fast cores).
On the other hand, using a scheduling domain solely of E-cores with
scx_bpfland results in a lower average frame rate (25.5 fps), though it
maintains a stable performance (error of ±0.5 fps), but the power
consumption is also reduced, averaging 2.2W, compared to 3.5W with
either of the other configurations.
== Conclusion ==
In summary, with this change users have the flexibility to prioritize
scheduling on performance cores for better performance and consistency,
or prioritize energy efficient cores for reduced power consumption, on
hybrid architectures.
Moreover, this feature can also be used to minimize the number of cores
used by the scheduler, until they reach full capacity. This capability
can be useful for reducing power consumption even in homogeneous systems
or for conducting scheduling experiments with smaller sets of cores,
provided the system is not overcommitted.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Abbreviate the statistics reported to stdout and remove the slice_ms
metric: this metric can be easily derived from slice_ns, slice_ns_min
and nr_wait, which is already reported to stdout.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Layer matching currently takes a large number of bpf instructions.
Moving layer matching to a global function will reduce the overall
instruction count and allow for other layer matching methods such as
glob.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
Put a performance-critical task to a performance critical queue and a
regular task to a regular queue.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
The logic of tasks filtering were moved from find_first_candidate() into
a vector filter operation in commit 1c3b563. However, it was forgotten
to transfer the logic with "NOT" since now .filter() will populate the
tasks we want, rather than .skip_while() which was throwing unwanted
tasks out.
That's why the logic here should be reverse so we won't take kworker or
migrated tasks into considerations.
Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
The member "topo_map" in Scheduler is never used and thus should be
removed, the related imports are removed as well.
Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
Re-add the partial mode option that was dropped during the refactoring.
The partial option allows to apply the scheduler only to the tasks which
have their scheduling policy set to SCHED_EXT via sched_setscheduler().
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
The API for determining which PID is running on a specific CPU is racy
and is unnecessary since this information can be obtained from user
space.
Additionally, it's not reliable for identifying idle CPUs. Therefore,
it's better to remove this API and, in the future, provide a cpumask
alternative that can export the idle state of the CPUs to user space.
As a consequence also change scx_rustland to dispatch one task a time,
instead of dispatching tasks in batches of idle cores (that are usually
not accurate due to the racy nature of the CPU ownership interaface).
Dispatching one task at a time even makes the scheduler more performant,
due to the vruntime scheduling being applied to more tasks sitting in
the scheduler's queue.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Drop the slice boost logic and apply a vruntime and task time slice
evaluation approach similar to scx_bpfland (but implement this in the
user-space component instead of the BPF part).
Additionally, introduce a slice_us_min parameter to define the minimum
time slice that can be assigned to a task, also similar to scx_bpfland.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>