When a pinned task cannot run on either active or overflow sets, we try
to stay on the previous CPU which is still okay to run on.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Bump up scx_rustland_core version to include this critical fix that
allows to prevent scheduler stalls:
94a3594 ("scx_rustland_core: always dispatch per-cpu kthreads directly")
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
In auto mode, rather than keeping the previous fixed cpuperf factor,
dynamically calculate it based on CPU utilization and apply it before a
task runs within its allocated time slot.
Interactive tasks consistently receive the maximum scaling factor to
ensure optimal performance.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Always consider the turbo domain when running in "auto" mode.
Additionally, when the turbo domain is used, split the CPU idle
selection logic into two stages:
1) in ops.select_cpu(), provide the task with a second opportunity to
remain within the same LLC
2) in ops.enqueue(), perform another check for an idle CPU, allowing
the task to move to a different LLC if an idle CPU within the same
LLC is not available.
This allows tasks to stick more on turbo-boosted CPUs and CPUs within
the same LLC.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
When tasks are changing CPU affinity it is pointless to try to find an
optimal idle CPU. In this case just skip the the idle CPU selection step
and let the task being dispatched to a global DSQ if needed.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Add hints for the cpufreq governor based on the selected scheduler's
performance profile and the current energy performance preference (EPP).
With this change applied the scheduler works as following:
scheduler profile (--primary-domain option):
- default:
- use all cores
- cpufreq: use default scaling factor
- powersave:
- use E-cores
- cpufreq: use min scaling factor
- performance:
- use P-cores
- cpufreq: use max scaling factor
- auto:
- EPP: power, powersave
- use E-cores
- cpufreq: use min scaling factor
- EPP: balance_power (typically battery-powered systems)
- use E-cores
- cpufreq: use default scaling factor
- EPP: balance_performance, performance
- use P-cores
- cpufreq: use max scaling factor
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
scx_rustland was originally designed as a PoC to showcase the benefits
of implementing specialized schedulers via sched_ext, focusing on a very
specific use case: prioritize game responsiveness regardless of what
runs in the background.
Its original design was subsequently modified to better serve as a
general-purpose scheduler, balancing the prioritization of interactive
tasks with CPU-intensive ones to prevent over-prioritization.
With scx_bpfland serving as a more "general-purpose" scheduler, it makes
sense to revisit scx_rustland's original goal and make it much more
aggressive at prioritizing interactive tasks, determined in function of
their average amount of context switches.
This change makes scx_rustland again a really good PoC to showcase the
benefits of having specialized schedulers, by focusing only at a very
specific use case: provide a high and stable frames-per-second (fps)
while a kernel build is running in the background.
= Results =
- Test: Run a WebGL application [1] while building the kernel (make -j32)
- Hardware: 8-cores Intel 11th Gen Intel(R) Core(TM) i7-1195G7 @ 2.90GHz
+----------------------+--------+--------+
| Scheduler | avg fps| stdev |
+----------------------+--------+--------+
| EEVDF | 28 | 4.00 |
| scx_rustland-before | 43 | 1.25 |
| scx_rustland-after | 60 | 0.25 |
+----------------------+--------+--------+
[1] https://webglsamples.org/aquarium/aquarium.html
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
An old BPF verifier does not allow calling bpf_cpumask_set_cpu() in the
BPF syscall context, so we defer actual bpf_cpumask_set_cpu() to the
timer handler, update_sys_stat(), to workaround the problem.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
If a task is performance-critical, pick_idle_cpu() checks if the
previous core is a big core or not. If not, don't try to run on previous
core since a performance-critical task is better to run on a big core.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
A single threshold for a low watermark does not work well across systems
with various numbers of cores and core types. Instead of using a single
low watermark value, we dynamically decide the low watermark: 1) until
one little core is fully utilized or 2) until two big cores are fully
utilized. This works better across systems.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
The scx_rustland_core API has been redesigned recently, breaking the
compatibility with the past.
Considering that Rust crates should update their major version when the
previous API becomes incompatible [1], bump up the version to 2.0.0.
[1] https://semver.org/
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
We want to directly dispatch only kthreads when local_kthreads is
enabled, not all tasks that can run on a single CPU.
Fixes: 7cc1846 ("scx_bpfland: always rely on prev_cpu with single-CPU tasks")
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
When the no_freq_scaling changes during runtime in the autopilot mode,
the last target freq set would not be 1024. So the performance mode
enabled by the autopilot mode would not run in the best profile. Hence,
we set the target freq to 1024 always when no_freq_scaling is set.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
This allows scx_rustland to avoid generating excessive logs for
statistics while still allowing detailed monitoring on demand.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Add "--autopilot" option and mode. In the autopilot mode, the scheduler
dynamically changes its power mode according to system's load (cpu
utilization). When the cpu utilization is low enough (say <=5%), it
switches to the powersave mode since there is nothing to process fast so
powersaving is the primary goal. When the utilization is moderate (say
>5%, <=30%), it runs in balanced mode. When the utilization is high
enough (say >30%), it runs in performance mode.
Note that it only changes scheduler's power mode but it does not change
system's energy profile.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
When a cpu is idle for a whole interval, its idle time does not
correctlyh adds up so the utilization of such cpu tends to be higher
than the actual utilization. Now it is fixedk, so cpu utilization
becomes more accurate.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
When the power mode changes back to performance mode, we should
active/overflow cpumask to its initial state -- all big cores are in
active cpumask and all little cores are in overflow cpumask. Otherwise,
the active/overflow cpumasks will be used in the perfformance mode.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
If a task can be run only on a single cpu, we don't need to go through
all the steps in ops.select_cpu(). Instread, we simply check if a task
is still pinned on the prev_cpu and go.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
When selecting an idle for tasks that can only run on a single CPU,
always check if the previously used CPU is sill usable, instead of
trying to figure out the single allowed CPU looking at the task's
cpumask.
Apparently, single-CPU tasks can report a prev_cpu that is not in the
allowed cpumask when they rapidly change affinity.
This could lead to stalls, because we may end up dispatching the kthread
to a per-CPU DSQ that is not compatible with its allowed cpumask.
Example:
kworker/u32:2[173797] triggered exit kind 1026:
runnable task stall (kworker/2:1[70] failed to run for 7.552s)
...
R kworker/2:1[70] -7552ms
scx_state/flags=3/0x9 dsq_flags=0x1 ops_state/qseq=0/0
sticky/holding_cpu=-1/-1 dsq_id=0x8 dsq_vtime=234483011369
cpus=04
In this case kworker/2 can only run on CPU #2 (cpus=0x4), but it's
dispatched to dsq_id=0x8, that can only be consumed by CPU 8 => stall.
To prevent this, do not try to figure out the best idle CPU for tasks
that are changing affinity and just dispatch them to a global DSQ
(either priority or regular, depending on its interactive state).
Moreover, introduce an explicit error check in dispatch_direct_cpu() to
improve detection of similar issues in the future, and drop
lookup_task_ctx() in favor of try_lookup_task_ctx(), since we can now
safely handle all the cases where the task context is not found.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
It checkes the EPP (energy performance preference) peirodically and sets
the power profile of the scheduler during runtiime as a user changes its
EPP profile (from her desktop UI).
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Add a per layer config for different implementations of layer growth
algorithms. Convert the existing default logic into a default layer
growth algorithm and add a linear implementation.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
Refactor some BPF code to make verification easier on older kernels.
This is to make it easier to maintain backports.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
Aggressively try to keep tasks running on the same CPU / cache / domain,
to achieve higher performance when the system is not over commissioned.
This is done by giving a second chance in ops.enqueue(), in addition to
ops.select_cpu(), to find an idle CPU close to the previously used CPU.
Moreover, even if the task is dispatched to the global DSQs, always try
to check if there is an idle CPU in the primary domain that can
immediately consume the task.
= Results =
This change seems to provide a minor, but consistent, boost of
performance with the CPU-intensive benchmarks from the CachyOS
benchmarks selection [1].
Similar results can also be noticed with some WebGL benchmarks [2], when
system usage is close to its maximum capacity.
Test:
- cachyos-benchmarker
System:
- AMD Ryzen 7 5800X 8-Core Processor
Metrics:
- total time: elapsed time of all benchmarks
- total score: geometric mean of all benchmarks
NOTE: total time is the most relevant, since it gives a measure of the
aggregate performance, while the total score emphasizes more on
performance consistency across all benchmarks.
== Results: summary ==
+-------------------------+---------------------+---------------------+
| Scheduler | Total Time | Total Score |
| | (less = better) | (less = better) |
+-------------------------+---------------------+---------------------+
| EEVDF | 624.44 sec | 123.68 |
| bpfland | 625.34 sec | 122.21 |
| bpfland-task-affinity | 623.67 sec | 122.27 |
+-------------------------+---------------------+---------------------+
== Conclusion ==
With this patch applied, bpfland shows both a better performance and
consistency. Although the gains are small (less than 1%), they are still
significant for this type of benchmark and consistently appear across
multiple runs.
[1] https://github.com/CachyOS/cachyos-benchmarker
[2] https://webglsamples.org/aquarium/aquarium.html
Tested-by: Piotr Gorski < piotr.gorski@cachyos.org >
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
When iterating neighbors, the existing code unnecessarily iterates all
the neighbors to the maximum even if there is no neighors. So the fix
escapes early when there is no neighbors.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Ctrl-c wasn't properly handled in the monitoring mode
(`--monitor-sched-samples`), so the scheduler could not be terminated by
pressing ctrl-c. The missing ctrl-c handling is added to the monitor
thread.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Rely on scx_utils::Topology to classify Big, Little and Turbo CPUs.
Moreover, support the special keyword "all" with --primary-domain to
include all the CPUs in the system (default).
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Integrate the logic used by scx_bpfland to detect turbo-boosted cores in
Topology.
Also change the logic to detect Big/Little cores in function of
base_frequency, instead of scaling_max_freq, otherwise turbo-boosted
cores in homogeneous systems may be incorrectly classified as Big.
Moreover, introduce the following new methods to Cpu to check for the
core type:
- is_turbo(): return true if the CPU is Turbo, false otherwise
- is_big(): return true if the CPU is either Turbo or Big
- is_little(): return true if the CPU is Little
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
When creating the turbo boost scheduling domain, we might use a full CPU
mask (selecting all possible CPUs) to indicate "do not prioritize turbo
boost CPUs" or when all CPUs have the same maximum frequency.
This approach works when the primary domain also contains all the CPUs,
as the complete overlap allows the CPU selection logic to ignore the
turbo boost domain and start picking CPUs directly from the primary
domain.
However, if the primary domain doesn't include all CPUs, the two domains
won't fully overlap, which can lead to the turbo boost domain
incorrectly including all CPUs, thereby negating the restrictions set by
the primary scheduling domain.
To resolve this, an empty CPU mask should be used for the turbo boost
domain when turbo boost CPUs aren't prioritized. If the turbo boost
domain is empty, it should be entirely bypassed, and the selection
should proceed directly to the primary domain.
Reported-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
With an ill combination of old kernel and old LLVM, the BPF verifier
incorrectly detects an infinite loop. After changing the loop with a
constant end, the old verifier can pass the code.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Refactor the code design to make it more suitable as a template for
implementing advanced scheduling policies.
In particular, create separate loops for task consumption and task
dispatching. This will make the scheduler easier to adapt as a
foundation for implementing more complex scheduling policies.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Now it checks an active cpumask within a previous core's compute domain
before checking the full active CPUs.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
The BPF verifier in the old kernel gives up to analysis the nested loop
in the consume_task(). We reduce the loop less complex by reducing
LAVD_CPDOM_MAX_DIST from 6 to 4 in order to make the verifier happy.
Note that the theoretical maximum distance is 6 (numa > llc > core type)
but there is no such hardware today, hence reducing it to 6 should be
okay in next few years, when hopefully the verifier becomes smarter.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Updating nr_queue_task every runqueue operation is expensive and
unnecessary. So we do update every system state update interval and use
moving average, which is accurate enough.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Avoid to periodically read the current performance profile from
/sys/devices/system/cpu/cpufreq/policy0/energy_performance_preference if
it's not available (i.e., with older CPUs or kernels without cpufreq).
This fixes issue #560.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
meson build script was building each rust sub-project under rust/ and
scheds/rust/ separately. This means that each rust project is built
independently which leads to a couple problems - 1. There are a lot of
shared dependencies but they have to be built over and over again for each
proejct. 2. Concurrency management becomes sad - we either have to unleash
multiple cargo builds at the same time possibly thrashing the system or
build one by one.
We've been trying to solve this from meson side in vain. Thankfully, in
issue #546, @vimproved suggested using cargo workspace which makes the
sub-projects share the same target directory and built together by the same
cargo instance while still allowing each project to behave independently for
development and publishing purposes.
Make the following changes:
- Create two cargo workspaces - one under rust/, the other under
scheds/rust/. Each contains all rust projects underneath it.
- Don't let meson descend into rust/. These are libraries used by the rust
schedulers. No need to build them from meson. Cargo will build them as
needed.
- Change the rust_scheds build target to invoke `cargo build` in
scheds/rust/ and let cargo do its thing.
- Remove per-scheduler meson.build files and instead generate custom_targets
in scheds/rust/meson.build which invokes `cargo build -p $SCHED`.
- This changes rust binary directory. Update README and
meson-scripts/install_rust_user_scheds accordingly.
- Remove per-scheduler Cargo.lock as scheds/rust/Cargo.lock is shared by all
schedulers now.
- Unify .gitignore handling.
The followings are build times on Ryzen 3975W:
Before:
________________________________________________________
Executed in 165.93 secs fish external
usr time 40.55 mins 2.71 millis 40.55 mins
sys time 3.34 mins 36.40 millis 3.34 mins
After:
________________________________________________________
Executed in 36.04 secs fish external
usr time 336.42 secs 0.00 millis 336.42 secs
sys time 36.65 secs 43.95 millis 36.61 secs
Wallclock time is reduced 5x and CPU time 7x.
Refactor the code to hide the shutdown handling inside BpfScheduler and
simply use the exited() method to check when the scheduler is stopped.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Three of the reported stats are cumulative. While they obviously can be
processed into delta values, that holds for the other direction too and the
cumulative values are difficult to make intutive sense of. Report interval
delta values instead.
Note that a stats client can reliably build back cumulative values even
under heavy system contention - the delta values reported between two
consecutive reads are guaranteed to be correct regardless of the duration of
the interval.
Use scx_stats instead of prometheus for stats reporting. This has a few
advantages:
- Stats metadata can be defined more succinctly.
- Natural support for nesting statistics which will be useful in making
scheduler components composable.
- Support for multiple programmable readers where each reader can use their
own reading interval.
- Built-in stats help message generation.
- Openmetrics integration is still available through
scx_stats/scripts/scxstats_to_openmetrics.py.
Let's make it a bit easier to use:
- Shorten exported names by changing the prefix from ScxStats to Stats. This
should be distinctive enough and more inline with how most libraries name
their exports.
- Importing the right set of traits can be tricky. Introduce prelude module
so that importing is a bit less painful.
There is no reason to have two separate options for "verbose" and
"debug" mode. Just merge the two and always use "debug". If enabled,
increase verbosity to stdout and enable reporting BPF scheduling events
in debugfs (e.g., /sys/kernel/debug/tracing/trace_pipe).
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Since scx_rustland_core enables setting a time slice on a per-task basis
during task dispatch, there's no need to maintain a global time slice in
the BPF component. Instead, a global time slice can simply be managed in
user-space, achieving the same outcome.
Therefore, drop the global slice_us property from BpfScheduler to
simplify the API.
NOTE: if a time slice is not specified for a task, SCX_SLICE_DFL will be
used by default.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Add more comments to make the source code more understandable, so that
it can be easily used as a template for implementing more complex
scheduling policies.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Scheduling sample reporting is switched to use scx_stats. This makes the
scheduler run without making too much noise while still allowing monitoring
on demand. It can also make introspection more dynamic - e.g. it shouldn't
be difficult to add other monitoring commands which take scheduling samples
based on different criteria or add other types of staisitcs.
--nr_sched-samples is replaced with --monitor-nr-samples.