Refactor match_layer() to prevent the compiling error caused by
uninitialization of the variable "nr_match_ors" before usage.
Move the checking of "nr_match_ors" after it access the value within
"layer->nr_match_ors" to make sure it's initiailized successfully.
Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
Pass enqueue flags to user-space: flags will be passed via
QueuedTask.flags and can be forwarded back to BPF via
DispatchedTask.flags.
These flags can be also passed to BpfScheduler.select_cpu() to apply a
more refined CPU selection policy.
Moreover, avoid to prioritize the user-space scheduler too much and
dispatch it only if there are no other tasks that needs to be dispatched
in ops.dispatch().
This improves CPU utilization and enhances the fairness, robustness, and
resilience of schedulers based on scx_rustland_core, particularly under
stress test conditions.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
On WAKE_SYNC attempt to migrate the wakee on the same CPU as the waker
if the waker is not exiting, the wakee can use the waker's CPU, the
waker's L3 domain is not saturated and there are not other tasks queued
to the local DSQ of the waker's CPU.
This is the same logic used in scx_rusty.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Using the turbo boosted CPUs as preferred scheduling seems to be
beneficial only a very few corner cases, for example on battery-powered
devices with an aggressive cpufreq governor that constantly tries to
scale down the frequency (and even in this case it's probably better to
not force the tasks to run on the fast CPUs, to save power).
In practive the preferred domain seems to introduce more overhead than
benefits overall, so let's get rid of it.
This can be improved in the future adding multiple user-configurable
scheduling domains.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Many kernel threads performs latency critical tasks (e.g., net, gpu). In
particular, AMD GPU driver runs the most part in the kernel space using
kworker. Hence, treat kernel threads as if a woken up task.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Initialize the node cpumask, which was previously uninitialized causing
metric calculations to be wrong when attempting to lookup CPUs in the
node cpumask.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
Use `cargo fmt` with a specific nightly branch in the CI to enforce formatting. Globally format these files while the diff is still small so we can stay on top of it.
Test plan:
- CI lint check passes.
The domains are added to the aggregator when load is added (and
duty_cycle is not 0.0f64).
This commit makes sure that all domains are added to the aggregator even
when the calculated duty_cycle is 0.
Signed-off-by: Fredrik Lönnegren <fredrik@frelon.se>
Pass in the layer spec when determining the layer core growth algo. This
should make it easier to implement layer growth algos that are spec
specific.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
Using p->scx.slice to evaluate the consumed time slice can be a bit
imprecise, because the sched_ext core implements yielding by setting
p->scx.slice to 0.
When the task's vruntime is evaluated this is considered as the task has
exhausted its entire allocated time slice, even though it voluntarily
released the CPU before the slice fully expired.
To avoid this inaccuracy and prevent penalizing tasks that voluntarily
release the CPU, always evaluate the used time slice based on the
difference in the task's total execution time (p->se.sum_exec_runtime).
This method provides a more precise calculation of vruntime and results
in a fairer task's deadline evaluation.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Rust build was using two separate workspaces - rust/ and scheds/rust.
There's no reason to separate them and it makes doc generation tricky. Use
single top level workspace so that we can drive all rust building from
cargo.
split build and test jobs to reduce ci turnaround time
and make it clear what is failing when something fails.
also add virtiofsd to deps to make test compilation faster
(most test time is compliation) and remove all force 9ps.
Simplify scx_rlfifo code, add detailed documentation of the
scx_rustland_core API and get rid of the additional task queue, since it
just makes the code bigger, slower and it doesn't really provide any
benefit (considering that we are dispatching the tasks in FIFO order
anyway).
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Pass the enqueue flags to the user-space scheduler through the
QueuedTask struct.
These flags allow the user-space scheduler to make more informed
scheduling decisions.
Also bump up scx_rustland_core minor version to reflect the new API (we
are not breaking the old API, so we don't need to bump the major version
in this case).
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Unexpectedly, little cores, which have relative short time slices, have
more chance to schedule performance-critical tasks. Hence it is better
to keep the time slice same regardless the core types.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
When selecting an idle CPU for a task that has been woken up, prioritize
reusing the same CPU if the waker and wakee share the same L3 cache.
Otherwise, attempt to migrate the wakee to the waker's CPU, provided it
is allowed by the wakee's scheduling domain.
This seems to consistently improve FPS performance when the system is
not operating over its full capacity.
Example:
$ __GL_SYNC_TO_VBLANK=0 vblank_mode=0 glxgears -geometry 800x600
- before: ~18305.77 FPS
- after: ~19060.62 FPS
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Rename "turbo domain" to "preferred domain", that conceptually is more
generic and introduce the new option `--preferred-domain CPUMASK`, which
allows users to define the preferred domain, specifying a cpumask as a
hex number. By default ("auto") the scheduler will always try to detect
and use the fastest CPUs in the system.
Moreover, adjust the cpufreq logic to use "auto" both with the
"balance_power" and "balance_performance" EPP profiles.
Then, enable "auto" mode by default: the scheduler will try to
automatically determine the optimal primary domain, preferred domain and
cpufreq level, based on the selected scheduler and energy profiles.
Tested-by: Piotr Gorski < piotr.gorski@cachyos.org >
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Fix formatting precision of stats to have lower precision for
readability. The existing formatting is hard to read:
tot= 1538 local=31.27 open_idle= 2.73 affn_viol=23.80 proc=4ms
busy= 1.1 util= 16.6 load= 32.7 fallback_cpu= 6
excl_coll=0.06501950585175553 excl_preempt=0.26007802340702213 excl_idle=0.16384915474642392 excl_wakeup=0.25097529258777634
With this fix stats are far more readable formatting:
tot= 441 local=33.56 open_idle= 0.00 affn_viol=20.63 proc=3ms
busy= 0.4 util= 6.3 load= 33.6 fallback_cpu= 6
excl_coll=0.454 excl_preempt=0.000 excl_idle=0.132 excl_wakeup=0.200
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
When a pinned task cannot run on either active or overflow sets, we try
to stay on the previous CPU which is still okay to run on.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Bump up scx_rustland_core version to include this critical fix that
allows to prevent scheduler stalls:
94a3594 ("scx_rustland_core: always dispatch per-cpu kthreads directly")
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
In auto mode, rather than keeping the previous fixed cpuperf factor,
dynamically calculate it based on CPU utilization and apply it before a
task runs within its allocated time slot.
Interactive tasks consistently receive the maximum scaling factor to
ensure optimal performance.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Always consider the turbo domain when running in "auto" mode.
Additionally, when the turbo domain is used, split the CPU idle
selection logic into two stages:
1) in ops.select_cpu(), provide the task with a second opportunity to
remain within the same LLC
2) in ops.enqueue(), perform another check for an idle CPU, allowing
the task to move to a different LLC if an idle CPU within the same
LLC is not available.
This allows tasks to stick more on turbo-boosted CPUs and CPUs within
the same LLC.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
When tasks are changing CPU affinity it is pointless to try to find an
optimal idle CPU. In this case just skip the the idle CPU selection step
and let the task being dispatched to a global DSQ if needed.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Add hints for the cpufreq governor based on the selected scheduler's
performance profile and the current energy performance preference (EPP).
With this change applied the scheduler works as following:
scheduler profile (--primary-domain option):
- default:
- use all cores
- cpufreq: use default scaling factor
- powersave:
- use E-cores
- cpufreq: use min scaling factor
- performance:
- use P-cores
- cpufreq: use max scaling factor
- auto:
- EPP: power, powersave
- use E-cores
- cpufreq: use min scaling factor
- EPP: balance_power (typically battery-powered systems)
- use E-cores
- cpufreq: use default scaling factor
- EPP: balance_performance, performance
- use P-cores
- cpufreq: use max scaling factor
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
scx_rustland was originally designed as a PoC to showcase the benefits
of implementing specialized schedulers via sched_ext, focusing on a very
specific use case: prioritize game responsiveness regardless of what
runs in the background.
Its original design was subsequently modified to better serve as a
general-purpose scheduler, balancing the prioritization of interactive
tasks with CPU-intensive ones to prevent over-prioritization.
With scx_bpfland serving as a more "general-purpose" scheduler, it makes
sense to revisit scx_rustland's original goal and make it much more
aggressive at prioritizing interactive tasks, determined in function of
their average amount of context switches.
This change makes scx_rustland again a really good PoC to showcase the
benefits of having specialized schedulers, by focusing only at a very
specific use case: provide a high and stable frames-per-second (fps)
while a kernel build is running in the background.
= Results =
- Test: Run a WebGL application [1] while building the kernel (make -j32)
- Hardware: 8-cores Intel 11th Gen Intel(R) Core(TM) i7-1195G7 @ 2.90GHz
+----------------------+--------+--------+
| Scheduler | avg fps| stdev |
+----------------------+--------+--------+
| EEVDF | 28 | 4.00 |
| scx_rustland-before | 43 | 1.25 |
| scx_rustland-after | 60 | 0.25 |
+----------------------+--------+--------+
[1] https://webglsamples.org/aquarium/aquarium.html
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
An old BPF verifier does not allow calling bpf_cpumask_set_cpu() in the
BPF syscall context, so we defer actual bpf_cpumask_set_cpu() to the
timer handler, update_sys_stat(), to workaround the problem.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
If a task is performance-critical, pick_idle_cpu() checks if the
previous core is a big core or not. If not, don't try to run on previous
core since a performance-critical task is better to run on a big core.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
A single threshold for a low watermark does not work well across systems
with various numbers of cores and core types. Instead of using a single
low watermark value, we dynamically decide the low watermark: 1) until
one little core is fully utilized or 2) until two big cores are fully
utilized. This works better across systems.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
The scx_rustland_core API has been redesigned recently, breaking the
compatibility with the past.
Considering that Rust crates should update their major version when the
previous API becomes incompatible [1], bump up the version to 2.0.0.
[1] https://semver.org/
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
We want to directly dispatch only kthreads when local_kthreads is
enabled, not all tasks that can run on a single CPU.
Fixes: 7cc1846 ("scx_bpfland: always rely on prev_cpu with single-CPU tasks")
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
When the no_freq_scaling changes during runtime in the autopilot mode,
the last target freq set would not be 1024. So the performance mode
enabled by the autopilot mode would not run in the best profile. Hence,
we set the target freq to 1024 always when no_freq_scaling is set.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
This allows scx_rustland to avoid generating excessive logs for
statistics while still allowing detailed monitoring on demand.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Add "--autopilot" option and mode. In the autopilot mode, the scheduler
dynamically changes its power mode according to system's load (cpu
utilization). When the cpu utilization is low enough (say <=5%), it
switches to the powersave mode since there is nothing to process fast so
powersaving is the primary goal. When the utilization is moderate (say
>5%, <=30%), it runs in balanced mode. When the utilization is high
enough (say >30%), it runs in performance mode.
Note that it only changes scheduler's power mode but it does not change
system's energy profile.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
When a cpu is idle for a whole interval, its idle time does not
correctlyh adds up so the utilization of such cpu tends to be higher
than the actual utilization. Now it is fixedk, so cpu utilization
becomes more accurate.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
When the power mode changes back to performance mode, we should
active/overflow cpumask to its initial state -- all big cores are in
active cpumask and all little cores are in overflow cpumask. Otherwise,
the active/overflow cpumasks will be used in the perfformance mode.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
If a task can be run only on a single cpu, we don't need to go through
all the steps in ops.select_cpu(). Instread, we simply check if a task
is still pinned on the prev_cpu and go.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
When selecting an idle for tasks that can only run on a single CPU,
always check if the previously used CPU is sill usable, instead of
trying to figure out the single allowed CPU looking at the task's
cpumask.
Apparently, single-CPU tasks can report a prev_cpu that is not in the
allowed cpumask when they rapidly change affinity.
This could lead to stalls, because we may end up dispatching the kthread
to a per-CPU DSQ that is not compatible with its allowed cpumask.
Example:
kworker/u32:2[173797] triggered exit kind 1026:
runnable task stall (kworker/2:1[70] failed to run for 7.552s)
...
R kworker/2:1[70] -7552ms
scx_state/flags=3/0x9 dsq_flags=0x1 ops_state/qseq=0/0
sticky/holding_cpu=-1/-1 dsq_id=0x8 dsq_vtime=234483011369
cpus=04
In this case kworker/2 can only run on CPU #2 (cpus=0x4), but it's
dispatched to dsq_id=0x8, that can only be consumed by CPU 8 => stall.
To prevent this, do not try to figure out the best idle CPU for tasks
that are changing affinity and just dispatch them to a global DSQ
(either priority or regular, depending on its interactive state).
Moreover, introduce an explicit error check in dispatch_direct_cpu() to
improve detection of similar issues in the future, and drop
lookup_task_ctx() in favor of try_lookup_task_ctx(), since we can now
safely handle all the cases where the task context is not found.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
It checkes the EPP (energy performance preference) peirodically and sets
the power profile of the scheduler during runtiime as a user changes its
EPP profile (from her desktop UI).
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Add a per layer config for different implementations of layer growth
algorithms. Convert the existing default logic into a default layer
growth algorithm and add a linear implementation.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
Refactor some BPF code to make verification easier on older kernels.
This is to make it easier to maintain backports.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
Aggressively try to keep tasks running on the same CPU / cache / domain,
to achieve higher performance when the system is not over commissioned.
This is done by giving a second chance in ops.enqueue(), in addition to
ops.select_cpu(), to find an idle CPU close to the previously used CPU.
Moreover, even if the task is dispatched to the global DSQs, always try
to check if there is an idle CPU in the primary domain that can
immediately consume the task.
= Results =
This change seems to provide a minor, but consistent, boost of
performance with the CPU-intensive benchmarks from the CachyOS
benchmarks selection [1].
Similar results can also be noticed with some WebGL benchmarks [2], when
system usage is close to its maximum capacity.
Test:
- cachyos-benchmarker
System:
- AMD Ryzen 7 5800X 8-Core Processor
Metrics:
- total time: elapsed time of all benchmarks
- total score: geometric mean of all benchmarks
NOTE: total time is the most relevant, since it gives a measure of the
aggregate performance, while the total score emphasizes more on
performance consistency across all benchmarks.
== Results: summary ==
+-------------------------+---------------------+---------------------+
| Scheduler | Total Time | Total Score |
| | (less = better) | (less = better) |
+-------------------------+---------------------+---------------------+
| EEVDF | 624.44 sec | 123.68 |
| bpfland | 625.34 sec | 122.21 |
| bpfland-task-affinity | 623.67 sec | 122.27 |
+-------------------------+---------------------+---------------------+
== Conclusion ==
With this patch applied, bpfland shows both a better performance and
consistency. Although the gains are small (less than 1%), they are still
significant for this type of benchmark and consistently appear across
multiple runs.
[1] https://github.com/CachyOS/cachyos-benchmarker
[2] https://webglsamples.org/aquarium/aquarium.html
Tested-by: Piotr Gorski < piotr.gorski@cachyos.org >
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
When iterating neighbors, the existing code unnecessarily iterates all
the neighbors to the maximum even if there is no neighors. So the fix
escapes early when there is no neighbors.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Ctrl-c wasn't properly handled in the monitoring mode
(`--monitor-sched-samples`), so the scheduler could not be terminated by
pressing ctrl-c. The missing ctrl-c handling is added to the monitor
thread.
Signed-off-by: Changwoo Min <changwoo@igalia.com>