When selecting an idle CPU for a task that has been woken up, prioritize
reusing the same CPU if the waker and wakee share the same L3 cache.
Otherwise, attempt to migrate the wakee to the waker's CPU, provided it
is allowed by the wakee's scheduling domain.
This seems to consistently improve FPS performance when the system is
not operating over its full capacity.
Example:
$ __GL_SYNC_TO_VBLANK=0 vblank_mode=0 glxgears -geometry 800x600
- before: ~18305.77 FPS
- after: ~19060.62 FPS
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Rename "turbo domain" to "preferred domain", that conceptually is more
generic and introduce the new option `--preferred-domain CPUMASK`, which
allows users to define the preferred domain, specifying a cpumask as a
hex number. By default ("auto") the scheduler will always try to detect
and use the fastest CPUs in the system.
Moreover, adjust the cpufreq logic to use "auto" both with the
"balance_power" and "balance_performance" EPP profiles.
Then, enable "auto" mode by default: the scheduler will try to
automatically determine the optimal primary domain, preferred domain and
cpufreq level, based on the selected scheduler and energy profiles.
Tested-by: Piotr Gorski < piotr.gorski@cachyos.org >
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Fix formatting precision of stats to have lower precision for
readability. The existing formatting is hard to read:
tot= 1538 local=31.27 open_idle= 2.73 affn_viol=23.80 proc=4ms
busy= 1.1 util= 16.6 load= 32.7 fallback_cpu= 6
excl_coll=0.06501950585175553 excl_preempt=0.26007802340702213 excl_idle=0.16384915474642392 excl_wakeup=0.25097529258777634
With this fix stats are far more readable formatting:
tot= 441 local=33.56 open_idle= 0.00 affn_viol=20.63 proc=3ms
busy= 0.4 util= 6.3 load= 33.6 fallback_cpu= 6
excl_coll=0.454 excl_preempt=0.000 excl_idle=0.132 excl_wakeup=0.200
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
Prevent triggering a critical error when a local context for a task
can't be found.
Instead, handle the error gracefully (reporting a warning in debugfs) to
enhance the robustness of the schedulers based on scx_rustland_core.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
When a pinned task cannot run on either active or overflow sets, we try
to stay on the previous CPU which is still okay to run on.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Now that we have a sane build environment with meson we don't need to
force sequential builds anymore (--jobs=1) and we can just run regular
builds without having to worry about excessive parallelization.
See commit 43950c6 ("build: Use workspace to group rust sub-projects").
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Bump up scx_rustland_core version to include this critical fix that
allows to prevent scheduler stalls:
94a3594 ("scx_rustland_core: always dispatch per-cpu kthreads directly")
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Do not send per-CPU kthreads to the user-space scheduler, but always
dispatch them directly from BPF.
In specific environments, sending critical per-CPU kthreads to the
user-space scheduler can lead to potential stalls. This occurs because
the user-space scheduler might be blocked by an action that these
per-CPU kthreads need to perform, but they cannot complete their action
if the scheduler needs to schedule them, hence the deadlock.
To prevent this deadlock, always dispatch the per-CPU kthreads directly
from the BPF component, ensuring that the user-space scheduler does not
get blocked by these events.
Fixes: c0a2cfb ("scx_rustland_core: always schedule per-CPU kthreads to user-space")
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Recently, we have triggered some OOM conditions during stress tests,
particularly with the user-space schedulers. To avoid this issue and
prevent false positives, increase the memory size of the virtme-ng
instance from the default 1GB to 2GB.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
In auto mode, rather than keeping the previous fixed cpuperf factor,
dynamically calculate it based on CPU utilization and apply it before a
task runs within its allocated time slot.
Interactive tasks consistently receive the maximum scaling factor to
ensure optimal performance.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Always consider the turbo domain when running in "auto" mode.
Additionally, when the turbo domain is used, split the CPU idle
selection logic into two stages:
1) in ops.select_cpu(), provide the task with a second opportunity to
remain within the same LLC
2) in ops.enqueue(), perform another check for an idle CPU, allowing
the task to move to a different LLC if an idle CPU within the same
LLC is not available.
This allows tasks to stick more on turbo-boosted CPUs and CPUs within
the same LLC.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
When tasks are changing CPU affinity it is pointless to try to find an
optimal idle CPU. In this case just skip the the idle CPU selection step
and let the task being dispatched to a global DSQ if needed.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Add hints for the cpufreq governor based on the selected scheduler's
performance profile and the current energy performance preference (EPP).
With this change applied the scheduler works as following:
scheduler profile (--primary-domain option):
- default:
- use all cores
- cpufreq: use default scaling factor
- powersave:
- use E-cores
- cpufreq: use min scaling factor
- performance:
- use P-cores
- cpufreq: use max scaling factor
- auto:
- EPP: power, powersave
- use E-cores
- cpufreq: use min scaling factor
- EPP: balance_power (typically battery-powered systems)
- use E-cores
- cpufreq: use default scaling factor
- EPP: balance_performance, performance
- use P-cores
- cpufreq: use max scaling factor
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
scx_rustland was originally designed as a PoC to showcase the benefits
of implementing specialized schedulers via sched_ext, focusing on a very
specific use case: prioritize game responsiveness regardless of what
runs in the background.
Its original design was subsequently modified to better serve as a
general-purpose scheduler, balancing the prioritization of interactive
tasks with CPU-intensive ones to prevent over-prioritization.
With scx_bpfland serving as a more "general-purpose" scheduler, it makes
sense to revisit scx_rustland's original goal and make it much more
aggressive at prioritizing interactive tasks, determined in function of
their average amount of context switches.
This change makes scx_rustland again a really good PoC to showcase the
benefits of having specialized schedulers, by focusing only at a very
specific use case: provide a high and stable frames-per-second (fps)
while a kernel build is running in the background.
= Results =
- Test: Run a WebGL application [1] while building the kernel (make -j32)
- Hardware: 8-cores Intel 11th Gen Intel(R) Core(TM) i7-1195G7 @ 2.90GHz
+----------------------+--------+--------+
| Scheduler | avg fps| stdev |
+----------------------+--------+--------+
| EEVDF | 28 | 4.00 |
| scx_rustland-before | 43 | 1.25 |
| scx_rustland-after | 60 | 0.25 |
+----------------------+--------+--------+
[1] https://webglsamples.org/aquarium/aquarium.html
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
An old BPF verifier does not allow calling bpf_cpumask_set_cpu() in the
BPF syscall context, so we defer actual bpf_cpumask_set_cpu() to the
timer handler, update_sys_stat(), to workaround the problem.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
If a task is performance-critical, pick_idle_cpu() checks if the
previous core is a big core or not. If not, don't try to run on previous
core since a performance-critical task is better to run on a big core.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
A single threshold for a low watermark does not work well across systems
with various numbers of cores and core types. Instead of using a single
low watermark value, we dynamically decide the low watermark: 1) until
one little core is fully utilized or 2) until two big cores are fully
utilized. This works better across systems.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
The scx_rustland_core API has been redesigned recently, breaking the
compatibility with the past.
Considering that Rust crates should update their major version when the
previous API becomes incompatible [1], bump up the version to 2.0.0.
[1] https://semver.org/
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
We want to directly dispatch only kthreads when local_kthreads is
enabled, not all tasks that can run on a single CPU.
Fixes: 7cc1846 ("scx_bpfland: always rely on prev_cpu with single-CPU tasks")
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
When the no_freq_scaling changes during runtime in the autopilot mode,
the last target freq set would not be 1024. So the performance mode
enabled by the autopilot mode would not run in the best profile. Hence,
we set the target freq to 1024 always when no_freq_scaling is set.
Signed-off-by: Changwoo Min <changwoo@igalia.com>