scx_rustland was originally designed as a PoC to showcase the benefits
of implementing specialized schedulers via sched_ext, focusing on a very
specific use case: prioritize game responsiveness regardless of what
runs in the background.
Its original design was subsequently modified to better serve as a
general-purpose scheduler, balancing the prioritization of interactive
tasks with CPU-intensive ones to prevent over-prioritization.
With scx_bpfland serving as a more "general-purpose" scheduler, it makes
sense to revisit scx_rustland's original goal and make it much more
aggressive at prioritizing interactive tasks, determined in function of
their average amount of context switches.
This change makes scx_rustland again a really good PoC to showcase the
benefits of having specialized schedulers, by focusing only at a very
specific use case: provide a high and stable frames-per-second (fps)
while a kernel build is running in the background.
= Results =
- Test: Run a WebGL application [1] while building the kernel (make -j32)
- Hardware: 8-cores Intel 11th Gen Intel(R) Core(TM) i7-1195G7 @ 2.90GHz
+----------------------+--------+--------+
| Scheduler | avg fps| stdev |
+----------------------+--------+--------+
| EEVDF | 28 | 4.00 |
| scx_rustland-before | 43 | 1.25 |
| scx_rustland-after | 60 | 0.25 |
+----------------------+--------+--------+
[1] https://webglsamples.org/aquarium/aquarium.html
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
An old BPF verifier does not allow calling bpf_cpumask_set_cpu() in the
BPF syscall context, so we defer actual bpf_cpumask_set_cpu() to the
timer handler, update_sys_stat(), to workaround the problem.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
If a task is performance-critical, pick_idle_cpu() checks if the
previous core is a big core or not. If not, don't try to run on previous
core since a performance-critical task is better to run on a big core.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
A single threshold for a low watermark does not work well across systems
with various numbers of cores and core types. Instead of using a single
low watermark value, we dynamically decide the low watermark: 1) until
one little core is fully utilized or 2) until two big cores are fully
utilized. This works better across systems.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
The scx_rustland_core API has been redesigned recently, breaking the
compatibility with the past.
Considering that Rust crates should update their major version when the
previous API becomes incompatible [1], bump up the version to 2.0.0.
[1] https://semver.org/
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
We want to directly dispatch only kthreads when local_kthreads is
enabled, not all tasks that can run on a single CPU.
Fixes: 7cc1846 ("scx_bpfland: always rely on prev_cpu with single-CPU tasks")
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
When the no_freq_scaling changes during runtime in the autopilot mode,
the last target freq set would not be 1024. So the performance mode
enabled by the autopilot mode would not run in the best profile. Hence,
we set the target freq to 1024 always when no_freq_scaling is set.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
This allows scx_rustland to avoid generating excessive logs for
statistics while still allowing detailed monitoring on demand.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Add "--autopilot" option and mode. In the autopilot mode, the scheduler
dynamically changes its power mode according to system's load (cpu
utilization). When the cpu utilization is low enough (say <=5%), it
switches to the powersave mode since there is nothing to process fast so
powersaving is the primary goal. When the utilization is moderate (say
>5%, <=30%), it runs in balanced mode. When the utilization is high
enough (say >30%), it runs in performance mode.
Note that it only changes scheduler's power mode but it does not change
system's energy profile.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
When a cpu is idle for a whole interval, its idle time does not
correctlyh adds up so the utilization of such cpu tends to be higher
than the actual utilization. Now it is fixedk, so cpu utilization
becomes more accurate.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
When the power mode changes back to performance mode, we should
active/overflow cpumask to its initial state -- all big cores are in
active cpumask and all little cores are in overflow cpumask. Otherwise,
the active/overflow cpumasks will be used in the perfformance mode.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
If a task can be run only on a single cpu, we don't need to go through
all the steps in ops.select_cpu(). Instread, we simply check if a task
is still pinned on the prev_cpu and go.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
When selecting an idle for tasks that can only run on a single CPU,
always check if the previously used CPU is sill usable, instead of
trying to figure out the single allowed CPU looking at the task's
cpumask.
Apparently, single-CPU tasks can report a prev_cpu that is not in the
allowed cpumask when they rapidly change affinity.
This could lead to stalls, because we may end up dispatching the kthread
to a per-CPU DSQ that is not compatible with its allowed cpumask.
Example:
kworker/u32:2[173797] triggered exit kind 1026:
runnable task stall (kworker/2:1[70] failed to run for 7.552s)
...
R kworker/2:1[70] -7552ms
scx_state/flags=3/0x9 dsq_flags=0x1 ops_state/qseq=0/0
sticky/holding_cpu=-1/-1 dsq_id=0x8 dsq_vtime=234483011369
cpus=04
In this case kworker/2 can only run on CPU #2 (cpus=0x4), but it's
dispatched to dsq_id=0x8, that can only be consumed by CPU 8 => stall.
To prevent this, do not try to figure out the best idle CPU for tasks
that are changing affinity and just dispatch them to a global DSQ
(either priority or regular, depending on its interactive state).
Moreover, introduce an explicit error check in dispatch_direct_cpu() to
improve detection of similar issues in the future, and drop
lookup_task_ctx() in favor of try_lookup_task_ctx(), since we can now
safely handle all the cases where the task context is not found.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
The veristat diff action isn't working properly so remove it for now.
Instead, run veristat (non diff mode) as part of the build-scheds
workflow. This should still provide some useful data for BPF program
related stats.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
It checkes the EPP (energy performance preference) peirodically and sets
the power profile of the scheduler during runtiime as a user changes its
EPP profile (from her desktop UI).
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Add a per layer config for different implementations of layer growth
algorithms. Convert the existing default logic into a default layer
growth algorithm and add a linear implementation.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
Refactor some BPF code to make verification easier on older kernels.
This is to make it easier to maintain backports.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>