- scx_utils: Replace kfunc_exists() with ksym_exists() which doesn't care
about the type of the symbol.
- scx_layered: Fix load failure on kernels >= v6.10-rc due to
scheduler_tick() -> sched_tick rename. Attach the tick fentry function to
either scheduler_tick() or sched_tick().
Make sure to never assign a time slice longer than the default time
slice, that can be used as an upper limit.
This seems to prevent potential stall conditions (reported by the
CachyOS community) when running CPU-intensive workloads, such as:
[ 68.062813] sched_ext: BPF scheduler "rustland" errored, disabling
[ 68.062831] sched_ext: runnable task stall (ollama_llama_se[3312] failed to run for 5.180s)
[ 68.062832] scx_watchdog_workfn+0x154/0x1e0
[ 68.062837] process_one_work+0x18e/0x350
[ 68.062839] worker_thread+0x2fa/0x490
[ 68.062841] kthread+0xd2/0x100
[ 68.062842] ret_from_fork+0x34/0x50
[ 68.062844] ret_from_fork_asm+0x1a/0x30
Fixes: 6f4cd853 ("scx_rustland: introduce virtual time slice")
Tested-by: SoulHarsh007 <harsh.peshwani@outlook.com>
Tested-by: Piotr Gorski <piotrgorski@cachyos.org>
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Overview
========
Currently, a task's time slice is determined based on the total number
of tasks waiting to be scheduled: the more overloaded the system, the
shorter the time slice.
This approach can help to reduce the average wait time of all tasks,
allowing them to progress more slowly, but uniformly, thus providing a
smoother overall system performance.
However, under heavy system load, this approach can lead to very short
time slices distributed among all tasks, causing excessive context
switches that can badly affect soft real-time workloads.
Moreover, the scheduler tends to operate in a bursty manner (tasks are
queued and dispatched in bursts). This can also result in fluctuations
of longer and shorter time slices, depending on the number of tasks
still waiting in the scheduler's queue.
Such behavior can also negatively impact on soft real-time workloads,
such as real-time audio processing.
Virtual time slice
==================
To mitigate this problem, introduce the concept of virtual time slice:
the idea is to evaluate the optimal time slice of a task, considering
the vruntime as a deadline for the task to complete its work before
releasing the CPU.
This is accomplished by calculating the difference between the task's
vruntime and the global current vruntime and use this value as the task
time slice:
task_slice = task_vruntime - min_vruntime
In this way, tasks that "promise" to release the CPU quickly (based on
their previous work pattern) get a much higher priority (due to
vruntime-based scheduling and the additional priority boost for being
classified as interactive), but they are also given a shorter time slice
to complete their work and fulfill their promise of rapidity.
At the same time tasks that are more CPU-intensive get de-prioritized,
but they will tend to have a longer time slice available, reducing in
this way the amount of context switches that can negatively affect their
performance.
In conclusion, latency-sensitive tasks get a high priority and a short
time slice (and they can preempt other tasks), CPU-intensive tasks get
low priority and a long time slice.
Example
=======
Let's consider the following theoretical scenario:
task | time
-----+-----
A | 1
B | 3
C | 6
D | 6
In this case task A represents a short interactive task, task C and D
are CPU-intensive tasks and task B is mainly interactive, but it also
requires some CPU time.
With a uniform time slice, scaled based on the amount of tasks, the
scheduling looks like this (assuming the time slice is 2):
A B B C C D D A B C C D D C C D D
| | | | | | | | |
`---`---`---`-`-`---`---`---`----> 9 context switches
With the virtual time slice the scheduling changes to this:
A B B C C C D A B C C C D D D D D
| | | | | | |
`---`-----`-`-`-`-----`----------> 7 context switches
In the latter scenario, tasks do not receive the same time slice scaled
by the total number of tasks waiting to be scheduled. Instead, their
time slice is adjusted based on their previous CPU usage. Tasks that
used more CPU time are given longer slices and their processing time
tends to be packed together, reducing the amount of context switches.
Meanwhile, latency-sensitive tasks can still be processed as soon as
they need to, because they get a higher priority and they can preempt
other tasks. However, they will get a short time slice, so tasks that
were incorrectly classified as interactive will still be forced to
release the CPU quickly.
Experimental results
====================
This patch has been tested on a on a 8-cores AMD Ryzen 7 5800X 8-Core
Processor (16 threads with SMT), 16GB RAM, NVIDIA GeForce RTX 3070.
The test case involves the usual benchmark of playing a video game while
simultaneously overloading the system with a parallel kernel build
(`make -j32`).
The average frames per second (fps) reported by Steam is used as a
metric for measuring system responsiveness (the higher the better):
Game | before | after | delta |
---------------------------+---------+---------+--------+
Baldur's Gate 3 | 40 fps | 48 fps | +20.0% |
Counter-Strike 2 | 8 fps | 15 fps | +87.5% |
Cyberpunk 2077 | 41 fps | 46 fps | +12.2% |
Terraria | 98 fps | 108 fps | +10.2% |
Team Fortress 2 | 81 fps | 92 fps | +13.6% |
WebGL demo (firefox) [1] | 32 fps | 42 fps | +31.2% |
---------------------------+---------+---------+--------+
Apart from the massive boost with Counter-Strike 2 (that should be taken
with a grain of salt, considering the overall poor performance in both
cases), the virtual time slice seems to systematically provide a boost
in responsiveness of around +10-20% fps.
It also seems to significantly prevent potential audio cracking issues
when the system is massively overloaded: no audio cracking was detected
during the entire run of these tests with the virtual deadline change
applied.
[1] https://webglsamples.org/aquarium/aquarium.html
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Make restart handling with user_exit_info simpler and consistently use the
load and report macros consistently across the rust schedulers. This makes
all schedulers automatically handle auto restarts from CPU hotplug events.
Note that this is necessary even for scx_lavd which has CPU hotplug
operations as CPU hotplug operations which took place between skel open and
scheduler init can still trigger restart.
In cpumask_intersects_domain(), we check whether a given cpumask has any
CPUs in common with the specified domain by looking at the const, static
dom_cpumasks map. This map is only really necessary when creating the
domain struct bpf_cpumask objects at scheduler load time. After that, we
can just use the actual struct bpf_cpumask object embedded in the domain
context. Let's use that and cpumask kfuncs instead.
This allows rusty to load with
https://github.com/sched-ext/sched_ext/pull/216.
Signed-off-by: David Vernet <void@manifault.com>
Commit 23b0bb5f ("scx_rustland: dispatch interactive tasks on any CPU")
allows only interactive tasks to be dispatched on any CPU, enabling them
to quickly use the first idle CPU available. Non-interactive tasks, on
the other hand, are kept on the same CPU as much as possible.
This change deprioritizes CPU-intensive tasks further, but it also helps
to exploit cache locality, while latency-sensitive tasks are dispatched
sooner, improving overall responsiveness, despite the potential
migration cost.
Given this new logic, the built-idle option, which forces all tasks to
be dispatched on the CPU assigned during select_cpu(), no longer offers
significant benefits. It would merely reduce the responsiveness of
interactive tasks.
Therefore, simply remove this option, allowing the scheduler to
determine the target CPU(s) for all tasks based on their nature.
Fixes: 23b0bb5f ("scx_rustland: dispatch interactive tasks on any CPU")
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
In order to prevent compiler from merging or refetching load/store
operations or unwanted reordering, we take the implemetation of
READ_ONCE()/WRITE_ONCE() from kernel sources under
"/include/asm-generic/rwonce.h".
Use WRITE_ONCE() in function flip_sys_cpu_util() to ensure the compiler
doesn't perform unnecessary optimization so the compiler won't make
incorrect assumptions when performing the operation of modifying of bit
flipping.
Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
layered_dispatch() was incorrectly continuing down to the lower priority
DSQs after successfully consuming from HI_FALLBACK_DSQ which can lead to
latency issues. Fix it.
As reported in #319, we may get a build failure in presence of musl,
that requires additional parameters in sched_param.
Fix by adding a proper conditional to support both gnu libc and musl
libc.
This fixes#319.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Use the GNU built-in __sync_fetch_and_xor() to perform the XOR operation
on global variable "__sys_cpu_util_idx" to ensure the operations
visibility.
The built-in function "__sync_fetch_and_xor()" can provide both atomic
operation and full memory barrier which is needed by every operation
(especially store operation) on global variables.
Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
Newer sched_ext kernel versions sets the scheduler to schedule all tasks
within the system by default. However, some users are using the old
versions of kernel.
Therefore we call "__COMPAT_scx_bpf_switch_all()" to move all tasks to
"SCHED_EXT" class so scx_central would schedule all tasks by default in
older kernels.
The main reason why custom affinities are tricky for scx_layered is because
if we put a task which doesn't allow all CPUs into a layer's DSQ, it may not
get consumed for an indefinite amount of time. However, this is only true
for confined layers. Both open and grouped layers always consumed from all
CPUs and thus don't have this risk.
Let's allow tasks with custom affinities in open and grouped layers.
- In select_cpu(), don't consider direct dispatching to a local DSQ as
affinity violation even if the target CPU is outside the layer's cpumask
if the layer is open.
- In enqueue(), separate out per-cpu kthread special case into its own
block. Note that this is only applied if the layer is not preempting as a
preempting layer has a higher priority than HI_FALLBACK_DSQ anyway.
- Trigger the LO_FALLBACK_DSQ path for other threads only if the layer is
confined.
- The preemption path now also runs for tasks with a custom affinity in open
and grouped layers. Update it so that it only considers the CPUs in the
preempting task's allowed cpumask.
(cherry picked from commit 82d2f887a4608de61ddf5e15643c10e504a88f7b)
- AFFN_VIOL for per-cpu tasks could be double counted. Once in select_cpu()
and again in enqueue(). Count in select_cpu() only when direct
dispatching.
- Violating tasks were prioritized over non-violating ones because they were
queued on SCX_DSQ_GLOBAL which has priority over all user DSQs. This
doesn't make sense. Let's introduce two fallback DSQs - HI_FALLBACK_DSQ
and LO_FALLBACK_DSQ. HI is used for violating kthreads and LO for
violating user threads. HI is dispatched after preempting layers and LO
after all other layers. This shouldn't change the behavior too much for
kthreads while punshing, rather than rewarding, violating user threads.
(cherry picked from commit 67f69645667ba8a155cae9a9b7e90c055d39e23c)
The setting of ops->dispatch_max_batch leads to a too large allocated
size for percpu allocator, and it will be unhappy if we want a size
larger than PCPU_MIN_UNIT_SIZE. Reduce MAX_ENQUEUED_TASKS for fix.
Implement a second-chance migration in select_cpu(): after a task has
been dispatched directly do not try to migrate it immediately on a
different CPU, but force it to stay on prev_cpu for another round.
This seems quite effective on certain architectures (such as on a system
with 11th Gen Intel(R) Core(TM) i7-1195G7 @ 2.90GHz), and it can provide
noticeable benefits with gaming or WebGL applications (such as
https://webglsamples.org/aquarium/aquarium.html) under regular workload
conditions (around +5% fps).
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Dispatch non-interactive tasks on the CPU selected by the built-in idle
selection logic and allow interactive tasks to be dispatched on any CPU.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>