Bump up scx_rustland_core version to include this critical fix that
allows to prevent scheduler stalls:
94a3594 ("scx_rustland_core: always dispatch per-cpu kthreads directly")
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
scx_rustland was originally designed as a PoC to showcase the benefits
of implementing specialized schedulers via sched_ext, focusing on a very
specific use case: prioritize game responsiveness regardless of what
runs in the background.
Its original design was subsequently modified to better serve as a
general-purpose scheduler, balancing the prioritization of interactive
tasks with CPU-intensive ones to prevent over-prioritization.
With scx_bpfland serving as a more "general-purpose" scheduler, it makes
sense to revisit scx_rustland's original goal and make it much more
aggressive at prioritizing interactive tasks, determined in function of
their average amount of context switches.
This change makes scx_rustland again a really good PoC to showcase the
benefits of having specialized schedulers, by focusing only at a very
specific use case: provide a high and stable frames-per-second (fps)
while a kernel build is running in the background.
= Results =
- Test: Run a WebGL application [1] while building the kernel (make -j32)
- Hardware: 8-cores Intel 11th Gen Intel(R) Core(TM) i7-1195G7 @ 2.90GHz
+----------------------+--------+--------+
| Scheduler | avg fps| stdev |
+----------------------+--------+--------+
| EEVDF | 28 | 4.00 |
| scx_rustland-before | 43 | 1.25 |
| scx_rustland-after | 60 | 0.25 |
+----------------------+--------+--------+
[1] https://webglsamples.org/aquarium/aquarium.html
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
The scx_rustland_core API has been redesigned recently, breaking the
compatibility with the past.
Considering that Rust crates should update their major version when the
previous API becomes incompatible [1], bump up the version to 2.0.0.
[1] https://semver.org/
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
This allows scx_rustland to avoid generating excessive logs for
statistics while still allowing detailed monitoring on demand.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
meson build script was building each rust sub-project under rust/ and
scheds/rust/ separately. This means that each rust project is built
independently which leads to a couple problems - 1. There are a lot of
shared dependencies but they have to be built over and over again for each
proejct. 2. Concurrency management becomes sad - we either have to unleash
multiple cargo builds at the same time possibly thrashing the system or
build one by one.
We've been trying to solve this from meson side in vain. Thankfully, in
issue #546, @vimproved suggested using cargo workspace which makes the
sub-projects share the same target directory and built together by the same
cargo instance while still allowing each project to behave independently for
development and publishing purposes.
Make the following changes:
- Create two cargo workspaces - one under rust/, the other under
scheds/rust/. Each contains all rust projects underneath it.
- Don't let meson descend into rust/. These are libraries used by the rust
schedulers. No need to build them from meson. Cargo will build them as
needed.
- Change the rust_scheds build target to invoke `cargo build` in
scheds/rust/ and let cargo do its thing.
- Remove per-scheduler meson.build files and instead generate custom_targets
in scheds/rust/meson.build which invokes `cargo build -p $SCHED`.
- This changes rust binary directory. Update README and
meson-scripts/install_rust_user_scheds accordingly.
- Remove per-scheduler Cargo.lock as scheds/rust/Cargo.lock is shared by all
schedulers now.
- Unify .gitignore handling.
The followings are build times on Ryzen 3975W:
Before:
________________________________________________________
Executed in 165.93 secs fish external
usr time 40.55 mins 2.71 millis 40.55 mins
sys time 3.34 mins 36.40 millis 3.34 mins
After:
________________________________________________________
Executed in 36.04 secs fish external
usr time 336.42 secs 0.00 millis 336.42 secs
sys time 36.65 secs 43.95 millis 36.61 secs
Wallclock time is reduced 5x and CPU time 7x.
Refactor the code to hide the shutdown handling inside BpfScheduler and
simply use the exited() method to check when the scheduler is stopped.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
There is no reason to have two separate options for "verbose" and
"debug" mode. Just merge the two and always use "debug". If enabled,
increase verbosity to stdout and enable reporting BPF scheduling events
in debugfs (e.g., /sys/kernel/debug/tracing/trace_pipe).
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Since scx_rustland_core enables setting a time slice on a per-task basis
during task dispatch, there's no need to maintain a global time slice in
the BPF component. Instead, a global time slice can simply be managed in
user-space, achieving the same outcome.
Therefore, drop the global slice_us property from BpfScheduler to
simplify the API.
NOTE: if a time slice is not specified for a task, SCX_SLICE_DFL will be
used by default.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
The update_tasks() API is somewhat confusing, so replace it with a
clearer API, notify_complete().
This new API will return control to the BPF component and inform it
about the number of tasks still pending in the user-space scheduler.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
The low-power API is a bit of a hack implemented purely in the BPF
layer, this should be better re-implemented with some concepts of
topology awareness.
Therefore, get rid of this API for now.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
The current API used to notify the user-space scheduler when a task
exits is really confusing (setting a negative value in
queued_task_ctx.cpu), and it's also possible to detect task exiting
events from user-space (or check in procfs, even if it's slower).
In any case, a better API should be provided for this, so drop the
current one for now.
NOTE: this will cause additional memory usage for scx_rustland, but it
can be fixed/addressed later in a separate commit (i.e., providing a
periodic garbage collector for the unused task entries).
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Allow user-space scheduler to pick an idle CPU via
self.bpf.select_cpu(pid, prev_task, flags), mimicking the BPF's
select_cpu() iterface.
Also remove the full_user option and always rely on the idle selection
logic from user-space.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
The member "topo_map" in Scheduler is never used and thus should be
removed, the related imports are removed as well.
Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
Re-add the partial mode option that was dropped during the refactoring.
The partial option allows to apply the scheduler only to the tasks which
have their scheduling policy set to SCHED_EXT via sched_setscheduler().
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
The API for determining which PID is running on a specific CPU is racy
and is unnecessary since this information can be obtained from user
space.
Additionally, it's not reliable for identifying idle CPUs. Therefore,
it's better to remove this API and, in the future, provide a cpumask
alternative that can export the idle state of the CPUs to user space.
As a consequence also change scx_rustland to dispatch one task a time,
instead of dispatching tasks in batches of idle cores (that are usually
not accurate due to the racy nature of the CPU ownership interaface).
Dispatching one task at a time even makes the scheduler more performant,
due to the vruntime scheduling being applied to more tasks sitting in
the scheduler's queue.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Drop the slice boost logic and apply a vruntime and task time slice
evaluation approach similar to scx_bpfland (but implement this in the
user-space component instead of the BPF part).
Additionally, introduce a slice_us_min parameter to define the minimum
time slice that can be assigned to a task, also similar to scx_bpfland.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Use the same idle selection logic used in scx_bpfland also in
scx_rustland_core.
Also drop fifo_mode and always use the BPF idle selection logic by
default as long as the system is not saturated, unless full_user is
specified.
This approach allows user-space schedulers aiming for maximum
performance to leverage the BPF idle selection logic (bypassing
user-space), while those seeking full control can enable full_user to
bypass the BPF CPU idle selection logic and choose the target CPU for
each task from user-space.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
We don't need to send the number of voluntary context switches (nvcsw)
from BPF to user-space, as this information is already accessible in
user-space via procfs. Sending this data would only create unnecessary
overhead for schedulers that don't require it, and those that do can
easily retrieve it through procfs.
Therefore, drop this metric from scx_rustland_core and change
scx_rustland implementing an interactive task classifier fully in the
user-space part of the scheduler.
Also drop some options that are not provide any significant benefit
(also in preparation of a bigger refactoring to define a better API for
the user-space framework).
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Update libbpf-rs & libbpf-cargo to 0.24. Among other things, generated
skeletons now contain directly accessible map and program objects, no
longer necessitating the use of accessor methods. As a result, the risk
for mutability conflicts is reduced greatly.
Signed-off-by: Daniel Müller <deso@posteo.net>
sched_ext is about to be merged upstream. There are some compatibility
breaking changes and we're making the current sched_ext/for-6.11
1edab907b57d ("sched_ext/scx_qmap: Pick idle CPU for direct dispatch on
!wakeup enqueues") the baseline.
Tag everything except scx_mitosis as 1.0.0. As scx_mitosis is still in early
development and is currently temporarily disabled, only the patchlevel is
bumped.
With commit 5d20f89a ("scheds-rust: build rust schedulers in sequence"),
schedulers are now built serially one after the other to prevent meson
and cargo from forking NxN parallel tasks.
However, this change has made building a single scheduler much more
cumbersome, due to the chain of dependencies.
For example, building scx_rusty using the specific meson target would
still result in all schedulers being built, because they all depend on
each other.
To address this issue, introduce the new meson build option
`serialize=true|false` (default is false).
This option allows to disable the schedulers' build chain, restoring the
old behavior.
With this option enabled, it is now possible to build just a single
scheduler, parallelizing the cargo build properly, without triggering
the build of the others. Example:
$ meson setup build -Dbuildtype=release -Dserialize=false
$ meson compile -C build scx_rusty
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
In some cases, a host may have an odd topology where there are gaps in
CPU IDs (including between possible CPUs). A common pattern in
schedulers is to perform allocations for every possible CPU ID, such as
creating a per-cpu DSQ. In order to avoid confusing schedulers, let's
track the maximum CPU ID on a system so that we can return the number of
CPU IDs on the system which is inclusive of gaps.
We also update scx_rustland in this change to accommodate the fact that
we no longer export nr_cpus_possible() from TopologyMap.
Signed-off-by: David Vernet <void@manifault.com>
Keep track of the maximum vruntime among all tasks and flush them if the
difference between the maximum and minimum vruntime exceeds slice_ns.
This helps to prevent excessive starvation, as every task is guaranteed
to be dispatched within the slice_ns time limit.
Tested-by: Tested-by: SoulHarsh007 <harsh.peshwani@outlook.com>
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Make sure to never assign a time slice longer than the default time
slice, that can be used as an upper limit.
This seems to prevent potential stall conditions (reported by the
CachyOS community) when running CPU-intensive workloads, such as:
[ 68.062813] sched_ext: BPF scheduler "rustland" errored, disabling
[ 68.062831] sched_ext: runnable task stall (ollama_llama_se[3312] failed to run for 5.180s)
[ 68.062832] scx_watchdog_workfn+0x154/0x1e0
[ 68.062837] process_one_work+0x18e/0x350
[ 68.062839] worker_thread+0x2fa/0x490
[ 68.062841] kthread+0xd2/0x100
[ 68.062842] ret_from_fork+0x34/0x50
[ 68.062844] ret_from_fork_asm+0x1a/0x30
Fixes: 6f4cd853 ("scx_rustland: introduce virtual time slice")
Tested-by: SoulHarsh007 <harsh.peshwani@outlook.com>
Tested-by: Piotr Gorski <piotrgorski@cachyos.org>
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Overview
========
Currently, a task's time slice is determined based on the total number
of tasks waiting to be scheduled: the more overloaded the system, the
shorter the time slice.
This approach can help to reduce the average wait time of all tasks,
allowing them to progress more slowly, but uniformly, thus providing a
smoother overall system performance.
However, under heavy system load, this approach can lead to very short
time slices distributed among all tasks, causing excessive context
switches that can badly affect soft real-time workloads.
Moreover, the scheduler tends to operate in a bursty manner (tasks are
queued and dispatched in bursts). This can also result in fluctuations
of longer and shorter time slices, depending on the number of tasks
still waiting in the scheduler's queue.
Such behavior can also negatively impact on soft real-time workloads,
such as real-time audio processing.
Virtual time slice
==================
To mitigate this problem, introduce the concept of virtual time slice:
the idea is to evaluate the optimal time slice of a task, considering
the vruntime as a deadline for the task to complete its work before
releasing the CPU.
This is accomplished by calculating the difference between the task's
vruntime and the global current vruntime and use this value as the task
time slice:
task_slice = task_vruntime - min_vruntime
In this way, tasks that "promise" to release the CPU quickly (based on
their previous work pattern) get a much higher priority (due to
vruntime-based scheduling and the additional priority boost for being
classified as interactive), but they are also given a shorter time slice
to complete their work and fulfill their promise of rapidity.
At the same time tasks that are more CPU-intensive get de-prioritized,
but they will tend to have a longer time slice available, reducing in
this way the amount of context switches that can negatively affect their
performance.
In conclusion, latency-sensitive tasks get a high priority and a short
time slice (and they can preempt other tasks), CPU-intensive tasks get
low priority and a long time slice.
Example
=======
Let's consider the following theoretical scenario:
task | time
-----+-----
A | 1
B | 3
C | 6
D | 6
In this case task A represents a short interactive task, task C and D
are CPU-intensive tasks and task B is mainly interactive, but it also
requires some CPU time.
With a uniform time slice, scaled based on the amount of tasks, the
scheduling looks like this (assuming the time slice is 2):
A B B C C D D A B C C D D C C D D
| | | | | | | | |
`---`---`---`-`-`---`---`---`----> 9 context switches
With the virtual time slice the scheduling changes to this:
A B B C C C D A B C C C D D D D D
| | | | | | |
`---`-----`-`-`-`-----`----------> 7 context switches
In the latter scenario, tasks do not receive the same time slice scaled
by the total number of tasks waiting to be scheduled. Instead, their
time slice is adjusted based on their previous CPU usage. Tasks that
used more CPU time are given longer slices and their processing time
tends to be packed together, reducing the amount of context switches.
Meanwhile, latency-sensitive tasks can still be processed as soon as
they need to, because they get a higher priority and they can preempt
other tasks. However, they will get a short time slice, so tasks that
were incorrectly classified as interactive will still be forced to
release the CPU quickly.
Experimental results
====================
This patch has been tested on a on a 8-cores AMD Ryzen 7 5800X 8-Core
Processor (16 threads with SMT), 16GB RAM, NVIDIA GeForce RTX 3070.
The test case involves the usual benchmark of playing a video game while
simultaneously overloading the system with a parallel kernel build
(`make -j32`).
The average frames per second (fps) reported by Steam is used as a
metric for measuring system responsiveness (the higher the better):
Game | before | after | delta |
---------------------------+---------+---------+--------+
Baldur's Gate 3 | 40 fps | 48 fps | +20.0% |
Counter-Strike 2 | 8 fps | 15 fps | +87.5% |
Cyberpunk 2077 | 41 fps | 46 fps | +12.2% |
Terraria | 98 fps | 108 fps | +10.2% |
Team Fortress 2 | 81 fps | 92 fps | +13.6% |
WebGL demo (firefox) [1] | 32 fps | 42 fps | +31.2% |
---------------------------+---------+---------+--------+
Apart from the massive boost with Counter-Strike 2 (that should be taken
with a grain of salt, considering the overall poor performance in both
cases), the virtual time slice seems to systematically provide a boost
in responsiveness of around +10-20% fps.
It also seems to significantly prevent potential audio cracking issues
when the system is massively overloaded: no audio cracking was detected
during the entire run of these tests with the virtual deadline change
applied.
[1] https://webglsamples.org/aquarium/aquarium.html
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Make restart handling with user_exit_info simpler and consistently use the
load and report macros consistently across the rust schedulers. This makes
all schedulers automatically handle auto restarts from CPU hotplug events.
Note that this is necessary even for scx_lavd which has CPU hotplug
operations as CPU hotplug operations which took place between skel open and
scheduler init can still trigger restart.
Commit 23b0bb5f ("scx_rustland: dispatch interactive tasks on any CPU")
allows only interactive tasks to be dispatched on any CPU, enabling them
to quickly use the first idle CPU available. Non-interactive tasks, on
the other hand, are kept on the same CPU as much as possible.
This change deprioritizes CPU-intensive tasks further, but it also helps
to exploit cache locality, while latency-sensitive tasks are dispatched
sooner, improving overall responsiveness, despite the potential
migration cost.
Given this new logic, the built-idle option, which forces all tasks to
be dispatched on the CPU assigned during select_cpu(), no longer offers
significant benefits. It would merely reduce the responsiveness of
interactive tasks.
Therefore, simply remove this option, allowing the scheduler to
determine the target CPU(s) for all tasks based on their nature.
Fixes: 23b0bb5f ("scx_rustland: dispatch interactive tasks on any CPU")
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Dispatch non-interactive tasks on the CPU selected by the built-in idle
selection logic and allow interactive tasks to be dispatched on any CPU.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Do not always assign the maximum time slice to interactive tasks, but
use the same value of the dynamic time slice for everyone.
This seems to prevent potential audio cracking when the system is over
commissioned.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
The option --full-user is provided to delegate *all* scheduling
decisions to the user-space scheduler with no exception, including the
idle selection logic.
Therefore, make this option incompatible with --builtin-idle and
completely bypass the built-in idle selection logic when running in
full-user mode.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Provide a knob in scx_rustland_core to automatically turn the scheduler
into a simple FIFO when the system is underutilized.
This choice is based on the assumption that, in the case of system
underutilization (less tasks running than the amount of available CPUs),
the best scheduling policy is FIFO.
With this option enabled the scheduler starts in FIFO mode. If most of
the CPUs are busy (nr_running >= num_cpus - 1), the scheduler
immediately exits from FIFO mode and starts to apply the logic
implemented by the user-space component. Then the scheduler can switch
back to FIFO if there are no tasks waiting to be scheduled (evaluated
using a moving average).
This option can be enabled/disabled by the user-space scheduler using
the fifo_sched parameter in BpfScheduler: if set, the BPF component will
periodically check for system utilization and switch back and forth to
FIFO mode based on that.
This allows to improve performance of workloads that are using a small
amount of the available CPUs in the system, while still maintaining the
same good level of performance for interactive tasks when the system is
over commissioned.
In certain video games, such as Baldur's Gate 3 or Counter-Strike 2,
running in "normal" system conditions, we can experience a boost in fps
of approximately 4-8% with this change applied.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
This merge included additional commits that were supposed to be included
in a separate pull request and have nothing to do with the fifo-mode
changes.
Therefore, revert the whole pull request and create a separate one with
the correct list of commits required to implement this feature.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Dispatch non-interactive tasks on the CPU selected by the built-in idle
selection logic and allow interactive tasks to be dispatched on any CPU.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Do not always assign the maximum time slice to interactive tasks, but
use the same value of the dynamic time slice for everyone.
This seems to prevent potential audio cracking when the system is over
commissioned.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>