Bump up scx_rustland_core version to include this critical fix that
allows to prevent scheduler stalls:
94a3594 ("scx_rustland_core: always dispatch per-cpu kthreads directly")
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Do not send per-CPU kthreads to the user-space scheduler, but always
dispatch them directly from BPF.
In specific environments, sending critical per-CPU kthreads to the
user-space scheduler can lead to potential stalls. This occurs because
the user-space scheduler might be blocked by an action that these
per-CPU kthreads need to perform, but they cannot complete their action
if the scheduler needs to schedule them, hence the deadlock.
To prevent this deadlock, always dispatch the per-CPU kthreads directly
from the BPF component, ensuring that the user-space scheduler does not
get blocked by these events.
Fixes: c0a2cfb ("scx_rustland_core: always schedule per-CPU kthreads to user-space")
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
The scx_rustland_core API has been redesigned recently, breaking the
compatibility with the past.
Considering that Rust crates should update their major version when the
previous API becomes incompatible [1], bump up the version to 2.0.0.
[1] https://semver.org/
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Include the FIFO example directly in the README.md, instead of linking
scx_rlfifo.
Including the example directly in the README can be more useful and
practical in those cases where internet access is not available or when
we need to distribute a more "standalone" documentation.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Introduce a vtime attribute to struct DispatchedTask that can be set by
the user-space scheduler and it'll be use by the BPF component to
dispatch the task via scx_bpf_dispatch_vtime().
In this way a user-space scheduler can decide to apply its own internal
task ordering or rely on the BPF vtime priority DSQs (or both).
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Temporarily drop the RL_PREEMPT_CPU flag, we need a better way to
implement preemption in scx_rustland_core and it's not very effective at
the moment, so simply it drop it for now (it'll be re-added later in the
future in a proper way).
This change does not affect any scheduler, since RL_PREEMPT_CPU is
currently unused.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
meson build script was building each rust sub-project under rust/ and
scheds/rust/ separately. This means that each rust project is built
independently which leads to a couple problems - 1. There are a lot of
shared dependencies but they have to be built over and over again for each
proejct. 2. Concurrency management becomes sad - we either have to unleash
multiple cargo builds at the same time possibly thrashing the system or
build one by one.
We've been trying to solve this from meson side in vain. Thankfully, in
issue #546, @vimproved suggested using cargo workspace which makes the
sub-projects share the same target directory and built together by the same
cargo instance while still allowing each project to behave independently for
development and publishing purposes.
Make the following changes:
- Create two cargo workspaces - one under rust/, the other under
scheds/rust/. Each contains all rust projects underneath it.
- Don't let meson descend into rust/. These are libraries used by the rust
schedulers. No need to build them from meson. Cargo will build them as
needed.
- Change the rust_scheds build target to invoke `cargo build` in
scheds/rust/ and let cargo do its thing.
- Remove per-scheduler meson.build files and instead generate custom_targets
in scheds/rust/meson.build which invokes `cargo build -p $SCHED`.
- This changes rust binary directory. Update README and
meson-scripts/install_rust_user_scheds accordingly.
- Remove per-scheduler Cargo.lock as scheds/rust/Cargo.lock is shared by all
schedulers now.
- Unify .gitignore handling.
The followings are build times on Ryzen 3975W:
Before:
________________________________________________________
Executed in 165.93 secs fish external
usr time 40.55 mins 2.71 millis 40.55 mins
sys time 3.34 mins 36.40 millis 3.34 mins
After:
________________________________________________________
Executed in 36.04 secs fish external
usr time 336.42 secs 0.00 millis 336.42 secs
sys time 36.65 secs 43.95 millis 36.61 secs
Wallclock time is reduced 5x and CPU time 7x.
Refactor the code to hide the shutdown handling inside BpfScheduler and
simply use the exited() method to check when the scheduler is stopped.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
There is no reason to have two separate options for "verbose" and
"debug" mode. Just merge the two and always use "debug". If enabled,
increase verbosity to stdout and enable reporting BPF scheduling events
in debugfs (e.g., /sys/kernel/debug/tracing/trace_pipe).
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Since scx_rustland_core enables setting a time slice on a per-task basis
during task dispatch, there's no need to maintain a global time slice in
the BPF component. Instead, a global time slice can simply be managed in
user-space, achieving the same outcome.
Therefore, drop the global slice_us property from BpfScheduler to
simplify the API.
NOTE: if a time slice is not specified for a task, SCX_SLICE_DFL will be
used by default.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
The update_tasks() API is somewhat confusing, so replace it with a
clearer API, notify_complete().
This new API will return control to the BPF component and inform it
about the number of tasks still pending in the user-space scheduler.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
The low-power API is a bit of a hack implemented purely in the BPF
layer, this should be better re-implemented with some concepts of
topology awareness.
Therefore, get rid of this API for now.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
The current API used to notify the user-space scheduler when a task
exits is really confusing (setting a negative value in
queued_task_ctx.cpu), and it's also possible to detect task exiting
events from user-space (or check in procfs, even if it's slower).
In any case, a better API should be provided for this, so drop the
current one for now.
NOTE: this will cause additional memory usage for scx_rustland, but it
can be fixed/addressed later in a separate commit (i.e., providing a
periodic garbage collector for the unused task entries).
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Instead of determining the task time slice in ops.enqueue(), refresh the
time slice immediately before the task is started on its assigned CPU in
ops.running().
This ensures to apply the exact time slice specified by the user-space
scheduler and the sched_ext core will never implicitly dispatch tasks
using SCX_SLICE_DFL.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Allow user-space scheduler to pick an idle CPU via
self.bpf.select_cpu(pid, prev_task, flags), mimicking the BPF's
select_cpu() iterface.
Also remove the full_user option and always rely on the idle selection
logic from user-space.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Re-add the partial mode option that was dropped during the refactoring.
The partial option allows to apply the scheduler only to the tasks which
have their scheduling policy set to SCHED_EXT via sched_setscheduler().
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
The API for determining which PID is running on a specific CPU is racy
and is unnecessary since this information can be obtained from user
space.
Additionally, it's not reliable for identifying idle CPUs. Therefore,
it's better to remove this API and, in the future, provide a cpumask
alternative that can export the idle state of the CPUs to user space.
As a consequence also change scx_rustland to dispatch one task a time,
instead of dispatching tasks in batches of idle cores (that are usually
not accurate due to the racy nature of the CPU ownership interaface).
Dispatching one task at a time even makes the scheduler more performant,
due to the vruntime scheduling being applied to more tasks sitting in
the scheduler's queue.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Drop the slice boost logic and apply a vruntime and task time slice
evaluation approach similar to scx_bpfland (but implement this in the
user-space component instead of the BPF part).
Additionally, introduce a slice_us_min parameter to define the minimum
time slice that can be assigned to a task, also similar to scx_bpfland.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Use the same idle selection logic used in scx_bpfland also in
scx_rustland_core.
Also drop fifo_mode and always use the BPF idle selection logic by
default as long as the system is not saturated, unless full_user is
specified.
This approach allows user-space schedulers aiming for maximum
performance to leverage the BPF idle selection logic (bypassing
user-space), while those seeking full control can enable full_user to
bypass the BPF CPU idle selection logic and choose the target CPU for
each task from user-space.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
We don't need to send the number of voluntary context switches (nvcsw)
from BPF to user-space, as this information is already accessible in
user-space via procfs. Sending this data would only create unnecessary
overhead for schedulers that don't require it, and those that do can
easily retrieve it through procfs.
Therefore, drop this metric from scx_rustland_core and change
scx_rustland implementing an interactive task classifier fully in the
user-space part of the scheduler.
Also drop some options that are not provide any significant benefit
(also in preparation of a bigger refactoring to define a better API for
the user-space framework).
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Explicitly import timespec to fix the following potential build error:
error[E0422]: cannot find struct, variant or union type `timespec` in this scope
--> src/bpf.rs:365:35
|
365 | sched_ss_repl_period: timespec {
| ^^^^^^^^ not found in this scope
|
help: consider importing this struct
|
6 + use libc::timespec;
|
This fixes issue #469.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Update libbpf-rs & libbpf-cargo to 0.24. Among other things, generated
skeletons now contain directly accessible map and program objects, no
longer necessitating the use of accessor methods. As a result, the risk
for mutability conflicts is reduced greatly.
Signed-off-by: Daniel Müller <deso@posteo.net>
sched_ext is about to be merged upstream. There are some compatibility
breaking changes and we're making the current sched_ext/for-6.11
1edab907b57d ("sched_ext/scx_qmap: Pick idle CPU for direct dispatch on
!wakeup enqueues") the baseline.
Tag everything except scx_mitosis as 1.0.0. As scx_mitosis is still in early
development and is currently temporarily disabled, only the patchlevel is
bumped.
We may end up selecting an invalid CPU (according to the task's cpumask)
when dispatching the task via dispatch_direct_cpu().
When this happens simply return an error and do not dispatch the task
and let the caller handle the error: in the context of select_cpu() we
can simply ignore the dispatch and return the target CPU; in the context
of FIFO mode dispatch we can fallback to SCX_DSQ_LOCAL if the target CPU
is not valid.
This fixes issue #353.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Kick CPUs in the dispatch path only when needed (typically when tasks
are bounced to other CPUs).
Moreover, avoid to consume all the tasks dispatched at once.
This seems to reduce the BPF overhead (according to bpftop), going from
~10% CPU usage down to ~6% CPU usage of rustland_dispatch() on an over
commissioned system, without introducing any measureable performance
regression.
Tested-by: SoulHarsh007 <harsh.peshwani@outlook.com>
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Allow to dispatch tasks directly (bypassing the user-space scheduler)
only when the scheduler is operating in FIFO mode.
On an over-commissioned system, directly dispatching tasks can only
increase OS noise. These tasks can get a brief priority boost and an
extended time slice just because they found an idle CPU, which can lead
to erratic behavior.
This is particularly problematic when measuring performance stability,
such as evaluating the frames-per-second (fps) of a video game on an
overloaded system.
In such cases, it's better to bounce all tasks to the user-space
scheduler, that will ensure a better level of fairness and smoother
performance.
Moreover, get rid of the second chance dispatch logic introduced in
commit 4791d862 ("scx_rustland_core: second chance CPU migration"). This
seems to provide benefits only on certain architectures (Intel), but it
can introduces lags in others (AMD).
Tested-by: SoulHarsh007 <harsh.peshwani@outlook.com>
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Report dispatch_direct_cpu() events in the trace, like any other
dispatch-related event.
Tested-by: SoulHarsh007 <harsh.peshwani@outlook.com>
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
The dependency of the buddy-alloc crate [1] seems to cause some troubles
with packaging, mostly because the selftests for the crate are failing
when it's compiled in release mode.
For example:
$ cargo test --release -- --nocapture
thread 'tests::fast_alloc::test_basic_malloc' panicked at src/tests/fast_alloc.rs:25:13:
assertion `left == right` failed
left: 0
right: 42
Some of these failures with BuddyAlloc can be fixed by using a memory
arena buffer aligned to page size.
However, some test failures with FastAlloc persist that cannot be
resolved merely by aligning the pre-allocated memory arena to the page
size, as mentioned in [2].
The concern is that this may potentially lead to actual memory bugs.
Therefore, it seems safer to refactor the custom allocator code to
simply use BuddyAlloc, dropping FastAlloc completely.
To achieve this, the entire BuddyAlloc code has been directly included
in scx_rustland_core, referencing the original project and its MIT
licensing information (with the entire code still distributed under the
GPLv2 license).
Then the code has been slightly modified to remove FastAlloc and the
external dependency on the buddy-alloc crate has been dropped.
From a performance perspective this change doesn't seem to introduce any
measurable regression.
[1] https://github.com/jjyr/buddy-alloc
[2] https://github.com/jjyr/buddy-alloc/issues/16
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Fix a potential race condition that might lead to a task being
dispatched without kicking the target CPU, which could result in a
potential stall.
With this applied, scx_rustland has been running without any stall for
about 18 hours on a system where the issue was previously quite easy to
reproduce.
Moreover, clarify a couple of comments in the dispatch path.
This fixes issue #353.
Tested-by: SoulHarsh007 <harsh.peshwani@outlook.com>
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
In preparation of upstreaming, let's set the min version requirement at the
released v6.9 kernels. Drop __COMPAT_SCX_KICK_IDLE. The open helper macros
now check the existence of SCX_KICK_IDLE and abort if not.
In preparation of upstreaming, let's set the min version requirement at the
released v6.9 kernels. Drop __COMPAT_scx_bpf_switch_call(). The open helper
macros now check the existence of SCX_OPS_SWITCH_PARTIAL and abort if not.
The same change was previously applied with commit 8820af8d
("scx_rustland: enable user-space scheduler to preempt other tasks").
However, it was reverted in commit 732ba490 ("scx_rustland: avoid using
SCX_ENQ_PREEMPT") with the introduction of the global dynamic time
slice (inversely proportional to the number of running tasks), because
it already provided sufficient context switch opportunities, negating
any advantage of the preemption.
With the introduction of the virtual time slice in commit 6f4cd853
("scx_rustland: introduce virtual time slice"), re-introducing the
ability for the user-space scheduler to preempt other tasks now appears
beneficial again.
As confirmed by experimental results, this change helps prevent
potential audio cracking issues and enhances overall system
responsiveness, resulting in a 4-5% increase in fps performing the
usual benchmark of gaming while recompiling the kernel.
Tested-by: SoulHarsh007 <harsh.peshwani@outlook.com>
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Tasks dispatched directly using SCX_DSQ_LOCAL may receive excessively
high priority compared to those dispatched by the user-space scheduler.
To avoid this priority disparity, dispatch tasks to the per-CPU DSQs
using direct dispatch and reserve SCX_DSQ_LOCAL for per-CPU kthreads
only.
Tested-by: Tested-by: SoulHarsh007 <harsh.peshwani@outlook.com>
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>