Considering the CPU where the user-space scheduler is running as busy
doesn't really provide any benefit, since the user-space scheduler is
constantly dispatching an amount of tasks equal to the amount of idle
CPUs and then yields (therefore its own CPU should be considered idle).
Considering the CPU where the user-space scheduler is running as busy
doesn't provide any benefit, as the scheduler consistently dispatches
tasks equal to the number of idle CPUs and then yields (therefore its
own CPU should be considered idle).
This also allows to reduce the overall user-space scheduler CPU
utilization, especially when the system is mostly idle, without
introducing any measurable performance regression.
Measuring the average CPU utilization of a (mostly) idle system over a
time period of 60 sec:
- wihout this patch: 5.41% avg cpu util
- with this patch: 2.26% avg cpu util
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Use virtme-ng to run the schedulers after they're built; virtme-ng
allows to pick an arbitrary sched-ext enabled kernel and run it
virtualizing the entire user-space root filesystem, so we can basically
exceute the recompiled schedulers inside such kernel.
This should allow to catch potential run-time issue in advance (both in
the kernel and the schedulers).
The sched-ext kernel is taken from the Ubuntu ppa (ppa:arighi/sched-ext)
at the moment, since it is the easiest / fastest way to get a
precompiled sched-ext kernel to run inside the Ubuntu 22.04 testing
environment.
The schedulers are tested using the new meson target "test_sched", the
specific actions are defined in meson-scripts/test_sched.
By default each test has a timeout of 30 sec, after the virtme-ng
completes the boot (that should be enough to initialize the scheduler
and run the scheduler for some seconds), while the total lifetime of the
virtme-ng guest is set to 60 sec, after this time the guest will be
killed (this allows to catch potential kernel crashes / hangs).
If a single scheduler fails the test, the entire "test_sched" action
will be interrupted and the overall test result will be considered a
failure.
At the moment scx_layered is excluded from the tests, because it
requires a special configuration (we should probably pre-generate a
default config in the workflow actions and change the scheduler to use
the default config if it's executed without any argument).
Moreover, scx_flatcg is also temporarily excluded from the tests,
because of these known issues:
- https://github.com/sched-ext/scx/issues/49
- https://github.com/sched-ext/sched_ext/pull/101
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Move the logic to activate the userspace scheduler to an update_idle()
callback, which is called when the CPU is about to go idle.
This disables the built-in idle tracking mechanism, so it allows to rely
completely on the internal CPU ownership logic (via get_cpu_owner() and
set_cpu_owner()) and it also allows to share the idle state with the
user-space scheduler via the BPF_MAP_TYPE_ARRAY cpu_map.
Moreover, when the user-space scheduler is activated, kick the idle cpu
to trigger immediate dispatch and avoid bubbles in the scheduling
pipeline.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
This reverts commit 9237e1d ("scx_rustland: always dispatch kthreads on
the local CPU").
Do not always prioritize all kthreads, we may have unbound workqueue
workers that can consume a lot of CPU cycles (e.g., encryption workers),
so we definitely want to apply the scheduling for those.
Therefore, restore the old behavior to prioritize only per-CPU kthreads.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Adding extra overhead to any kthread can potentially slow down the
entire system, so make sure this never happens by dispatching all
kthreads directly on the same local CPU (not just the per-CPU kthreads),
bypassing the user-space scheduler.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Trigger the user-space scheduler only upon a task's CPU release event
(avoiding its activation during each enqueue event) and only if there
are tasks waiting to be processed by the user-space scheduler.
This should save unnecessary calls to the user-space scheduler, reducing
the overall overhead of the scheduler.
Moreover, rename nr_enqueues to nr_queued and store the amount of tasks
currently queued to the user-space scheduler (that are waiting to be
dispatched).
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Provide the following primitives to get and set CPU ownership in the BPF
part. This improves code readability and these primitives can be used by
the BPF part as a baseline to implement a better CPU idle tracking in
the future.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
BPF doesn't have full memory model yet, and while strict atomicity might
not be necessary in this context, it is advisable to enhance clarity in
the interlocking model.
To achieve this, provide the following primitives to operate on
usersched_needed:
static void set_usersched_needed(void)
static bool test_and_clear_usersched_needed(void)
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Dispatch tasks in a batch equal to the amount of idle CPUs in the
system.
This allows to reduce the pressure on the dispatcher queues, improving
the effectiveness of the scheduler (by having more tasks sitting in the
scheduler task pool) and mitigating potential priority inversion issues.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Provide an interface for the BPF dispatcher and user-space scheduler to
share CPU information. This information can empower the user-space
scheduler to make more informed decisions and enable the implementation
of a broader range of scheduling policies.
With this change the BPF dispatcher provides a CPU map (one entry per
CPU) that stores the pid that is running on each CPU (0 if the CPU is
idle). The CPU map is updated by the BPF dispatcher in the .running()
and .stopping() callbacks.
The dispatcher then sends to the user-space scheduler a suggestion of
the candidate CPU for each task that needs to run (that is always the
previously used CPU), along with all the task's information.
The user-space scheduler can decide to confirm the selected CPU or to
choose a different one, using all the shared CPU information.
Lastly, the selected CPU is communicated back to the dispatcher along
with all the task's information and the BPF dispatcher takes care of
executing the task on the selected CPU, eventually triggering a
migration.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Do not report an exit error message if it's empty. Moreover, distinguish
between a graceful exit vs a non-graceful exit.
In general, try to follow the behavior of user_exit_info.h for the C
schedulers.
NOTE: in the future the whole exit handling probably can be moved to a
more generic place (scx_utils) to prevent code duplication across
schedulers and also to prevent small inconsistencies like this one.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Rename scx_rustlite to scx_rustland to better represent the mirroring of
scx_userland (in C), but implemented in Rust.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
This scheduler is made of a BPF component (dispatcher) that implements
the low level sched-ext functionalities and a user-space counterpart
(scheduler), written in Rust, that implements the actual scheduling
policy.
The main goal of this scheduler is to be easy to read and well
documented, so that newcomers (i.e., students, researchers, junior devs,
etc.) can use this as a template to quickly experiment scheduling
theory.
For this reason the design of this scheduler is mostly focused on
simplicity and code readability.
Moreover, the BPF dispatcher is completely agnostic of the particular
scheduling policy implemented by the user-space scheduler. For this
reason developers that are willing to use this scheduler to experiment
scheduling policies should be able to simply modify the Rust component,
without having to deal with any internal kernel / BPF details.
Future improvements:
- Transfer the responsibility of determining the CPU for executing a
particular task to the user-space scheduler.
Right now this logic is still fully implemented in the BPF part and
the user-space scheduler can only decide the order of execution of
the tasks, that significantly restricts the scheduling policies that
can be implemented in the user-space scheduler.
- Experiment the possibility to send tasks from the user-space
scheduler to the BPF dispatcher using a batch size, instead of
draining the task queue completely and sending all the tasks at once
every single time.
A batch size should help to reduce the overhead and it should also
help to reduce the wakeups of the user-space scheduler.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Andrea pointed out that we can and should be using Ubuntu 22.04.
Unfortunately it still doesn't ship some of the deps we need like
clang-17, but it does at least ship virtme-ng, so it's good for us to
use this so that we can actually test running the schedulers in a
virtme-ng VM when it supports being run in docker.
Also, update the job to run on pushes, and not just when a PR is opened
Suggested-by: Andrea Righi <andrea.righi@canonical.com>
Signed-off-by: David Vernet <void@manifault.com>
When Ubuntu ships with sched_ext, we can also maybe test loading the
schedulers (not sure if the runners can run as root though). For now, we
should at least have a CI job that lets us verify that the schedulers
can _build_. To that end, this patch adds a basic CI action that builds
the schedulers.
As is, this is a bit brittle in that we're having to manually download
and install a few dependencies. I don't see a better way for now without
hosting our own runners with our own containers, but that's a bigger
investment. For now, hopefully this will get us _some_ coverage.
Signed-off-by: David Vernet <void@manifault.com>
The core sched code calls select_task_rq() in a few places: the task
wakeup path (typical path), the fork() path, and the exec() path. For
nest scheduling, we don't want to select a core from the nest on the
exec() path. If we were previously able to find an idle core, we would
have found it on the fork() path, so we don't gain much by checking on
the exec() path. In fact, it's actually harmful, because we could
incorrectly blow up the primary nest unnecessarily by bumping the same
task between multiple cores for no reason. Let's just opt-out of
select_task_rq calls on the exec() path.
Suggested-by: Julia Lawall <julia.lawall@inria.fr>
Signed-off-by: David Vernet <void@manifault.com>
Julia pointed out that our current implementation of r_impatient is
incorrect. r_impatient is meant to be a mechanism for more aggressively
growing the primary nest if a task repeatedly isn't able to find a core.
Right now, we trigger r_impatient if we're not able to find an attached
or previous core in the primary nest, but we _should_ be triggering it
only if we're unable to find _any_ core in the primary nest. Fixing the
implementation to do this drastically decreases how aggressively we grow
the primary nest when r_impatient is in effect.
Reported-by: Julia Lawall <julia.lawall@inria.fr>
Signed-off-by: David Vernet <void@manifault.com>
- combine c and kernel-examples as it's confusing to have both
- rename 'rust-user' and 'c-user' to just 'rust' and 'c', which is simpler
- update and fix sync-to-kernel.sh
This is a follow on to #32, which got reverted. I wrongly assumed that
scx_rusty resides in the sched_ext tree and consumes published version
of scx_utils.
With this change we update the other in-tree dependencies. I built
scx_layered & scx_rusty. I bumped scx-utils to 0.4, because the
libbpf-cargo seems to be part of the public API surface and libbpf-cargo
0.21 and 0.22 are not compatible with each other.
Signed-off-by: Daniel Müller <deso@posteo.net>