Make sure the scheduler is not activated if we are deadling with the
last task running.
This allows to consistency reduce scx_rustland CPU usage in systems that
are mostly idle (and avoid unnecessary power consumption).
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
When a task is dispatched we always try to pick the previously used CPU
(if idle) to minimize the migration overhead. Alternatively, if such CPU
is not available, we pick any other idle CPU in the system.
However, we don't update the list of idle CPUs as we dispatch tasks,
therefore we may end up sending multiple tasks to the same idle CPU (if
their previously used CPU is the same) and we may even skip some idle
CPUs completely.
Change this logic to make sure that we never dispatch multiple tasks to
the same idle CPU, by updating the list of idle CPUs as we send tasks to
the BPF dispatcher.
This also avoids dispatching tasks with a closely matched vruntime to
the same CPU, thereby negating the advantages of the vruntime ordering.
With this change in place, we ensure that tasks with a similar vruntime
are dispatched to different CPUs, leading to significant improvements in
latency performance.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
When the scheduler decides to assign a different CPU to the task always
make sure the assignment is valid according to the task cpumask. If it's
not valid simply dispatch the task to the global DSQ.
This prevents the scheduler from exiting with errors like this:
09:11:02 [WARN] EXIT: SCX_DSQ_LOCAL[_ON] verdict target cpu 7 not allowed for gcc[440718]
In the future we may want move this check directly into the user-space
scheduler, but for now let's keep this check in the BPF dispatcher as a
quick fix.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
The current CPU selection logic in the scheduler presents some
inefficiencies.
When a task is drained from the BPF queue, the scheduler immediately
checks whether the CPU previously assigned to the task is still idle,
assigning it if it is. Otherwise, it iterates through available CPUs,
always starting from CPU #0, and selects the first idle one without
updating its state. This approach is consistently applied to the entire
batch of tasks drained from the BPF queue, resulting in all of them
being assigned to the same idle CPU (also with a higher likelihood of
allocation to lower CPU ids rather than higher ones).
While dispatching a batch of tasks to the same idle CPU is not
necessarily problematic, a fairer distribution among the list of idle
CPUs would be preferable.
Therefore change the CPU selection logic to distribute tasks equally
among the idle CPUs, still maintaining the preference for the previously
used one. Additionally, apply the CPU selection logic just before tasks
are dispatched, rather than assigning a CPU when tasks are drained from
the BPF queue. This adjustment is important, because tasks may linger in
the scheduler's internal structures for a bit and the idle state of the
CPUs in the system may change during that period.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
We want to activate the user-space scheduler only when there are pending
tasks that require scheduling actions.
To do so we keep track of the queued tasks via nr_queued, that is
incremented in .enqueue() when a task is sent to the user-space
scheduler and decremented in .dispatch() when a task is dispatched.
However, we may trigger an unbalance if the same pid is sent to the
scheduler multiple times (because the scheduler store all the tasks by
their unique pid).
When this happens nr_queued is never decremented back to 0, leading the
user-space scheduler to constantly spin, even if there's no activity to
do.
To prevent this from happening split nr_queued into nr_queued and
nr_scheduled. The former will be updated by the BPF component every time
that a task is sent to the scheduler and it's up to the user-space
scheduler to reset the counter when the queue is fully dreained. The
latter is maintained by the user-space scheduler and represents the
amount of tasks that are still processed by the scheduler and are
waiting to be dispatched.
The sum of nr_queued + nr_scheduled will be called nr_waiting and we can
rely on this metric to determine if the user-space scheduler has some
pending work to do or not.
This change makes rust_rustland more reliable and it strongly reduces
the CPU usage of the user-space scheduler by eliminating a lot of
unnecessary activations.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Considering the CPU where the user-space scheduler is running as busy
doesn't really provide any benefit, since the user-space scheduler is
constantly dispatching an amount of tasks equal to the amount of idle
CPUs and then yields (therefore its own CPU should be considered idle).
Considering the CPU where the user-space scheduler is running as busy
doesn't provide any benefit, as the scheduler consistently dispatches
tasks equal to the number of idle CPUs and then yields (therefore its
own CPU should be considered idle).
This also allows to reduce the overall user-space scheduler CPU
utilization, especially when the system is mostly idle, without
introducing any measurable performance regression.
Measuring the average CPU utilization of a (mostly) idle system over a
time period of 60 sec:
- wihout this patch: 5.41% avg cpu util
- with this patch: 2.26% avg cpu util
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Move the logic to activate the userspace scheduler to an update_idle()
callback, which is called when the CPU is about to go idle.
This disables the built-in idle tracking mechanism, so it allows to rely
completely on the internal CPU ownership logic (via get_cpu_owner() and
set_cpu_owner()) and it also allows to share the idle state with the
user-space scheduler via the BPF_MAP_TYPE_ARRAY cpu_map.
Moreover, when the user-space scheduler is activated, kick the idle cpu
to trigger immediate dispatch and avoid bubbles in the scheduling
pipeline.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
This reverts commit 9237e1d ("scx_rustland: always dispatch kthreads on
the local CPU").
Do not always prioritize all kthreads, we may have unbound workqueue
workers that can consume a lot of CPU cycles (e.g., encryption workers),
so we definitely want to apply the scheduling for those.
Therefore, restore the old behavior to prioritize only per-CPU kthreads.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Adding extra overhead to any kthread can potentially slow down the
entire system, so make sure this never happens by dispatching all
kthreads directly on the same local CPU (not just the per-CPU kthreads),
bypassing the user-space scheduler.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Trigger the user-space scheduler only upon a task's CPU release event
(avoiding its activation during each enqueue event) and only if there
are tasks waiting to be processed by the user-space scheduler.
This should save unnecessary calls to the user-space scheduler, reducing
the overall overhead of the scheduler.
Moreover, rename nr_enqueues to nr_queued and store the amount of tasks
currently queued to the user-space scheduler (that are waiting to be
dispatched).
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Provide the following primitives to get and set CPU ownership in the BPF
part. This improves code readability and these primitives can be used by
the BPF part as a baseline to implement a better CPU idle tracking in
the future.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
BPF doesn't have full memory model yet, and while strict atomicity might
not be necessary in this context, it is advisable to enhance clarity in
the interlocking model.
To achieve this, provide the following primitives to operate on
usersched_needed:
static void set_usersched_needed(void)
static bool test_and_clear_usersched_needed(void)
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Dispatch tasks in a batch equal to the amount of idle CPUs in the
system.
This allows to reduce the pressure on the dispatcher queues, improving
the effectiveness of the scheduler (by having more tasks sitting in the
scheduler task pool) and mitigating potential priority inversion issues.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Provide an interface for the BPF dispatcher and user-space scheduler to
share CPU information. This information can empower the user-space
scheduler to make more informed decisions and enable the implementation
of a broader range of scheduling policies.
With this change the BPF dispatcher provides a CPU map (one entry per
CPU) that stores the pid that is running on each CPU (0 if the CPU is
idle). The CPU map is updated by the BPF dispatcher in the .running()
and .stopping() callbacks.
The dispatcher then sends to the user-space scheduler a suggestion of
the candidate CPU for each task that needs to run (that is always the
previously used CPU), along with all the task's information.
The user-space scheduler can decide to confirm the selected CPU or to
choose a different one, using all the shared CPU information.
Lastly, the selected CPU is communicated back to the dispatcher along
with all the task's information and the BPF dispatcher takes care of
executing the task on the selected CPU, eventually triggering a
migration.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Do not report an exit error message if it's empty. Moreover, distinguish
between a graceful exit vs a non-graceful exit.
In general, try to follow the behavior of user_exit_info.h for the C
schedulers.
NOTE: in the future the whole exit handling probably can be moved to a
more generic place (scx_utils) to prevent code duplication across
schedulers and also to prevent small inconsistencies like this one.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Rename scx_rustlite to scx_rustland to better represent the mirroring of
scx_userland (in C), but implemented in Rust.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
This scheduler is made of a BPF component (dispatcher) that implements
the low level sched-ext functionalities and a user-space counterpart
(scheduler), written in Rust, that implements the actual scheduling
policy.
The main goal of this scheduler is to be easy to read and well
documented, so that newcomers (i.e., students, researchers, junior devs,
etc.) can use this as a template to quickly experiment scheduling
theory.
For this reason the design of this scheduler is mostly focused on
simplicity and code readability.
Moreover, the BPF dispatcher is completely agnostic of the particular
scheduling policy implemented by the user-space scheduler. For this
reason developers that are willing to use this scheduler to experiment
scheduling policies should be able to simply modify the Rust component,
without having to deal with any internal kernel / BPF details.
Future improvements:
- Transfer the responsibility of determining the CPU for executing a
particular task to the user-space scheduler.
Right now this logic is still fully implemented in the BPF part and
the user-space scheduler can only decide the order of execution of
the tasks, that significantly restricts the scheduling policies that
can be implemented in the user-space scheduler.
- Experiment the possibility to send tasks from the user-space
scheduler to the BPF dispatcher using a batch size, instead of
draining the task queue completely and sending all the tasks at once
every single time.
A batch size should help to reduce the overhead and it should also
help to reduce the wakeups of the user-space scheduler.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
When Ubuntu ships with sched_ext, we can also maybe test loading the
schedulers (not sure if the runners can run as root though). For now, we
should at least have a CI job that lets us verify that the schedulers
can _build_. To that end, this patch adds a basic CI action that builds
the schedulers.
As is, this is a bit brittle in that we're having to manually download
and install a few dependencies. I don't see a better way for now without
hosting our own runners with our own containers, but that's a bigger
investment. For now, hopefully this will get us _some_ coverage.
Signed-off-by: David Vernet <void@manifault.com>
The core sched code calls select_task_rq() in a few places: the task
wakeup path (typical path), the fork() path, and the exec() path. For
nest scheduling, we don't want to select a core from the nest on the
exec() path. If we were previously able to find an idle core, we would
have found it on the fork() path, so we don't gain much by checking on
the exec() path. In fact, it's actually harmful, because we could
incorrectly blow up the primary nest unnecessarily by bumping the same
task between multiple cores for no reason. Let's just opt-out of
select_task_rq calls on the exec() path.
Suggested-by: Julia Lawall <julia.lawall@inria.fr>
Signed-off-by: David Vernet <void@manifault.com>
Julia pointed out that our current implementation of r_impatient is
incorrect. r_impatient is meant to be a mechanism for more aggressively
growing the primary nest if a task repeatedly isn't able to find a core.
Right now, we trigger r_impatient if we're not able to find an attached
or previous core in the primary nest, but we _should_ be triggering it
only if we're unable to find _any_ core in the primary nest. Fixing the
implementation to do this drastically decreases how aggressively we grow
the primary nest when r_impatient is in effect.
Reported-by: Julia Lawall <julia.lawall@inria.fr>
Signed-off-by: David Vernet <void@manifault.com>
- combine c and kernel-examples as it's confusing to have both
- rename 'rust-user' and 'c-user' to just 'rust' and 'c', which is simpler
- update and fix sync-to-kernel.sh
This is a follow on to #32, which got reverted. I wrongly assumed that
scx_rusty resides in the sched_ext tree and consumes published version
of scx_utils.
With this change we update the other in-tree dependencies. I built
scx_layered & scx_rusty. I bumped scx-utils to 0.4, because the
libbpf-cargo seems to be part of the public API surface and libbpf-cargo
0.21 and 0.22 are not compatible with each other.
Signed-off-by: Daniel Müller <deso@posteo.net>
With commit 48bba8e ("scx_userland: survive to dispatch failures")
scx_useland can better tolerate dispatch failures, so we can reduce a
bit MAX_ENQUEUED_TASKS and align it with the size used in bpf_repeat(),
when tasks are actually dispatched in the bpf counterpart.
This allows reducing the memory footprint of the scheduler and makes it
more consistent between enqueue and dispatch events.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
If the scheduler fails to dispatch a task we immediately give up,
exiting with an error like the following:
Failed to dispatch task 251 in 1
EXIT: BPF scheduler unregistered
This scenario can be simulated decreasing dramatically the value of
MAX_ENQUEUED_TASKS.
We can make the scheduler a little more robust simply by re-adding the
task that cannot be dispatched to vruntime_head and stop dispatching
additional tasks in the same batch.
This can give enough room, under such "dispatch overload" condition, to
catch up and resume the normal execution without crashing.
Moreover, introduce nr_vruntime_failed to report failed dispatch events
in the scheduler's statistics.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Currently the array of enqueued tasks is statically allocated to a fixed
size of USERLAND_MAX_TASKS to avoid potential deadlocks that could be
introduced by performing dynamic allocations in the enqueue path.
However, this also adds a limit on the maximum pid that the scheduler
can handle, since the pid is used as the index to access the array.
In fact, it is quite easy to trigger the following failure on an average
desktop system (making this scheduler pretty much unusable in such
scenario):
$ sudo scx_userland
...
Failed to enqueue task 33258: No such file or directory
EXIT: BPF scheduler unregistered
Prevent this by using sysctl's kernel.pid_max as the size of the tasks
array (and still allocate it all at once during initialization).
The downside of this change is that scx_userland may require additional
memory to start and in small systems it could even trigger OOMs. For
this reason add an explicit message to the command help, suggesting to
reduce kernel.pid_max in case of OOM conditions.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
For the case where many tasks being popped from the central queue cannot
be dispatched to the local DSQ of the target CPU, we will keep bouncing
them to the fallback DSQ and continue the dispatch_to_cpu loop until we
find one which can be dispatch to the local DSQ of the target CPU.
In a contrived case, it might be so that all tasks pin themselves to
CPUs != target CPU, and due to their affinity cannot be dispatched to
that CPU's local DSQ. If all of them are filling up the central queue,
then we will keep looping in the dispatch_to_cpu loop and eventually run
out of slots for the dispatch buffer. The nr_mismatched counter will
quickly rise and sched-ext will notice the error and unload the BPF
scheduler.
To remedy this, ensure that we break the dispatch_to_cpu loop when we
can no longer perform a dispatch operation. The outer loop in
central_dispatch for the central CPU should ensure the loop breaks when
we run out of these slots and schedule a self-IPI to the central core,
and allow sched-ext to consume the dispatch buffer before restarting the
dispatch loop again.
A basic way to reproduce this scenario is to do:
taskset -c 0 perf bench sched messaging
The error in the kernel will be:
sched_ext: BPF scheduler "central" errored, disabling
sched_ext: runtime error (dispatch buffer overflow)
bpf_prog_6a473147db3cec67_dispatch_to_cpu+0xc2/0x19a
bpf_prog_c9e51ba75372a829_central_dispatch+0x103/0x1a5
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Introduce an option to enable/disable the build of all the Rust
sub-projects.
This can be useful to build scx on those systems where Rust is not
fully supported (e.g., armhf).
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
We should explicitly use u64 for hweight_gen to prevent the following
build failures on 32-bit architectures:
scheds/kernel-examples/scx_flatcg.p/scx_flatcg.bpf.skel.h: In function ‘scx_flatcg__assert’:
scheds/kernel-examples/scx_flatcg.p/scx_flatcg.bpf.skel.h:3523:9: error: static assertion failed: "unexpected size of \'hweight_gen\'"
3523 | _Static_assert(sizeof(s->data->hweight_gen) == 8, "unexpected size of 'hweight_gen'");
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
When printing scheduler statistics we use %lu to print u64 values, that
works well on 64-bit architectures, but on 32-bit architectures we get
errors like the following:
106 | printf("total :%10lu local:%10lu queued:%10lu lost:%10lu\n",
| ~~~~^
| |
| long unsigned int
| %10llu
107 | skel->bss->nr_total,
| ~~~~~~~~~~~~~~~~~~~
| |
| u64 {aka long long unsigned int}
Fix this by using the proper format %llu.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Use compiler's built-in stack initialization instead of memset().
In this way we can get rid of the string.h include and make
cross-compilation easier in certain small environments (i.e., arm).
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
It seems that under certain conditions, the difference between the
current and the previous procfs::CpuStat values may become negative,
triggering the following crash/trace:
thread 'main' panicked at /build/rustc-VvCkKl/rustc-1.73.0+dfsg0ubuntu1/library/core/src/ops/arith.rs:217:1:
attempt to subtract with overflow
stack backtrace:
...
19: 0x590d8481909e - scx_rusty::calc_util::h46f2af9c512c2ecd
at /home/arighi/src/scx/scheds/rust-user/scx_rusty/src/main.rs:217:31
20: 0x590d8481c794 - scx_rusty::Tuner::step::h2e51076f043a8593
at /home/arighi/src/scx/scheds/rust-user/scx_rusty/src/main.rs:444:38
21: 0x590d84828270 - scx_rusty::Scheduler::run::hb5483f1e585f52fe
at /home/arighi/src/scx/scheds/rust-user/scx_rusty/src/main.rs:1198:17
22: 0x590d848289e9 - scx_rusty::main::h9ba8c62ad33aeee1
...
Prevent this by introducing a sub_or_zero() helper function that returns
zero if the difference is negative.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
In scx_nest, we currently count the number of times that a core is
scheduled for compaction before we eventually just eagerly compact the
core. The idea is that the core could thrash between being scheduled and
then "de-scheduled" for compaction if there are a couple of tasks that
are bouncing between cores in the primary nest often enough to kick them
out of being compacted.
We're currently resetting schedulings when a core is eagerly compacted,
but to be precise we should probably also reset the count when a core
consumes a task from the fallback DSQ, at this indicates that the system
is overcommitted and that we likely won't benefit from compacting the
primary nest.
Signed-off-by: David Vernet <void@manifault.com>
The scx_nest scheduler seems to be behaving well. Let's merge it to the
scx repo so that CachyOS can package and use it more easily.
Signed-off-by: David Vernet <void@manifault.com>
We were assigning curr to prev stats, and vice versa, in calc_util().
This was causing the following crash on debug builds:
[void@maniforge scheds]$ sudo RUST_BACKTRACE=1 scx_rusty
00:00:56 [INFO] CPUs: online/possible = 32/32
00:00:56 [INFO] DOM[00] cpumask 0000000000FF00FF (16 cpus)
00:00:56 [INFO] DOM[01] cpumask 00000000FF00FF00 (16 cpus)
00:00:56 [INFO] Rusty Scheduler Attached
thread 'main' panicked at /rustc/475c71da0710fd1d40c046f9cee04b733b5b2b51/library/core/src/ops/arith.rs:217:1:
attempt to subtract with overflow
stack backtrace:
0: rust_begin_unwind
at /rustc/475c71da0710fd1d40c046f9cee04b733b5b2b51/library/std/src/panicking.rs:597:5
1: core::panicking::panic_fmt
at /rustc/475c71da0710fd1d40c046f9cee04b733b5b2b51/library/core/src/panicking.rs:72:14
2: core::panicking::panic
at /rustc/475c71da0710fd1d40c046f9cee04b733b5b2b51/library/core/src/panicking.rs:127:5
3: <u64 as core::ops::arith::Sub>::sub
at /rustc/475c71da0710fd1d40c046f9cee04b733b5b2b51/library/core/src/ops/arith.rs:217:1
4: <&u64 as core::ops::arith::Sub<&u64>>::sub
at /rustc/475c71da0710fd1d40c046f9cee04b733b5b2b51/library/core/src/internal_macros.rs:55:17
5: scx_rusty::calc_util
at ./rust-user/scx_rusty/src/main.rs:216:29
6: scx_rusty::Tuner::step
at ./rust-user/scx_rusty/src/main.rs:444:38
7: scx_rusty::Scheduler::run
at ./rust-user/scx_rusty/src/main.rs:1198:17
8: scx_rusty::main
at ./rust-user/scx_rusty/src/main.rs:1261:5
9: core::ops::function::FnOnce::call_once
at /rustc/475c71da0710fd1d40c046f9cee04b733b5b2b51/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Flip them to avoid the crash. Rusty now runs fine.
Signed-off-by: David Vernet <void@manifault.com>
There's a fairly comprehensive README in the kernel's tools/sched_ext
directory which describes each of the example schedulers. Let's pull it
into this repository, and split it into the various subdirectories
containing the kernele-examples/ schedulers, and the rust-user/
schedulers.
Signed-off-by: David Vernet <void@manifault.com>
SCX_DSQ_GLOBAL now does not support vtime dispatching. scx_simple uses
it to do vtime scheduling, so let's update it to create and use a
separate DSQ that it can both FIFO and PRIQ dispatch to.
Signed-off-by: David Vernet <void@manifault.com>
tp_cgroup_attach_task() walks p->thread_group to visit all member threads
and set tctx->refresh_layer. However, the upstream kernel has removed
p->thread_group recently in 8e1f385104ac ("kill task_struct->thread_group")
as it was mostly a duplicate of p->signal->thread_head list which goes
through p->thread_node.
Switch to iterate via p->thread_node instead, add a comment explaining why
it's using the cgroup TP instead of scx_ops.cgroup_move(), and make
iteration failure non-fatal as the iteration is racy.
As in scx_layered, bpf_map_delete_elem() can fail due to recursion
protection triggering spuriously which can then lead to task_ctx creation
failure after PIDs wrap. Work around by dropping BPF_NOEXIST.
The scx repo is going to serve as the source of truth for sched_ext
schedulers. Reverse the sync direction and include syncing rust-user
schedulers too.