If the scheduler fails to dispatch a task we immediately give up,
exiting with an error like the following:
Failed to dispatch task 251 in 1
EXIT: BPF scheduler unregistered
This scenario can be simulated decreasing dramatically the value of
MAX_ENQUEUED_TASKS.
We can make the scheduler a little more robust simply by re-adding the
task that cannot be dispatched to vruntime_head and stop dispatching
additional tasks in the same batch.
This can give enough room, under such "dispatch overload" condition, to
catch up and resume the normal execution without crashing.
Moreover, introduce nr_vruntime_failed to report failed dispatch events
in the scheduler's statistics.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Currently the array of enqueued tasks is statically allocated to a fixed
size of USERLAND_MAX_TASKS to avoid potential deadlocks that could be
introduced by performing dynamic allocations in the enqueue path.
However, this also adds a limit on the maximum pid that the scheduler
can handle, since the pid is used as the index to access the array.
In fact, it is quite easy to trigger the following failure on an average
desktop system (making this scheduler pretty much unusable in such
scenario):
$ sudo scx_userland
...
Failed to enqueue task 33258: No such file or directory
EXIT: BPF scheduler unregistered
Prevent this by using sysctl's kernel.pid_max as the size of the tasks
array (and still allocate it all at once during initialization).
The downside of this change is that scx_userland may require additional
memory to start and in small systems it could even trigger OOMs. For
this reason add an explicit message to the command help, suggesting to
reduce kernel.pid_max in case of OOM conditions.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
For the case where many tasks being popped from the central queue cannot
be dispatched to the local DSQ of the target CPU, we will keep bouncing
them to the fallback DSQ and continue the dispatch_to_cpu loop until we
find one which can be dispatch to the local DSQ of the target CPU.
In a contrived case, it might be so that all tasks pin themselves to
CPUs != target CPU, and due to their affinity cannot be dispatched to
that CPU's local DSQ. If all of them are filling up the central queue,
then we will keep looping in the dispatch_to_cpu loop and eventually run
out of slots for the dispatch buffer. The nr_mismatched counter will
quickly rise and sched-ext will notice the error and unload the BPF
scheduler.
To remedy this, ensure that we break the dispatch_to_cpu loop when we
can no longer perform a dispatch operation. The outer loop in
central_dispatch for the central CPU should ensure the loop breaks when
we run out of these slots and schedule a self-IPI to the central core,
and allow sched-ext to consume the dispatch buffer before restarting the
dispatch loop again.
A basic way to reproduce this scenario is to do:
taskset -c 0 perf bench sched messaging
The error in the kernel will be:
sched_ext: BPF scheduler "central" errored, disabling
sched_ext: runtime error (dispatch buffer overflow)
bpf_prog_6a473147db3cec67_dispatch_to_cpu+0xc2/0x19a
bpf_prog_c9e51ba75372a829_central_dispatch+0x103/0x1a5
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Introduce an option to enable/disable the build of all the Rust
sub-projects.
This can be useful to build scx on those systems where Rust is not
fully supported (e.g., armhf).
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
We should explicitly use u64 for hweight_gen to prevent the following
build failures on 32-bit architectures:
scheds/kernel-examples/scx_flatcg.p/scx_flatcg.bpf.skel.h: In function ‘scx_flatcg__assert’:
scheds/kernel-examples/scx_flatcg.p/scx_flatcg.bpf.skel.h:3523:9: error: static assertion failed: "unexpected size of \'hweight_gen\'"
3523 | _Static_assert(sizeof(s->data->hweight_gen) == 8, "unexpected size of 'hweight_gen'");
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
When printing scheduler statistics we use %lu to print u64 values, that
works well on 64-bit architectures, but on 32-bit architectures we get
errors like the following:
106 | printf("total :%10lu local:%10lu queued:%10lu lost:%10lu\n",
| ~~~~^
| |
| long unsigned int
| %10llu
107 | skel->bss->nr_total,
| ~~~~~~~~~~~~~~~~~~~
| |
| u64 {aka long long unsigned int}
Fix this by using the proper format %llu.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Use compiler's built-in stack initialization instead of memset().
In this way we can get rid of the string.h include and make
cross-compilation easier in certain small environments (i.e., arm).
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
It seems that under certain conditions, the difference between the
current and the previous procfs::CpuStat values may become negative,
triggering the following crash/trace:
thread 'main' panicked at /build/rustc-VvCkKl/rustc-1.73.0+dfsg0ubuntu1/library/core/src/ops/arith.rs:217:1:
attempt to subtract with overflow
stack backtrace:
...
19: 0x590d8481909e - scx_rusty::calc_util::h46f2af9c512c2ecd
at /home/arighi/src/scx/scheds/rust-user/scx_rusty/src/main.rs:217:31
20: 0x590d8481c794 - scx_rusty::Tuner::step::h2e51076f043a8593
at /home/arighi/src/scx/scheds/rust-user/scx_rusty/src/main.rs:444:38
21: 0x590d84828270 - scx_rusty::Scheduler::run::hb5483f1e585f52fe
at /home/arighi/src/scx/scheds/rust-user/scx_rusty/src/main.rs:1198:17
22: 0x590d848289e9 - scx_rusty::main::h9ba8c62ad33aeee1
...
Prevent this by introducing a sub_or_zero() helper function that returns
zero if the difference is negative.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
In scx_nest, we currently count the number of times that a core is
scheduled for compaction before we eventually just eagerly compact the
core. The idea is that the core could thrash between being scheduled and
then "de-scheduled" for compaction if there are a couple of tasks that
are bouncing between cores in the primary nest often enough to kick them
out of being compacted.
We're currently resetting schedulings when a core is eagerly compacted,
but to be precise we should probably also reset the count when a core
consumes a task from the fallback DSQ, at this indicates that the system
is overcommitted and that we likely won't benefit from compacting the
primary nest.
Signed-off-by: David Vernet <void@manifault.com>
The scx_nest scheduler seems to be behaving well. Let's merge it to the
scx repo so that CachyOS can package and use it more easily.
Signed-off-by: David Vernet <void@manifault.com>
We were assigning curr to prev stats, and vice versa, in calc_util().
This was causing the following crash on debug builds:
[void@maniforge scheds]$ sudo RUST_BACKTRACE=1 scx_rusty
00:00:56 [INFO] CPUs: online/possible = 32/32
00:00:56 [INFO] DOM[00] cpumask 0000000000FF00FF (16 cpus)
00:00:56 [INFO] DOM[01] cpumask 00000000FF00FF00 (16 cpus)
00:00:56 [INFO] Rusty Scheduler Attached
thread 'main' panicked at /rustc/475c71da0710fd1d40c046f9cee04b733b5b2b51/library/core/src/ops/arith.rs:217:1:
attempt to subtract with overflow
stack backtrace:
0: rust_begin_unwind
at /rustc/475c71da0710fd1d40c046f9cee04b733b5b2b51/library/std/src/panicking.rs:597:5
1: core::panicking::panic_fmt
at /rustc/475c71da0710fd1d40c046f9cee04b733b5b2b51/library/core/src/panicking.rs:72:14
2: core::panicking::panic
at /rustc/475c71da0710fd1d40c046f9cee04b733b5b2b51/library/core/src/panicking.rs:127:5
3: <u64 as core::ops::arith::Sub>::sub
at /rustc/475c71da0710fd1d40c046f9cee04b733b5b2b51/library/core/src/ops/arith.rs:217:1
4: <&u64 as core::ops::arith::Sub<&u64>>::sub
at /rustc/475c71da0710fd1d40c046f9cee04b733b5b2b51/library/core/src/internal_macros.rs:55:17
5: scx_rusty::calc_util
at ./rust-user/scx_rusty/src/main.rs:216:29
6: scx_rusty::Tuner::step
at ./rust-user/scx_rusty/src/main.rs:444:38
7: scx_rusty::Scheduler::run
at ./rust-user/scx_rusty/src/main.rs:1198:17
8: scx_rusty::main
at ./rust-user/scx_rusty/src/main.rs:1261:5
9: core::ops::function::FnOnce::call_once
at /rustc/475c71da0710fd1d40c046f9cee04b733b5b2b51/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Flip them to avoid the crash. Rusty now runs fine.
Signed-off-by: David Vernet <void@manifault.com>
There's a fairly comprehensive README in the kernel's tools/sched_ext
directory which describes each of the example schedulers. Let's pull it
into this repository, and split it into the various subdirectories
containing the kernele-examples/ schedulers, and the rust-user/
schedulers.
Signed-off-by: David Vernet <void@manifault.com>
SCX_DSQ_GLOBAL now does not support vtime dispatching. scx_simple uses
it to do vtime scheduling, so let's update it to create and use a
separate DSQ that it can both FIFO and PRIQ dispatch to.
Signed-off-by: David Vernet <void@manifault.com>
tp_cgroup_attach_task() walks p->thread_group to visit all member threads
and set tctx->refresh_layer. However, the upstream kernel has removed
p->thread_group recently in 8e1f385104ac ("kill task_struct->thread_group")
as it was mostly a duplicate of p->signal->thread_head list which goes
through p->thread_node.
Switch to iterate via p->thread_node instead, add a comment explaining why
it's using the cgroup TP instead of scx_ops.cgroup_move(), and make
iteration failure non-fatal as the iteration is racy.
As in scx_layered, bpf_map_delete_elem() can fail due to recursion
protection triggering spuriously which can then lead to task_ctx creation
failure after PIDs wrap. Work around by dropping BPF_NOEXIST.
The scx repo is going to serve as the source of truth for sched_ext
schedulers. Reverse the sync direction and include syncing rust-user
schedulers too.