The RESIZE_ARRAY() macro assumes the presence of an in-scope "skel" variable.
This is bad practice and can cause issues in other macros that use it. Let's
update it to explicitly take a skel argument.
Signed-off-by: David Vernet <void@manifault.com>
Newer sched_ext kernel versions sets the scheduler to schedule all tasks
within the system by default. However, some users are using the old
versions of kernel.
Therefore we call "__COMPAT_scx_bpf_switch_all()" to move all tasks to
"SCHED_EXT" class so scx_central would schedule all tasks by default in
older kernels.
scx_simple is a basic scheduler that does either basic vtime, or global
FIFO, scheduling. At first glance, it may be confusing why we create a
separate DSQ rather than just using SCX_DSQ_GLOBAL. Let's add a comment
explaining the reason for this, so that users that are going over
scx_simple as an example scheduler don't get confused.
Signed-off-by: David Vernet <void@manifault.com>
Although newer kernels default to switching-all, some users might still
be using the scheduler with older kernels.
Therefore, ensure all tasks are moved to the SCHED_EXT class by calling
__COMPAT_scx_bpf_switch_all() during init, so that scx_simple can still
operate on these older kernels as well.
Fixes: cf66e58 ("Sync from kernel (670bdab6073)")
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
C SCX_OPS_ATTACH() and rust scx_ops_attach() macros were not calling
.attach() and were only attaching the struct_ops. This meant that all
non-struct_ops BPF programs contained in the skels were never attached which
breaks e.g. scx_layered.
Let's fix it by adding .attach() invocation the the attach macros.
scx_simple no longer supports running in "partial" mode, with only
certain tasks usig scx_simple. When this option was removed, we also
removed the call to scx_bpf_switch_all();
While switching-all is the default behavior for newer kernels, let's add
__COMPAT_scx_bpf_switch_all() so that scx_simple can work on older
kernels as well.
Signed-off-by: David Vernet <void@manifault.com>
If we try to cross-build scx on builders with older versions of system's
linux headers (such as those provided by linux-libc-headers in older
releases of Ubuntu), we may hit build failures, due to the different
kernel ABI, such as:
error: invalid use of undefined type ‘struct btf_enum64’
To address this, introduce a new build option called "kernel_headers"
that allows to specify a custom path for the kernel headers required
during the build process.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Synchronize stragglers.
- Bug fix in __COMPAT_read_enum().
- A cosmetic difference in scx_qmap.bpf.c.
- Stray 'p' when calling getopt() in scx_simple.c.
After this the kernel tree and scx repo are in sync.
Sync from kernel to receive new vmlinux.h and the updates to common headers.
This includes the following updates:
- scx_bpf_switch_all() is replaced by SCX_OPS_SWITCH_PARTIAL flag.
- sched_ext_ops.exit_dump_len added to allow customizing dump buffer size.
- scx_bpf_exit() added.
- Common headers updated to provide backward compatibility in a way which
hides most complexities from scheduler implementations.
scx_simple, qmap, central and flatcg are updated accordingly. Other
schedulers are broken for the moment.
The flatcg scheduler uses a rb_node type - struct cgv_node - to keep
track of vtime. On cgroup init, a cgv_node is created and stashed in a
hashmap - cgv_node_stash - for later use. In cgroup_enqueued and
try_pick_next_cgroup, the node is inserted into the rbtree, which
required removing it from the stash before this patch's changes.
This patch makes cgv_node refcounted, which allows keeping it in the
stash for the entirety of the cgroup's lifetime. Unnecessary
bpf_kptr_xchg's and other boilerplate can be removed as a result.
Note that in addition to bpf_refcount patches, which have been upstream
for quite some time, this change depends on a more recent series [0].
[0]: https://lore.kernel.org/bpf/20231107085639.3016113-1-davemarchevsky@fb.com/
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
SCX_KICK_IDLE is a new feature which isn't defined in older kernels. Add
compat wrapper and use it for idle CPU wakeups.
Signed-off-by: Tejun Heo <tj@kernel.org>
This is meant to be an example scheduler that won't necessarily run well
in production. Let's remove the 3 second timeout and use the system
default of 30.
Signed-off-by: David Vernet <void@manifault.com>
Let's make it clear that this scheduler isn't expected to perform well,
and instead point people to scx_rustland.
Signed-off-by: David Vernet <void@manifault.com>
Now that scheduling BPF timers works correctly, we don't need this extra
logic to eagerly compact if a scheduling for compaction has happened a
few times in a row. Let's remove it.
Signed-off-by: David Vernet <void@manifault.com>
In scx_nest, we use a per-cpu BPF timer to schedule compaction for a
primary core before it goes idle. If a task comes along that could use
that core, we cancel the callback with bpf_timer_cancel().
bpf_timer_cancel() drops a refcnt on the prog and nullifies the
callback, so if we want to schedule the callback again, we must use
bpf_timer_set_callback() to reset the prog. This patch does that.
Reported-by: Julia Lawall <julia.lawall@inria.fr>
Signed-off-by: David Vernet <void@manifault.com>
552b75a9c7 ("scx: Build fix after kernel update") updated scx_flatcg along
with other schedulers to use the new direct dispatching from
ops.select_cpu() mechanism. However, this was buggy for flatcg.
flatcg uses direct dispatch for two purposes - as an optimization when there
are idle cpus and to avoid dealing with custom CPU affinities in the
dispatch logic. While the former can be moved to ops.select_cpu(), the
latter can't as it should also apply to tasks which get enqueued without
preceding ops.select_cpu(), e.g., when the task gets requeued after an
attribute change or runs out of time slice. The API update incorrectly moved
both to ops.select_cpu() leading to futile retries of try_pick_next_cgroup()
and scheduling misbheaviors.
Fix it by separating out the two cases and only keeping the idle
optimization case in ops.select_cpu().
Signed-off-by: Tejun Heo <tj@kernel.org>
We may end up stalling for too long in fcg_dispatch() if
try_pick_next_cgroup() doesn't find another valid cgroup to pick. This
can be quite risky, considering that we are holding the rq lock in
dispatch().
This condition can be reproduced easily in our CI, where we can trigger
stalling softirq works:
[ 4.972926] NOHZ tick-stop error: local softirq work is pending, handler #200!!!
Or rcu stalls:
[ 47.731900] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 47.731900] rcu: 1-...!: (0 ticks this GP) idle=b29c/1/0x4000000000000000 softirq=2204/2204 fqs=0
[ 47.731900] rcu: 3-...!: (0 ticks this GP) idle=db74/1/0x4000000000000000 softirq=2286/2286 fqs=0
[ 47.731900] rcu: (detected by 0, t=26002 jiffies, g=6029, q=54 ncpus=4)
[ 47.731900] Sending NMI from CPU 0 to CPUs 1:
To mitigate this issue reduce the amount of try_pick_next_cgroup()
retries from BPF_MAX_LOOPS (8M) to CGROUP_MAX_RETRIES (1024).
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
In the latest kernel, sched_ext API has changed in two areas:
- ops.prep_enable/cancel_enable/enable/disable() replaced with
ops.init_task/enable/disable/exit_task().
- scx_bpf_dispatch() can now be called from ops.select_cpu(). Also,
SCX_ENQ_LOCAL flag is removed. Instead, users can call
scx_bpf_select_cpu_dfl() from ops.select_cpu() and use the @is_idle out
param value to determine whether to dispatch directly.
This commit updates all schedules so that they build.
- Init functions renamed / merged / split.
- ops.select_cpu() is added to several schedulers and local direct
disptching logic is moved there.
This is the minimum update which is need to make the schedulers build and
work. It needs further update to e.g. move vtime udpates to ops.enable().
We can sometimes hit scenarios in the scx_userland scheduler where there
is work to be done in user space, but we incorrectly fail to run the
user space scheduler. In order to avoid this, we can use global
variables that are set from both BPF and user space. The BPF-side
variable reflects when one or more tasks have been enqueued, and the
user space-side variable reflects when user space has received tasks but
has not yet dispatched them.
In the ops.update_idle() callback, we can check these variables and send
a resched IPI to a core to ensure that the user-space scheduler is
always scheduled when there's work to be done.
Signed-off-by: David Vernet <void@manifault.com>
When Ubuntu ships with sched_ext, we can also maybe test loading the
schedulers (not sure if the runners can run as root though). For now, we
should at least have a CI job that lets us verify that the schedulers
can _build_. To that end, this patch adds a basic CI action that builds
the schedulers.
As is, this is a bit brittle in that we're having to manually download
and install a few dependencies. I don't see a better way for now without
hosting our own runners with our own containers, but that's a bigger
investment. For now, hopefully this will get us _some_ coverage.
Signed-off-by: David Vernet <void@manifault.com>
The core sched code calls select_task_rq() in a few places: the task
wakeup path (typical path), the fork() path, and the exec() path. For
nest scheduling, we don't want to select a core from the nest on the
exec() path. If we were previously able to find an idle core, we would
have found it on the fork() path, so we don't gain much by checking on
the exec() path. In fact, it's actually harmful, because we could
incorrectly blow up the primary nest unnecessarily by bumping the same
task between multiple cores for no reason. Let's just opt-out of
select_task_rq calls on the exec() path.
Suggested-by: Julia Lawall <julia.lawall@inria.fr>
Signed-off-by: David Vernet <void@manifault.com>
Julia pointed out that our current implementation of r_impatient is
incorrect. r_impatient is meant to be a mechanism for more aggressively
growing the primary nest if a task repeatedly isn't able to find a core.
Right now, we trigger r_impatient if we're not able to find an attached
or previous core in the primary nest, but we _should_ be triggering it
only if we're unable to find _any_ core in the primary nest. Fixing the
implementation to do this drastically decreases how aggressively we grow
the primary nest when r_impatient is in effect.
Reported-by: Julia Lawall <julia.lawall@inria.fr>
Signed-off-by: David Vernet <void@manifault.com>
- combine c and kernel-examples as it's confusing to have both
- rename 'rust-user' and 'c-user' to just 'rust' and 'c', which is simpler
- update and fix sync-to-kernel.sh