Sync from kernel to receive new vmlinux.h and the updates to common headers.
This includes the following updates:
- scx_bpf_switch_all() is replaced by SCX_OPS_SWITCH_PARTIAL flag.
- sched_ext_ops.exit_dump_len added to allow customizing dump buffer size.
- scx_bpf_exit() added.
- Common headers updated to provide backward compatibility in a way which
hides most complexities from scheduler implementations.
scx_simple, qmap, central and flatcg are updated accordingly. Other
schedulers are broken for the moment.
The flatcg scheduler uses a rb_node type - struct cgv_node - to keep
track of vtime. On cgroup init, a cgv_node is created and stashed in a
hashmap - cgv_node_stash - for later use. In cgroup_enqueued and
try_pick_next_cgroup, the node is inserted into the rbtree, which
required removing it from the stash before this patch's changes.
This patch makes cgv_node refcounted, which allows keeping it in the
stash for the entirety of the cgroup's lifetime. Unnecessary
bpf_kptr_xchg's and other boilerplate can be removed as a result.
Note that in addition to bpf_refcount patches, which have been upstream
for quite some time, this change depends on a more recent series [0].
[0]: https://lore.kernel.org/bpf/20231107085639.3016113-1-davemarchevsky@fb.com/
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
552b75a9c7 ("scx: Build fix after kernel update") updated scx_flatcg along
with other schedulers to use the new direct dispatching from
ops.select_cpu() mechanism. However, this was buggy for flatcg.
flatcg uses direct dispatch for two purposes - as an optimization when there
are idle cpus and to avoid dealing with custom CPU affinities in the
dispatch logic. While the former can be moved to ops.select_cpu(), the
latter can't as it should also apply to tasks which get enqueued without
preceding ops.select_cpu(), e.g., when the task gets requeued after an
attribute change or runs out of time slice. The API update incorrectly moved
both to ops.select_cpu() leading to futile retries of try_pick_next_cgroup()
and scheduling misbheaviors.
Fix it by separating out the two cases and only keeping the idle
optimization case in ops.select_cpu().
Signed-off-by: Tejun Heo <tj@kernel.org>
We may end up stalling for too long in fcg_dispatch() if
try_pick_next_cgroup() doesn't find another valid cgroup to pick. This
can be quite risky, considering that we are holding the rq lock in
dispatch().
This condition can be reproduced easily in our CI, where we can trigger
stalling softirq works:
[ 4.972926] NOHZ tick-stop error: local softirq work is pending, handler #200!!!
Or rcu stalls:
[ 47.731900] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 47.731900] rcu: 1-...!: (0 ticks this GP) idle=b29c/1/0x4000000000000000 softirq=2204/2204 fqs=0
[ 47.731900] rcu: 3-...!: (0 ticks this GP) idle=db74/1/0x4000000000000000 softirq=2286/2286 fqs=0
[ 47.731900] rcu: (detected by 0, t=26002 jiffies, g=6029, q=54 ncpus=4)
[ 47.731900] Sending NMI from CPU 0 to CPUs 1:
To mitigate this issue reduce the amount of try_pick_next_cgroup()
retries from BPF_MAX_LOOPS (8M) to CGROUP_MAX_RETRIES (1024).
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
In the latest kernel, sched_ext API has changed in two areas:
- ops.prep_enable/cancel_enable/enable/disable() replaced with
ops.init_task/enable/disable/exit_task().
- scx_bpf_dispatch() can now be called from ops.select_cpu(). Also,
SCX_ENQ_LOCAL flag is removed. Instead, users can call
scx_bpf_select_cpu_dfl() from ops.select_cpu() and use the @is_idle out
param value to determine whether to dispatch directly.
This commit updates all schedules so that they build.
- Init functions renamed / merged / split.
- ops.select_cpu() is added to several schedulers and local direct
disptching logic is moved there.
This is the minimum update which is need to make the schedulers build and
work. It needs further update to e.g. move vtime udpates to ops.enable().
- combine c and kernel-examples as it's confusing to have both
- rename 'rust-user' and 'c-user' to just 'rust' and 'c', which is simpler
- update and fix sync-to-kernel.sh