Commit Graph

7 Commits

Author SHA1 Message Date
Tejun Heo
9447cb27b2 scx: Sync from kernel, some schedulers are broken
Sync from kernel to receive new vmlinux.h and the updates to common headers.
This includes the following updates:

- scx_bpf_switch_all() is replaced by SCX_OPS_SWITCH_PARTIAL flag.

- sched_ext_ops.exit_dump_len added to allow customizing dump buffer size.

- scx_bpf_exit() added.

- Common headers updated to provide backward compatibility in a way which
  hides most complexities from scheduler implementations.

scx_simple, qmap, central and flatcg are updated accordingly. Other
schedulers are broken for the moment.
2024-03-29 12:14:26 -10:00
Dave Marchevsky
3b7f33ea1b scx_flatcg: Keep cgroup rb nodes stashed
The flatcg scheduler uses a rb_node type - struct cgv_node - to keep
track of vtime. On cgroup init, a cgv_node is created and stashed in a
hashmap - cgv_node_stash - for later use. In cgroup_enqueued and
try_pick_next_cgroup, the node is inserted into the rbtree, which
required removing it from the stash before this patch's changes.

This patch makes cgv_node refcounted, which allows keeping it in the
stash for the entirety of the cgroup's lifetime. Unnecessary
bpf_kptr_xchg's and other boilerplate can be removed as a result.

Note that in addition to bpf_refcount patches, which have been upstream
for quite some time, this change depends on a more recent series [0].

  [0]: https://lore.kernel.org/bpf/20231107085639.3016113-1-davemarchevsky@fb.com/

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
2024-03-11 08:35:25 -07:00
Tejun Heo
18f7fe8477 scx_flatcg: Fix fallout from direct dispatch API update
552b75a9c7 ("scx: Build fix after kernel update") updated scx_flatcg along
with other schedulers to use the new direct dispatching from
ops.select_cpu() mechanism. However, this was buggy for flatcg.

flatcg uses direct dispatch for two purposes - as an optimization when there
are idle cpus and to avoid dealing with custom CPU affinities in the
dispatch logic. While the former can be moved to ops.select_cpu(), the
latter can't as it should also apply to tasks which get enqueued without
preceding ops.select_cpu(), e.g., when the task gets requeued after an
attribute change or runs out of time slice. The API update incorrectly moved
both to ops.select_cpu() leading to futile retries of try_pick_next_cgroup()
and scheduling misbheaviors.

Fix it by separating out the two cases and only keeping the idle
optimization case in ops.select_cpu().

Signed-off-by: Tejun Heo <tj@kernel.org>
2024-01-10 10:57:50 -10:00
Tejun Heo
c1f22ea073 scx_flatcg: Report pick_next_cgroup() race and fail counts
To improve visibility into failure mode. While at it, improve output
formatting.

Signed-off-by: Tejun Heo <tj@kernel.org>
2024-01-10 10:52:24 -10:00
Andrea Righi
0609abdca6 scx_flatcg: introduce CGROUP_MAX_RETRIES
We may end up stalling for too long in fcg_dispatch() if
try_pick_next_cgroup() doesn't find another valid cgroup to pick. This
can be quite risky, considering that we are holding the rq lock in
dispatch().

This condition can be reproduced easily in our CI, where we can trigger
stalling softirq works:

[    4.972926] NOHZ tick-stop error: local softirq work is pending, handler #200!!!

Or rcu stalls:

[   47.731900] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[   47.731900] rcu:     1-...!: (0 ticks this GP) idle=b29c/1/0x4000000000000000 softirq=2204/2204 fqs=0
[   47.731900] rcu:     3-...!: (0 ticks this GP) idle=db74/1/0x4000000000000000 softirq=2286/2286 fqs=0
[   47.731900] rcu:     (detected by 0, t=26002 jiffies, g=6029, q=54 ncpus=4)
[   47.731900] Sending NMI from CPU 0 to CPUs 1:

To mitigate this issue reduce the amount of try_pick_next_cgroup()
retries from BPF_MAX_LOOPS (8M) to CGROUP_MAX_RETRIES (1024).

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-10 17:36:17 +01:00
Tejun Heo
552b75a9c7 scx: Build fix after kernel update
In the latest kernel, sched_ext API has changed in two areas:

- ops.prep_enable/cancel_enable/enable/disable() replaced with
  ops.init_task/enable/disable/exit_task().

- scx_bpf_dispatch() can now be called from ops.select_cpu(). Also,
  SCX_ENQ_LOCAL flag is removed. Instead, users can call
  scx_bpf_select_cpu_dfl() from ops.select_cpu() and use the @is_idle out
  param value to determine whether to dispatch directly.

This commit updates all schedules so that they build.

- Init functions renamed / merged / split.

- ops.select_cpu() is added to several schedulers and local direct
  disptching logic is moved there.

This is the minimum update which is need to make the schedulers build and
work. It needs further update to e.g. move vtime udpates to ops.enable().
2024-01-08 14:48:24 -10:00
Jordan Rome
e9a9d32ab6 Restructure scheds folder names
- combine c and kernel-examples as it's confusing to have both
- rename 'rust-user' and 'c-user' to just 'rust' and 'c', which is simpler
- update and fix sync-to-kernel.sh
2023-12-17 13:14:31 -08:00