Commit Graph

51 Commits

Author SHA1 Message Date
Tejun Heo
540576ac30 scheds/c: Re-enable scx_flatcg and scx_pair
cgroup support is availale again, re-enable scx_flatcg and scx_pair.
2024-09-25 12:38:45 -10:00
Tejun Heo
466b7984c2 Sync from kernel - a748db0c8c6a ("tools/sched_ext: Receive misc updates from SCX repo")
Sync from sched_ext/for-6.12-fixes a748db0c8c6a ("tools/sched_ext: Receive
misc updates from SCX repo")

  git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git for-6.12-fixes

- cgroup support + scx_flatcg updates.

- scx_bpf_dispatch[_vtime]_from_dsq() + scx_qmap updates.

- COMPAT macros.
2024-09-25 12:34:03 -10:00
I Hsin Cheng
61cb3f7fc5 scx_common_bpf: Append cast_mask()
Remove cast_mask() function distributed throughout different schedulers
and add it in common.bpf.h so every scheduler can reference it once they
need to.

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
2024-09-24 16:01:19 +08:00
Tejun Heo
54c487731a Update to vmlinux-v6.10-rc2-g1edab907b57d.h
Sync to vmlinux.h from sched_ext/for-6.11 1edab907b57d ("sched_ext/scx_qmap:
Pick idle CPU for direct dispatch on !wakeup enqueues"). This most likely
will be the commit which will be merged during the upcoming kernel v6.11
merge window.

Unfortunately, this is a compatibility breaking change. As the size of
bpf_iter_scx_dsq is reduced, schedulers that use the iterator - scx_lavd and
scx_layered - won't be able to run on older kernels. Likewise, older
binaries from before this commit won't be able to run on newer kernels.
2024-07-12 11:13:34 -10:00
Tejun Heo
f261d0f037 Sync from kernel - 1edab907b57d
Sync from sched_ext/for-6.11 1edab907b57d ("sched_ext/scx_qmap: Pick idle
CPU for direct dispatch on !wakeup enqueues")

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git for-6.11

- cgroup support hasn't landed in the upstream kernel yet. This most likely
  will happen in a few weeks. For the time being, disable scx_flatcg,
  scx_pair and scx_mitosis.

- Compat macro for DSQ task iterator dropped. This is now a part of
  the baseline.

- scx_bpf_consume() isn't upstream yet. BPF interfacing side is still being
  discussed. Dropped example usage from tools/sched_ext. None of the
  practical schedulers use it, so this should be fine for now.

- scx_bpf_cpu_rq() added.

- AUTOATTACH workaround for newer libbpf versions added.
2024-07-12 11:08:41 -10:00
Tejun Heo
af5e89e73c
Merge pull request #412 from vax-r/flatcg_delta_fetch
scx_flatcg: Make good use of __sync_fetch_and_sub()
2024-07-05 07:39:22 -10:00
I Hsin Cheng
aae826b1b3 scx_flatcg: Make good use of __sync_fetch_and_sub()
Fetch the value of "delta" directly from the returned value from
__sync_fetch_and_sub, as it returns the origin value of
cgc->cvtime_delta.

Additional fetching instruction of cgc->cvtime_delta would be redundant
here.

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
2024-07-05 01:03:20 +08:00
I Hsin Cheng
3e52761487 scx_flatcg: Fix_typo
Fix "oppotunistic" to "opportunistic".

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
2024-07-04 22:04:40 +08:00
Tejun Heo
7a40059b55 Revert "scx_flatcg: Keep cgroup rb nodes stashed"
This reverts commit 3b7f33ea1b.

I haven't root caused it yet but it's easy to reproduce stall and trigger
the watchdog after the commit - just running stress in multiple cgroups
easily triggers stalls after a couple tens of seconds. Let's revert it for
now.
2024-06-19 14:44:26 -10:00
Tejun Heo
1012e3a6db scx/compat.bpf.h: Fix __COMPAT_scx_bpf_consume_task() and improve scx_qmap example
__COMPAT_scx_bpf_consume_task() wasn't calling scx_bpf_consume_task() at all
and was always returning false. Fix it.

Also, update scx_qmap usage example so that it matches cgroup ID rather than
comm prefix. This should make testing with multiple processes a bit easier.
2024-06-17 10:11:06 -10:00
Tejun Heo
13e8388e1e compat: Drop __COMPAT_HAS_CPUMASKS
In preparation of upstreaming, let's set the min version requirement at the
released v6.9 kernels. Drop __COMPAT_HAS_CPUMASKS(). The open helper macros
now check the existence of scx_bpf_nr_cpu_ids() and abort if not.
2024-06-16 06:12:06 -10:00
Tejun Heo
66901e2b44 compat: Drop __COMPAT_scx_bpf_dump()
In preparation of upstreaming, let's set the min version requirement at the
released v6.9 kernels. Drop __COMPAT_scx_bpf_dump(). The open helper macros
now check the existence of scx_bpf_dump_bstr() and abort if not.

While at it, reorder the min requirement checks so that newly added ones are
up top to make testing easier.
2024-06-16 06:02:47 -10:00
Tejun Heo
5b5e5be906 compat: Drop __COMPAT_SCX_KICK_IDLE
In preparation of upstreaming, let's set the min version requirement at the
released v6.9 kernels. Drop __COMPAT_SCX_KICK_IDLE. The open helper macros
now check the existence of SCX_KICK_IDLE and abort if not.
2024-06-15 20:24:15 -10:00
Tejun Heo
7c9aedaefe compat: Drop __COMPAT_scx_bpf_switch_all()
In preparation of upstreaming, let's set the min version requirement at the
released v6.9 kernels. Drop __COMPAT_scx_bpf_switch_call(). The open helper
macros now check the existence of SCX_OPS_SWITCH_PARTIAL and abort if not.
2024-06-15 20:03:37 -10:00
Tejun Heo
d3b34d1df7 scx_qmap: Rename central_timer to monitor_timer
The name was copied from scx_central.bpf.c and doesn't match what the timer
is used for in scx_qmap.bpf.c.
2024-06-14 16:07:20 -10:00
David Vernet
b50ba626cc
uei: Pass skel to RESIZE_ARRAY()
The RESIZE_ARRAY() macro assumes the presence of an in-scope "skel" variable.
This is bad practice and can cause issues in other macros that use it. Let's
update it to explicitly take a skel argument.

Signed-off-by: David Vernet <void@manifault.com>
2024-06-11 10:15:26 -05:00
Tejun Heo
92317aa2f9 Use __always_inline uniformly
Instead of using __attribute__((always_inline)) use the __always_inline
macro provided by BPF.
2024-06-10 11:23:26 -10:00
I Hsin Cheng
5881c61a5e scx_central: Provide backward compability
Newer sched_ext kernel versions sets the scheduler to schedule all tasks
within the system by default. However, some users are using the old
versions of kernel.

Therefore we call "__COMPAT_scx_bpf_switch_all()" to move all tasks to
"SCHED_EXT" class so scx_central would schedule all tasks by default in
older kernels.
2024-05-24 15:12:34 +08:00
I Hsin Cheng
e605b067c6 scx_flatcg: Correct content error in comment
A's share in the hierarchy should be 100/(200+100), plus 200/(200+100)
doesn't equal to 1/3. Correct the mistake by changing "200" to "100".
2024-05-21 13:27:26 +08:00
Tejun Heo
0181df54b5
Merge pull request #303 from sched-ext/simple_comment
simple: Add comment explaining use of SHARED_DSQ
2024-05-20 06:45:13 -10:00
David Vernet
0dda4badd5
simple: Add comment explaining use of SHARED_DSQ
scx_simple is a basic scheduler that does either basic vtime, or global
FIFO, scheduling. At first glance, it may be confusing why we create a
separate DSQ rather than just using SCX_DSQ_GLOBAL. Let's add a comment
explaining the reason for this, so that users that are going over
scx_simple as an example scheduler don't get confused.

Signed-off-by: David Vernet <void@manifault.com>
2024-05-20 08:48:31 -05:00
Andrea Righi
c5a4a01994 scx_simple: re-add __COMPAT_scx_bpf_switch_all()
Although newer kernels default to switching-all, some users might still
be using the scheduler with older kernels.

Therefore, ensure all tasks are moved to the SCHED_EXT class by calling
__COMPAT_scx_bpf_switch_all() during init, so that scx_simple can still
operate on these older kernels as well.

Fixes: cf66e58 ("Sync from kernel (670bdab6073)")
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-05-20 04:50:16 +02:00
David Vernet
b421cee59e
Merge pull request #291 from sched-ext/htejun/sync-kernel
Sync from kernel (73f4013eb1eb)
2024-05-17 20:43:00 -05:00
Tejun Heo
ab25992416 Add missing skel.attach() calls
C SCX_OPS_ATTACH() and rust scx_ops_attach() macros were not calling
.attach() and were only attaching the struct_ops. This meant that all
non-struct_ops BPF programs contained in the skels were never attached which
breaks e.g. scx_layered.

Let's fix it by adding .attach() invocation the the attach macros.
2024-05-17 14:33:04 -10:00
Tejun Heo
e26fba9255 Sync from kernel (73f4013eb1eb)
This pulls in the support for dump ops.
2024-05-17 01:57:36 -10:00
vax-r
093a08356e Fix typo
Fix "expermentation" to "experimentation".
2024-05-09 12:10:55 +08:00
Tejun Heo
c77d101655 scheds/c: Sync to the new conventions
Sync with the in-kernel-tree example schedulers.
2024-04-29 10:13:46 -10:00
Tejun Heo
cf66e58118 Sync from kernel (670bdab6073)
And fix build breakage in scx_utils due to an enum type rename.
2024-04-29 09:58:19 -10:00
David Vernet
eed338ef25
simple: Invoke __COMPAT_scx_bpf_switch_all();
scx_simple no longer supports running in "partial" mode, with only
certain tasks usig scx_simple. When this option was removed, we also
removed the call to scx_bpf_switch_all();

While switching-all is the default behavior for newer kernels, let's add
__COMPAT_scx_bpf_switch_all() so that scx_simple can work on older
kernels as well.

Signed-off-by: David Vernet <void@manifault.com>
2024-04-16 11:09:44 -05:00
Andrea Righi
eca7ecd24e build: introduce kernel_headers build option
If we try to cross-build scx on builders with older versions of system's
linux headers (such as those provided by linux-libc-headers in older
releases of Ubuntu), we may hit build failures, due to the different
kernel ABI, such as:

 error: invalid use of undefined type ‘struct btf_enum64’

To address this, introduce a new build option called "kernel_headers"
that allows to specify a custom path for the kernel headers required
during the build process.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-04-04 10:53:36 +02:00
Tejun Heo
348fe53256 Sync from kernel
Synchronize stragglers.

- Bug fix in __COMPAT_read_enum().
- A cosmetic difference in scx_qmap.bpf.c.
- Stray 'p' when calling getopt() in scx_simple.c.

After this the kernel tree and scx repo are in sync.
2024-04-02 11:29:50 -10:00
Tejun Heo
891df57b98 scheds/c: Fix up C schedulers
Fix up the remaining C schedulers after the recent API and header updates.
Also drop stray -p from usage help message from some schedulers.
2024-03-29 12:15:45 -10:00
Tejun Heo
9447cb27b2 scx: Sync from kernel, some schedulers are broken
Sync from kernel to receive new vmlinux.h and the updates to common headers.
This includes the following updates:

- scx_bpf_switch_all() is replaced by SCX_OPS_SWITCH_PARTIAL flag.

- sched_ext_ops.exit_dump_len added to allow customizing dump buffer size.

- scx_bpf_exit() added.

- Common headers updated to provide backward compatibility in a way which
  hides most complexities from scheduler implementations.

scx_simple, qmap, central and flatcg are updated accordingly. Other
schedulers are broken for the moment.
2024-03-29 12:14:26 -10:00
Dave Marchevsky
3b7f33ea1b scx_flatcg: Keep cgroup rb nodes stashed
The flatcg scheduler uses a rb_node type - struct cgv_node - to keep
track of vtime. On cgroup init, a cgv_node is created and stashed in a
hashmap - cgv_node_stash - for later use. In cgroup_enqueued and
try_pick_next_cgroup, the node is inserted into the rbtree, which
required removing it from the stash before this patch's changes.

This patch makes cgv_node refcounted, which allows keeping it in the
stash for the entirety of the cgroup's lifetime. Unnecessary
bpf_kptr_xchg's and other boilerplate can be removed as a result.

Note that in addition to bpf_refcount patches, which have been upstream
for quite some time, this change depends on a more recent series [0].

  [0]: https://lore.kernel.org/bpf/20231107085639.3016113-1-davemarchevsky@fb.com/

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
2024-03-11 08:35:25 -07:00
Tejun Heo
2062d1ad1f scx: Add compat support for SCX_KICK_IDLE and use it for idle CPU wakeups
SCX_KICK_IDLE is a new feature which isn't defined in older kernels. Add
compat wrapper and use it for idle CPU wakeups.

Signed-off-by: Tejun Heo <tj@kernel.org>
2024-02-06 15:28:40 -10:00
David Vernet
4108ece204
scx_userland: Increase scx_userland timeout
This is meant to be an example scheduler that won't necessarily run well
in production. Let's remove the 3 second timeout and use the system
default of 30.

Signed-off-by: David Vernet <void@manifault.com>
2024-02-04 16:23:18 -06:00
David Vernet
28a0b82be6
scx_userland: Print warning about poor performance
Let's make it clear that this scheduler isn't expected to perform well,
and instead point people to scx_rustland.

Signed-off-by: David Vernet <void@manifault.com>
2024-02-04 16:02:57 -06:00
David Vernet
7a3fe759f2
scx_nest: Remove -D option for eager compaction
Now that scheduling BPF timers works correctly, we don't need this extra
logic to eagerly compact if a scheduling for compaction has happened a
few times in a row. Let's remove it.

Signed-off-by: David Vernet <void@manifault.com>
2024-01-16 14:08:36 -06:00
David Vernet
607119d8a4
scx_nest: Set timer callback after cancelling
In scx_nest, we use a per-cpu BPF timer to schedule compaction for a
primary core before it goes idle. If a task comes along that could use
that core, we cancel the callback with bpf_timer_cancel().
bpf_timer_cancel() drops a refcnt on the prog and nullifies the
callback, so if we want to schedule the callback again, we must use
bpf_timer_set_callback() to reset the prog. This patch does that.

Reported-by: Julia Lawall <julia.lawall@inria.fr>
Signed-off-by: David Vernet <void@manifault.com>
2024-01-16 14:01:39 -06:00
Tejun Heo
18f7fe8477 scx_flatcg: Fix fallout from direct dispatch API update
552b75a9c7 ("scx: Build fix after kernel update") updated scx_flatcg along
with other schedulers to use the new direct dispatching from
ops.select_cpu() mechanism. However, this was buggy for flatcg.

flatcg uses direct dispatch for two purposes - as an optimization when there
are idle cpus and to avoid dealing with custom CPU affinities in the
dispatch logic. While the former can be moved to ops.select_cpu(), the
latter can't as it should also apply to tasks which get enqueued without
preceding ops.select_cpu(), e.g., when the task gets requeued after an
attribute change or runs out of time slice. The API update incorrectly moved
both to ops.select_cpu() leading to futile retries of try_pick_next_cgroup()
and scheduling misbheaviors.

Fix it by separating out the two cases and only keeping the idle
optimization case in ops.select_cpu().

Signed-off-by: Tejun Heo <tj@kernel.org>
2024-01-10 10:57:50 -10:00
Tejun Heo
c1f22ea073 scx_flatcg: Report pick_next_cgroup() race and fail counts
To improve visibility into failure mode. While at it, improve output
formatting.

Signed-off-by: Tejun Heo <tj@kernel.org>
2024-01-10 10:52:24 -10:00
Andrea Righi
0609abdca6 scx_flatcg: introduce CGROUP_MAX_RETRIES
We may end up stalling for too long in fcg_dispatch() if
try_pick_next_cgroup() doesn't find another valid cgroup to pick. This
can be quite risky, considering that we are holding the rq lock in
dispatch().

This condition can be reproduced easily in our CI, where we can trigger
stalling softirq works:

[    4.972926] NOHZ tick-stop error: local softirq work is pending, handler #200!!!

Or rcu stalls:

[   47.731900] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[   47.731900] rcu:     1-...!: (0 ticks this GP) idle=b29c/1/0x4000000000000000 softirq=2204/2204 fqs=0
[   47.731900] rcu:     3-...!: (0 ticks this GP) idle=db74/1/0x4000000000000000 softirq=2286/2286 fqs=0
[   47.731900] rcu:     (detected by 0, t=26002 jiffies, g=6029, q=54 ncpus=4)
[   47.731900] Sending NMI from CPU 0 to CPUs 1:

To mitigate this issue reduce the amount of try_pick_next_cgroup()
retries from BPF_MAX_LOOPS (8M) to CGROUP_MAX_RETRIES (1024).

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-10 17:36:17 +01:00
Tejun Heo
552b75a9c7 scx: Build fix after kernel update
In the latest kernel, sched_ext API has changed in two areas:

- ops.prep_enable/cancel_enable/enable/disable() replaced with
  ops.init_task/enable/disable/exit_task().

- scx_bpf_dispatch() can now be called from ops.select_cpu(). Also,
  SCX_ENQ_LOCAL flag is removed. Instead, users can call
  scx_bpf_select_cpu_dfl() from ops.select_cpu() and use the @is_idle out
  param value to determine whether to dispatch directly.

This commit updates all schedules so that they build.

- Init functions renamed / merged / split.

- ops.select_cpu() is added to several schedulers and local direct
  disptching logic is moved there.

This is the minimum update which is need to make the schedulers build and
work. It needs further update to e.g. move vtime udpates to ops.enable().
2024-01-08 14:48:24 -10:00
David Vernet
e8978ebe23
scx_userland: Introduce ops.update_idle() callback
We can sometimes hit scenarios in the scx_userland scheduler where there
is work to be done in user space, but we incorrectly fail to run the
user space scheduler. In order to avoid this, we can use global
variables that are set from both BPF and user space. The BPF-side
variable reflects when one or more tasks have been enqueued, and the
user space-side variable reflects when user space has received tasks but
has not yet dispatched them.

In the ops.update_idle() callback, we can check these variables and send
a resched IPI to a core to ensure that the user-space scheduler is
always scheduled when there's work to be done.

Signed-off-by: David Vernet <void@manifault.com>
2024-01-02 16:29:19 -06:00
Andrea Righi
dbc8e23980 scx_userland: flush stdout when printing stats
Periodically flush stdout to help following the scheduler progress
during testing.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2023-12-29 15:53:12 +01:00
Andrea Righi
614a1ff901 scx_flatcg: flush stdout when printing stats
Periodically flush stdout to help following the scheduler progress
during testing.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2023-12-29 15:53:12 +01:00
David Vernet
eb7b3c99f0
Merge pull request #40 from sched-ext/ci
scx: Add CI action that builds schedulers for PRs
2023-12-18 21:17:47 -06:00
David Vernet
4523b10e45
scx: Add CI action that builds schedulers for PRs
When Ubuntu ships with sched_ext, we can also maybe test loading the
schedulers (not sure if the runners can run as root though). For now, we
should at least have a CI job that lets us verify that the schedulers
can _build_. To that end, this patch adds a basic CI action that builds
the schedulers.

As is, this is a bit brittle in that we're having to manually download
and install a few dependencies. I don't see a better way for now without
hosting our own runners with our own containers, but that's a bigger
investment. For now, hopefully this will get us _some_ coverage.

Signed-off-by: David Vernet <void@manifault.com>
2023-12-18 21:12:50 -06:00
David Vernet
318c06fa9c
nest: Skip out of idle cpu selection on exec() path
The core sched code calls select_task_rq() in a few places: the task
wakeup path (typical path), the fork() path, and the exec() path. For
nest scheduling, we don't want to select a core from the nest on the
exec() path. If we were previously able to find an idle core, we would
have found it on the fork() path, so we don't gain much by checking on
the exec() path. In fact, it's actually harmful, because we could
incorrectly blow up the primary nest unnecessarily by bumping the same
task between multiple cores for no reason. Let's just opt-out of
select_task_rq calls on the exec() path.

Suggested-by: Julia Lawall <julia.lawall@inria.fr>
Signed-off-by: David Vernet <void@manifault.com>
2023-12-18 13:51:15 -06:00
David Vernet
ab0e36f9ce
scx_nest: Apply r_impatient if no task is found in primary nest
Julia pointed out that our current implementation of r_impatient is
incorrect. r_impatient is meant to be a mechanism for more aggressively
growing the primary nest if a task repeatedly isn't able to find a core.
Right now, we trigger r_impatient if we're not able to find an attached
or previous core in the primary nest, but we _should_ be triggering it
only if we're unable to find _any_ core in the primary nest. Fixing the
implementation to do this drastically decreases how aggressively we grow
the primary nest when r_impatient is in effect.

Reported-by: Julia Lawall <julia.lawall@inria.fr>
Signed-off-by: David Vernet <void@manifault.com>
2023-12-18 11:05:36 -06:00