Commit Graph

81 Commits

Author SHA1 Message Date
Tejun Heo
51334b5c4d Bump versions for 1.0.1 release 2024-07-15 13:21:52 -10:00
Tejun Heo
761ec142ce Bump most versions to 1.0.0
sched_ext is about to be merged upstream. There are some compatibility
breaking changes and we're making the current sched_ext/for-6.11
1edab907b57d ("sched_ext/scx_qmap: Pick idle CPU for direct dispatch on
!wakeup enqueues") the baseline.

Tag everything except scx_mitosis as 1.0.0. As scx_mitosis is still in early
development and is currently temporarily disabled, only the patchlevel is
bumped.
2024-07-12 11:34:14 -10:00
Tejun Heo
f261d0f037 Sync from kernel - 1edab907b57d
Sync from sched_ext/for-6.11 1edab907b57d ("sched_ext/scx_qmap: Pick idle
CPU for direct dispatch on !wakeup enqueues")

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git for-6.11

- cgroup support hasn't landed in the upstream kernel yet. This most likely
  will happen in a few weeks. For the time being, disable scx_flatcg,
  scx_pair and scx_mitosis.

- Compat macro for DSQ task iterator dropped. This is now a part of
  the baseline.

- scx_bpf_consume() isn't upstream yet. BPF interfacing side is still being
  discussed. Dropped example usage from tools/sched_ext. None of the
  practical schedulers use it, so this should be fine for now.

- scx_bpf_cpu_rq() added.

- AUTOATTACH workaround for newer libbpf versions added.
2024-07-12 11:08:41 -10:00
Andrea Righi
cf4883fbf8 meson: introduce serialize build option
With commit 5d20f89a ("scheds-rust: build rust schedulers in sequence"),
schedulers are now built serially one after the other to prevent meson
and cargo from forking NxN parallel tasks.

However, this change has made building a single scheduler much more
cumbersome, due to the chain of dependencies.

For example, building scx_rusty using the specific meson target would
still result in all schedulers being built, because they all depend on
each other.

To address this issue, introduce the new meson build option
`serialize=true|false` (default is false).

This option allows to disable the schedulers' build chain, restoring the
old behavior.

With this option enabled, it is now possible to build just a single
scheduler, parallelizing the cargo build properly, without triggering
the build of the others. Example:

  $ meson setup build -Dbuildtype=release -Dserialize=false
  $ meson compile -C build scx_rusty

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-06-28 10:17:37 +02:00
Tejun Heo
dde2942125 compat: Drop __COMPAT_scx_bpf_cpuperf_*()
In preparation of upstreaming, let's set the min version requirement at the
released v6.9 kernels. Drop __COMPAT_scx_bpf_cpuperf_*(). The open helper
macros now check the existence of scx_bpf_cpuperf_cap() and abort if not.
2024-06-16 06:16:53 -10:00
Tejun Heo
13e8388e1e compat: Drop __COMPAT_HAS_CPUMASKS
In preparation of upstreaming, let's set the min version requirement at the
released v6.9 kernels. Drop __COMPAT_HAS_CPUMASKS(). The open helper macros
now check the existence of scx_bpf_nr_cpu_ids() and abort if not.
2024-06-16 06:12:06 -10:00
Tejun Heo
5b5e5be906 compat: Drop __COMPAT_SCX_KICK_IDLE
In preparation of upstreaming, let's set the min version requirement at the
released v6.9 kernels. Drop __COMPAT_SCX_KICK_IDLE. The open helper macros
now check the existence of SCX_KICK_IDLE and abort if not.
2024-06-15 20:24:15 -10:00
Tejun Heo
7c9aedaefe compat: Drop __COMPAT_scx_bpf_switch_all()
In preparation of upstreaming, let's set the min version requirement at the
released v6.9 kernels. Drop __COMPAT_scx_bpf_switch_call(). The open helper
macros now check the existence of SCX_OPS_SWITCH_PARTIAL and abort if not.
2024-06-15 20:03:37 -10:00
Tejun Heo
9ec3594b4f scx_layered: Several fixes to address David's review
- pick_idle_cpu() was putting idle_smtmask that it didn't acquire.

- layered_enqueue() was unnecessarily entering preemption path after finding
  an idle CPU.

- No need to test whether scx_bpf_get_idle_cpu/smtmask() return NULL. They
  never do.

- Relocate cctx->yielding test into keep_runinng() from its caller.
2024-06-10 11:23:37 -10:00
Tejun Heo
92317aa2f9 Use __always_inline uniformly
Instead of using __attribute__((always_inline)) use the __always_inline
macro provided by BPF.
2024-06-10 11:23:26 -10:00
Tejun Heo
a165970ab9 scx_layered: Add migration statistic
Keep track of how frequent migrations are.
2024-06-07 11:49:39 -10:00
Tejun Heo
5b31d96c3d scx_layered: Implement "preempt_first" layer property
If set, tasks in the layer will try to preempt tasks in their previous CPUs
before trying to find idle CPUs.
2024-06-07 11:49:39 -10:00
Tejun Heo
ece3638664 scx_layered: Allow confined layers to preempt
There's no reason to restrict confined layers from preempting on the CPUs
that they are entitled to. Allow preemption for confined layers.
2024-06-07 11:49:39 -10:00
Tejun Heo
7c48814ed0 scx_layered: Prefer preempting the CPU the task was previously on
Currently, when preempting, searching for the candidate CPU always starts
from the RR preemption cursor. Let's first try the previous CPU the
preempting task was on as that may have some locality benefits.
2024-06-07 11:49:38 -10:00
Tejun Heo
3db3257911 scx_layered: Find and kick an idle CPU from enqueue path
When a task is being enqueued outside wakeup path, ops.select_cpu() isn't
called, so we can end up in a situation where a newly enqueued task keeps
waiting in one of the DSQs while there are idle CPUs. Factor out idle CPU
selection path into pick_idle_cpu() and call it from the enqueue path in
such cases. This problem is shared across schedulers and likely needs a more
generic solution in the future.
2024-06-07 11:49:38 -10:00
Tejun Heo
0f2d1ad2fa scx_layered: Implement a new layer parameter "yield_ignore"
yield(2) currently gives up the entire slice. Add "yield_ignore" layer
parameter which can modulate the magnitude of yiedling. When 1.0, yields are
completely ignored. 0.5, only half worth of the full slice is given up and
so on.
2024-06-07 11:49:38 -10:00
Tejun Heo
4aa8124b9c scx_layered: Add explicit yield() support
Currently, a task which yields is treated the same as a task which has run
out its slice. As the budget charged to a task is calculated from wall clock
time, a repeatedly yielding task can stay at the top of the queue for quite
a while hogging the CPU and spiking the number of scheduling events.

Let's add explicit yield support. An yielding task is now always charged the
full slice and not allowed to keep running on the same CPU.
2024-06-07 11:49:38 -10:00
Tejun Heo
436cd7ba9e scx_layered: Make enqueue path comprehensive and handle CPU preemptions
The keep_running path relies on the implicit last task enqueue which makes
the statistics a bit difficult to track. Let's make the enqueue path
comprehensive:

- Set SCX_OPS_ENQ_LAST and handle the last runnable task enqueue explicitly.

- Implement layered_cpu_release() to re-enqueue tasks from a CPU preempted
  by a higher pri sched class and handle the re-enqueued tasks explicitly in
  layered_enqueue().

- Add more statistics to track all enqueue operations.
2024-06-07 11:49:38 -10:00
Tejun Heo
4a0993ceab scx_layered: Allow long-running tasks to keep running on the same CPU
When a task exhausts its slice, layered currently doesn't make any effort to
keep it on the same CPU. It dispatches the next task to run and then
enqueues the running one. This leads to suboptimal behaviors. e.g. When this
happens to a task in a preempting layer, the task will most likely find an
idle CPU or a task to preempt and then migrate there causing a completely
unnecessary migration.

This patch layered_dispatch() test whether the current task should keep
running on the CPU and then skip dispatching to keep the task running. This
behavior depends on the implicit local DSQ enqueue mechanism which triggers
when there are no other tasks to run.
2024-06-07 11:49:38 -10:00
Tejun Heo
200af60f2a scx_layered: Fix load failure due to scheduler_tick() -> sched_tick() rename
- scx_utils: Replace kfunc_exists() with ksym_exists() which doesn't care
  about the type of the symbol.

- scx_layered: Fix load failure on kernels >= v6.10-rc due to
  scheduler_tick() -> sched_tick rename. Attach the tick fentry function to
  either scheduler_tick() or sched_tick().
2024-06-06 12:54:59 -10:00
Tejun Heo
e556dd375d scx: Unify loading and running boilerplate across rust schedulers
Make restart handling with user_exit_info simpler and consistently use the
load and report macros consistently across the rust schedulers. This makes
all schedulers automatically handle auto restarts from CPU hotplug events.
Note that this is necessary even for scx_lavd which has CPU hotplug
operations as CPU hotplug operations which took place between skel open and
scheduler init can still trigger restart.
2024-06-03 12:25:41 -10:00
Tejun Heo
a2d5310cb6 Bump versions for a release 2024-06-03 08:35:21 -10:00
Tejun Heo
d3ed4cb5c7 scx_layered: Successfully consuming from HI_FALLBACK_DSQ should terminate dispatching
layered_dispatch() was incorrectly continuing down to the lower priority
DSQs after successfully consuming from HI_FALLBACK_DSQ which can lead to
latency issues. Fix it.
2024-05-28 10:20:55 -10:00
Tejun Heo
99eb56b6b5 scx_layered: Implement layered_dump()
which dumps layer states.
2024-05-23 12:54:17 -10:00
Tejun Heo
a576242b69 scx_layered: Open and grouped layers can handle tasks with custom affinities
The main reason why custom affinities are tricky for scx_layered is because
if we put a task which doesn't allow all CPUs into a layer's DSQ, it may not
get consumed for an indefinite amount of time. However, this is only true
for confined layers. Both open and grouped layers always consumed from all
CPUs and thus don't have this risk.

Let's allow tasks with custom affinities in open and grouped layers.

- In select_cpu(), don't consider direct dispatching to a local DSQ as
  affinity violation even if the target CPU is outside the layer's cpumask
  if the layer is open.

- In enqueue(), separate out per-cpu kthread special case into its own
  block. Note that this is only applied if the layer is not preempting as a
  preempting layer has a higher priority than HI_FALLBACK_DSQ anyway.

- Trigger the LO_FALLBACK_DSQ path for other threads only if the layer is
  confined.

- The preemption path now also runs for tasks with a custom affinity in open
  and grouped layers. Update it so that it only considers the CPUs in the
  preempting task's allowed cpumask.

(cherry picked from commit 82d2f887a4608de61ddf5e15643c10e504a88f7b)
2024-05-23 12:54:17 -10:00
Tejun Heo
1ce23760b5 scx_layered: Improve affinity violation handling
- AFFN_VIOL for per-cpu tasks could be double counted. Once in select_cpu()
  and again in enqueue(). Count in select_cpu() only when direct
  dispatching.

- Violating tasks were prioritized over non-violating ones because they were
  queued on SCX_DSQ_GLOBAL which has priority over all user DSQs. This
  doesn't make sense. Let's introduce two fallback DSQs - HI_FALLBACK_DSQ
  and LO_FALLBACK_DSQ. HI is used for violating kthreads and LO for
  violating user threads. HI is dispatched after preempting layers and LO
  after all other layers. This shouldn't change the behavior too much for
  kthreads while punshing, rather than rewarding, violating user threads.

(cherry picked from commit 67f69645667ba8a155cae9a9b7e90c055d39e23c)
2024-05-23 12:54:17 -10:00
David Vernet
61cbfdf912
layered: Remove unused variables
There are some unused variables in scx_layered. Remove them.

Signed-off-by: David Vernet <void@manifault.com>
2024-05-18 07:51:20 -05:00
Tejun Heo
ab25992416 Add missing skel.attach() calls
C SCX_OPS_ATTACH() and rust scx_ops_attach() macros were not calling
.attach() and were only attaching the struct_ops. This meant that all
non-struct_ops BPF programs contained in the skels were never attached which
breaks e.g. scx_layered.

Let's fix it by adding .attach() invocation the the attach macros.
2024-05-17 14:33:04 -10:00
Tejun Heo
71d5e60093 scheds/rust: Use __COMPAT helpers instead of open coding feature tests 2024-04-29 09:58:34 -10:00
Tejun Heo
e5e88b7e18 Bump versions to prepare for a release 2024-04-29 09:07:27 -10:00
Andrea Righi
cabde30736 scx_utils: bump up version to 0.8.0
Bump up scx-utils version to provide the new scx_utils::TopologyMap.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-04-28 21:01:16 +02:00
David Vernet
5ba137e8c9
layered: Make layered backwards compat with cpufreq
Only the very newest kernels support scx_bpf_cpuperf_set(). Let's update
scx_layered to accommodate older kernels as well.

Signed-off-by: David Vernet <void@manifault.com>
2024-04-24 14:01:51 -05:00
Tejun Heo
9a9b4dd23e
Merge pull request #239 from hodgesds/cpufreq_helpers
Add CPU frequency related helpers and extend scx_layered
2024-04-24 07:22:15 -10:00
Daniel Hodges
32e97bf4d5 Adds CPU frequency related helpers and extend scx_layered
This change adds `scx_bpf_cpuperf_cap`, `scx_bpf_cpuperf_cur` and
`scx_bpf_cpuperf_set` definitions that were recently introduced into
[`sched_ext`](https://github.com/sched-ext/sched_ext/pull/180). It adds
a `perf` field to `scx_layered` to allow for controlling performance per
layer.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-04-24 07:27:52 -07:00
David Vernet
24c248eebb
layered: Add support for filtering on process name
If a library creates threads, those threads will often have the same
name. If two different processes of different priority both use a
library, it may be that we want the library's threads in each process to
be put into different layers.

To support this, let's add the ability to filter not only by task name,
but also by process name via the task thread group leader's comm.

Tested by creating two executables named "foo" and "bar", which both
spawn a bunch of tasks named "exp_worker" that spin until being
interrupted. With this config: https://pastebin.com/Uz2phzxQ, the tasks
were correctly matched to the expected layers.

Signed-off-by: David Vernet <void@manifault.com>
2024-04-23 23:12:37 -05:00
David Vernet
a998fb7d01
layered: Clarify f: and file: prefix behavior
Some people have expressed confusion at this behavior. Let's be a bit
more explicit in the documentation.

Signed-off-by: David Vernet <void@manifault.com>
2024-04-23 20:39:28 -05:00
takase1121
5d20f89a87
scheds-rust: build rust schedulers in sequence 2024-04-23 08:06:27 +08:00
David Vernet
5f1eac85ff
layered: Fix init_task
When I transitioned layered to using task local storage, I messed up
initializing the task ctx, not realizing we previously had a separate
variable that was initializing the hasmap entry. We need to initialize
the task's layer to -11, and also set refresh_layer to 1.

Signed-off-by: David Vernet <void@manifault.com>
2024-04-18 09:44:32 -05:00
Dan Schatzberg
6eefc8c27f
Fix error typo
ENONET means "Machine is not on the network" - this was supposed to be ENOENT "No such file or directory"
2024-04-10 15:28:05 -04:00
Tejun Heo
b925bdf94d Cargo.toml: Update libbpf-rs/cargo dependencies to 0.23 and drop patch.crates-io sections
New versions of libbpf-rs and libbpf-cargo are now available with all the
needed features. Update the dependencies and drop the patch sections.
2024-04-02 11:19:39 -10:00
Tejun Heo
6f81409df4 Bump versions
- scx_utils bumped from 0.6.0 to 0.7.0.

- Repo and rust schedulers get a PATCH level bump.
2024-04-02 10:58:50 -10:00
Tejun Heo
59bbd800c1 compat: Implement scx_utils::compat and fix up scx_layered
Implement scx_utils::compat to match C's scx/compat.h and update
scx_layered. Other rust scheds are still broken.
2024-04-02 07:08:56 -10:00
David Vernet
e857dd90ab
layered: Use TLS map instead of hash map
In scx_layered, we're using a BPF_MAP_TYPE_HASH map (indexed by pid)
rather than a BPF_MAP_TYPE_TASK_STORAGE, to track local storage for a
task. As far as I can tell, there's no reason we need to be doing this.
We never access the map from user space, and we're even passing a
struct task_struct * to a helper subprog to look up the task context
rather than only doing it by pid.

Using a hashmap is error prone for this because we end up having to
manually track lifecycles for entries in the map rather than relying on
BPF to do it for us. For example, BPF will automatically free a task's
entry from the map when it exits. Let's just use TLS here rather than a
hashmap to avoid issues from this (e.g. we've observed the scheduler
getting evicted because we're accessing a stale map entry after a task
has been destroyed).

Reported-by: Valentin Andrei <vandrei@meta.com>
Signed-off-by: David Vernet <void@manifault.com>
2024-03-27 20:14:27 -05:00
David Vernet
602ec5ada3
layered: Make helper functions static
lookup_task_ctx(), lookup_task_ctx_may_fail(), and lookup_layer()
currently don't have the static keyword, so BPF may treat them as a
global function. We don't actually want these to be global, so let's
make them static to avoid confusing the verifier.

Signed-off-by: David Vernet <void@manifault.com>
2024-03-26 15:08:32 -05:00
David Vernet
3cda1bc690
Merge pull request #187 from sched-ext/layered-updates
scx_layered: Make config json assume default vaules for unspecified fields
2024-03-13 17:15:18 -05:00
Tejun Heo
76fb0fdd8f scx_layered: Make config json assume default vaules for unspecified fields
This makes writing configs and allows introducing new fields without
breaking existing configs.
2024-03-13 11:10:38 -10:00
Tejun Heo
6048992ca7
Merge pull request #185 from sched-ext/layered-updates
scx_layered: Implement layer properties `exclusive` and `min_exec_us`
2024-03-13 09:59:37 -10:00
Tejun Heo
60b346c1fc scx_layered: Add more comments 2024-03-13 09:56:28 -10:00
Tejun Heo
a9457a408e scx_layered: stat reporting updates 2024-03-12 10:48:21 -10:00
Tejun Heo
a642fc873b scx_layered: Fix stat reporting
GSTAT_TASK_CTX_FREE_FAILED should report total while EXCL_* should report
delta pct. Fix them.
2024-03-12 10:25:51 -10:00