Commit Graph

2182 Commits

Author SHA1 Message Date
Andrea Righi
c4b6408e92 scx_rustland: smooth vruntime update
Update vruntime adding the used virtual time slice of each task as soon
they are scheduled.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-10-16 14:06:00 +02:00
Andrea Righi
0b2de2c10c scx_rustland: use built-in nvcsw metrics
Use the nvcsw metric from the scx_rustland_core backend, intead of
retrieving this metric in user-space via procfs.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-10-16 14:06:00 +02:00
Andrea Righi
97629178e2 scx_rustland_core: bump up version to 2.2.2
Bump up the minor version to reflect the new backward-compatible
functionality added.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-10-16 14:06:00 +02:00
Andrea Righi
704fe95f51 scx_rustland_core: get rid of the SCX_ENQ_WAKEUP logic
With user-space scheduling we don't usually dispatch a task immediately
after selecting an idle CPU, so there's not much benefit at trying to
optimize the WAKE_SYNC scenario (when a task is waking up another task
and releaing the CPU) when picking an idle CPU.

Therefore, get rid of the WAKE_SYNC logic in select_cpu() and rely on
the user-space logic (that has access to the WAKE_SYNC information) to
handle this particular case.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-10-16 14:05:58 +02:00
Andrea Righi
67ec1af5cf scx_rustland_core: kick an idle CPU after global dispatch
Do not kick a CPU from rs_select_cpu() (called by the user-space
scheduler), since we may not immediately dispatch the task.

Instead, always try to wake up the task's assigned CPU after dispatching
to a global DSQ, ensuring it can be consumed immediately.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-10-16 14:05:33 +02:00
Andrea Righi
0a05f1f193 scx_rustland_core: keep CPUs alive with pending tasks
Prevent CPUs from going idle when the user-space scheduler has some
pending activities to complete.

Keeping the CPU alive allows to consume tasks from the user-space
scheduler more efficiently, preventing bubbles in the scheduling
pipeline.

To achieve this, trigger a CPU kick from ops.update_idle() and set a
flag in the CPU context to prevent it from going idle. Then keep kicking
the CPU from ops.dispatch() until the flag is cleared, which occurs when
no more tasks are pending or when the CPU exits idle as a task starts
running on it.

This allows to fix the performance regression introduced by the
put_prev_task_scx() behavior change in Linux 6.12 (see #788).

Link: https://lore.kernel.org/lkml/20241015111539.12136-1-andrea.righi@linux.dev/
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-10-16 10:43:43 +02:00
Daniel Hodges
907746745e scx_layered: Refactor refresh cpumasks
Refactor the logic for refresh cpumasks to be easy to read and verify.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-10-15 17:58:10 -07:00
Andrea Righi
abfb4c53f5 scx_rustland_core: restart scheduler on hotplug events
User-space schedulers may still hit some stalls during CPU hotplugging
events.

There is no reason to overcomplicate the code and trying to handle
hotplug events within the scx_rustland_core framework and we can simply
handle a scheduler restart performed by the scx core.

This makes CPU hotplugging more reliable with scx_rustland_core-based
schedulers.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-10-15 23:11:43 +02:00
Andrea Righi
4432e64d85 scx_rustland_core: allow user-space scheduler to run indefinitely
Assign an infinite time slice to the user-space scheduler itself, so
that it can completely drain all the pending tasks and voluntarily
release the CPU when it's done.

This allows to achieve more consistent performance and we can also
remove the speculative user-space scheduler wakeup from ops.stopping().

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-10-15 23:11:43 +02:00
Andrea Righi
be681c731a scx_rustland_core: pass nvcsw, slice and dsq_vtime to user-space
Provide additional task metrics to user-space schedulers via QueuedTask:
 - nvcsw: total amount of voluntary context switches
 - slice: task time slice "budget" (from p->scx.slice)
 - dsq_vtime: current task vtime (from p->scx.dsq_vtime)

In this way user-space schedulers can quickly access these metrics to
implement better scheduling policy.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-10-15 23:11:43 +02:00
Andrea Righi
1bbae64dc7 scx_rustland_core: update CPU idle selection logic
Re-align idle selection logic with some of the latest improvements done
in scx_bpfland.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-10-15 23:11:42 +02:00
Tejun Heo
4841df8138
Merge pull request #793 from minosfuture/vmlinux_per_arch
Use per-arch vmlinux.h
2024-10-15 19:52:42 +00:00
Dan Schatzberg
96ebe6b84a scx_mitosis: Handle pinned tasks
Pinned tasks should just be routed to a fallback DSQ. kthreads are given
a higher priority than non-kthreads so use two fallback DSQs.

Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
2024-10-15 09:09:01 -07:00
Dan Schatzberg
902f41adf0
Merge pull request #799 from dschatzberg/mitosis_dispatch_no_wakeup
scx_mitosis: handle enqueue() on !wakeup
2024-10-15 13:46:07 +00:00
Daniel Hodges
e017692697
Merge pull request #801 from hodgesds/layered-iter-fix
scx_layered: Remove layer iteration
2024-10-14 23:39:48 +00:00
Daniel Hodges
71d63010af scx_layered: Refactor layer iteration
Remove DSQ iter algos.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-10-14 13:13:53 -07:00
Dan Schatzberg
a17f16e4b9 scx_mitosis: handle enqueue() on !wakeup
If we're not on the wakeup path, we may see enqueue() invoked without
select_cpu() which will require an idle cpu lookup. In order to fix
this, we refactor the idle_cpu lookup in select_cpu so it can be invoked
from enqueue().

Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
2024-10-14 10:13:07 -07:00
Daniel Hodges
7bfbc71012
Merge pull request #798 from sched-ext/hodgesds-perfetto-docs
Update developer guide with Perfetto info
2024-10-14 15:09:15 +00:00
Daniel Hodges
43615107f9
Merge pull request #797 from hodgesds/layered-llc-integration
scx_layered: Add LLC integration test
2024-10-14 15:06:25 +00:00
Daniel Hodges
c76b87f62f
Update developer guide with Perfetto info
Update the developer guide with info on how to run the helper script to generate Perfetto compatible traces.
2024-10-14 10:51:03 -04:00
Daniel Hodges
912d6e01c1 scx_layered: Add LLC integration test
Add an integration test for testing that the `llcs` field on the layer
config works properly.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-10-14 07:27:29 -07:00
Daniel Hodges
ed18e43612
Merge pull request #795 from hodgesds/bpftrace-tests
scx_layered: Add topology integration test
2024-10-14 12:54:54 +00:00
Andrea Righi
5408cd3f78
Merge pull request #796 from sched-ext/sched-ftrace-script
scripts: Convert sched ftrace helper scripts to python
2024-10-14 12:39:57 +00:00
Andrea Righi
8b7f9cde0f scripts: Convert sched ftrace helper scripts to python
Merge the sched_switch ftrace helper scripts into a single python script
that prints the result to stdout.

In this way it's possible to generate a perfetto-compatible trace
running:

 $ sudo ./scripts/sched_ftrace.py > sched.ftrace

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-10-14 08:44:14 +02:00
Daniel Hodges
e456c83536 scx_layered: Add topology integration test
Add a bpftrace script that does a topology aware test. The test script
runs a bpftrace script that asserts that stress-ng processes are
scheduled on NUMA node 0 only.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-10-13 20:23:11 -07:00
Ming Yang
f7cdf08754 scx_mitosis: Fix static assertion of scx_bpf_task_cgroup failing __weak check
it failed the static assertion in macro bpf_ksym_exists.

Signed-off-by: Ming Yang <minos.future@gmail.com>
2024-10-13 07:57:12 -07:00
Ming Yang
7067092555 Add vmlinux.h for multiple arch
Following the change of using per-arch vmlinux.h. Add it for the
remaining archs.

Signed-off-by: Ming Yang <minos.future@gmail.com>
2024-10-13 07:57:12 -07:00
Ming Yang
a23f3566e3 Use per-arch vmlinux.h
vmlinux.h is not compatible across archs.

Handle this compatibility issue by
* Add arch info into vmlinux.h real file name
* Link vmlinux.h to the target-arch real file at build time
* Use target-arch real file for scx_utils bindgen.

Also refactored clang related logic into a new clang_info mod, which is
shared by bpf_builder.rs and builder.rs.

Signed-off-by: Ming Yang <minos.future@gmail.com>
2024-10-13 07:57:12 -07:00
Daniel Hodges
03f078ac74
Merge pull request #792 from hodgesds/ftrace-perfetto
scripts: Add ftrace perfetto helper scripts
2024-10-13 13:03:47 +00:00
Daniel Hodges
cc3fede8e0
scripts: Add ftrace helper scripts
Add a set of ftrace helper scripts for making perfetto compatible ftrace
scheduler profiles.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-10-13 09:00:07 -04:00
Changwoo Min
ba9d75a6ab
Merge pull request #791 from multics69/lavd-sched-sample
scx_lavd: misc updates
2024-10-13 08:52:08 +00:00
Changwoo Min
c1f4051a14 scx_lavd: fix int overflow in calculating avg_lat_cri
u32 is not big enough to hold the sum of lat_cri in a period,
so sum_lat_cri (u32) was overflown, resulting in incorrect
avg_lat_cri. Change the type from u32 to u64, avoiding the
interger overflow. Note that {sum/avg}_lat_cri is only for
deubugging so it is irrelevant in making scheduling decisions.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-10-13 00:58:36 +09:00
Changwoo Min
6c9bbe66dc scx_lavd: remove unnecessary downscaling in deadline calculation
The downscaling is not necessary in calculating task's virtual
deadline because virtual dealine represents only relative order
in task scheduling. Hence downscaling incurs only inacuracy
caused by truncation.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-10-13 00:41:23 +09:00
Changwoo Min
dce67296bf
Merge pull request #790 from multics69/lavd-sched-sample
scx_lavd: do not inspect scx_lavd process itself
2024-10-12 14:18:59 +00:00
Daniel Hodges
a846fb3be6
Merge pull request #789 from hodgesds/vtime-dist
Add bpftrace script to print vtime distributions across DSQs
2024-10-12 12:46:59 +00:00
Daniel Hodges
f59b73b97c
scripts: Add vtime distribution script
Add bpftrace script to print the distribution of vtime across DSQs.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-10-12 08:40:16 -04:00
Changwoo Min
6ddc3f0a2b scx_lavd: do not inspect scx_lavd process itself
Print the task status of scx_lavd is not useful,
so filter it out.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-10-12 17:21:08 +09:00
Andrea Righi
7f3b0cb739
Merge pull request #780 from sched-ext/bpfland-remove-pcpu-dsq
scx_bpfland: drop per-cpu DSQs
2024-10-12 06:25:35 +00:00
Andrea Righi
09fee68a6b
Merge pull request #785 from sched-ext/rustland-core-fix-mm-kprobe
scx_rustland_core: use handle_mm_fault kprobe
2024-10-12 06:20:15 +00:00
Andrea Righi
197dee93f4 scx_bpfland: get rid of per-CPU DSQs
Using per-CPU DSQs seems to introduce more issues than benefits
(potential stalls, etc.). Therefore, let's get rid of the per-CPU DSQs
and use SCX_DSQ_LOCAL for tasks directly dispatched to specific CPUs.

This change seems to also improve performance on 6.12 and it makes the
scheduler a lot more stable and consistent.

The issues will be investigated separately, providing a separate stress
test scheduler, designed to stress test per-CPU DSQs.

Tested-by: Piotr Gorski <piotrgorski@cachyos.org>
Tested-by: Eric Naim <dnaim@cachyos.org>
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-10-12 08:15:51 +02:00
Andrea Righi
198f22656c scx_bpfland: clarify error code returned by pick_idle_cpu()
Return more meaningful error codes from pick_idle_cpu(). No functional
change, just improved code readability.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-10-12 08:08:48 +02:00
Andrea Righi
ceb4f1755f scx_bpfland: always refill task timeslice in ops.dispatch()
When a task exhausts its timeslice and no other tasks are ready to run,
we automatically refill its timeslice, but only if the current CPU is a
fully idle SMT core.

If we don’t handle the refill, the sched_ext core will default to
refilling using SCX_SLICE_DFL, which may not be optimal.

To ensure better control over the task’s timeslice, always refill it
when no other tasks are available to run.

Fixes: 6e24fcc ("scx_bpfland: keep tasks running on full-idle SMT cores")
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-10-12 08:08:48 +02:00
Andrea Righi
54d704ceda scx_bpfland: pick a random idle CPU when prev_cpu is not valid
Pick any random idle CPU when the previous CPU isn't valid anymore
according to the task's cpumask.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-10-12 08:08:48 +02:00
Changwoo Min
836cf9faa4
Merge pull request #779 from multics69/lavd-futex-v2
scx_lavd: mitigate the lock holder preemption problem
2024-10-12 02:42:33 +00:00
Daniel Hodges
a950f96353
Merge pull request #787 from hodgesds/layered-clean
scx_layered: Cleanup non topology path
2024-10-11 18:26:23 +00:00
Daniel Hodges
a08a76ccd6 scx_layered: Cleanup non topology path
More cleanup in the non topology path to remove copy/pasta declarations.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-10-11 10:18:34 -07:00
eb59085e61
Merge pull request #781 from JakeHillion/pr781
layered: move configuration into library component
2024-10-11 16:39:23 +00:00
caff46e864
Merge pull request #786 from JakeHillion/pr786
layered: make default value for disable_topology dynamic
2024-10-11 16:25:45 +00:00
Jake Hillion
52c279a469 layered: make default value for disable_topology dynamic
Disable topology currently defaults to `false` (topology enabled...). Change
this so that topology is enabled by default on hardware that may benefit from
it (multiple NUMA nodes or LLCs) and disabled on hardware that does not benefit
from it.

This is a slightly noisy change as we have to move ownership of the newly
mutable layer specs into the `Scheduler` object (previously they were a
borrow). We don't have a `Topology` object to make the default decision from
until `Scheduler::init`, and I think this is because of the possibility of hot
plugs. We therefore have to clone the `Vec<LayerSpec>` each time as it is
potentially mutable.

Test plan:
- CI. Updated to be explicit about topology in both cases.

Single NUMA multi-LLC machine:
```
$ scx_layered --run-example
...
13:34:01 [INFO] Topology awareness not specified, selecting enabled based on
hardware
...
$ scx_layered --run-example --disable-topology=true
...
13:33:41 [INFO] Disabling topology awareness
...
$ scx_layered --run-example -t
...
13:33:15 [INFO] Disabling topology awareness
...
$ scx_layered --run-example --disable-topology=false
# none of the above messages present
```

Single NUMA single LLC machine:
```
$ scx_layered --run-example
15:33:10 [INFO] Topology awareness not specified, selecting disabled based on
hardware
```
2024-10-11 17:09:07 +01:00
Jake Hillion
143a55cda1 layered: move configuration into library component
Move the LayerConfig and its children from `main.rs` into `lib.rs`. This allows
other tooling, such as config managers or test executors, to modify layered
configs programmatically.

The end goal is to move everything in `layered` except for the argument parsing
into a `run_layered` function, but I haven't done it in this diff because it's
a larger change. This is a common pattern in Rust projects to do as little as
possible in `main.rs` for extensibility.

The only change here, other than publicity and where things are located, is the
signature of `CpuPool::alloc_cpus`. It previously relied on `&Layer`, and this
changes it to the two elements of `Layer` it uses. This allows `Layer` to stay
confined to `main.rs` (for now) to prevent scope creep in this PR.

This may be inconvenient in the short term for WIPs and anyone doing non-Cargo
builds (cough me), but having things split into more files should make
rebases/merges easier in the long run.

Test plan:
- `cargo build --release`
- CI.
2024-10-11 15:55:29 +01:00