Commit Graph

1167 Commits

Author SHA1 Message Date
Tejun Heo
6ea15f9f9f
Merge pull request #819 from minosfuture/vmlinux_per_arch
Use per-arch vmlinux.h v2
2024-10-21 19:36:52 +00:00
likewhatevs
303c6d09a0
Merge pull request #824 from likewhatevs/layered-exit-task-no-missing-ctx
scx_layered: fix exit_task ctx lookup err
2024-10-21 14:52:07 +00:00
Jake Hillion
55c9636f78 layered: bpf: add layer kind to layer
Currently we have an approximation of LayerKind in the BPF code with `open` on
the layer, but it is difficult/impossible to tell the difference between an
Open and a Grouped layer. Add a `kind` field to the BPF `layer` and plumb
through an enum from the Rust side.
2024-10-21 11:32:17 +01:00
Pat Somaru
d89c571593
scx_layered: do not attempt ctx lookup on tasks exited before running on scx 2024-10-20 17:47:24 -04:00
Andrea Righi
fb3f1d0b43
Merge pull request #821 from sched-ext/rustland-min-vtime-budget
scx_rustland: Adjust task's vruntime budget based on latency weight
2024-10-20 07:44:35 +00:00
Changwoo Min
bf1b014d63
Merge pull request #818 from multics69/lavd-tuning
scx_lavd: add missing reset_lock_futex_boost()
2024-10-20 01:41:54 +00:00
Daniel Hodges
e72e5ce0f4
Merge pull request #744 from minosfuture/main
scx_layered: Fix crash on aarch64 due to unavailable cache id file
2024-10-19 22:33:53 +00:00
Ming Yang
1b5359ef4a Use per-arch vmlinux.h v2
Rework per-arch vmlinux solution
* have per-arch directory under sched/include/arch/, in which we
  maintain vmlinux.h symlink and real file
  vmlinux-{kernel_ver}-g{sha1}.h. The original sched/include/vmlinux/
  folder is removed.
* update meson build `-I` option to find the new vmlinux.h position
* update cargo build scripts to use the per-arch vmlinux.h for
  generating bindings
* keep the original ClangInfo refactoring changes

Signed-off-by: Ming Yang <minos.future@gmail.com>
2024-10-19 10:50:59 -07:00
Andrea Righi
30a2a2013c scx_rustland: Adjust task's vruntime budget based on latency weight
Adjust the amount of vruntime budget an idle task can accumulate in
function of its latency weight, which is derived from the average number
of voluntary context switches.

This ensures that latency-sensitive tasks naturally receive an
additional priority boost and we can get avoid scaling down the vruntime
to determine the task's deadline, making the scheduler more fair.

It also makes the scheduler more robust, now rustland can survive
intensive stress tests, such as `stress-ng --cpu-sched 64` or hackbench.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-10-19 19:32:14 +02:00
Daniel Hodges
b1b76ee72a
scx_rusty: Cleanup cpumask casting
Use the cask_mask helper function to cleanup scx_rusty.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-10-19 12:01:36 -04:00
Changwoo Min
2fd395bbbf scx_lavd: remove unnecessary load tracking
The algorithm has been evolved to decide the time slice without
tracking the system-wide load. So remove the obsolete load tracking
code.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-10-19 15:39:24 +09:00
Changwoo Min
8d63024be7 scx_lavd: add missing reset_lock_futex_boost()
reset_lock_futex_boost() should be called every context switch of a
task. Otherwise, in the worst case, a task and that CPU could block
the preemption. To avoid such a situation, add missing
reset_lock_futex_boost() calls.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-10-19 15:39:18 +09:00
Ming Yang
f3f4726c09 scx_layered: Read CPU topology for building CpuPool
Building CpuPool from cache-cpu topology did not apply on arm, because
`/sys/devices/system/cpu/cpu{}/cache/index{}/id` file is unavailable.

Read CPU topology instead.

Signed-off-by: Ming Yang <minos.future@gmail.com>
2024-10-17 23:41:08 -07:00
Andrea Righi
48bbcd24dd scx_bpfland: tune default settings
Adjust some default settings after the rework done with commit 112a5d4
("scx_bpfland: rework lowlatency mode to adjust tasks priority").

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-10-17 21:46:51 +02:00
Andrea Righi
4d68133f3b scx_bpfland: rework lowlatency mode to adjust tasks priority
Rework lowlatency mode as following:
 - introduce task dynamic priority: task weight multiplied by the
   average amount of voluntary context switches
 - use dynamic priority to determine task's vruntime (instead of the
   static task's weight)
 - task's minimum vruntime is evaluated in function of the dynamic
   priority (tasks with a higher dynamic priority can have a smaller
   vruntime compared to tasks with a lower dynamic priority)

The dynamic priority allows to maintain a good system responsiveness
also without applying the classification of tasks in "interactive" and
"regular", therefore in lowlatency mode only the shared DSQ will be
used (priority DSQ is disabled).

Using a separate priority queue to dispatch "interactive" tasks makes
the scheduler less fair, allowing latency-sensitive tasks to be
prioritized even when there is a high number of tasks in the system
(e.g., `stress-ng -c 1024` or similar scenarios), where relying solely
on dynamic priority may not be sufficient.

On the other hand, disabling the classification of "interactive" tasks
results in a fairer scheduler and more predictable performance, making
it better suited for soft real-time applications (e.g, audio and
multimedia).

Therefore, the --lowlatency option is retained to allow users to choose
between more predictable performance (by disabling the interactive task
classification) or a more responsive system (default).

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-10-17 21:46:51 +02:00
Andrea Righi
d336892c71
Merge pull request #816 from sched-ext/rustland-core-update-doc
scx_rustland_core: update documentation about the new API
2024-10-17 19:18:16 +00:00
Andrea Righi
a155ff2ada scx_rustland_core: update documentation about the new API
Update the documentation adding the new task statistics provided by
scx_rustland_core.

Fixes: be681c7 ("scx_rustland_core: pass nvcsw, slice and dsq_vtime to user-space")
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-10-17 19:07:51 +02:00
f1b1830512
Merge pull request #814 from JakeHillion/pr814
layered: add RandomTopo layer growth algorithm
2024-10-17 17:05:53 +00:00
Jake Hillion
1415b4a454 layered: make disable_topology arg require equals
The recent changes to `disable_topology` making the arg an `Option<bool>`
instead of a `bool` caused an issue with it incorrectly attaching arguments.
Make the argument `require_equals` to fix this case.

This is a behaviour change for anybody previously relying on `-t true`,
`-t false`, `--disable-topology true`, or `--disable-topology false`. The
equals syntax worked before and continues to work after, as demonstrated in the
CI.

Test plan:

Before:
```sh
$ sudo target/release/scx_layered -t f:/tmp/test.json
error: invalid value 'f:/tmp/test.json' for '--disable-topology
[<DISABLE_TOPOLOGY>]'
  [possible values: true, false]

  For more information, try '--help'.
```

After:
```sh
$ sudo target/release/scx_layered -t f:/tmp/test.json
14:44:00 [INFO] CPUs: online/possible=176/176 nr_cores=88
14:44:00 [INFO] Disabling topology awareness
...
^CEXIT: Scheduler unregistered from user space
```
2024-10-17 15:46:30 +01:00
Jake Hillion
a0fe303b61 layered: add RandomTopo layer growth algorithm
Add an additional layer growth algorithm, named 'RandomTopo'. It follows these
rules:
- Randomise NUMA nodes. List each core in each NUMA node before a core from
  another NUMA node.
- Randomise LLCs within each NUMA node. List each core in each LLC before a
  core in a different LLC.
- Randomise the core order within each LLC.

This attempts to provide a relatively evenly distributed set of cores while
considering topology. Unlike `Topo`, it does not require you to specify the
ordering and instead generates it from the hardware, making desyncs between the
config and the hardware less likely.

Currently `RandomTopo` considers topology even with `--disable-topology=true`.
I can see the arguments for this going both ways. On one hand requesting
disable topology suggests you want no consideration of machine topology, and
`RandomTopo` should decay to `Random` (which it does on single node/LLC machines
anyway). On the other hand, the config explicitly specifies `RandomTopo` and
should consider the topology. If anyone feels strongly I can change this to
respect `disable_topology`.

Test plan:
```sh
$ sudo target/release/scx_layered -v f:/tmp/test.json
...
14:31:19 [DEBUG] layer: batch algo: RandomTopo core order: [47, 44, 43, 42, 40, 45, 46, 41, 38, 37, 36, 39, 34, 32, 35, 33, 54, 49, 50, 52, 51, 48, 55, 53, 68, 64, 66, 67, 70, 69, 71, 65, 9, 10, 12, 15, 14, 11, 8, 13, 59, 60, 57, 63, 62, 56, 58, 61, 2, 3, 5, 4, 0, 6, 7, 1, 86, 83, 85, 87, 84, 81, 80, 82, 20, 22, 19, 23, 21, 18, 17, 16, 30, 25, 26, 31, 28, 27, 29, 24, 78, 73, 74, 79, 75, 77, 76, 72]
14:31:19 [DEBUG] layer: immediate algo: RandomTopo core order: [45, 40, 46, 42, 47, 43, 41, 44, 80, 82, 83, 84, 85, 86, 81, 87, 13, 10, 9, 15, 14, 12, 11, 8, 36, 38, 39, 32, 34, 35, 33, 37, 7, 3, 1, 0, 2, 5, 4, 6, 53, 52, 54, 48, 50, 49, 55, 51, 76, 77, 79, 78, 73, 74, 72, 75, 71, 66, 64, 67, 70, 69, 65, 68, 24, 26, 31, 25, 28, 30, 27, 29, 58, 56, 59, 61, 57, 62, 60, 63, 16, 19, 17, 23, 22, 20, 18, 21]
...
```

This is a machine with 1 NUMA/11 LLCs with 8 cores per LLC and you can see the
results are grouped by LLC but random within.
2024-10-17 15:36:00 +01:00
Daniel Hodges
b01ff79080
Merge pull request #805 from hodgesds/layered-refresh-cleanup
scx_layered: Refactor refresh cpumasks
2024-10-16 19:06:15 +00:00
Andrea Righi
2ea47af4bc
Merge pull request #804 from sched-ext/rustland-fixes
scx_rustland fixes and improvements
2024-10-16 18:26:03 +00:00
Tejun Heo
84d8abf913 Revert "Use per-arch vmlinux.h"
This reverts commit a23f3566e3.
2024-10-16 06:42:28 -10:00
Tejun Heo
bd79059f1a Revert "Add vmlinux.h for multiple arch"
This reverts commit 7067092555.
2024-10-16 06:42:18 -10:00
Dan Schatzberg
730052a0c4
Merge pull request #803 from dschatzberg/mitosis_fallback_dsq
scx_mitosis: Handle pinned tasks
2024-10-16 13:26:23 +00:00
Andrea Righi
763da6ab55 scx_rlfifo: operate in a more work-conserving way
Make scx_rlfifo even simpler and keep dispatching tasks even if the CPUs
are all busy.

This allows to better stress test the scx_rustland_core backend, by
using both the per-CPU DSQs and the global shared DSQ.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-10-16 14:06:00 +02:00
Andrea Righi
b07de1d7d5 scx_rustland: clarify EDF scheduling
scx_rustland is now effectively a deadline-based scheduler and not a
pure vruntime-based scheduler.

Clarify this in the source code. No functional change.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-10-16 14:06:00 +02:00
Andrea Righi
c4b6408e92 scx_rustland: smooth vruntime update
Update vruntime adding the used virtual time slice of each task as soon
they are scheduled.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-10-16 14:06:00 +02:00
Andrea Righi
0b2de2c10c scx_rustland: use built-in nvcsw metrics
Use the nvcsw metric from the scx_rustland_core backend, intead of
retrieving this metric in user-space via procfs.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-10-16 14:06:00 +02:00
Andrea Righi
97629178e2 scx_rustland_core: bump up version to 2.2.2
Bump up the minor version to reflect the new backward-compatible
functionality added.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-10-16 14:06:00 +02:00
Daniel Hodges
907746745e scx_layered: Refactor refresh cpumasks
Refactor the logic for refresh cpumasks to be easy to read and verify.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-10-15 17:58:10 -07:00
Tejun Heo
4841df8138
Merge pull request #793 from minosfuture/vmlinux_per_arch
Use per-arch vmlinux.h
2024-10-15 19:52:42 +00:00
Dan Schatzberg
96ebe6b84a scx_mitosis: Handle pinned tasks
Pinned tasks should just be routed to a fallback DSQ. kthreads are given
a higher priority than non-kthreads so use two fallback DSQs.

Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
2024-10-15 09:09:01 -07:00
Dan Schatzberg
902f41adf0
Merge pull request #799 from dschatzberg/mitosis_dispatch_no_wakeup
scx_mitosis: handle enqueue() on !wakeup
2024-10-15 13:46:07 +00:00
Daniel Hodges
71d63010af scx_layered: Refactor layer iteration
Remove DSQ iter algos.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-10-14 13:13:53 -07:00
Dan Schatzberg
a17f16e4b9 scx_mitosis: handle enqueue() on !wakeup
If we're not on the wakeup path, we may see enqueue() invoked without
select_cpu() which will require an idle cpu lookup. In order to fix
this, we refactor the idle_cpu lookup in select_cpu so it can be invoked
from enqueue().

Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
2024-10-14 10:13:07 -07:00
Daniel Hodges
912d6e01c1 scx_layered: Add LLC integration test
Add an integration test for testing that the `llcs` field on the layer
config works properly.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-10-14 07:27:29 -07:00
Daniel Hodges
ed18e43612
Merge pull request #795 from hodgesds/bpftrace-tests
scx_layered: Add topology integration test
2024-10-14 12:54:54 +00:00
Daniel Hodges
e456c83536 scx_layered: Add topology integration test
Add a bpftrace script that does a topology aware test. The test script
runs a bpftrace script that asserts that stress-ng processes are
scheduled on NUMA node 0 only.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-10-13 20:23:11 -07:00
Ming Yang
f7cdf08754 scx_mitosis: Fix static assertion of scx_bpf_task_cgroup failing __weak check
it failed the static assertion in macro bpf_ksym_exists.

Signed-off-by: Ming Yang <minos.future@gmail.com>
2024-10-13 07:57:12 -07:00
Ming Yang
7067092555 Add vmlinux.h for multiple arch
Following the change of using per-arch vmlinux.h. Add it for the
remaining archs.

Signed-off-by: Ming Yang <minos.future@gmail.com>
2024-10-13 07:57:12 -07:00
Ming Yang
a23f3566e3 Use per-arch vmlinux.h
vmlinux.h is not compatible across archs.

Handle this compatibility issue by
* Add arch info into vmlinux.h real file name
* Link vmlinux.h to the target-arch real file at build time
* Use target-arch real file for scx_utils bindgen.

Also refactored clang related logic into a new clang_info mod, which is
shared by bpf_builder.rs and builder.rs.

Signed-off-by: Ming Yang <minos.future@gmail.com>
2024-10-13 07:57:12 -07:00
Changwoo Min
c1f4051a14 scx_lavd: fix int overflow in calculating avg_lat_cri
u32 is not big enough to hold the sum of lat_cri in a period,
so sum_lat_cri (u32) was overflown, resulting in incorrect
avg_lat_cri. Change the type from u32 to u64, avoiding the
interger overflow. Note that {sum/avg}_lat_cri is only for
deubugging so it is irrelevant in making scheduling decisions.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-10-13 00:58:36 +09:00
Changwoo Min
6c9bbe66dc scx_lavd: remove unnecessary downscaling in deadline calculation
The downscaling is not necessary in calculating task's virtual
deadline because virtual dealine represents only relative order
in task scheduling. Hence downscaling incurs only inacuracy
caused by truncation.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-10-13 00:41:23 +09:00
Changwoo Min
6ddc3f0a2b scx_lavd: do not inspect scx_lavd process itself
Print the task status of scx_lavd is not useful,
so filter it out.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-10-12 17:21:08 +09:00
Andrea Righi
197dee93f4 scx_bpfland: get rid of per-CPU DSQs
Using per-CPU DSQs seems to introduce more issues than benefits
(potential stalls, etc.). Therefore, let's get rid of the per-CPU DSQs
and use SCX_DSQ_LOCAL for tasks directly dispatched to specific CPUs.

This change seems to also improve performance on 6.12 and it makes the
scheduler a lot more stable and consistent.

The issues will be investigated separately, providing a separate stress
test scheduler, designed to stress test per-CPU DSQs.

Tested-by: Piotr Gorski <piotrgorski@cachyos.org>
Tested-by: Eric Naim <dnaim@cachyos.org>
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-10-12 08:15:51 +02:00
Andrea Righi
198f22656c scx_bpfland: clarify error code returned by pick_idle_cpu()
Return more meaningful error codes from pick_idle_cpu(). No functional
change, just improved code readability.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-10-12 08:08:48 +02:00
Andrea Righi
ceb4f1755f scx_bpfland: always refill task timeslice in ops.dispatch()
When a task exhausts its timeslice and no other tasks are ready to run,
we automatically refill its timeslice, but only if the current CPU is a
fully idle SMT core.

If we don’t handle the refill, the sched_ext core will default to
refilling using SCX_SLICE_DFL, which may not be optimal.

To ensure better control over the task’s timeslice, always refill it
when no other tasks are available to run.

Fixes: 6e24fcc ("scx_bpfland: keep tasks running on full-idle SMT cores")
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-10-12 08:08:48 +02:00
Andrea Righi
54d704ceda scx_bpfland: pick a random idle CPU when prev_cpu is not valid
Pick any random idle CPU when the previous CPU isn't valid anymore
according to the task's cpumask.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-10-12 08:08:48 +02:00
Changwoo Min
836cf9faa4
Merge pull request #779 from multics69/lavd-futex-v2
scx_lavd: mitigate the lock holder preemption problem
2024-10-12 02:42:33 +00:00