Commit Graph

1843 Commits

Author SHA1 Message Date
I Hsin Cheng
61cb3f7fc5 scx_common_bpf: Append cast_mask()
Remove cast_mask() function distributed throughout different schedulers
and add it in common.bpf.h so every scheduler can reference it once they
need to.

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
2024-09-24 16:01:19 +08:00
Andrea Righi
0a57b93846 scx_rustland_core: prevent mm stall
Bypass user-space scheduling for tasks currently handling a page fault,
preventing potential deadlock conditions involving VMA lock / mmap_lock
during user-space scheduling.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-09-24 08:46:14 +02:00
Andrea Righi
34bc6a2b64 Revert "scx_rustland_core: dispatch all kthreads directly from BPF"
This reverts commit 809d39aa7f.

Dispatching all kthreads directly doesn't really help much at preventing
stalls with the stress-ng fork stressor, so revert this commit. A better
workaround will be provided in the next commit.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-09-24 08:24:01 +02:00
Changwoo Min
b1bc4033b4
Merge pull request #673 from multics69/lavd-prop-lat-cri
scx_lavd: propagate waker's latency criticality to its wakee
2024-09-24 07:34:07 +09:00
Mitchell Augustin
ab1c737e9e
Merge branch 'sched-ext:main' into scx_loader_automatic 2024-09-23 17:19:13 -05:00
Mitchell Augustin
d434ab4266 scx_loader: Add initial automatic scheduler switching via --monitor-no-dbus
Exposes an option --monitor-no-dbus in scx_loader that will monitor CPU
utilization and start scx_lavd when any CPU exceeds 90% for more than 5
seconds. scx_lavd will be terminated if all CPUs are below 90% for
more than 30 seconds. When this flag is specified, scx_loader's
dbus functionality is not utilized.
2024-09-23 17:07:43 -05:00
Daniel Hodges
29fb647c93 scx_layered: Refactor idle core selection
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-09-23 12:01:42 -07:00
Daniel Hodges
380fd1f3b3 scx_layered: Make idle select topology aware
Make idle CPU selection topology aware.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-09-23 10:10:43 -07:00
Daniel Hodges
1b5d23dfe1
Merge pull request #675 from hodgesds/layered-dsq-dump-cleanup
scx_layered: Cleanup dump format
2024-09-23 13:09:21 -04:00
Daniel Hodges
35477970bd scx_layered: Cleanup dump format
Cleanup the dump format for topology aware dumps in scx_layered.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-09-23 10:02:49 -07:00
Daniel Hodges
8b14e48994
Merge pull request #671 from hodgesds/layered-last-waker
scx_layered: Add waker stats per layer
2024-09-23 10:58:54 -04:00
Changwoo Min
71fa92cf1c scx_lavd: propagate waker's latency criticality to its wakee
If a waker is more latency critical than a wakee, inherit a waker's
latency criticality for the wakee. This allows the wakee to consider the
context of who wakes me up. For now, we limit such inheritance to one
hop and one schedule.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-09-23 12:56:16 +09:00
Changwoo Min
ad8536b4a4
Merge pull request #670 from multics69/lavd-opt-preemption
scx_lavd: find a victim cpu for preemption within task's compute domain
2024-09-23 10:22:08 +09:00
Daniel Hodges
91d32663bd
scx_layered: Refactor waker tracking to only use last waker
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-09-22 18:05:54 -04:00
Daniel Hodges
1a2f82b91c
Merge pull request #666 from hodgesds/layered-local-llc
scx_layered: Add topology aware preemption
2024-09-22 17:36:32 -04:00
Daniel Hodges
326f3b7988
Merge pull request #667 from hodgesds/layered-pcore-grow
scx_layered: Add Big/Little core growth algos
2024-09-22 16:59:42 -04:00
Daniel Hodges
1ac9712d2e
scx_layered: Refactor preemption into a separate function
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-09-22 16:54:11 -04:00
Daniel Hodges
bc34bd867b
scx_layered: Add option to enable XNUMA preemption
Disable XNUMA preemption by default and add an option to enable it.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-09-22 16:52:57 -04:00
Daniel Hodges
55b185313a
Remove unneeded Cargo lock file
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-09-22 16:52:57 -04:00
Daniel Hodges
e105d9f8b1
scx_layered: Use cast_mask helper
Use the cast_mask helper to clean up some of the bpf cpumask conversion
code for preemption.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-09-22 16:52:57 -04:00
Daniel Hodges
5d9d32b65c
scx_layered: Add stats for XLLC/XNUMA preemptions
Add stats for XLLC/XNUMA preemptions.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-09-22 16:52:57 -04:00
Daniel Hodges
c15ecbb3a4
scx_layered: Add topology aware preemption
Add topology aware preemption that begins in the local LLC and attempts
to preempt from cpus nearest in the topology.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-09-22 16:52:56 -04:00
Daniel Hodges
6fb2f0b2b4
scx_layered: Clean up waker code
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-09-22 06:43:10 -04:00
Daniel Hodges
c55b2c6e69
scx_layered: Add waker stats per layered
Update the task context to keep a mask of wakers and add stats for wakes
across layers.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-09-22 06:43:03 -04:00
Daniel Hodges
140a101874
Merge pull request #449 from hodgesds/layered-dsq-fixes
scx_layered: Add a hi fallback dsq per llc
2024-09-22 06:39:46 -04:00
Changwoo Min
13a68465bf common: add bpf_cpumask_weight() to common.bpf.h
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-09-22 12:47:38 +09:00
Changwoo Min
7321a89724 scx_lavd: find a victim cpu for preemption within task's compute domain
Previously, we found a victim from the entire CPUs, which include remote
or non-compatible CPUs. Now we limit our search for victim finding
within a task's compute domain.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-09-22 12:47:18 +09:00
Changwoo Min
a13082c2b8
Merge pull request #669 from multics69/lavd-opt-select-cpu
scx_lavd: consider waker's CPU when ops.select_cpu()
2024-09-22 09:16:06 +09:00
Andrea Righi
897977bbc1
Merge pull request #663 from vax-r/bpfland_fix
scx_bpfland: Remove the usage of cast_mask in bpfland_enqueue
2024-09-21 22:15:11 +02:00
Changwoo Min
8d8d8f9f61 scx_lavd: consider waker's CPU when ops.select_cpu()
In case of sync wake-up, consider waker's CPU also to improve cache
locality.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-09-22 01:57:49 +09:00
Daniel Hodges
4aa841de0a
scx_layered: Rename HI_FALLBACK_DSQ to HI_FALLBACK_DSQ_BASE
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-09-20 17:28:38 -04:00
Daniel Hodges
a3d1344293
scx_layered: Add core growth algo for core type
Add core growth algos for Big/Little core support. The algos allow
layers to grow layers by preferring either big or little cores first.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-09-20 11:50:15 -04:00
Daniel Hodges
a9f3190b5f
scx_utils: Add extra ordering macros for topology
Add extra ordering macros for Core/CPU structs for ease of use with
Rust standard library features. This issue was hit when trying to sort
cores based on the CoreType. See this similar issue for details:
https://github.com/rust-lang/rust/issues/113550

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-09-20 11:41:23 -04:00
Daniel Hodges
a3cc4c223f
Merge pull request #664 from vax-r/layered_fix_cpumask
scx_layered: Refactor match_layer() and implement helper function to access cpumask within bpf_cpumask
2024-09-20 15:20:35 +02:00
I Hsin Cheng
7799b94f07 scx_layered: Add helper function to access cpumask within bpf_cpumask
Before passing "nodec->cpumas" and "cachec->cpumask" into
"bpf_cpumask_test_cpu()", type conversion should be done first.
Implement "cast_mask()" to convert "struct bpf_cpumask *" into "const
struct cpumask *".

Reference from
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/bpf/progs/cpumask_common.h#n63

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
2024-09-20 20:52:03 +08:00
I Hsin Cheng
5596d5e3fe scx_bpfland: Remove the usage of cast_mask in bpfland_enqueue
The usage of cast_mask() within bpfland_enqueue aims to cast the type of
"p->cpus_ptr" from "struct bpf_cpumask *" to "const struct cpumask *".
However, the type of "p->cpus_ptr" is already "const cpumask_t *" aka
"const struct cpumask *", so no conversion is needed.

Passing a value of type "struct cpumask *" into "struct bpf_cpumask *"
also leads to compiling error.

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
2024-09-20 20:45:09 +08:00
Daniel Hodges
8532ba3f1e
scx_layered: Fix hi fallback dsq consumption
Fix hi fallback dsq consumption to only consume from the cache local hi
fallback dsq.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-09-20 04:18:05 -04:00
Andrea Righi
401c9392ed
Merge pull request #665 from vax-r/rustland_core_fix
scx_rustland_core: Access the returned value of saturating_sub()
2024-09-20 07:38:43 +02:00
I Hsin Cheng
9f64db7cbc scx_rustland_core: Access the returned value of saturating_sub()
Use an "_" variable to access the returned valued of "saturating_sub()"
to mute the compilation warnings.

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
2024-09-19 23:01:17 +08:00
I Hsin Cheng
e4bb99efc5 scx_layered: Refactor match_layer()
Refactor match_layer() to prevent the compiling error caused by
uninitialization of the variable "nr_match_ors" before usage.

Move the checking of "nr_match_ors" after it access the value within
"layer->nr_match_ors" to make sure it's initiailized successfully.

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
2024-09-19 22:20:03 +08:00
Andrea Righi
488f209c28
Merge pull request #662 from sched-ext/rustland-prevent-ci-failures
scx_rustland_core: prevent CI failures
2024-09-19 14:37:20 +02:00
Andrea Righi
809d39aa7f scx_rustland_core: dispatch all kthreads directly from BPF
Dispatching kthreads via user-space can still lead to deadlocks in
certain cases (for example we can still trigger stalls by running the
fork stressor via stress-ng).

To prevent such stalls simply dispatch kthreads directly from BPF for
now to prevent failures.

In the future we may consider to provide an API to restrict the
selection of tasks directly dispatched (for example passing a mask PF_*
flags to "whitelist" the tasks that are allowed to bypass the user-space
scheduler).

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-09-19 09:12:13 +02:00
Andrea Righi
e78ee41a2e scx_rustand_core: prevent nr_queued underflow
Updating nr_queued in a non-atomic when a queued task is consumed can
lead to underflows. We don't really care about being 100% accurate here,
since nr_queued should be considered more of a statistic than an
accurate value.

Therefore, just accept the fact that nr_queued can be inaccurate and
handle potential underflows.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-09-19 09:09:24 +02:00
Andrea Righi
3f8db5783b
Merge pull request #658 from sched-ext/rustland-core-improve-cpu-selection
scx_rustland_core: improve idle CPU selection API and logic
2024-09-17 22:38:15 +02:00
Andrea Righi
86db45f855 scx_rustland_core: prevent deadlock with per-CPU DSQs and CPU affinity
If a task that is executing sched_setaffinity() is dispatched on a
per-CPU DSQ it may stall the DSQ completely, since the task won't be
able to be consumed from the corresponding CPU.

This can be easily triggered running the following stress test:

  $ stress-ng --aggressive -c (nproc) -f (nproc)

From the stall trace we can see something like the following:

  R stress-ng[2648662] -6880ms
      scx_state/flags=3/0x9 dsq_flags=0x1 ops_state/qseq=0/0
      sticky/holding_cpu=-1/-1 dsq_id=0x5 dsq_vtime=0
      cpus=ff

    __set_cpus_allowed_ptr+0x1c8/0x260
    __sched_setaffinity+0x105/0x1c0
    sched_setaffinity+0x1ed/0x2d0
    __x64_sys_sched_setaffinity+0xa5/0x100
    do_syscall_64+0x82/0x190
    entry_SYSCALL_64_after_hwframe+0x76/0x7e

This should probably be addressed in the core sched_ext, but for now
prevent this deadlock by tracking when a task is executing
sched_setaffinity() and automatically bounce those tasks to the shared
DSQ (that can be consumed from any CPU).

This should solve all the recent CI failures with the scx_rustland_core
schedulers.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-09-17 07:42:37 +02:00
Andrea Righi
e6b624a97c scx_rustland_core: improve idle CPU selection API and logic
Pass enqueue flags to user-space: flags will be passed via
QueuedTask.flags and can be forwarded back to BPF via
DispatchedTask.flags.

These flags can be also passed to BpfScheduler.select_cpu() to apply a
more refined CPU selection policy.

Moreover, avoid to prioritize the user-space scheduler too much and
dispatch it only if there are no other tasks that needs to be dispatched
in ops.dispatch().

This improves CPU utilization and enhances the fairness, robustness, and
resilience of schedulers based on scx_rustland_core, particularly under
stress test conditions.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-09-16 22:12:38 +02:00
Jake Hillion
23acd6ebe9 scxstats_to_openmetrics: fix format string
On Python versions that perform validation of this line it fails because
of a square bracket mismatch. This is due to the single quotes being
parsed first. Fix by changing the outer string to double quotes.
2024-09-16 18:16:28 +01:00
Daniel Hodges
4f98de333d
Merge pull request #652 from JakeHillion/layer-growth-rr
scx_layered: add round robin growth strategy
2024-09-16 17:34:48 +02:00
Andrea Righi
8656157ee4
Merge pull request #655 from sched-ext/bpfland-refine-wake-sync
scx_bpfland: refine idle CPU selection logic
2024-09-15 15:51:51 +02:00
Andrea Righi
00eebaf905 scx_bpfland: refine task wakeup logic
On WAKE_SYNC attempt to migrate the wakee on the same CPU as the waker
if the waker is not exiting, the wakee can use the waker's CPU, the
waker's L3 domain is not saturated and there are not other tasks queued
to the local DSQ of the waker's CPU.

This is the same logic used in scx_rusty.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-09-15 14:50:14 +02:00