Commit Graph

1709 Commits

Author SHA1 Message Date
Changwoo Min
8d8d8f9f61 scx_lavd: consider waker's CPU when ops.select_cpu()
In case of sync wake-up, consider waker's CPU also to improve cache
locality.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-09-22 01:57:49 +09:00
Daniel Hodges
a3cc4c223f
Merge pull request #664 from vax-r/layered_fix_cpumask
scx_layered: Refactor match_layer() and implement helper function to access cpumask within bpf_cpumask
2024-09-20 15:20:35 +02:00
I Hsin Cheng
7799b94f07 scx_layered: Add helper function to access cpumask within bpf_cpumask
Before passing "nodec->cpumas" and "cachec->cpumask" into
"bpf_cpumask_test_cpu()", type conversion should be done first.
Implement "cast_mask()" to convert "struct bpf_cpumask *" into "const
struct cpumask *".

Reference from
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/bpf/progs/cpumask_common.h#n63

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
2024-09-20 20:52:03 +08:00
Andrea Righi
401c9392ed
Merge pull request #665 from vax-r/rustland_core_fix
scx_rustland_core: Access the returned value of saturating_sub()
2024-09-20 07:38:43 +02:00
I Hsin Cheng
9f64db7cbc scx_rustland_core: Access the returned value of saturating_sub()
Use an "_" variable to access the returned valued of "saturating_sub()"
to mute the compilation warnings.

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
2024-09-19 23:01:17 +08:00
I Hsin Cheng
e4bb99efc5 scx_layered: Refactor match_layer()
Refactor match_layer() to prevent the compiling error caused by
uninitialization of the variable "nr_match_ors" before usage.

Move the checking of "nr_match_ors" after it access the value within
"layer->nr_match_ors" to make sure it's initiailized successfully.

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
2024-09-19 22:20:03 +08:00
Andrea Righi
488f209c28
Merge pull request #662 from sched-ext/rustland-prevent-ci-failures
scx_rustland_core: prevent CI failures
2024-09-19 14:37:20 +02:00
Andrea Righi
809d39aa7f scx_rustland_core: dispatch all kthreads directly from BPF
Dispatching kthreads via user-space can still lead to deadlocks in
certain cases (for example we can still trigger stalls by running the
fork stressor via stress-ng).

To prevent such stalls simply dispatch kthreads directly from BPF for
now to prevent failures.

In the future we may consider to provide an API to restrict the
selection of tasks directly dispatched (for example passing a mask PF_*
flags to "whitelist" the tasks that are allowed to bypass the user-space
scheduler).

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-09-19 09:12:13 +02:00
Andrea Righi
e78ee41a2e scx_rustand_core: prevent nr_queued underflow
Updating nr_queued in a non-atomic when a queued task is consumed can
lead to underflows. We don't really care about being 100% accurate here,
since nr_queued should be considered more of a statistic than an
accurate value.

Therefore, just accept the fact that nr_queued can be inaccurate and
handle potential underflows.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-09-19 09:09:24 +02:00
Andrea Righi
3f8db5783b
Merge pull request #658 from sched-ext/rustland-core-improve-cpu-selection
scx_rustland_core: improve idle CPU selection API and logic
2024-09-17 22:38:15 +02:00
Andrea Righi
86db45f855 scx_rustland_core: prevent deadlock with per-CPU DSQs and CPU affinity
If a task that is executing sched_setaffinity() is dispatched on a
per-CPU DSQ it may stall the DSQ completely, since the task won't be
able to be consumed from the corresponding CPU.

This can be easily triggered running the following stress test:

  $ stress-ng --aggressive -c (nproc) -f (nproc)

From the stall trace we can see something like the following:

  R stress-ng[2648662] -6880ms
      scx_state/flags=3/0x9 dsq_flags=0x1 ops_state/qseq=0/0
      sticky/holding_cpu=-1/-1 dsq_id=0x5 dsq_vtime=0
      cpus=ff

    __set_cpus_allowed_ptr+0x1c8/0x260
    __sched_setaffinity+0x105/0x1c0
    sched_setaffinity+0x1ed/0x2d0
    __x64_sys_sched_setaffinity+0xa5/0x100
    do_syscall_64+0x82/0x190
    entry_SYSCALL_64_after_hwframe+0x76/0x7e

This should probably be addressed in the core sched_ext, but for now
prevent this deadlock by tracking when a task is executing
sched_setaffinity() and automatically bounce those tasks to the shared
DSQ (that can be consumed from any CPU).

This should solve all the recent CI failures with the scx_rustland_core
schedulers.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-09-17 07:42:37 +02:00
Andrea Righi
e6b624a97c scx_rustland_core: improve idle CPU selection API and logic
Pass enqueue flags to user-space: flags will be passed via
QueuedTask.flags and can be forwarded back to BPF via
DispatchedTask.flags.

These flags can be also passed to BpfScheduler.select_cpu() to apply a
more refined CPU selection policy.

Moreover, avoid to prioritize the user-space scheduler too much and
dispatch it only if there are no other tasks that needs to be dispatched
in ops.dispatch().

This improves CPU utilization and enhances the fairness, robustness, and
resilience of schedulers based on scx_rustland_core, particularly under
stress test conditions.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-09-16 22:12:38 +02:00
Jake Hillion
23acd6ebe9 scxstats_to_openmetrics: fix format string
On Python versions that perform validation of this line it fails because
of a square bracket mismatch. This is due to the single quotes being
parsed first. Fix by changing the outer string to double quotes.
2024-09-16 18:16:28 +01:00
Daniel Hodges
4f98de333d
Merge pull request #652 from JakeHillion/layer-growth-rr
scx_layered: add round robin growth strategy
2024-09-16 17:34:48 +02:00
Andrea Righi
8656157ee4
Merge pull request #655 from sched-ext/bpfland-refine-wake-sync
scx_bpfland: refine idle CPU selection logic
2024-09-15 15:51:51 +02:00
Andrea Righi
00eebaf905 scx_bpfland: refine task wakeup logic
On WAKE_SYNC attempt to migrate the wakee on the same CPU as the waker
if the waker is not exiting, the wakee can use the waker's CPU, the
waker's L3 domain is not saturated and there are not other tasks queued
to the local DSQ of the waker's CPU.

This is the same logic used in scx_rusty.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-09-15 14:50:14 +02:00
Andrea Righi
079a53c689 scx_bpfland: get rid of preferred domain
Using the turbo boosted CPUs as preferred scheduling seems to be
beneficial only a very few corner cases, for example on battery-powered
devices with an aggressive cpufreq governor that constantly tries to
scale down the frequency (and even in this case it's probably better to
not force the tasks to run on the fast CPUs, to save power).

In practive the preferred domain seems to introduce more overhead than
benefits overall, so let's get rid of it.

This can be improved in the future adding multiple user-configurable
scheduling domains.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-09-15 14:50:14 +02:00
Changwoo Min
4fb2b09a6e
Merge pull request #654 from multics69/main
scx_lavd: boost the latency critility of kernel threads
2024-09-14 10:44:31 +09:00
Changwoo Min
95e2f4dabe scx_lavd: boost the latency critility of kernel threads
Many kernel threads performs latency critical tasks (e.g., net, gpu). In
particular, AMD GPU driver runs the most part in the kernel space using
kworker. Hence, treat kernel threads as if a woken up task.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-09-14 00:41:02 +09:00
Changwoo Min
10f0378e9d
Merge pull request #653 from multics69/lavd-opt
scx_lavd: add a short circuit for the case of no turbo core
2024-09-14 00:34:26 +09:00
Changwoo Min
4b4f42fce1 scx_lavd: add a short circuit for the case of no turbo core
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-09-13 16:02:07 +09:00
Jake Hillion
3848d87895 scx_layered: add round robin growth strategy 2024-09-12 23:27:21 +01:00
Daniel Hodges
632fcfe4ae
Merge pull request #648 from hodgesds/layered-llc-stats
scx_layered: Add stats for XNUMA/XLLC migrations
2024-09-12 13:23:23 -04:00
Daniel Hodges
ec7f75619a
Merge pull request #649 from hodgesds/layered-topo-grow
scx_layered: Add topology aware core growth selection
2024-09-12 13:20:20 -04:00
Daniel Hodges
dde6e0c7f9 scx_utils: Add node/llc id to core topology
Add ids for node/llc in the Core topology struct.
2024-09-12 10:05:02 -07:00
Daniel Hodges
aee19dd9a1 scx_layered: Add topology aware core growth selection
Add topology aware core growth selection.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-09-12 06:48:51 -07:00
Daniel Hodges
be1020f517
Merge pull request #650 from frelon/update-tumbleweed-docs
update Tumbleweed installation notes
2024-09-12 09:05:13 -04:00
Daniel Hodges
e9a7d5ce16
Merge pull request #651 from hodgesds/layered-random
scx_layered: Add random layer growth algo
2024-09-12 08:56:45 -04:00
Daniel Hodges
14a19dc3ca scx_layered: Add random layer growth algo
Add a random layer growth algo.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-09-12 05:35:54 -07:00
Fredrik Lönnegren
7d7bf94bb0 update Tumbleweed installation notes
Kernel package name was change to kernel-default.

Also add link to documentation to README.md

Signed-off-by: Fredrik Lönnegren <fredrik@frelon.se>
2024-09-12 10:28:03 +02:00
Daniel Hodges
ae57f8d1f9 scx_rusty: Initialize node cpumask
Initialize the node cpumask, which was previously uninitialized causing
metric calculations to be wrong when attempting to lookup CPUs in the
node cpumask.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-09-11 13:14:44 -07:00
Jake Hillion
8ca45cfa37
lint: enable cargo fmt (#643)
Use `cargo fmt` with a specific nightly branch in the CI to enforce formatting. Globally format these files while the diff is still small so we can stay on top of it.

Test plan:
- CI lint check passes.
2024-09-11 10:03:20 +01:00
Daniel Hodges
43ec8bfe82 scx_layered: Add stats for XNUMA/XLLC migrations
Add stats for XNUMA/XLLC migrations. An example of the output is shown:
```
  hodgesd  : util/frac=    5.4/  0.1 load/frac=    301.0/  0.3 tasks=   476
             tot=   3168 local=97.82 wake/exp/reenq= 2.18/ 0.00/ 0.00
             keep/max/busy= 0.03/ 0.00/ 0.03 kick= 0.00 yield/ign= 0.09/    0
             open_idle= 0.00 mig= 6.82 xnuma_mig= 6.82 xllc_mig= 4.86 affn_viol= 0.00
             preempt/first/idle/fail= 0.00/ 0.00/ 0.00/ 0.00 min_exec= 0.00/   0.00ms
             cpus=  2 [  2,  4] 00000000 00000010 00001000
  normal   : util/frac=   28.7/  0.7 load/frac= 101704.7/ 95.8 tasks=  2450
             tot=   4660 local=99.06 wake/exp/reenq= 0.88/ 0.06/ 0.00
             keep/max/busy= 1.03/ 0.00/ 0.00 kick= 0.06 yield/ign= 0.04/  400
             open_idle=15.73 mig=23.45 xnuma_mig=23.45 xllc_mig= 3.07 affn_viol= 0.00
             preempt/first/idle/fail= 0.00/ 0.00/ 0.00/ 0.88 min_exec= 0.00/   0.00ms
             cpus=  2 [  2,  2] 00000001 00000100 00000000
             excl_coll=12.55 excl_preempt= 0.00
  random   : util/frac=    0.0/  0.0 load/frac=      0.0/  0.0 tasks=     0
             tot=      0 local= 0.00 wake/exp/reenq= 0.00/ 0.00/ 0.00
             keep/max/busy= 0.00/ 0.00/ 0.00 kick= 0.00 yield/ign= 0.00/    0
             open_idle= 0.00 mig= 0.00 xnuma_mig= 0.00 xllc_mig= 0.00 affn_viol= 0.00
             preempt/first/idle/fail= 0.00/ 0.00/ 0.00/ 0.00 min_exec= 0.00/   0.00ms
             cpus=  0 [  0,  0] 00000000 00000000 00000000
             excl_coll= 0.00 excl_preempt= 0.00
  stress-ng: util/frac= 4189.1/ 99.2 load/frac=   4200.0/  4.0 tasks=    43
             tot=     62 local= 0.00 wake/exp/reenq= 0.00/100.0/ 0.00
             keep/max/busy=2433.9/177.4/ 0.00 kick=100.0 yield/ign= 3.23/    0
             open_idle= 0.00 mig=54.84 xnuma_mig=54.84 xllc_mig=35.48 affn_viol= 0.00
             preempt/first/idle/fail= 0.00/ 0.00/ 0.00/ 0.00 min_exec= 0.00/   0.00ms
             cpus=  4 [  4,  4] 00000300 00030000 00000000
             excl_coll= 0.00 excl_preempt= 0.00
```

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-09-10 19:53:28 -07:00
Tejun Heo
8f0cc89ee8
Merge pull request #645 from frelon/rusty-init-dom
scx_rusty: init domains when calculating averages
2024-09-10 12:25:51 -10:00
Andrea Righi
e6e3579a92
Merge pull request #634 from anh0516/main
scx_bpfland: Documentation consistency fix
2024-09-10 23:25:55 +02:00
Fredrik Lönnegren
f155966b77 scx_rusty: init domains when calculating averages
The domains are added to the aggregator when load is added (and
duty_cycle is not 0.0f64).

This commit makes sure that all domains are added to the aggregator even
when the calculated duty_cycle is 0.

Signed-off-by: Fredrik Lönnegren <fredrik@frelon.se>
2024-09-10 21:51:41 +02:00
likewhatevs
85863d0e1c
Merge pull request #644 from hodgesds/layered-topo-order
scx_layered: Pass layer spec for core growth algo
2024-09-10 14:49:37 -04:00
Daniel Hodges
5fdd257862 scx_layered: Pass layer spec for core growth algo
Pass in the layer spec when determining the layer core growth algo. This
should make it easier to implement layer growth algos that are spec
specific.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-09-10 10:27:08 -07:00
Daniel Hodges
8f8fe1b2c1
Merge pull request #642 from samuelnair/main
scx_layered: Fix typo in stats
2024-09-10 13:20:01 -04:00
likewhatevs
ffe8ca31e5
Merge pull request #641 from likewhatevs/migrate-ci-24.04
migrate ci vm to 24.04
2024-09-10 11:46:43 -04:00
Samuel Nair
c6af1aa1c8 scx_layered: Fix typo in stats 2024-09-10 08:44:57 -07:00
patso
28319f3205
migrate ci vm to ubuntu 24.04
migrate ci vm to ubuntu 24.04
2024-09-10 09:53:40 -04:00
likewhatevs
c4c3659b6d
Merge pull request #638 from likewhatevs/remove-rlimit-dep
remove dependency on rlimit.rs
2024-09-10 03:14:12 -04:00
Andrea Righi
8efe786799
Merge pull request #640 from sched-ext/bpfland-used-time-slice
scx_bpfland: use sum_exec_runtime to evaluate task's used time slice
2024-09-10 09:10:52 +02:00
Andrea Righi
655ed5b4c6 scx_bpfland: use sum_exec_runtime to evaluate task's used time slice
Using p->scx.slice to evaluate the consumed time slice can be a bit
imprecise, because the sched_ext core implements yielding by setting
p->scx.slice to 0.

When the task's vruntime is evaluated this is considered as the task has
exhausted its entire allocated time slice, even though it voluntarily
released the CPU before the slice fully expired.

To avoid this inaccuracy and prevent penalizing tasks that voluntarily
release the CPU, always evaluate the used time slice based on the
difference in the task's total execution time (p->se.sum_exec_runtime).

This method provides a more precise calculation of vruntime and results
in a fairer task's deadline evaluation.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-09-10 08:03:35 +02:00
patso
c1df85914b
remove dependency on rlimit.rs
the rlimit crate is the only dependency crate
with a build.rs. build.rs files complicate portability.
this removes the need for rlimit.rs
2024-09-10 01:16:53 -04:00
Daniel Hodges
f3b7016a46
Merge pull request #633 from likewhatevs/add-pages
enable docs generation and upload
2024-09-09 19:12:23 -04:00
patso
2e46f71780
generate docs for scx and kernel
generate docs for scx and kernel and push to gh page

this "adds" kernel scheduler and bpf docs to
the generated scx rust docs.
2024-09-09 15:37:59 -04:00
Tejun Heo
249121f15f
Merge pull request #635 from sched-ext/htejun/build
build: Use a single top-level rust workspace
2024-09-08 21:57:27 -10:00
Tejun Heo
b2a71b166e build: Remove unused rust/meson.build
The previous commit makes this file unused but forgot to remove it. Remove
it.
2024-09-08 20:00:35 -10:00