Commit Graph

1129 Commits

Author SHA1 Message Date
Daniel Hodges
4aa841de0a
scx_layered: Rename HI_FALLBACK_DSQ to HI_FALLBACK_DSQ_BASE
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-09-20 17:28:38 -04:00
Daniel Hodges
a3d1344293
scx_layered: Add core growth algo for core type
Add core growth algos for Big/Little core support. The algos allow
layers to grow layers by preferring either big or little cores first.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-09-20 11:50:15 -04:00
I Hsin Cheng
7799b94f07 scx_layered: Add helper function to access cpumask within bpf_cpumask
Before passing "nodec->cpumas" and "cachec->cpumask" into
"bpf_cpumask_test_cpu()", type conversion should be done first.
Implement "cast_mask()" to convert "struct bpf_cpumask *" into "const
struct cpumask *".

Reference from
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/bpf/progs/cpumask_common.h#n63

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
2024-09-20 20:52:03 +08:00
I Hsin Cheng
5596d5e3fe scx_bpfland: Remove the usage of cast_mask in bpfland_enqueue
The usage of cast_mask() within bpfland_enqueue aims to cast the type of
"p->cpus_ptr" from "struct bpf_cpumask *" to "const struct cpumask *".
However, the type of "p->cpus_ptr" is already "const cpumask_t *" aka
"const struct cpumask *", so no conversion is needed.

Passing a value of type "struct cpumask *" into "struct bpf_cpumask *"
also leads to compiling error.

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
2024-09-20 20:45:09 +08:00
Daniel Hodges
8532ba3f1e
scx_layered: Fix hi fallback dsq consumption
Fix hi fallback dsq consumption to only consume from the cache local hi
fallback dsq.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-09-20 04:18:05 -04:00
I Hsin Cheng
e4bb99efc5 scx_layered: Refactor match_layer()
Refactor match_layer() to prevent the compiling error caused by
uninitialization of the variable "nr_match_ors" before usage.

Move the checking of "nr_match_ors" after it access the value within
"layer->nr_match_ors" to make sure it's initiailized successfully.

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
2024-09-19 22:20:03 +08:00
Andrea Righi
3f8db5783b
Merge pull request #658 from sched-ext/rustland-core-improve-cpu-selection
scx_rustland_core: improve idle CPU selection API and logic
2024-09-17 22:38:15 +02:00
Andrea Righi
e6b624a97c scx_rustland_core: improve idle CPU selection API and logic
Pass enqueue flags to user-space: flags will be passed via
QueuedTask.flags and can be forwarded back to BPF via
DispatchedTask.flags.

These flags can be also passed to BpfScheduler.select_cpu() to apply a
more refined CPU selection policy.

Moreover, avoid to prioritize the user-space scheduler too much and
dispatch it only if there are no other tasks that needs to be dispatched
in ops.dispatch().

This improves CPU utilization and enhances the fairness, robustness, and
resilience of schedulers based on scx_rustland_core, particularly under
stress test conditions.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-09-16 22:12:38 +02:00
Daniel Hodges
4f98de333d
Merge pull request #652 from JakeHillion/layer-growth-rr
scx_layered: add round robin growth strategy
2024-09-16 17:34:48 +02:00
Andrea Righi
00eebaf905 scx_bpfland: refine task wakeup logic
On WAKE_SYNC attempt to migrate the wakee on the same CPU as the waker
if the waker is not exiting, the wakee can use the waker's CPU, the
waker's L3 domain is not saturated and there are not other tasks queued
to the local DSQ of the waker's CPU.

This is the same logic used in scx_rusty.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-09-15 14:50:14 +02:00
Andrea Righi
079a53c689 scx_bpfland: get rid of preferred domain
Using the turbo boosted CPUs as preferred scheduling seems to be
beneficial only a very few corner cases, for example on battery-powered
devices with an aggressive cpufreq governor that constantly tries to
scale down the frequency (and even in this case it's probably better to
not force the tasks to run on the fast CPUs, to save power).

In practive the preferred domain seems to introduce more overhead than
benefits overall, so let's get rid of it.

This can be improved in the future adding multiple user-configurable
scheduling domains.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-09-15 14:50:14 +02:00
Changwoo Min
95e2f4dabe scx_lavd: boost the latency critility of kernel threads
Many kernel threads performs latency critical tasks (e.g., net, gpu). In
particular, AMD GPU driver runs the most part in the kernel space using
kworker. Hence, treat kernel threads as if a woken up task.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-09-14 00:41:02 +09:00
Changwoo Min
4b4f42fce1 scx_lavd: add a short circuit for the case of no turbo core
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-09-13 16:02:07 +09:00
Jake Hillion
3848d87895 scx_layered: add round robin growth strategy 2024-09-12 23:27:21 +01:00
Daniel Hodges
632fcfe4ae
Merge pull request #648 from hodgesds/layered-llc-stats
scx_layered: Add stats for XNUMA/XLLC migrations
2024-09-12 13:23:23 -04:00
Daniel Hodges
dde6e0c7f9 scx_utils: Add node/llc id to core topology
Add ids for node/llc in the Core topology struct.
2024-09-12 10:05:02 -07:00
Daniel Hodges
aee19dd9a1 scx_layered: Add topology aware core growth selection
Add topology aware core growth selection.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-09-12 06:48:51 -07:00
Daniel Hodges
14a19dc3ca scx_layered: Add random layer growth algo
Add a random layer growth algo.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-09-12 05:35:54 -07:00
Daniel Hodges
ae57f8d1f9 scx_rusty: Initialize node cpumask
Initialize the node cpumask, which was previously uninitialized causing
metric calculations to be wrong when attempting to lookup CPUs in the
node cpumask.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-09-11 13:14:44 -07:00
Jake Hillion
8ca45cfa37
lint: enable cargo fmt (#643)
Use `cargo fmt` with a specific nightly branch in the CI to enforce formatting. Globally format these files while the diff is still small so we can stay on top of it.

Test plan:
- CI lint check passes.
2024-09-11 10:03:20 +01:00
Daniel Hodges
43ec8bfe82 scx_layered: Add stats for XNUMA/XLLC migrations
Add stats for XNUMA/XLLC migrations. An example of the output is shown:
```
  hodgesd  : util/frac=    5.4/  0.1 load/frac=    301.0/  0.3 tasks=   476
             tot=   3168 local=97.82 wake/exp/reenq= 2.18/ 0.00/ 0.00
             keep/max/busy= 0.03/ 0.00/ 0.03 kick= 0.00 yield/ign= 0.09/    0
             open_idle= 0.00 mig= 6.82 xnuma_mig= 6.82 xllc_mig= 4.86 affn_viol= 0.00
             preempt/first/idle/fail= 0.00/ 0.00/ 0.00/ 0.00 min_exec= 0.00/   0.00ms
             cpus=  2 [  2,  4] 00000000 00000010 00001000
  normal   : util/frac=   28.7/  0.7 load/frac= 101704.7/ 95.8 tasks=  2450
             tot=   4660 local=99.06 wake/exp/reenq= 0.88/ 0.06/ 0.00
             keep/max/busy= 1.03/ 0.00/ 0.00 kick= 0.06 yield/ign= 0.04/  400
             open_idle=15.73 mig=23.45 xnuma_mig=23.45 xllc_mig= 3.07 affn_viol= 0.00
             preempt/first/idle/fail= 0.00/ 0.00/ 0.00/ 0.88 min_exec= 0.00/   0.00ms
             cpus=  2 [  2,  2] 00000001 00000100 00000000
             excl_coll=12.55 excl_preempt= 0.00
  random   : util/frac=    0.0/  0.0 load/frac=      0.0/  0.0 tasks=     0
             tot=      0 local= 0.00 wake/exp/reenq= 0.00/ 0.00/ 0.00
             keep/max/busy= 0.00/ 0.00/ 0.00 kick= 0.00 yield/ign= 0.00/    0
             open_idle= 0.00 mig= 0.00 xnuma_mig= 0.00 xllc_mig= 0.00 affn_viol= 0.00
             preempt/first/idle/fail= 0.00/ 0.00/ 0.00/ 0.00 min_exec= 0.00/   0.00ms
             cpus=  0 [  0,  0] 00000000 00000000 00000000
             excl_coll= 0.00 excl_preempt= 0.00
  stress-ng: util/frac= 4189.1/ 99.2 load/frac=   4200.0/  4.0 tasks=    43
             tot=     62 local= 0.00 wake/exp/reenq= 0.00/100.0/ 0.00
             keep/max/busy=2433.9/177.4/ 0.00 kick=100.0 yield/ign= 3.23/    0
             open_idle= 0.00 mig=54.84 xnuma_mig=54.84 xllc_mig=35.48 affn_viol= 0.00
             preempt/first/idle/fail= 0.00/ 0.00/ 0.00/ 0.00 min_exec= 0.00/   0.00ms
             cpus=  4 [  4,  4] 00000300 00030000 00000000
             excl_coll= 0.00 excl_preempt= 0.00
```

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-09-10 19:53:28 -07:00
Tejun Heo
8f0cc89ee8
Merge pull request #645 from frelon/rusty-init-dom
scx_rusty: init domains when calculating averages
2024-09-10 12:25:51 -10:00
Andrea Righi
e6e3579a92
Merge pull request #634 from anh0516/main
scx_bpfland: Documentation consistency fix
2024-09-10 23:25:55 +02:00
Fredrik Lönnegren
f155966b77 scx_rusty: init domains when calculating averages
The domains are added to the aggregator when load is added (and
duty_cycle is not 0.0f64).

This commit makes sure that all domains are added to the aggregator even
when the calculated duty_cycle is 0.

Signed-off-by: Fredrik Lönnegren <fredrik@frelon.se>
2024-09-10 21:51:41 +02:00
likewhatevs
85863d0e1c
Merge pull request #644 from hodgesds/layered-topo-order
scx_layered: Pass layer spec for core growth algo
2024-09-10 14:49:37 -04:00
Daniel Hodges
5fdd257862 scx_layered: Pass layer spec for core growth algo
Pass in the layer spec when determining the layer core growth algo. This
should make it easier to implement layer growth algos that are spec
specific.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-09-10 10:27:08 -07:00
Samuel Nair
c6af1aa1c8 scx_layered: Fix typo in stats 2024-09-10 08:44:57 -07:00
likewhatevs
c4c3659b6d
Merge pull request #638 from likewhatevs/remove-rlimit-dep
remove dependency on rlimit.rs
2024-09-10 03:14:12 -04:00
Andrea Righi
655ed5b4c6 scx_bpfland: use sum_exec_runtime to evaluate task's used time slice
Using p->scx.slice to evaluate the consumed time slice can be a bit
imprecise, because the sched_ext core implements yielding by setting
p->scx.slice to 0.

When the task's vruntime is evaluated this is considered as the task has
exhausted its entire allocated time slice, even though it voluntarily
released the CPU before the slice fully expired.

To avoid this inaccuracy and prevent penalizing tasks that voluntarily
release the CPU, always evaluate the used time slice based on the
difference in the task's total execution time (p->se.sum_exec_runtime).

This method provides a more precise calculation of vruntime and results
in a fairer task's deadline evaluation.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-09-10 08:03:35 +02:00
patso
c1df85914b
remove dependency on rlimit.rs
the rlimit crate is the only dependency crate
with a build.rs. build.rs files complicate portability.
this removes the need for rlimit.rs
2024-09-10 01:16:53 -04:00
Tejun Heo
56bb963136 build: Use a single top-level rust workspace
Rust build was using two separate workspaces - rust/ and scheds/rust.
There's no reason to separate them and it makes doc generation tricky. Use
single top level workspace so that we can drive all rust building from
cargo.
2024-09-08 14:23:48 -10:00
patso
120211d731
split build and test jobs
split build and test jobs to reduce ci turnaround time
and make it clear what is failing when something fails.

also add virtiofsd to deps to make test compilation faster
(most test time is compliation) and remove all force 9ps.
2024-09-08 02:54:24 -04:00
Changwoo Min
17e0e08e6e
Merge pull request #621 from multics69/lavd-greedy-fix
scx_lavd: improve greedy ratio calculation and more
2024-09-07 10:52:00 +09:00
Tejun Heo
6f8917ceca
Merge pull request #624 from JakeHillion/cleanup-layer_growth_algo
scx_layered: clean up Layer::new layer_growth_algo
2024-09-06 15:10:41 -10:00
Avraham Hollander
f71cc646a3 scx_bpfland: Fix in README.md for the same text as a comment in the
source
2024-09-06 19:12:33 -04:00
Jake Hillion
2c008b2afa scx_layered: clean up Layer::new layer_growth_algo 2024-09-06 18:25:50 +01:00
Changwoo Min
36df970a8f scx_lavd: add debug print for turbo cores
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-09-06 19:23:17 +09:00
Changwoo Min
351a1c6656 scx_lavd: enable autopilot mode by default
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-09-06 19:23:12 +09:00
Andrea Righi
8231f8586a scx_rlfifo: better documentation and code readability
Simplify scx_rlfifo code, add detailed documentation of the
scx_rustland_core API and get rid of the additional task queue, since it
just makes the code bigger, slower and it doesn't really provide any
benefit (considering that we are dispatching the tasks in FIFO order
anyway).

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-09-06 11:25:24 +02:00
Andrea Righi
ed879bae28 scx_rustland_core: expose enq_flags to user-space
Pass the enqueue flags to the user-space scheduler through the
QueuedTask struct.

These flags allow the user-space scheduler to make more informed
scheduling decisions.

Also bump up scx_rustland_core minor version to reflect the new API (we
are not breaking the old API, so we don't need to bump the major version
in this case).

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-09-06 11:25:24 +02:00
Changwoo Min
ebe9375b6a scx_lavd: pretty printing of status
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-09-06 16:27:20 +09:00
Changwoo Min
461cb9a3a0 scx_lavd: fix calculation of greedy_ratio
The service time (taskc->svc_time) should be the sum of total CPU time
consumed not jut a delta.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-09-06 16:22:40 +09:00
Tejun Heo
46fc2e1a49 version: v1.0.4 2024-09-05 18:12:45 -10:00
Tejun Heo
cd555741d0 rust: Synchronize depency versions 2024-09-05 17:10:02 -10:00
Changwoo Min
e3243c5d51
Merge pull request #612 from multics69/lavd-monitor
scx_lavd: add --monitor flag and two micro-optimizations
2024-09-06 09:33:55 +09:00
Changwoo Min
d9274bd8e6 scx_lavd: drop time slice boost for big cores
Unexpectedly, little cores, which have relative short time slices, have
more chance to schedule performance-critical tasks. Hence it is better
to keep the time slice same regardless the core types.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-09-06 09:32:38 +09:00
Changwoo Min
fdecba227c scx_lavd: print more info with --monitor
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-09-06 09:32:31 +09:00
Daniel Hodges
0fa369b914
Merge pull request #619 from hodgesds/stats-fixes
scx_layered: Fix stats typo
2024-09-05 15:44:15 -04:00
Daniel Hodges
25e1642bbc
scx_layered: Fix stats typo
Small typo fix

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-09-05 14:12:28 -04:00
Andrea Righi
918cfc613d scx_bpfland: optimize producer/consumer workloads
When selecting an idle CPU for a task that has been woken up, prioritize
reusing the same CPU if the waker and wakee share the same L3 cache.

Otherwise, attempt to migrate the wakee to the waker's CPU, provided it
is allowed by the wakee's scheduling domain.

This seems to consistently improve FPS performance when the system is
not operating over its full capacity.

Example:
 $ __GL_SYNC_TO_VBLANK=0 vblank_mode=0 glxgears -geometry 800x600

 - before: ~18305.77 FPS
 - after:  ~19060.62 FPS

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-09-05 19:02:09 +02:00
Andrea Righi
28050dcd7d
Merge pull request #615 from sched-ext/bpfland-auto
scx_bpfland: enable "auto" mode by default
2024-09-05 19:01:50 +02:00
Daniel Hodges
e6ed9b05ba
Merge pull request #614 from hodgesds/layered-stats-fix
scx_layered: Fix stats formatting
2024-09-05 12:54:56 -04:00
Andrea Righi
844c00fd26 scx_bpfland: enable "auto" mode by default
Rename "turbo domain" to "preferred domain", that conceptually is more
generic and introduce the new option `--preferred-domain CPUMASK`, which
allows users to define the preferred domain, specifying a cpumask as a
hex number. By default ("auto") the scheduler will always try to detect
and use the fastest CPUs in the system.

Moreover, adjust the cpufreq logic to use "auto" both with the
"balance_power" and "balance_performance" EPP profiles.

Then, enable "auto" mode by default: the scheduler will try to
automatically determine the optimal primary domain, preferred domain and
cpufreq level, based on the selected scheduler and energy profiles.

Tested-by: Piotr Gorski < piotr.gorski@cachyos.org >
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-09-05 16:11:12 +02:00
Daniel Hodges
76ad880475
scx_layered: Fix stats formatting
Fix formatting precision of stats to have lower precision for
readability. The existing formatting is hard to read:

tot=   1538 local=31.27 open_idle= 2.73 affn_viol=23.80 proc=4ms
busy=  1.1 util=   16.6 load=     32.7 fallback_cpu=  6
excl_coll=0.06501950585175553 excl_preempt=0.26007802340702213 excl_idle=0.16384915474642392 excl_wakeup=0.25097529258777634

With this fix stats are far more readable formatting:

tot=    441 local=33.56 open_idle= 0.00 affn_viol=20.63 proc=3ms
busy=  0.4 util=    6.3 load=     33.6 fallback_cpu=  6
excl_coll=0.454 excl_preempt=0.000 excl_idle=0.132 excl_wakeup=0.200

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-09-05 06:44:54 -04:00
Changwoo Min
f490a55d54 scx_lavd: accmulate more system-wide statistics
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-09-05 16:03:14 +09:00
Changwoo Min
e5d27d0553 scx_lavd: print basic system status when --monior is given
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-09-05 16:03:14 +09:00
Changwoo Min
6b717a3f3d scx_lavd: add --help-stats option
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-09-05 16:03:14 +09:00
Changwoo Min
ca1c86eb9c scx_lavd: improve pick_idle_cpu() for pinned tasks
When a pinned task cannot run on either active or overflow sets, we try
to stay on the previous CPU which is still okay to run on.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-09-05 16:03:14 +09:00
Andrea Righi
afc7b5404b
Merge pull request #600 from sched-ext/bpfland-cpufreq
scx_bpfland: improve cpufreq awareness
2024-09-05 07:32:10 +02:00
Tejun Heo
f010eda5c0 meson: Remove scheds/rust/*/meson.build
These aren't used since 43950c65 ("build: Use workspace to group rust
sub-projects"). Drop them.
2024-09-04 06:40:17 -10:00
Andrea Righi
c3cab45f6a scx_rustland_core: bump up version to 2.0.1
Bump up scx_rustland_core version to include this critical fix that
allows to prevent scheduler stalls:

 94a3594 ("scx_rustland_core: always dispatch per-cpu kthreads directly")

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-09-04 08:00:25 +02:00
Andrea Righi
918f1db4bd scx_bpfland: dynamically adjust cpufreq level in auto mode
In auto mode, rather than keeping the previous fixed cpuperf factor,
dynamically calculate it based on CPU utilization and apply it before a
task runs within its allocated time slot.

Interactive tasks consistently receive the maximum scaling factor to
ensure optimal performance.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-09-03 21:36:48 +02:00
Daniel Hodges
9c5717577f
Merge pull request #601 from hodgesds/namespace-helpers
scx_helpers: Add pid namespace helpers
2024-09-03 14:38:26 -04:00
Daniel Hodges
8f4e9e5e3b scx_helpers: Add pid namespace helpers
Add pid namespace helpers for translating namespace pids.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-09-03 11:21:32 -07:00
Andrea Righi
fe6ac15015 scx_bpfland: improve turbo domain CPU selection
Always consider the turbo domain when running in "auto" mode.

Additionally, when the turbo domain is used, split the CPU idle
selection logic into two stages:
 1) in ops.select_cpu(), provide the task with a second opportunity to
    remain within the same LLC
 2) in ops.enqueue(), perform another check for an idle CPU, allowing
    the task to move to a different LLC if an idle CPU within the same
    LLC is not available.

This allows tasks to stick more on turbo-boosted CPUs and CPUs within
the same LLC.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-09-03 09:59:29 +02:00
Andrea Righi
70b93ed641 scx_bpfland: skip idle CPU selection for tasks with changing affinity
When tasks are changing CPU affinity it is pointless to try to find an
optimal idle CPU. In this case just skip the the idle CPU selection step
and let the task being dispatched to a global DSQ if needed.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-09-03 09:59:29 +02:00
Andrea Righi
802d104b46 scx_bpfland: add basic cpufreq support
Add hints for the cpufreq governor based on the selected scheduler's
performance profile and the current energy performance preference (EPP).

With this change applied the scheduler works as following:

scheduler profile (--primary-domain option):
  - default:
    - use all cores
    - cpufreq: use default scaling factor
  - powersave:
    - use E-cores
    - cpufreq: use min scaling factor
  - performance:
    - use P-cores
    - cpufreq: use max scaling factor
  - auto:
    - EPP: power, powersave
      - use E-cores
      - cpufreq: use min scaling factor
    - EPP: balance_power (typically battery-powered systems)
      - use E-cores
      - cpufreq: use default scaling factor
    - EPP: balance_performance, performance
      - use P-cores
      - cpufreq: use max scaling factor

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-09-03 09:59:29 +02:00
Andrea Righi
d0fb29a0f7 scx_rustland: aggressively prioritize interactive tasks
scx_rustland was originally designed as a PoC to showcase the benefits
of implementing specialized schedulers via sched_ext, focusing on a very
specific use case: prioritize game responsiveness regardless of what
runs in the background.

Its original design was subsequently modified to better serve as a
general-purpose scheduler, balancing the prioritization of interactive
tasks with CPU-intensive ones to prevent over-prioritization.

With scx_bpfland serving as a more "general-purpose" scheduler, it makes
sense to revisit scx_rustland's original goal and make it  much more
aggressive at prioritizing interactive tasks, determined in function of
their average amount of context switches.

This change makes scx_rustland again a really good PoC to showcase the
benefits of having specialized schedulers, by focusing only at a very
specific use case: provide a high and stable frames-per-second (fps)
while a kernel build is running in the background.

= Results =

 - Test: Run a WebGL application [1] while building the kernel (make -j32)
 - Hardware: 8-cores Intel 11th Gen Intel(R) Core(TM) i7-1195G7 @ 2.90GHz

  +----------------------+--------+--------+
  |      Scheduler       | avg fps|  stdev |
  +----------------------+--------+--------+
  |               EEVDF  |   28   |  4.00  |
  | scx_rustland-before  |   43   |  1.25  |
  |  scx_rustland-after  |   60   |  0.25  |
  +----------------------+--------+--------+

[1] https://webglsamples.org/aquarium/aquarium.html

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-09-02 15:53:35 +02:00
Changwoo Min
172fe1efc6
Merge pull request #597 from multics69/lavd-turbo-tuning2
scx_lavd: misc updates (verifier, README, monitor option name, and micro-optimization)
2024-09-02 18:00:26 +09:00
Changwoo Min
0108b83050 scx_lavd: make the old verifier happy (bpf_cpumask_set_cpu)
An old BPF verifier does not allow calling bpf_cpumask_set_cpu() in the
BPF syscall context, so we defer actual bpf_cpumask_set_cpu() to the
timer handler, update_sys_stat(), to workaround the problem.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-09-02 18:00:12 +09:00
Changwoo Min
3bc2fd4977 scx_lavd: update README
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-09-02 18:00:12 +09:00
Changwoo Min
afbebaeed6 scx_lavd: check a core type of previous cpu at pick_idle_cpu()
If a task is performance-critical, pick_idle_cpu() checks if the
previous core is a big core or not. If not, don't try to run on previous
core since a performance-critical task is better to run on a big core.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-09-01 17:28:16 +09:00
Changwoo Min
f2122c4197
Merge pull request #595 from multics69/lavd-turbo-tuning
scx_lavd: improve  the autopilot mode
2024-09-01 16:24:41 +09:00
Andrea Righi
1595445a63
Merge pull request #594 from sched-ext/scx-rustland-core-version-2
scx_rustland_core: bump up major version to 2.0.0
2024-09-01 08:57:32 +02:00
Changwoo Min
5ca4501139 scx_lavd: dynamically decide autopilot's low watermark
A single threshold for a low watermark does not work well across systems
with various numbers of cores and core types. Instead of using a single
low watermark value, we dynamically decide the low watermark: 1) until
one little core is fully utilized or 2) until two big cores are fully
utilized. This works better across systems.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-09-01 12:46:57 +09:00
Andrea Righi
0aa71c832b scx_rustland_core: bump up major version to 2.0.0
The scx_rustland_core API has been redesigned recently, breaking the
compatibility with the past.

Considering that Rust crates should update their major version when the
previous API becomes incompatible [1], bump up the version to 2.0.0.

[1] https://semver.org/

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-31 23:23:26 +02:00
Andrea Righi
2cbf252019 scx_bpfland: directly dispatch only per-cpu kthreads with local_kthreads
We want to directly dispatch only kthreads when local_kthreads is
enabled, not all tasks that can run on a single CPU.

Fixes: 7cc1846 ("scx_bpfland: always rely on prev_cpu with single-CPU tasks")
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-31 16:35:54 +02:00
Changwoo Min
4a7b806dd2 scx_lavd: when no_freq_scaling, always set to the max freq
When the no_freq_scaling changes during runtime in the autopilot mode,
the last target freq set would not be 1024. So the performance mode
enabled by the autopilot mode would not run in the best profile. Hence,
we set the target freq to 1024 always when no_freq_scaling is set.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-08-31 18:22:33 +09:00
Daniel Hodges
63a2eecce8
Merge pull request #592 from hodgesds/layered-ts-fixes
scx_layered: Fix layer timeslice not being applied
2024-08-30 15:34:57 -04:00
Daniel Hodges
e04b612688 scx_layered: Fix layer timeslice not being applied
Fix a small bug where the layer timeslice is not applied.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-08-30 11:53:42 -07:00
Changwoo Min
4d8bf870a1
Merge pull request #591 from multics69/lavd-turbo3
scx_lavd: introduce "autopilot" mode and misc. optimization & bug fix
2024-08-31 02:14:35 +09:00
Andrea Righi
f782467eaf scx_rustland: convert to scx_stats
This allows scx_rustland to avoid generating excessive logs for
statistics while still allowing detailed monitoring on demand.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-30 18:32:32 +02:00
Changwoo Min
9091dd983b scx_lavd: add "--autopilot" mode
Add "--autopilot" option and mode. In the autopilot mode, the scheduler
dynamically changes its power mode according to system's load (cpu
utilization). When the cpu utilization is low enough (say <=5%), it
switches to the powersave mode since there is nothing to process fast so
powersaving is the primary goal. When the utilization is moderate (say
>5%, <=30%), it runs in balanced mode. When the utilization is high
enough (say >30%), it runs in performance mode.

Note that it only changes scheduler's power mode but it does not change
system's energy profile.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-08-31 01:14:33 +09:00
Changwoo Min
5ecaa9ebe2 scx_lavd: improve the accuracy of cpu utilization calculation
When a cpu is idle for a whole interval, its idle time does not
correctlyh adds up so the utilization of such cpu tends to be higher
than the actual utilization. Now it is fixedk, so cpu utilization
becomes more accurate.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-08-31 01:14:33 +09:00
Changwoo Min
2f8cc0d60f scx_lavd: rename the "--auto" opetion to "--autopower" to be clear
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-08-31 01:14:33 +09:00
Changwoo Min
815f1263b2 scx_lavd: reinitialize active cpumask when power mode changes
When the power mode changes back to performance mode, we should
active/overflow cpumask to its initial state -- all big cores are in
active cpumask and all little cores are in overflow cpumask. Otherwise,
the active/overflow cpumasks will be used in the perfformance mode.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-08-31 01:14:33 +09:00
Changwoo Min
afb8c78a09 scx_lavd: print power mode change in the auto mode
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-08-31 01:14:33 +09:00
Changwoo Min
a89a56dba4 scx_lavd: add a fastpath in ops.select_cpu() for a sharply pinned task
If a task can be run only on a single cpu, we don't need to go through
all the steps in ops.select_cpu(). Instread, we simply check if a task
is still pinned on the prev_cpu and go.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-08-31 01:14:33 +09:00
Andrea Righi
b54fc202b8
Merge pull request #583 from sched-ext/bpfland-fix-pcpu-direct-dispatch
scx_bpfland: always rely on prev_cpu with single-CPU tasks
2024-08-30 18:12:59 +02:00
Andrea Righi
7cc18460b9 scx_bpfland: always rely on prev_cpu with single-CPU tasks
When selecting an idle for tasks that can only run on a single CPU,
always check if the previously used CPU is sill usable, instead of
trying to figure out the single allowed CPU looking at the task's
cpumask.

Apparently, single-CPU tasks can report a prev_cpu that is not in the
allowed cpumask when they rapidly change affinity.

This could lead to stalls, because we may end up dispatching the kthread
to a per-CPU DSQ that is not compatible with its allowed cpumask.

Example:

kworker/u32:2[173797] triggered exit kind 1026:
  runnable task stall (kworker/2:1[70] failed to run for 7.552s)
...
  R kworker/2:1[70] -7552ms
      scx_state/flags=3/0x9 dsq_flags=0x1 ops_state/qseq=0/0
      sticky/holding_cpu=-1/-1 dsq_id=0x8 dsq_vtime=234483011369
      cpus=04

In this case kworker/2 can only run on CPU #2 (cpus=0x4), but it's
dispatched to dsq_id=0x8, that can only be consumed by CPU 8 => stall.

To prevent this, do not try to figure out the best idle CPU for tasks
that are changing affinity and just dispatch them to a global DSQ
(either priority or regular, depending on its interactive state).

Moreover, introduce an explicit error check in dispatch_direct_cpu() to
improve detection of similar issues in the future, and drop
lookup_task_ctx() in favor of try_lookup_task_ctx(), since we can now
safely handle all the cases where the task context is not found.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-30 09:45:58 +02:00
Changwoo Min
3e2e78a9ec
Merge pull request #584 from multics69/lavd-turbo2
scx_lavd: automatically determine power mode and more
2024-08-30 08:56:16 +09:00
Daniel Hodges
47184e9d19
Merge pull request #582 from hodgesds/layered-growth-interface
scx_layered: Add layer growth config
2024-08-29 18:49:59 -04:00
Changwoo Min
bb08919203 scx_lavd: determine power mode automatically with --auto option
It checkes the EPP (energy performance preference) peirodically and sets
the power profile of the scheduler during runtiime as a user changes its
EPP profile (from her desktop UI).

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-08-29 19:15:23 +09:00
Andrea Righi
cc3f696c4b
Merge pull request #577 from sched-ext/bpfland-task-affinity
scx_bpfland: enhanced task affinity
2024-08-29 07:46:57 +02:00
Daniel Hodges
7e0329e45c scx_layered: Add layer growth config
Add a per layer config for different implementations of layer growth
algorithms. Convert the existing default logic into a default layer
growth algorithm and add a linear implementation.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-08-28 19:17:24 -07:00
Daniel Hodges
cf765562c7
scx_layered: Update docs for layer slice setting
Add docs for layer slice setting.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-08-28 22:12:07 -04:00
Daniel Hodges
a23308e7b0 scx_layered: Add more docs on tuning
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-08-28 12:38:05 -07:00
Daniel Hodges
96326b1ef3 scx_layered: Add additional docs
Add some additional docs on tuning layered.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-08-28 12:27:26 -07:00
Daniel Hodges
cc450f1a4b scx_layered: Add per layer timeslice
Allow setting a different timeslice per layer.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-08-28 11:21:03 -07:00
Daniel Hodges
c511b42b7b scx_layered: Make verification easier on older kernels
Refactor some BPF code to make verification easier on older kernels.
This is to make it easier to maintain backports.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-08-28 08:05:10 -07:00
Daniel Hodges
12f8cb74b5 scx_utils: Add GPU topology
Add GPU awareness to the topology crate.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-08-28 06:35:35 -07:00
Andrea Righi
28cb1ec5cb scx_bpfland: enhanced task affinity
Aggressively try to keep tasks running on the same CPU / cache / domain,
to achieve higher performance when the system is not over commissioned.

This is done by giving a second chance in ops.enqueue(), in addition to
ops.select_cpu(), to find an idle CPU close to the previously used CPU.

Moreover, even if the task is dispatched to the global DSQs, always try
to check if there is an idle CPU in the primary domain that can
immediately consume the task.

= Results =

This change seems to provide a minor, but consistent, boost of
performance with the CPU-intensive benchmarks from the CachyOS
benchmarks selection [1].

Similar results can also be noticed with some WebGL benchmarks [2], when
system usage is close to its maximum capacity.

Test:
 - cachyos-benchmarker

System:
 - AMD Ryzen 7 5800X 8-Core Processor

Metrics:
 - total time: elapsed time of all benchmarks
 - total score: geometric mean of all benchmarks

NOTE: total time is the most relevant, since it gives a measure of the
aggregate performance, while the total score emphasizes more on
performance consistency across all benchmarks.

== Results: summary ==

 +-------------------------+---------------------+---------------------+
 |         Scheduler       |    Total Time       |    Total Score      |
 |                         |    (less = better)  |    (less = better)  |
 +-------------------------+---------------------+---------------------+
 |                 EEVDF   |  624.44 sec         |      123.68         |
 |               bpfland   |  625.34 sec         |      122.21         |
 | bpfland-task-affinity   |  623.67 sec         |      122.27         |
 +-------------------------+---------------------+---------------------+

== Conclusion ==

With this patch applied, bpfland shows both a better performance and
consistency. Although the gains are small (less than 1%), they are still
significant for this type of benchmark and consistently appear across
multiple runs.

[1] https://github.com/CachyOS/cachyos-benchmarker
[2] https://webglsamples.org/aquarium/aquarium.html

Tested-by: Piotr Gorski < piotr.gorski@cachyos.org >
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-28 10:30:54 +02:00
Avraham Hollander
6c5d85401d
Merge branch 'sched-ext:main' into main 2024-08-27 23:07:54 -04:00
Avraham Hollander
2a3cbeb760 scx_lavd: Add same power mode clarification to --no-prefer-turbo-core 2024-08-27 23:06:31 -04:00
Changwoo Min
5588126cff scx_lavd: minior optimization for consume_task()
When iterating neighbors, the existing code unnecessarily iterates all
the neighbors to the maximum even if there is no neighors. So the fix
escapes early when there is no neighbors.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-08-28 10:26:50 +09:00
Changwoo Min
95272ae910 scx_lavd: proper handling of ctrl-c in a monitoring mode
Ctrl-c wasn't properly handled in the monitoring mode
(`--monitor-sched-samples`), so the scheduler could not be terminated by
pressing ctrl-c. The missing ctrl-c handling is added to the monitor
thread.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-08-28 10:05:34 +09:00
Changwoo Min
9c4428fd8b scx_lavd: remove unused rust functions
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-08-28 10:02:11 +09:00
Andrea Righi
a155d5185d scx_bpfland: rely on Topology to classify core types
Rely on scx_utils::Topology to classify Big, Little and Turbo CPUs.

Moreover, support the special keyword "all" with --primary-domain to
include all the CPUs in the system (default).

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-28 00:23:55 +02:00
Andrea Righi
872e653cd2 scx_utils: introduce Turbo core type to Topology
Integrate the logic used by scx_bpfland to detect turbo-boosted cores in
Topology.

Also change the logic to detect Big/Little cores in function of
base_frequency, instead of scaling_max_freq, otherwise turbo-boosted
cores in homogeneous systems may be incorrectly classified as Big.

Moreover, introduce the following new methods to Cpu to check for the
core type:
 - is_turbo(): return true if the CPU is Turbo, false otherwise
 - is_big(): return true if the CPU is either Turbo or Big
 - is_little(): return true if the CPU is Little

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-28 00:09:08 +02:00
Daniel Hodges
41cebb807a
Merge pull request #569 from anh0516/main
scx_layered: Clean up in-code documentation; add commas for consistency
2024-08-27 09:47:29 -04:00
Andrea Righi
6768f9f88c
Merge pull request #572 from sched-ext/bpfland-fix-turbo-domain
scx_bpfland: fix turbo boost domain nullifying primary domain limits
2024-08-27 15:23:12 +02:00
Andrea Righi
e0f49a338a scx_bpfland: fix turbo boost domain nullifying primary domain limits
When creating the turbo boost scheduling domain, we might use a full CPU
mask (selecting all possible CPUs) to indicate "do not prioritize turbo
boost CPUs" or when all CPUs have the same maximum frequency.

This approach works when the primary domain also contains all the CPUs,
as the complete overlap allows the CPU selection logic to ignore the
turbo boost domain and start picking CPUs directly from the primary
domain.

However, if the primary domain doesn't include all CPUs, the two domains
won't fully overlap, which can lead to the turbo boost domain
incorrectly including all CPUs, thereby negating the restrictions set by
the primary scheduling domain.

To resolve this, an empty CPU mask should be used for the turbo boost
domain when turbo boost CPUs aren't prioritized. If the turbo boost
domain is empty, it should be entirely bypassed, and the selection
should proceed directly to the primary domain.

Reported-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-27 13:36:50 +02:00
Changwoo Min
00430c3ded scx_lavd: make a loop easier to correctly verify
With an ill combination of old kernel and old LLVM, the BPF verifier
incorrectly detects an infinite loop. After changing the loop with a
constant end, the old verifier can pass the code.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-08-27 17:11:20 +09:00
Changwoo Min
09cff560aa
Merge pull request #566 from multics69/lavd-turbo
scx_lavd: prioritize the turbo boost-able cores
2024-08-27 08:47:25 +09:00
Daniel Hodges
83cd26eb9e
Merge pull request #564 from hodgesds/layered-help
scx_layered: Update help for tgid matching
2024-08-26 14:52:53 -04:00
Andrea Righi
35db89e90d
Merge pull request #568 from sched-ext/rustland-core-design-improv
scx_rustland_core: small core design improvements
2024-08-26 20:06:21 +02:00
Avraham Hollander
7a43801d76 Add quotes for clarity 2024-08-26 13:20:01 -04:00
Avraham Hollander
0b6ebf826e scx_lavd, scx_mitosis, scx_rusty: Add comma for grammatical consistency
with the same change in the other schedulers
2024-08-26 13:06:58 -04:00
Avraham Hollander
07039f1f07 scx_layered: Documentation cleanup 2024-08-26 13:03:52 -04:00
Andrea Righi
1427d7d347 scx_rlfifo: enhance code design
Refactor the code design to make it more suitable as a template for
implementing advanced scheduling policies.

In particular, create separate loops for task consumption and task
dispatching. This will make the scheduler easier to adapt as a
foundation for implementing more complex scheduling policies.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-26 16:10:54 +02:00
Daniel Hodges
c45c2de39f scx_layered: Update help for tgid matching
Forgot to add doc for tgid matching

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-08-26 07:06:21 -07:00
Changwoo Min
9807e561f0 scx_lavd: prioritize the turbo boost-able cores
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-08-26 17:57:33 +09:00
Changwoo Min
cd5b2bf664 scx_lavd: replace nix signal handler to ctrlc
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-08-26 17:57:33 +09:00
Changwoo Min
e887c56da0 scx_lavd: add "--version" option, which prints the current version
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-08-26 17:57:33 +09:00
Changwoo Min
0f97ca3066 scx_lavd: drop time slice calculation in ops.select_cpu()
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-08-26 17:55:00 +09:00
Changwoo Min
4e3c36ca3f scx_lavd: handle the missing cases in time slice calculation
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-08-26 11:43:29 +09:00
Changwoo Min
be7d06e280 scx_lavd: make the old BPF verifier happy :-(
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-08-26 11:43:29 +09:00
Changwoo Min
82f55b95b2 scx_lavd: add a fast path in pick_idle_cpu() when SMT is not activated
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-08-26 11:43:29 +09:00
Changwoo Min
38779dbe8b scx_lavd: improve pick_idle_cpu()
Now it checks an active cpumask within a previous core's compute domain
before checking the full active CPUs.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-08-26 11:43:29 +09:00
Changwoo Min
d1d9e97d08 scx_lavd: reduce LAVD_CPDOM_MAX_DIST to 4
The BPF verifier in the old kernel gives up to analysis the nested loop
in the consume_task(). We reduce the loop less complex by reducing
LAVD_CPDOM_MAX_DIST from 6 to 4 in order to make the verifier happy.
Note that the theoretical maximum distance is 6 (numa > llc > core type)
but there is no such hardware today, hence reducing it to 6 should be
okay in next few years, when hopefully the verifier becomes smarter.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-08-26 11:43:29 +09:00
Changwoo Min
950710990f scx_lavd: move time slice calculation to ops.enqueue() and ops.select_cpu()
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-08-26 11:43:29 +09:00
Changwoo Min
954b684a70 scx_lavd: update nr_queued_task every system stat update interval
Updating nr_queue_task every runqueue operation is expensive and
unnecessary. So we do update every system state update interval and use
moving average, which is accurate enough.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-08-26 11:43:29 +09:00
Changwoo Min
4f906f1f49 scx_lavd: update README since it supports multi-CCX/NUMA
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-08-26 11:43:29 +09:00
Changwoo Min
9551657b42 scx_lavd: prefer big cores in the performance mode
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-08-26 11:43:29 +09:00
Changwoo Min
d4bb35e651 scx_lavd: use itertools::iproduct!() for a nested loop
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-08-26 11:43:29 +09:00
Changwoo Min
9368c6881d scx_lavd: replace get_task_cpu_id() to scx_bpf_task_cpu()
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-08-26 11:43:29 +09:00
Andrea Righi
a469f0f1ce
Merge pull request #561 from sched-ext/bpfland-fix-energy-profile-refresh
scx_bpfland: prevent reading energy profile if not available
2024-08-25 18:31:34 +02:00
Tejun Heo
ca13e13ad6
Merge pull request #559 from sched-ext/htejun/cargo-workspace
build: Use workspace to group rust sub-projects
2024-08-25 06:26:18 -10:00
Andrea Righi
f8acd069f0 scx_bpfland: prevent reading energy profile if not available
Avoid to periodically read the current performance profile from
/sys/devices/system/cpu/cpufreq/policy0/energy_performance_preference if
it's not available (i.e., with older CPUs or kernels without cpufreq).

This fixes issue #560.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-25 16:53:35 +02:00
Andrea Righi
8853d9a9f2
Merge pull request #548 from sched-ext/rustland-core-refactoring
scx_rustland_core: user-space framework refactoring
2024-08-25 16:39:28 +02:00
Tejun Heo
43950c65bd build: Use workspace to group rust sub-projects
meson build script was building each rust sub-project under rust/ and
scheds/rust/ separately. This means that each rust project is built
independently which leads to a couple problems - 1. There are a lot of
shared dependencies but they have to be built over and over again for each
proejct. 2. Concurrency management becomes sad - we either have to unleash
multiple cargo builds at the same time possibly thrashing the system or
build one by one.

We've been trying to solve this from meson side in vain. Thankfully, in
issue #546, @vimproved suggested using cargo workspace which makes the
sub-projects share the same target directory and built together by the same
cargo instance while still allowing each project to behave independently for
development and publishing purposes.

Make the following changes:

- Create two cargo workspaces - one under rust/, the other under
  scheds/rust/. Each contains all rust projects underneath it.

- Don't let meson descend into rust/. These are libraries used by the rust
  schedulers. No need to build them from meson. Cargo will build them as
  needed.

- Change the rust_scheds build target to invoke `cargo build` in
  scheds/rust/ and let cargo do its thing.

- Remove per-scheduler meson.build files and instead generate custom_targets
  in scheds/rust/meson.build which invokes `cargo build -p $SCHED`.

- This changes rust binary directory. Update README and
  meson-scripts/install_rust_user_scheds accordingly.

- Remove per-scheduler Cargo.lock as scheds/rust/Cargo.lock is shared by all
  schedulers now.

- Unify .gitignore handling.

The followings are build times on Ryzen 3975W:

Before:
  ________________________________________________________
  Executed in  165.93 secs    fish           external
     usr time   40.55 mins    2.71 millis   40.55 mins
     sys time    3.34 mins   36.40 millis    3.34 mins

After:
  ________________________________________________________
  Executed in   36.04 secs    fish           external
     usr time  336.42 secs    0.00 millis  336.42 secs
     sys time   36.65 secs   43.95 millis   36.61 secs

Wallclock time is reduced 5x and CPU time 7x.
2024-08-25 00:47:58 -10:00
Andrea Righi
894f9582d0 scx_rustland_core: hide shutdown boilerplate in BpfScheduler
Refactor the code to hide the shutdown handling inside BpfScheduler and
simply use the exited() method to check when the scheduler is stopped.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-25 12:17:04 +02:00
Tejun Heo
152a8471cc scx_bpfland: When reporting stats, use interval deltas
Three of the reported stats are cumulative. While they obviously can be
processed into delta values, that holds for the other direction too and the
cumulative values are difficult to make intutive sense of. Report interval
delta values instead.

Note that a stats client can reliably build back cumulative values even
under heavy system contention - the delta values reported between two
consecutive reads are guaranteed to be correct regardless of the duration of
the interval.
2024-08-24 23:14:57 -10:00
Tejun Heo
bd68e230b9 scx_bpfland: Convert to scx_stats
Use scx_stats instead of prometheus for stats reporting. This has a few
advantages:

- Stats metadata can be defined more succinctly.

- Natural support for nesting statistics which will be useful in making
  scheduler components composable.

- Support for multiple programmable readers where each reader can use their
  own reading interval.

- Built-in stats help message generation.

- Openmetrics integration is still available through
  scx_stats/scripts/scxstats_to_openmetrics.py.
2024-08-24 23:14:55 -10:00
Tejun Heo
625381280c scx_stats: Shorten exported names and add prelude module
Let's make it a bit easier to use:

- Shorten exported names by changing the prefix from ScxStats to Stats. This
  should be distinctive enough and more inline with how most libraries name
  their exports.

- Importing the right set of traits can be tricky. Introduce prelude module
  so that importing is a bit less painful.
2024-08-24 22:04:25 -10:00
Andrea Righi
a2e97fecbb scx_rustland_core: merge verbose and debug in the same option
There is no reason to have two separate options for "verbose" and
"debug" mode. Just merge the two and always use "debug". If enabled,
increase verbosity to stdout and enable reporting BPF scheduling events
in debugfs (e.g., /sys/kernel/debug/tracing/trace_pipe).

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-25 09:45:20 +02:00
Andrea Righi
cb16a11342 scx_rustland_core: get rid of the global scheduler's slice_us
Since scx_rustland_core enables setting a time slice on a per-task basis
during task dispatch, there's no need to maintain a global time slice in
the BPF component. Instead, a global time slice can simply be managed in
user-space, achieving the same outcome.

Therefore, drop the global slice_us property from BpfScheduler to
simplify the API.

NOTE: if a time slice is not specified for a task, SCX_SLICE_DFL will be
used by default.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-25 09:45:18 +02:00
Andrea Righi
e404bee5e7 scx_rustland / scx_rlfifo: small code format fixes
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-25 09:44:52 +02:00
Andrea Righi
1cd11ba916 scx_rlfifo: improve documentation and code readability
Add more comments to make the source code more understandable, so that
it can be easily used as a template for implementing more complex
scheduling policies.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-25 09:44:28 +02:00
Tejun Heo
35a4326aee scx_lavd: Drop unnecessary stat field explanation on startup
The scheduling instances no longer prints out sched samples. No reason to
print field explanation on startup.
2024-08-24 18:48:54 -10:00
Changwoo Min
02ad793c78
Merge branch 'main' into htejun/scx_lavd-stats 2024-08-25 11:57:41 +09:00
Changwoo Min
8b1874c27f
Merge pull request #552 from CachyOS/lavd-mutli-cxx2
scx_lavd: Drop message about unsupported multi-CXX support
2024-08-25 11:48:12 +09:00
Tejun Heo
fdfb7f60f4 Merge branch 'main' into htejun/scx_lavd-stats 2024-08-24 15:53:53 -10:00
Tejun Heo
55e5b8b43f scx_lavd: Switch to scx_stats
Scheduling sample reporting is switched to use scx_stats. This makes the
scheduler run without making too much noise while still allowing monitoring
on demand. It can also make introspection more dynamic - e.g. it shouldn't
be difficult to add other monitoring commands which take scheduling samples
based on different criteria or add other types of staisitcs.

--nr_sched-samples is replaced with --monitor-nr-samples.
2024-08-24 15:53:02 -10:00
Tejun Heo
1bba713a29
Merge pull request #542 from sched-ext/htejun/scx_stats
scx_stats, scx_rusty, scx_layered: Implement `--help-stats`
2024-08-24 15:38:36 -10:00
Peter Jung
906d054770
scx_lavd: Drop message about unsupported multi-CXX support
Signed-off-by: Peter Jung <admin@ptr1337.dev>
2024-08-25 01:10:38 +02:00
Andrea Righi
0aa23481de scx_rustland_core: drop update_tasks() and introduce notify_complete()
The update_tasks() API is somewhat confusing, so replace it with a
clearer API, notify_complete().

This new API will return control to the BPF component and inform it
about the number of tasks still pending in the user-space scheduler.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-25 00:45:23 +02:00
Daniel Hodges
e81faef103
Merge pull request #544 from hodgesds/layered-tgid
scx_layered: Add layer match for tgid
2024-08-24 16:58:19 -04:00
Andrea Righi
5ece102554 scx_rustland: get rid of unnecessary debugging information
Additional statistics will be re-added later via scx_stats.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-24 21:29:10 +02:00
Andrea Righi
cef8ff8757 scx_rustland_core: get rid of the low_power API
The low-power API is a bit of a hack implemented purely in the BPF
layer, this should be better re-implemented with some concepts of
topology awareness.

Therefore, get rid of this API for now.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-24 21:29:10 +02:00
Andrea Righi
be7ef1009b scx_rlfifo: user-space idle CPU selection
Select an idle CPU from user-space, instead of always dispatching on the
first CPU available.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-24 21:29:10 +02:00
Andrea Righi
568e292a24 scx_rustland_core: get rid of the exiting task API
The current API used to notify the user-space scheduler when a task
exits is really confusing (setting a negative value in
queued_task_ctx.cpu), and it's also possible to detect task exiting
events from user-space (or check in procfs, even if it's slower).

In any case, a better API should be provided for this, so drop the
current one for now.

NOTE: this will cause additional memory usage for scx_rustland, but it
can be fixed/addressed later in a separate commit (i.e., providing a
periodic garbage collector for the unused task entries).

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-24 21:29:10 +02:00
Andrea Righi
5d544ea264 scx_rustland_core: move CPU idle selection logic in user-space
Allow user-space scheduler to pick an idle CPU via
self.bpf.select_cpu(pid, prev_task, flags), mimicking the BPF's
select_cpu() iterface.

Also remove the full_user option and always rely on the idle selection
logic from user-space.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-24 21:28:13 +02:00
Andrea Righi
1dd329dd7d scx_rustland: update Cargo.lock
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-24 20:24:48 +02:00
Andrea Righi
106d59d997 scx_rlfifo: update Cargo.lock
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-24 20:24:48 +02:00
Andrea Righi
016aae759f
Merge pull request #545 from sched-ext/bpfland-honor-avg-nvcsw
scx_bpfland: always honor average nvcsw in lowlatency mode
2024-08-24 20:24:33 +02:00
Avraham Hollander
66b5dd0de9 Clean up scx_rusty help info a bit 2024-08-24 11:56:12 -04:00
Avraham Hollander
c34a470024 scx_lavd: Fix my own formatting error 2024-08-24 11:36:19 -04:00
Andrea Righi
5a08855a86 scx_bpfland: always honor average nvcsw in lowlatency mode
Keep evaluating the average number of voluntary context switches for
each task when lowlatency mode is enabled, even when interactive tasks
classification is disabled (via `-c 0`).

The average nvcsw is also used in lowlatency mode to evaluate the
proportional bonus to the tasks' deadline and it shouldn't be ignored
when interactive tasks classification is disabled. Moreover, make sure
that such bonus never exceeds the starvation threshold.

Keep in mind that it is still possible to disable the periodic average
nvcsw evaluation with `-c 0`, without specifying `--lowlatency`.

Fixes: 6a22853 ("scx_bpfland: introduce --lowlatency option")
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-24 10:42:22 +02:00
Tejun Heo
48092c6f88 scx_lavd: Relay introspection output in stats::TaskSample
This indirection doesn't make any visible behavior difference now but will
be used to implement scx_stats support.
2024-08-23 18:49:36 -10:00
Tejun Heo
725fa7f1be Merge branch 'main' into htejun/scx_stats 2024-08-23 17:10:08 -10:00
Daniel Hodges
5a2012763e
scx_layered: Add layer match for tgid
Add layer match for tgid.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-08-23 23:00:28 -04:00
Avraham Hollander
bedb18b48e Improve scx_lavd help info
A lot of scx_lavd's options do not clearly explain what they do. Add
some short explanations, clean up the existing ones, and direct the user
to read the in-code documentation for more info.
2024-08-23 18:56:14 -04:00
Avraham Hollander
d6e27b59e7 Clean up scx_bpfland help info a bit 2024-08-23 18:55:04 -04:00
Tejun Heo
25e437753c scx_layered, scx_rusty: Implement --help-stats
which shows all the defined stats. While at it, make some cosmetic updates.
2024-08-23 12:39:47 -10:00
Tejun Heo
405bcc63fe scx_stats: Make ScxStatsServerData a public carrier of data needed for stats server
And move related ops into it. This is a bit more natural and will also allow
doing other operaitons (e.g. describing stats) without launching the server.
2024-08-23 12:23:57 -10:00
Tejun Heo
7bd35b6cd3 scx_lavd: Cargo.lock update (caused by scx_utils depending on scx_stats) 2024-08-23 09:21:44 -10:00
Andrea Righi
e72676ede3
Merge pull request #540 from sched-ext/bpfland-cpufreq-awareness
scx_bpfland: cpu frequency and energy awareness
2024-08-23 21:17:34 +02:00
Tejun Heo
9e3b4e6db0 scx_stats: A bit of cleanups and renames 2024-08-23 09:09:02 -10:00
Tejun Heo
b6ccb87bec
Merge pull request #539 from sched-ext/htejun/scx_rusty
scx_rusty: Convert to scx_stats
2024-08-23 08:42:47 -10:00
Daniel Hodges
7d45059fa9
Merge pull request #538 from hodgesds/layered-pid
scx_layered: Add pid/ppid matches
2024-08-23 14:08:40 -04:00
Tejun Heo
8c8912ccea Merge branch 'main' into htejun/scx_rusty 2024-08-23 07:50:23 -10:00
Andrea Righi
50684e4569 scx_bpfland: introduce Intel Turbo Boost awareness
Make `--primar-domain auto` aware of turbo boosted CPUs and prioritize
them over the primary scheduling domain when the energy model
`balance_power` is used (typically when running on battery power with
the "balanced" profile).

With this change the scheduling hierarchy becomes the following:

 1) CPUs in the turbo scheduling domain
 2) CPUs in the primary scheduling domain
 3) full-idle SMT CPUs
 4) CPUs in the same L2 cache
 5) CPUs in the same L3 cache
 6) CPUs in the task's allowed domain

And the idle selection logic is modified as following:

 - In the turbo scheduling domain:
   - pick same full-idle SMT CPU
   - pick any other full-idle SMT CPU sharing the same L2 cache
   - pick any other full-idle SMT CPU sharing the same L3 cache
   - pick any other full-idle SMT CPU
   - pick same idle CPU
   - pick any other idle CPU sharing the same L2 cache
   - pick any other idle CPU sharing the same L3 cache
   - pick any other idle SMT CPU
 - In the primary scheduling domain:
   - pick same full-idle SMT CPU
   - pick any other full-idle SMT CPU sharing the same L2 cache
   - pick any other full-idle SMT CPU sharing the same L3 cache
   - pick any other full-idle SMT CPU
   - pick same idle CPU
   - pick any other idle CPU sharing the same L2 cache
   - pick any other idle CPU sharing the same L3 cache
   - pick any other idle SMT CPU
 - In the entire task domain:
   - pick any other idle CPU

Keep in mind that the turbo domain will be evaluated only when the
scheduler is started with `--primary-domain auto` and only when the
`balance_power` energy profile is used.

The turbo domain is always made using the subset of CPUs in the system
with the highest max frequency. If such subset can't be determined (for
example if all the CPUs in the primary domain have all the same
frequency), the turbo domain will be ignored.

Prioritizing turbo boosted CPUs can help to improve performance by
forcing the governor to scale up their frequency, without increasing too
much power consumption, due to the fact that tasks will be preferably
confined into a reduced amount of cores.

This change seems to improve performance, without increasing much
power consuption, on Intel laptops while using the `balanced_power`
energy profile.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-23 19:49:08 +02:00
Andrea Righi
d958dd4482 scx_bpfland: introduce dynamic energy profile
Introduce the new option `--primary-domain auto`. With this option the
scheduler will dynamically adjusts the primary scheduling domain at
run-time, in function of the current energy profile reported in
/sys/devices/system/cpu/cpufreq/policy0/energy_performance_preference.

When the `power` energy profile is selected, the primary scheduling
domain will prioritize E-cores. Alternatively, when the `performance`
profile is selected, it will prioritize P-cores. For all the other
energy profiles, all the CPUs in the system will be used.

Note that this option is only relevant on hybrid architectures with
P-cores and E-cores.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-23 19:49:01 +02:00
Tejun Heo
44a0f1b124 scx_utils: Factor out monitor_stats() from scx_rusty and scx_layered 2024-08-23 06:46:19 -10:00
Tejun Heo
ae3024e938 scx_layered: Add --stats and make --monitor behavior consistent with scx_rusty 2024-08-23 05:52:52 -10:00
Tejun Heo
0f04a93dd1 scx_rusty: Add stat descriptions and make minor adjustments 2024-08-23 05:46:13 -10:00
Tejun Heo
36865234f8 scx_rusty: Add scx_stats annotations necessary for openmetrics translation 2024-08-23 04:59:08 -10:00
Tejun Heo
2f3f473cd3 scx_rusty: Improve timestamp reporting 2024-08-23 04:31:27 -10:00
Daniel Hodges
11b978a892 scx_layered: Add pid/ppid matches
Add matches for pid/ppid.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-08-23 07:20:05 -07:00
Tejun Heo
76934f3aab scx_rusty: Convert to scx_stats
This allows scx_rusty to avoid generating excessive logs for statistics
while still allowing detailed monitoring on demand.
2024-08-22 19:44:12 -10:00
Tejun Heo
16c07a5cd9 scx_rusty: Don't reset bpf_stats, remember prev states and calculate delta
This will ease transition to scx_stats.
2024-08-22 13:02:23 -10:00
Tejun Heo
13fa48a871 scx_rusty: Separate out stats generation and formatting
to prepare for scx_stats conversion.
2024-08-22 10:03:10 -10:00
Tejun Heo
b4564520e5 scx_rusty: Simplify Stats structs and take id out of the structs
to prepare for scx_stats conversion. While at it, make some cosmetic
changes.
2024-08-22 08:45:33 -10:00
Andrea Righi
6a2285398d scx_bpfland: introduce --lowlatency option
Introduce the new `--lowlatency` option, which enables switching between
the default pure vruntime-based scheduling (more optimized for server
workloads) and a deadline-based scheduling (better suited for
low-latency workloads).

When the low-latency mode is activated, a task's deadline is calculated
as its vruntime, adjusted by a bonus proportional to the task's average
number of voluntary context switches (the more voluntary context
switches, the shorter the deadline).

This feature enhances the prioritization of interactive tasks even more,
proportionally to their average voluntary context switches, also within
the two main global queues (priority / shared) and it helps to maintain
interactive workloads always responsive, even in presence of heavy
non-interactive background work.

Low-latency mode allows to prevent audio cracking even in presence of a
large amount of short-lived tasks with pseudo-interactive behavior (i.e,
hackbench) and it enables achieving approximately a +33% average
frames-per-second (FPS) in the typical "gaming while building the
kernel" benchmark.

However, it can also amplify the de-prioritization of CPU-intensive
tasks, making this option more suitable for specific low-latency
scenarios. Therefore the low-latency mode is disabled by default and it
can only be enabled via the `--lowlatency` option.

Tested-by: Piotr Gorski (piotrgorski@cachyos.org)
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-22 13:26:19 +02:00
Tejun Heo
4834dec684 scx_rusty: Move stats structs to stats.rs and rename for consistency 2024-08-21 22:04:38 -10:00
Andrea Righi
b0a8e4a91e scx_bpfland: better time slice control
Explicitly replenish the task's time slice from ops.dispatch() if the
task still wants to run and no other task is selected. In this way the
sched_ext core won't automatically re-schedule the task on the same CPU,
implicitly assigning a time slice of SCX_SLICE_DFL.

Moreover, instead of determining the task time slice in ops.enqueue(),
refresh the time slice immediately before the task is started on its
assigned CPU in ops.running().

This allows to use a more precise time slice, adjusted based on the
actual amount of tasks that are currently waiting to be scheduled.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-22 09:23:37 +02:00
Tejun Heo
d6ac5fbd9c scx_layered: Drop SCX_OPS_ENQ_LAST
The meaning of SCX_OPS_ENQ_LAST will change with future kernel updates and
enqueueing on local DSQ will no longer be sufficient to avoid stalls. No
reason to do it anyway. Just drop it.
2024-08-21 13:13:59 -10:00
Tejun Heo
f726f0b73b Version: Cargo.lock 2024-08-21 06:45:19 -10:00
Tejun Heo
4d1f0639d8 Version: v1.0.3 2024-08-21 06:42:11 -10:00