Commit Graph

93 Commits

Author SHA1 Message Date
Jake Hillion
8ca45cfa37
lint: enable cargo fmt (#643)
Use `cargo fmt` with a specific nightly branch in the CI to enforce formatting. Globally format these files while the diff is still small so we can stay on top of it.

Test plan:
- CI lint check passes.
2024-09-11 10:03:20 +01:00
Andrea Righi
e6e3579a92
Merge pull request #634 from anh0516/main
scx_bpfland: Documentation consistency fix
2024-09-10 23:25:55 +02:00
likewhatevs
c4c3659b6d
Merge pull request #638 from likewhatevs/remove-rlimit-dep
remove dependency on rlimit.rs
2024-09-10 03:14:12 -04:00
Andrea Righi
655ed5b4c6 scx_bpfland: use sum_exec_runtime to evaluate task's used time slice
Using p->scx.slice to evaluate the consumed time slice can be a bit
imprecise, because the sched_ext core implements yielding by setting
p->scx.slice to 0.

When the task's vruntime is evaluated this is considered as the task has
exhausted its entire allocated time slice, even though it voluntarily
released the CPU before the slice fully expired.

To avoid this inaccuracy and prevent penalizing tasks that voluntarily
release the CPU, always evaluate the used time slice based on the
difference in the task's total execution time (p->se.sum_exec_runtime).

This method provides a more precise calculation of vruntime and results
in a fairer task's deadline evaluation.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-09-10 08:03:35 +02:00
patso
c1df85914b
remove dependency on rlimit.rs
the rlimit crate is the only dependency crate
with a build.rs. build.rs files complicate portability.
this removes the need for rlimit.rs
2024-09-10 01:16:53 -04:00
Avraham Hollander
f71cc646a3 scx_bpfland: Fix in README.md for the same text as a comment in the
source
2024-09-06 19:12:33 -04:00
Tejun Heo
46fc2e1a49 version: v1.0.4 2024-09-05 18:12:45 -10:00
Andrea Righi
918cfc613d scx_bpfland: optimize producer/consumer workloads
When selecting an idle CPU for a task that has been woken up, prioritize
reusing the same CPU if the waker and wakee share the same L3 cache.

Otherwise, attempt to migrate the wakee to the waker's CPU, provided it
is allowed by the wakee's scheduling domain.

This seems to consistently improve FPS performance when the system is
not operating over its full capacity.

Example:
 $ __GL_SYNC_TO_VBLANK=0 vblank_mode=0 glxgears -geometry 800x600

 - before: ~18305.77 FPS
 - after:  ~19060.62 FPS

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-09-05 19:02:09 +02:00
Andrea Righi
844c00fd26 scx_bpfland: enable "auto" mode by default
Rename "turbo domain" to "preferred domain", that conceptually is more
generic and introduce the new option `--preferred-domain CPUMASK`, which
allows users to define the preferred domain, specifying a cpumask as a
hex number. By default ("auto") the scheduler will always try to detect
and use the fastest CPUs in the system.

Moreover, adjust the cpufreq logic to use "auto" both with the
"balance_power" and "balance_performance" EPP profiles.

Then, enable "auto" mode by default: the scheduler will try to
automatically determine the optimal primary domain, preferred domain and
cpufreq level, based on the selected scheduler and energy profiles.

Tested-by: Piotr Gorski < piotr.gorski@cachyos.org >
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-09-05 16:11:12 +02:00
Andrea Righi
afc7b5404b
Merge pull request #600 from sched-ext/bpfland-cpufreq
scx_bpfland: improve cpufreq awareness
2024-09-05 07:32:10 +02:00
Tejun Heo
f010eda5c0 meson: Remove scheds/rust/*/meson.build
These aren't used since 43950c65 ("build: Use workspace to group rust
sub-projects"). Drop them.
2024-09-04 06:40:17 -10:00
Andrea Righi
918f1db4bd scx_bpfland: dynamically adjust cpufreq level in auto mode
In auto mode, rather than keeping the previous fixed cpuperf factor,
dynamically calculate it based on CPU utilization and apply it before a
task runs within its allocated time slot.

Interactive tasks consistently receive the maximum scaling factor to
ensure optimal performance.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-09-03 21:36:48 +02:00
Andrea Righi
fe6ac15015 scx_bpfland: improve turbo domain CPU selection
Always consider the turbo domain when running in "auto" mode.

Additionally, when the turbo domain is used, split the CPU idle
selection logic into two stages:
 1) in ops.select_cpu(), provide the task with a second opportunity to
    remain within the same LLC
 2) in ops.enqueue(), perform another check for an idle CPU, allowing
    the task to move to a different LLC if an idle CPU within the same
    LLC is not available.

This allows tasks to stick more on turbo-boosted CPUs and CPUs within
the same LLC.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-09-03 09:59:29 +02:00
Andrea Righi
70b93ed641 scx_bpfland: skip idle CPU selection for tasks with changing affinity
When tasks are changing CPU affinity it is pointless to try to find an
optimal idle CPU. In this case just skip the the idle CPU selection step
and let the task being dispatched to a global DSQ if needed.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-09-03 09:59:29 +02:00
Andrea Righi
802d104b46 scx_bpfland: add basic cpufreq support
Add hints for the cpufreq governor based on the selected scheduler's
performance profile and the current energy performance preference (EPP).

With this change applied the scheduler works as following:

scheduler profile (--primary-domain option):
  - default:
    - use all cores
    - cpufreq: use default scaling factor
  - powersave:
    - use E-cores
    - cpufreq: use min scaling factor
  - performance:
    - use P-cores
    - cpufreq: use max scaling factor
  - auto:
    - EPP: power, powersave
      - use E-cores
      - cpufreq: use min scaling factor
    - EPP: balance_power (typically battery-powered systems)
      - use E-cores
      - cpufreq: use default scaling factor
    - EPP: balance_performance, performance
      - use P-cores
      - cpufreq: use max scaling factor

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-09-03 09:59:29 +02:00
Andrea Righi
2cbf252019 scx_bpfland: directly dispatch only per-cpu kthreads with local_kthreads
We want to directly dispatch only kthreads when local_kthreads is
enabled, not all tasks that can run on a single CPU.

Fixes: 7cc1846 ("scx_bpfland: always rely on prev_cpu with single-CPU tasks")
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-31 16:35:54 +02:00
Andrea Righi
7cc18460b9 scx_bpfland: always rely on prev_cpu with single-CPU tasks
When selecting an idle for tasks that can only run on a single CPU,
always check if the previously used CPU is sill usable, instead of
trying to figure out the single allowed CPU looking at the task's
cpumask.

Apparently, single-CPU tasks can report a prev_cpu that is not in the
allowed cpumask when they rapidly change affinity.

This could lead to stalls, because we may end up dispatching the kthread
to a per-CPU DSQ that is not compatible with its allowed cpumask.

Example:

kworker/u32:2[173797] triggered exit kind 1026:
  runnable task stall (kworker/2:1[70] failed to run for 7.552s)
...
  R kworker/2:1[70] -7552ms
      scx_state/flags=3/0x9 dsq_flags=0x1 ops_state/qseq=0/0
      sticky/holding_cpu=-1/-1 dsq_id=0x8 dsq_vtime=234483011369
      cpus=04

In this case kworker/2 can only run on CPU #2 (cpus=0x4), but it's
dispatched to dsq_id=0x8, that can only be consumed by CPU 8 => stall.

To prevent this, do not try to figure out the best idle CPU for tasks
that are changing affinity and just dispatch them to a global DSQ
(either priority or regular, depending on its interactive state).

Moreover, introduce an explicit error check in dispatch_direct_cpu() to
improve detection of similar issues in the future, and drop
lookup_task_ctx() in favor of try_lookup_task_ctx(), since we can now
safely handle all the cases where the task context is not found.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-30 09:45:58 +02:00
Andrea Righi
28cb1ec5cb scx_bpfland: enhanced task affinity
Aggressively try to keep tasks running on the same CPU / cache / domain,
to achieve higher performance when the system is not over commissioned.

This is done by giving a second chance in ops.enqueue(), in addition to
ops.select_cpu(), to find an idle CPU close to the previously used CPU.

Moreover, even if the task is dispatched to the global DSQs, always try
to check if there is an idle CPU in the primary domain that can
immediately consume the task.

= Results =

This change seems to provide a minor, but consistent, boost of
performance with the CPU-intensive benchmarks from the CachyOS
benchmarks selection [1].

Similar results can also be noticed with some WebGL benchmarks [2], when
system usage is close to its maximum capacity.

Test:
 - cachyos-benchmarker

System:
 - AMD Ryzen 7 5800X 8-Core Processor

Metrics:
 - total time: elapsed time of all benchmarks
 - total score: geometric mean of all benchmarks

NOTE: total time is the most relevant, since it gives a measure of the
aggregate performance, while the total score emphasizes more on
performance consistency across all benchmarks.

== Results: summary ==

 +-------------------------+---------------------+---------------------+
 |         Scheduler       |    Total Time       |    Total Score      |
 |                         |    (less = better)  |    (less = better)  |
 +-------------------------+---------------------+---------------------+
 |                 EEVDF   |  624.44 sec         |      123.68         |
 |               bpfland   |  625.34 sec         |      122.21         |
 | bpfland-task-affinity   |  623.67 sec         |      122.27         |
 +-------------------------+---------------------+---------------------+

== Conclusion ==

With this patch applied, bpfland shows both a better performance and
consistency. Although the gains are small (less than 1%), they are still
significant for this type of benchmark and consistently appear across
multiple runs.

[1] https://github.com/CachyOS/cachyos-benchmarker
[2] https://webglsamples.org/aquarium/aquarium.html

Tested-by: Piotr Gorski < piotr.gorski@cachyos.org >
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-28 10:30:54 +02:00
Andrea Righi
a155d5185d scx_bpfland: rely on Topology to classify core types
Rely on scx_utils::Topology to classify Big, Little and Turbo CPUs.

Moreover, support the special keyword "all" with --primary-domain to
include all the CPUs in the system (default).

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-28 00:23:55 +02:00
Andrea Righi
872e653cd2 scx_utils: introduce Turbo core type to Topology
Integrate the logic used by scx_bpfland to detect turbo-boosted cores in
Topology.

Also change the logic to detect Big/Little cores in function of
base_frequency, instead of scaling_max_freq, otherwise turbo-boosted
cores in homogeneous systems may be incorrectly classified as Big.

Moreover, introduce the following new methods to Cpu to check for the
core type:
 - is_turbo(): return true if the CPU is Turbo, false otherwise
 - is_big(): return true if the CPU is either Turbo or Big
 - is_little(): return true if the CPU is Little

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-28 00:09:08 +02:00
Andrea Righi
e0f49a338a scx_bpfland: fix turbo boost domain nullifying primary domain limits
When creating the turbo boost scheduling domain, we might use a full CPU
mask (selecting all possible CPUs) to indicate "do not prioritize turbo
boost CPUs" or when all CPUs have the same maximum frequency.

This approach works when the primary domain also contains all the CPUs,
as the complete overlap allows the CPU selection logic to ignore the
turbo boost domain and start picking CPUs directly from the primary
domain.

However, if the primary domain doesn't include all CPUs, the two domains
won't fully overlap, which can lead to the turbo boost domain
incorrectly including all CPUs, thereby negating the restrictions set by
the primary scheduling domain.

To resolve this, an empty CPU mask should be used for the turbo boost
domain when turbo boost CPUs aren't prioritized. If the turbo boost
domain is empty, it should be entirely bypassed, and the selection
should proceed directly to the primary domain.

Reported-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-27 13:36:50 +02:00
Andrea Righi
a469f0f1ce
Merge pull request #561 from sched-ext/bpfland-fix-energy-profile-refresh
scx_bpfland: prevent reading energy profile if not available
2024-08-25 18:31:34 +02:00
Andrea Righi
f8acd069f0 scx_bpfland: prevent reading energy profile if not available
Avoid to periodically read the current performance profile from
/sys/devices/system/cpu/cpufreq/policy0/energy_performance_preference if
it's not available (i.e., with older CPUs or kernels without cpufreq).

This fixes issue #560.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-25 16:53:35 +02:00
Tejun Heo
43950c65bd build: Use workspace to group rust sub-projects
meson build script was building each rust sub-project under rust/ and
scheds/rust/ separately. This means that each rust project is built
independently which leads to a couple problems - 1. There are a lot of
shared dependencies but they have to be built over and over again for each
proejct. 2. Concurrency management becomes sad - we either have to unleash
multiple cargo builds at the same time possibly thrashing the system or
build one by one.

We've been trying to solve this from meson side in vain. Thankfully, in
issue #546, @vimproved suggested using cargo workspace which makes the
sub-projects share the same target directory and built together by the same
cargo instance while still allowing each project to behave independently for
development and publishing purposes.

Make the following changes:

- Create two cargo workspaces - one under rust/, the other under
  scheds/rust/. Each contains all rust projects underneath it.

- Don't let meson descend into rust/. These are libraries used by the rust
  schedulers. No need to build them from meson. Cargo will build them as
  needed.

- Change the rust_scheds build target to invoke `cargo build` in
  scheds/rust/ and let cargo do its thing.

- Remove per-scheduler meson.build files and instead generate custom_targets
  in scheds/rust/meson.build which invokes `cargo build -p $SCHED`.

- This changes rust binary directory. Update README and
  meson-scripts/install_rust_user_scheds accordingly.

- Remove per-scheduler Cargo.lock as scheds/rust/Cargo.lock is shared by all
  schedulers now.

- Unify .gitignore handling.

The followings are build times on Ryzen 3975W:

Before:
  ________________________________________________________
  Executed in  165.93 secs    fish           external
     usr time   40.55 mins    2.71 millis   40.55 mins
     sys time    3.34 mins   36.40 millis    3.34 mins

After:
  ________________________________________________________
  Executed in   36.04 secs    fish           external
     usr time  336.42 secs    0.00 millis  336.42 secs
     sys time   36.65 secs   43.95 millis   36.61 secs

Wallclock time is reduced 5x and CPU time 7x.
2024-08-25 00:47:58 -10:00
Tejun Heo
152a8471cc scx_bpfland: When reporting stats, use interval deltas
Three of the reported stats are cumulative. While they obviously can be
processed into delta values, that holds for the other direction too and the
cumulative values are difficult to make intutive sense of. Report interval
delta values instead.

Note that a stats client can reliably build back cumulative values even
under heavy system contention - the delta values reported between two
consecutive reads are guaranteed to be correct regardless of the duration of
the interval.
2024-08-24 23:14:57 -10:00
Tejun Heo
bd68e230b9 scx_bpfland: Convert to scx_stats
Use scx_stats instead of prometheus for stats reporting. This has a few
advantages:

- Stats metadata can be defined more succinctly.

- Natural support for nesting statistics which will be useful in making
  scheduler components composable.

- Support for multiple programmable readers where each reader can use their
  own reading interval.

- Built-in stats help message generation.

- Openmetrics integration is still available through
  scx_stats/scripts/scxstats_to_openmetrics.py.
2024-08-24 23:14:55 -10:00
Tejun Heo
1bba713a29
Merge pull request #542 from sched-ext/htejun/scx_stats
scx_stats, scx_rusty, scx_layered: Implement `--help-stats`
2024-08-24 15:38:36 -10:00
Andrea Righi
5a08855a86 scx_bpfland: always honor average nvcsw in lowlatency mode
Keep evaluating the average number of voluntary context switches for
each task when lowlatency mode is enabled, even when interactive tasks
classification is disabled (via `-c 0`).

The average nvcsw is also used in lowlatency mode to evaluate the
proportional bonus to the tasks' deadline and it shouldn't be ignored
when interactive tasks classification is disabled. Moreover, make sure
that such bonus never exceeds the starvation threshold.

Keep in mind that it is still possible to disable the periodic average
nvcsw evaluation with `-c 0`, without specifying `--lowlatency`.

Fixes: 6a22853 ("scx_bpfland: introduce --lowlatency option")
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-24 10:42:22 +02:00
Avraham Hollander
d6e27b59e7 Clean up scx_bpfland help info a bit 2024-08-23 18:55:04 -04:00
Tejun Heo
9e3b4e6db0 scx_stats: A bit of cleanups and renames 2024-08-23 09:09:02 -10:00
Andrea Righi
50684e4569 scx_bpfland: introduce Intel Turbo Boost awareness
Make `--primar-domain auto` aware of turbo boosted CPUs and prioritize
them over the primary scheduling domain when the energy model
`balance_power` is used (typically when running on battery power with
the "balanced" profile).

With this change the scheduling hierarchy becomes the following:

 1) CPUs in the turbo scheduling domain
 2) CPUs in the primary scheduling domain
 3) full-idle SMT CPUs
 4) CPUs in the same L2 cache
 5) CPUs in the same L3 cache
 6) CPUs in the task's allowed domain

And the idle selection logic is modified as following:

 - In the turbo scheduling domain:
   - pick same full-idle SMT CPU
   - pick any other full-idle SMT CPU sharing the same L2 cache
   - pick any other full-idle SMT CPU sharing the same L3 cache
   - pick any other full-idle SMT CPU
   - pick same idle CPU
   - pick any other idle CPU sharing the same L2 cache
   - pick any other idle CPU sharing the same L3 cache
   - pick any other idle SMT CPU
 - In the primary scheduling domain:
   - pick same full-idle SMT CPU
   - pick any other full-idle SMT CPU sharing the same L2 cache
   - pick any other full-idle SMT CPU sharing the same L3 cache
   - pick any other full-idle SMT CPU
   - pick same idle CPU
   - pick any other idle CPU sharing the same L2 cache
   - pick any other idle CPU sharing the same L3 cache
   - pick any other idle SMT CPU
 - In the entire task domain:
   - pick any other idle CPU

Keep in mind that the turbo domain will be evaluated only when the
scheduler is started with `--primary-domain auto` and only when the
`balance_power` energy profile is used.

The turbo domain is always made using the subset of CPUs in the system
with the highest max frequency. If such subset can't be determined (for
example if all the CPUs in the primary domain have all the same
frequency), the turbo domain will be ignored.

Prioritizing turbo boosted CPUs can help to improve performance by
forcing the governor to scale up their frequency, without increasing too
much power consumption, due to the fact that tasks will be preferably
confined into a reduced amount of cores.

This change seems to improve performance, without increasing much
power consuption, on Intel laptops while using the `balanced_power`
energy profile.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-23 19:49:08 +02:00
Andrea Righi
d958dd4482 scx_bpfland: introduce dynamic energy profile
Introduce the new option `--primary-domain auto`. With this option the
scheduler will dynamically adjusts the primary scheduling domain at
run-time, in function of the current energy profile reported in
/sys/devices/system/cpu/cpufreq/policy0/energy_performance_preference.

When the `power` energy profile is selected, the primary scheduling
domain will prioritize E-cores. Alternatively, when the `performance`
profile is selected, it will prioritize P-cores. For all the other
energy profiles, all the CPUs in the system will be used.

Note that this option is only relevant on hybrid architectures with
P-cores and E-cores.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-23 19:49:01 +02:00
Andrea Righi
6a2285398d scx_bpfland: introduce --lowlatency option
Introduce the new `--lowlatency` option, which enables switching between
the default pure vruntime-based scheduling (more optimized for server
workloads) and a deadline-based scheduling (better suited for
low-latency workloads).

When the low-latency mode is activated, a task's deadline is calculated
as its vruntime, adjusted by a bonus proportional to the task's average
number of voluntary context switches (the more voluntary context
switches, the shorter the deadline).

This feature enhances the prioritization of interactive tasks even more,
proportionally to their average voluntary context switches, also within
the two main global queues (priority / shared) and it helps to maintain
interactive workloads always responsive, even in presence of heavy
non-interactive background work.

Low-latency mode allows to prevent audio cracking even in presence of a
large amount of short-lived tasks with pseudo-interactive behavior (i.e,
hackbench) and it enables achieving approximately a +33% average
frames-per-second (FPS) in the typical "gaming while building the
kernel" benchmark.

However, it can also amplify the de-prioritization of CPU-intensive
tasks, making this option more suitable for specific low-latency
scenarios. Therefore the low-latency mode is disabled by default and it
can only be enabled via the `--lowlatency` option.

Tested-by: Piotr Gorski (piotrgorski@cachyos.org)
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-22 13:26:19 +02:00
Andrea Righi
b0a8e4a91e scx_bpfland: better time slice control
Explicitly replenish the task's time slice from ops.dispatch() if the
task still wants to run and no other task is selected. In this way the
sched_ext core won't automatically re-schedule the task on the same CPU,
implicitly assigning a time slice of SCX_SLICE_DFL.

Moreover, instead of determining the task time slice in ops.enqueue(),
refresh the time slice immediately before the task is started on its
assigned CPU in ops.running().

This allows to use a more precise time slice, adjusted based on the
actual amount of tasks that are currently waiting to be scheduled.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-22 09:23:37 +02:00
Tejun Heo
f726f0b73b Version: Cargo.lock 2024-08-21 06:45:19 -10:00
Tejun Heo
4d1f0639d8 Version: v1.0.3 2024-08-21 06:42:11 -10:00
Andrea Righi
fedfee0bd6 scx_bpfland: drop unused variable
With the global scx_utils::NR_CPU_IDS we don't need Topology anymore in
init_primary_domain(), so drop the variable to fix the following build
warning:

warning: unused variable: `topo`
   --> src/main.rs:385:9
    |
385 |         topo: &Topology,
    |         ^^^^ help: if this is intentional, prefix it with an underscore: `_topo`
    |
    = note: `#[warn(unused_variables)]` on by default

Fixes: 1da249f ("scx_utils::topology: Always use NR_CPU_IDS and NR_CPUS_POSSIBLE")
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-21 17:46:12 +02:00
Andrea Righi
9f7a11bba6
Merge pull request #528 from sched-ext/bpfland-turbo-boost
scx_bpfland: properly classify Intel Turbo Boost CPUs
2024-08-21 17:40:25 +02:00
Tejun Heo
9c62019c81
Merge pull request #527 from sched-ext/htejun/scx_utils
scx_utils::cpumask,topology: Misc updates
2024-08-20 22:25:25 -10:00
Andrea Righi
695e3b25b0 scx_bpfland: classify CPUs depending of their the base frequency
Use the base frequency, instead of maximum frequency, to classify fast
and slow CPUs. This ensures accurate distinction between Intel Turbo
Boost CPUs and genuinely faster CPUs when auto-detecting the primary
scheduling domain.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-21 10:16:41 +02:00
Tejun Heo
1da249f063 scx_utils::topology: Always use NR_CPU_IDS and NR_CPUS_POSSIBLE
Always use the LazyLock versions and drop the counterparts from Topology.
2024-08-20 21:57:56 -10:00
Andrea Righi
c85315d527 scx_bpfland: allow to completely disable interactive classification
Tasks enqueued with SCX_ENQ_WAKEUP are immediately classified as
interactive. However, if interactive tasks classification is disabled
(via `-c 0`), we should avoid promoting them as interactive.

This is particularly important because, with the nvcsw logic disabled,
tasks can remain classified as interactive indefinitely and they will
never be demoted to regular tasks.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-21 08:45:13 +02:00
Andrea Righi
a9f5aaa536 scx_bpfland: replace custom CpuMask with scx_utils::Cpumask
Rely on scx_utils::Cpumask instead of re-implementing a custom struct to
parse and manage CPU masks.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-21 07:21:52 +02:00
Andrea Righi
467d4b5ea4 scx_bpfland: get topology information from scx_utils::Topology
Rely on scx_utils::Topology to get CPU and cache information, instead of
re-implementing custom methods.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-20 10:16:02 +02:00
Andrea Righi
f8a2445869 scx_bpfland: introduce performance/powersave primary domain
The primary scheduling domain represents a group of CPUs in the system
where the scheduler will initially attempt to assign tasks. Tasks will
only be dispatched to CPUs within this primary domain until they are
fully utilized, after which tasks may overflow to other available CPUs.

The primary scheduling domain can defined using the option
`--primary-domain CPUMASK` (by default all the CPUs in the system are
used as primary domain).

This change introduces two new special values for the CPUMASK argument:
 - `performance`: automatically detect the fastest CPUs in the system
   and use them as primary scheduling domain,
 - `powersave`: automatically detect the slowest CPUs in the system and
   use them as primary scheduling domain.

The current logic only supports creating two groups: fast and slow CPUs.

The fast CPU group is created by excluding CPUs with the lowest
frequency from the overall set, which means that within the fast CPU
group, CPUs may have different maximum frequencies.

When using the `performance` mode the fast CPUs will be used as primary
domain, whereas in `powersave` mode, the slow CPUs will be used instead.

This option is particularly useful in hybrid architectures (with P-cores
and E-cores), as it allows the use of bpfland to prioritize task
scheduling on either P-cores or E-cores, depending on the desired
performance profile.

Example:

 - Dell Precision 5480
   - CPU: 13th Gen Intel(R) Core(TM) i7-13800H
     - P-cores:  0-11 / max freq: 5.2GHz
     - E-cores: 12-19 / max freq: 4.0GHz

 $ scx_bpfland --primary-domain performance

  0[|||||||||                24.5%]  10[||||||||                  22.8%]
  1[||||||                   14.9%]  11[|||||||||||||             36.9%]
  2[||||||                   16.2%]  12[                           0.0%]
  3[|||||||||                25.3%]  13[                           0.0%]
  4[|||||||||||              33.3%]  14[                           0.0%]
  5[||||                      9.9%]  15[                           0.0%]
  6[|||||||||||              31.5%]  16[                           0.0%]
  7[|||||||                  17.4%]  17[                           0.0%]
  8[||||||||                 23.4%]  18[                           0.0%]
  9[|||||||||                26.1%]  19[                           0.0%]

  Avg power consumption: 3.29W

 $ scx_bpfland --primary-domain powersave

  0[|                         2.5%]  10[                           0.0%]
  1[                          0.0%]  11[                           0.0%]
  2[                          0.0%]  12[||||                       8.0%]
  3[                          0.0%]  13[|||||||||||||||||||||     64.2%]
  4[                          0.0%]  14[||||||||||                29.6%]
  5[                          0.0%]  15[|||||||||||||||||         52.5%]
  6[                          0.0%]  16[|||||||||                 24.7%]
  7[                          0.0%]  17[||||||||||                30.4%]
  8[                          0.0%]  18[|||||||                   22.4%]
  9[                          0.0%]  19[|||||                     12.4%]

  Avg power consumption: 2.17W

(Info collected from htop and turbostat)

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-19 20:19:21 +02:00
Andrea Righi
174993f9d2 scx_bpfland: introduce cache awareness
While the system is not saturated the scheduler will use the following
strategy to select the next CPU for a task:
  - pick the same CPU if it's a full-idle SMT core
  - pick any full-idle SMT core in the primary scheduling group that
    shares the same L2 cache
  - pick any full-idle SMT core in the primary scheduling grouop that
    shares the same L3 cache
  - pick the same CPU (ignoring SMT)
  - pick any idle CPU in the primary scheduling group that shares the
    same L2 cache
  - pick any idle CPU in the primary scheduling group that shares the
    same L3 cache
  - pick any idle CPU in the system

While the system is completely saturated (no idle CPUs available), tasks
will be dispatched on the first CPU that becomes available.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-19 20:19:21 +02:00
Tejun Heo
c16b48d7b2 scheds/rust: Include Cargo.lock in the repo
Binary packages are expected to include Cargo.lock in the repo so that the
produced binaries match across different builds.
2024-08-15 23:08:35 -10:00
Andrea Righi
f9a994412d scx_bpfland: introduce primary scheduling domain
Allow to specify a primary scheduling domain via the new command line
option `--primary-domain CPUMASK`, where CPUMASK can be a hex number of
arbitrary length, representing the CPUs assigned to the domain.

If this option is not specified the scheduler will use all the available
CPUs in the system as primary domain (no behavior change).

Otherwise, if a primary scheduling domain is defined, the scheduler will
try to dispatch tasks only to the CPUs assigned to the primary domain,
until these CPUs are saturated, at which point tasks may overflow to
other available CPUs.

This feature can be used to prioritize certain cores over others and it
can be really effective in systems with heterogeneous cores (e.g.,
hybrid systems with P-cores and E-cores).

== Example (hybrid architecture) ==

Hardware:
 - Dell Precision 5480 with 13th Gen Intel(R) Core(TM) i7-13800H
   - 6 P-cores 0..5  with 2 CPUs each (CPU from  0..11)
   - 8 E-cores 6..13 with 1 CPU  each (CPU from 12..19)

== Test ==

WebGL application (https://webglsamples.org/aquarium/aquarium.html):
this allows to generate a steady workload in the system without
over-saturating the CPUs.

Use different scheduler configurations:

 - EEVDF (default)
 - scx_bpfland using P-cores only (--primary-domain 0x00fff)
 - scx_bpfland using E-cores only (--primary-domain 0xff000)

Measure performance (fps) and power consumption (W).

== Result ==

                  +-----+-----+------+-----+----------+
                  | min | max | avg  |       |        |
                  | fps | fps | fps  | stdev | power  |
+-----------------+-----+-----+------+-------+--------+
| EEVDF           | 28  | 34  | 31.0 |  1.73 |  3.5W  |
| bpfland-p-cores | 33  | 34  | 33.5 |  0.29 |  3.5W  |
| bpfland-e-cores | 25  | 26  | 25.5 |  0.29 |  2.2W  |
+-----------------+-----+-----+------+-------+--------+

Using a primary scheduling domain of only P-cores with scx_bpfland
allows to achieve a more stable and predictable level of performance,
with an average of 33.5 fps and an error of ±0.5 fps.

In contrast, using EEVDF results in an average frame rate of 31.0 fps
with an error of ±3.0 fps, indicating slightly less consistency, due to
the fact that tasks are evenly distributed across all the cores in the
system (both slow and fast cores).

On the other hand, using a scheduling domain solely of E-cores with
scx_bpfland results in a lower average frame rate (25.5 fps), though it
maintains a stable performance (error of ±0.5 fps), but the power
consumption is also reduced, averaging 2.2W, compared to 3.5W with
either of the other configurations.

== Conclusion ==

In summary, with this change users have the flexibility to prioritize
scheduling on performance cores for better performance and consistency,
or prioritize energy efficient cores for reduced power consumption, on
hybrid architectures.

Moreover, this feature can also be used to minimize the number of cores
used by the scheduler, until they reach full capacity. This capability
can be useful for reducing power consumption even in homogeneous systems
or for conducting scheduling experiments with smaller sets of cores,
provided the system is not overcommitted.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-14 16:17:54 +02:00
Andrea Righi
a6e977c70b scx_bpfland: make output more compact
Abbreviate the statistics reported to stdout and remove the slice_ms
metric: this metric can be easily derived from slice_ns, slice_ns_min
and nr_wait, which is already reported to stdout.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-14 16:17:54 +02:00
Andrea Righi
8656effa50 scx_bpfland: update copyright info
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-14 16:17:54 +02:00