Commit Graph

1293 Commits

Author SHA1 Message Date
Andrea Righi
f9d7844b2e scx_bpfland: split direct dispatches and kthread dispatches
Show separate statistics for direct dispatches and kthread direct
dispatches.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-07-05 09:27:59 +02:00
Andrea Righi
86d2f50230
Merge pull request #410 from sched-ext/bpfland-smooth-perf
scx_bpfland: enhance performance consistency and predictability
2024-07-04 21:37:07 +02:00
Andrea Righi
d98516fe75
Merge pull request #413 from vax-r/Remove_unused_variable
scx_rustland_core: Remove unused variable
2024-07-04 19:09:27 +02:00
I Hsin Cheng
1595da78dc scx_rustland_core: Remove unused variable
Remove unused variable "tctx" in rustland_select_cpu.

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
2024-07-05 01:04:49 +08:00
I Hsin Cheng
aae826b1b3 scx_flatcg: Make good use of __sync_fetch_and_sub()
Fetch the value of "delta" directly from the returned value from
__sync_fetch_and_sub, as it returns the origin value of
cgc->cvtime_delta.

Additional fetching instruction of cgc->cvtime_delta would be redundant
here.

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
2024-07-05 01:03:20 +08:00
I Hsin Cheng
3e52761487 scx_flatcg: Fix_typo
Fix "oppotunistic" to "opportunistic".

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
2024-07-04 22:04:40 +08:00
Andrea Righi
cfe2ed063d scx_bpfland: time-based starvation prevention
Tasks are consumed from various DSQs in the following order:

  per-CPU DSQs => priority DSQ => shared DSQ

Tasks in the shared DSQ may be starved by those in the priority DSQ,
which in turn may be starved by tasks dispatched to any per-CPU DSQ.

To mitigate this, record the timestamp of the last task scheduling event
both from the priority DSQ and the shared DSQ.

If the starvation threshold is exceeded without consuming a task, the
scheduler will be forced to consume a task from the corresponding DSQ.

The starvation threshold can be adjusted using the --starvation-thresh
command line parameter (default is 5ms).

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-07-04 10:52:39 +02:00
Andrea Righi
9e0db4ae17 scx_bpfland: remove unnecessary RCU read protection
There is no need to RCU protect the cpumask for the offline CPUs: it is
created once when the scheduler is initialized and it's never
deallocated.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-07-04 10:24:43 +02:00
Andrea Righi
cef6ca93cf scx_bpfland: adjust default time slice to 5ms
Reduce the default time slice down to 5ms for a faster reaction and
system responsiveness when the system is overcomissioned.

This also helps to provide a more predictable level of performance.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-07-04 10:24:43 +02:00
Andrea Righi
7d15e3171c scx_bpfland: ensure task time slice never exceeds the slice_ns limit
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-07-04 10:24:43 +02:00
Andrea Righi
e8a4d350ad scx_bpfland: unify dispatching kthreads with direct CPU dispatches
Always use direct CPU dispatch for kthreads, there is no need to treat
kthreads in a special way, simply reuse direct CPU dispatch to
prioritize them.

Moreover, change direct CPU dispatches to use scx_bpf_dispatch_vtime(),
since we may dispatch multiple tasks to the same per-CPU DSQ now.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-07-03 09:38:43 +02:00
Andrea Righi
d2231b0aed scx_bpfland: drop built-in idle CPU selection logic
Small refactoring of the idle CPU selection logic:
 - optimize idle CPU selection for tasks that can run on a single CPU
 - drop the built-in idle selection policy and completely rely on the
   custom one

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-07-03 08:54:17 +02:00
Andrea Righi
a72c9058a3
Merge pull request #409 from sched-ext/bpfland-fix-idle-cpumask
scx_bpfland: use the right cpumask to find any idle CPU
2024-07-01 21:37:35 +02:00
Andrea Righi
7c355f50b2 scx_bpfland: use the right cpumask to find any idle CPU
We are incorrectly using the SMT idle cpumask to find any idle CPU, fix
by using the generic idle cpumask.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-07-01 20:36:24 +02:00
Andrea Righi
c458f345b4
Merge pull request #408 from sched-ext/bpfland-cpu-hotplug
scx_bpfland: support CPU hotplugging
2024-07-01 19:41:00 +02:00
Dan Schatzberg
32ac4b2cff
Merge pull request #389 from dschatzberg/mitosis
mitosis: Update synchronization
2024-07-01 09:44:26 -04:00
Andrea Righi
ff7a518d28 scx_bpfland: support CPU hotplugging
Implement CPU hotplugging in scx_bpfland without restarting the
scheduler.

The idle selection logic has been updated to consider online CPUs.
Additionally, a cpumask for offline CPUs has been introduced. Tasks
that have been dispatched to the DSQs associated with offline CPUs are
consumed by the other CPUs that are still online.

Moreover, the dependency on the Topology crate is temporarily dropped
and instead, /sys/devices/system/cpu/smt/active is used to determine if
SMT should be taken into account during idle selection. The Topology
crate will be re-introduced later when scx_bpfland will gain more
topology-aware capabilities.

This fixes #406.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-06-30 23:04:13 +02:00
Andrea Righi
f965ceb572
Merge pull request #407 from sched-ext/rusty-fix-stats-init
scx_rusty: fix stats map initialization
2024-06-30 20:03:48 +02:00
Andrea Righi
d76551bbd3 scx_rusty: fix stats map initialization
The stats map in scx_rusty is a BPF_MAP_TYPE_PERCPU_ARRAY, with its size
determined by num_possible_cpus(). Initializing it with nr_cpu_ids() can
result in errors such as:

 Error: Failed to zero stat

 Caused by:
     number of values 6 != number of cpus 8

Fix by using num_possible_cpus() to initialize it.

Fixes: 263e02f6 ("rusty: Use nr_cpu_ids instead of nr_cpus_possible")
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-06-30 17:37:14 +02:00
Andrea Righi
14a33b6275
Merge pull request #404 from sirlucjan/config-update3
scheds: Add scx_bpfland scheduler to /etc/default/scx
2024-06-28 22:08:56 +02:00
Piotr Gorski
ee7c0cbea6
scheds: Add scx_bpfland scheduler to /etc/default/scx
Signed-off-by: Piotr Gorski <lucjan.lucjanov@gmail.com>
2024-06-28 22:01:24 +02:00
Andrea Righi
338dd0dafb
Merge pull request #403 from sched-ext/bpfland-meson
scx_bpfland: properly integrate with meson build
2024-06-28 21:58:17 +02:00
Andrea Righi
74175f5a49 scx_bpfland: properly integrate with meson build
Properly honor the meson build `serialize` option.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-06-28 21:37:00 +02:00
Andrea Righi
f98c35fd07
Merge pull request #388 from sched-ext/bpfland
scheds: introduce scx_bpfland
2024-06-28 21:27:43 +02:00
Andrea Righi
183b1b2cfc
Merge pull request #399 from sched-ext/meson-serialize
meson: introduce serialize build option
2024-06-28 20:13:16 +02:00
Andrea Righi
b7977f1ce3
Merge pull request #402 from sched-ext/meson-check-git-commit
meson: check if commit exists in remote git repos
2024-06-28 19:03:02 +02:00
Andrea Righi
273728fd2b meson: check if commit exists in remote git repos
When fetching external git repositories (libbpf and bpftool) we don't
check if the target commit exists.

This can leads to issues such as #400, because we may silently use HEAD,
instead of the specified commit.

Prevent this by returning an error when the target SHA1 cannot be found.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-06-28 15:16:56 +02:00
Andrea Righi
38a05d49f8
Merge pull request #401 from CachyOS/feat/cargo-rel-with-plain
meson: run cargo build in release mode when using plain buildtype
2024-06-28 14:49:12 +02:00
Jose Fernandez
f64cdd1a51
sched_utils: Add log recorder format customization
This change adds the ability to customize the log recorder format for
each metric type. There is a default format that is used if no custom
`MetricFormatter` is provided. This is the same format that was used
before this change.

The `MetricFormatter` should be implemented by the user to customize the
format of the log recorder. The `LogRecorderBuilder` now takes a
`MetricFormatter` as an optional parameter.

Following changes will allow additional customization of the log
recorder format, such as how many metrics are logged per line.

Signed-off-by: Jose Fernandez <josef@netflix.com>
2024-06-28 06:21:03 -06:00
Vladislav Nepogodin
22f13e2284
meson: run cargo build in release mode when using plain buildtype 2024-06-28 16:10:16 +04:00
Andrea Righi
657fb6a4aa
Merge pull request #400 from sched-ext/fix-bpftool
meson: restore previous libbpf version and update bpftool
2024-06-28 13:21:46 +02:00
Andrea Righi
39a06c86f1 meson: restore previous libbpf version and update bpftool
The upstrem bpftool git repo (https://github.com/libbpf/bpftool.git) is
periodically force pushed and the specific commit that we needed is not
available anymore.

Instead of failing we are actually fetching the latest bpftool (HEAD)
that introduced some breakage initially fixed by commit e59c48a6
("Update libbpf commit hash").

However, updating libbpf seems to introduce a run-time problem and all
the schedulers are failing to start:

 libbpf: failed to find skeleton map ''
 libbpf: failed to populate skeleton maps for 'bpf_bpf': -3

So, revert libbpf to the previous version and update the commit for
bpftool to use a version that still allows to generate a compatible BPF
skel.

Fixes: e59c48a6 ("Update libbpf commit hash")
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-06-28 12:43:37 +02:00
Andrea Righi
cf4883fbf8 meson: introduce serialize build option
With commit 5d20f89a ("scheds-rust: build rust schedulers in sequence"),
schedulers are now built serially one after the other to prevent meson
and cargo from forking NxN parallel tasks.

However, this change has made building a single scheduler much more
cumbersome, due to the chain of dependencies.

For example, building scx_rusty using the specific meson target would
still result in all schedulers being built, because they all depend on
each other.

To address this issue, introduce the new meson build option
`serialize=true|false` (default is false).

This option allows to disable the schedulers' build chain, restoring the
old behavior.

With this option enabled, it is now possible to build just a single
scheduler, parallelizing the cargo build properly, without triggering
the build of the others. Example:

  $ meson setup build -Dbuildtype=release -Dserialize=false
  $ meson compile -C build scx_rusty

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-06-28 10:17:37 +02:00
Tejun Heo
769e6c4e55
Merge pull request #396 from otteryc/main
Update libbpf commit hash
2024-06-27 21:09:00 -10:00
Yu-Cheng Chen
e59c48a6da Update libbpf commit hash
libbpf has added a member `link` in the struct `bpf_map_skeleton`, which causes
a build failure in SCX. This commit updates the libbpf commit hash to the latest
version to avoid this error.

Please refer to commit be998aa in libbpf repo.

Signed-off-by: Yu-Cheng Chen <otteryc210@gmail.com>
2024-06-28 08:42:14 +08:00
Changwoo Min
b18ebfea91
Merge pull request #395 from multics69/lavd-opt-v5
scx_lavd: optimization for communicaiton-intensive workloads and heavily-overloaded cases
2024-06-28 09:28:32 +09:00
Changwoo Min
24a238846e scx_lavd: optimizing deadline related tunables
The competition window was 7.5 msec, half of the targeted latency.
However, it is too wide for some workloads, so unrelated tasks may
compete with each other. Hence, it is tightened to about 1 msec with
LAVD_LAT_WEIGHT_SHIFT to avoid unnecessary competition.

Also, when a system is overloaded, now the time space is stretched more
aggressively (i.e., lat_prio^2) when a task's latency priority is low
(high value).

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-06-28 09:00:45 +09:00
Andrea Righi
02bf883737
Merge pull request #394 from sched-ext/scx-utils-debug-info-btf
scx_utils: clarify error about missing CONFIG_DEBUG_INFO_BTF
2024-06-27 20:50:42 +02:00
Andrea Righi
188b3d3bfc ci: enable stress tests for scx_bpfland
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-06-27 17:28:42 +02:00
Andrea Righi
7606b95150 scx_bpfland: introduce maximum time slice lag
Introduce a tunable to set a limit of the minimum vruntime that is used
when a task is dispatched, as:

 vtime_min = vtime_now - slice_lag_ns

Increasing the time slice lag can make interactive tasks even more
responsive at the cost of starving regular and newly created tasks.

Default time slice lag is 0.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-06-27 17:28:42 +02:00
Andrea Righi
5a44329d45 scheds: introduce scx_bpfland
Overview
========

This scheduler is derived from scx_rustland, but it is fully implemented
in BFP with minimal user-space Rust part to process command line
options, collect metrics and logs out scheduling statistics.

Unlike scx_rustland, all scheduling decisions are made by the BPF
component.

Motivation
==========

The primary goal of this scheduler is to act as a performance baseline
for comparison with scx_rustland, allowing for a better assessment of
the overhead caused by kernel/user-space interactions.

It can also be used to deploy prototypes initially tested in the
scx_rustland scheduler. In fact, this scheduler is expected to
outperform scx_rustland, due to the elimitation of the kernel/user-space
overhead.

Scheduling policy
=================

scx_bpfland is a vruntime-based sched_ext scheduler that prioritizes
interactive workloads. Its scheduling policy closely mirrors
scx_rustland, but it has been re-implemented in BPF with some small
adjustments.

Tasks are categorized as either interactive or regular based on their
average rate of voluntary context switches per second: tasks that exceed
a specific voluntary context switch threshold are classified as
interactive.

Interactive tasks are prioritized in a higher-priority DSQ, while
regular tasks are placed in a lower-priority DSQ. Within each queue,
tasks are sorted based on their weighted runtime, using the built-in scx
vtime ordering capabilities (scx_bpf_dispatch_vtime()).

Moreover, each task gets a time slice budget. When a task is dispatched,
it receives a time slice equivalent to the remaining unused portion of
its previously allocated time slice (with a minimum threshold applied).

This gives latency-sensitive workloads more chances to exceed their time
slice when needed to perform short bursts of CPU activity without being
interrupted (i.e., real-time audio encoding / decoding workloads).

Results
=======

According to the initial test results, using the same benchmark "playing
a videogame while recompiling the kernel", this scheduler seems to
provide a +5% improvement in the frames-per-second (fps) compared to
scx_rustland, with video games such as Cyberpunk 2077, Counter-Strike 2
and Baldur's Gate 3.

Initial test results indicate that this scheduler offers around a +5%
improvement in frames-per-second (fps) compared to scx_rustland when
using the benchmark "playing a video game while recompiling the kernel".

This improvement was observed in games such as Cyberpunk 2077,
Counter-Strike 2, and Baldur's Gate 3.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-06-27 17:28:42 +02:00
Changwoo Min
f86d564d89 scx_lavd: fast path for ops.dispatch() when fully loaded
When fully loaded so all CPUs are using, skip checking the cpumask.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-06-27 18:00:39 +09:00
Andrea Righi
a7965abdbc scx_utils: clarify error about missing CONFIG_DEBUG_INFO_BTF
If CONFIG_DEBUG_INFO_BTF is not enabled in the kernel, the C schedulers
report the following error via libbpf, clearly indicating the missing
kernel config:

 libbpf: kernel BTF is missing at '/sys/kernel/btf/vmlinux', was CONFIG_DEBUG_INFO_BTF enabled?

In contrast, the Rust schedulers report a less clear error:

 thread 'main' panicked at /home/arighi/src/scx/rust/scx_utils/src/compat.rs:23:9:
 btf__load_vmlinux_btf() returned NULL
 note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Make sure to report a similar error, so that users have a better clue
about the missing kernel config. After this change the error looks like
the following:

 thread 'main' panicked at /home/arighi/src/scx/rust/scx_utils/src/compat.rs:23:9:
 btf__load_vmlinux_btf() returned NULL, was CONFIG_DEBUG_INFO_BTF enabled?

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-06-27 09:15:43 +02:00
David Vernet
8537c1b474
Merge pull request #393 from sched-ext/revert-382-rusty_refactor
Revert "scx_rusty: Refactor ridx assignment in populate_tasks_by_load"
2024-06-26 16:36:11 -05:00
David Vernet
fe3ce64a9b
Revert "scx_rusty: Refactor ridx assignment in populate_tasks_by_load" 2024-06-26 17:35:22 -04:00
Changwoo Min
41d60aef04
Merge pull request #391 from multics69/lavd-tuning-v4
scx_lavd: tweaks to avoid fork starvation
2024-06-27 00:21:48 +09:00
Andrea Righi
d26a76f238
Merge pull request #390 from sirlucjan/scx-update2
Revert "Add After=graphical.target into service"
2024-06-26 12:11:49 +02:00
Piotr Gorski
1659152a62
Revert "Add After=graphical.target into service"
This reverts commit f7e575808b.

Signed-off-by: Piotr Gorski <lucjan.lucjanov@gmail.com>
2024-06-26 12:08:21 +02:00
Changwoo Min
abc6e31fef scx_lavd: for a forked task, inherit its parent's statistics
The old approach was too conservative in running a new task, so when a
fork-heavy workload competes with a CPU-bound workload, the fork-heavy
one is starved. The new approach solves the starvation problem by
inheriting parent's statistics. It seems a good (at least better than
old) guess how a new task will behave.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-06-26 19:00:10 +09:00
Changwoo Min
ac9c49f5b5 scx_lavd: loosen the deadline when overloaded
When the system is highly loaded with compute-intensive tasks, the old
setting chokes latensive-intensive tasks, so loosen the dealine when the
system is overloaded (> 100% utilization).

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-06-26 15:06:31 +09:00