Commit Graph

574 Commits

Author SHA1 Message Date
Tejun Heo
54c487731a Update to vmlinux-v6.10-rc2-g1edab907b57d.h
Sync to vmlinux.h from sched_ext/for-6.11 1edab907b57d ("sched_ext/scx_qmap:
Pick idle CPU for direct dispatch on !wakeup enqueues"). This most likely
will be the commit which will be merged during the upcoming kernel v6.11
merge window.

Unfortunately, this is a compatibility breaking change. As the size of
bpf_iter_scx_dsq is reduced, schedulers that use the iterator - scx_lavd and
scx_layered - won't be able to run on older kernels. Likewise, older
binaries from before this commit won't be able to run on newer kernels.
2024-07-12 11:13:34 -10:00
Tejun Heo
f261d0f037 Sync from kernel - 1edab907b57d
Sync from sched_ext/for-6.11 1edab907b57d ("sched_ext/scx_qmap: Pick idle
CPU for direct dispatch on !wakeup enqueues")

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git for-6.11

- cgroup support hasn't landed in the upstream kernel yet. This most likely
  will happen in a few weeks. For the time being, disable scx_flatcg,
  scx_pair and scx_mitosis.

- Compat macro for DSQ task iterator dropped. This is now a part of
  the baseline.

- scx_bpf_consume() isn't upstream yet. BPF interfacing side is still being
  discussed. Dropped example usage from tools/sched_ext. None of the
  practical schedulers use it, so this should be fine for now.

- scx_bpf_cpu_rq() added.

- AUTOATTACH workaround for newer libbpf versions added.
2024-07-12 11:08:41 -10:00
Changwoo Min
00fdc1d949
Merge pull request #417 from multics69/lavd-vdeadline
scx_lavd: improve virtual deadline and current clock handling
2024-07-12 14:05:44 +09:00
Changwoo Min
d4bc92bea7 scx_lavd: print lat_cri to output
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-12 13:23:56 +09:00
Changwoo Min
4c5c564523 scx_lavd: initial current logical clock to zero
To easily distinguish, let's initialize the current logical clock to
zero (not the current physical time). Also, avoid the deadline
calculation being zero by adding +1 here and there.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-12 10:15:54 +09:00
Andrea Righi
640bd562ff scx_bpfland: prevent tasks from abusing interactive priority boost
The priority boost for interactive tasks can be exploited to render the
system nearly unresponsive by creating numerous tasks that constantly
switch between wait/wakeup states.

For example, stress tests like `hackbench -l 10000` can significantly
degrade system responsiveness.

To mitigate this, limit the number of interactive tasks added to the
priority queue to 4x the number of online CPUs.

This simple approach appears to be a quite effective at identifying
potential spam of "fake" interactive tasks, while still prioritizing
legitimate interactive tasks.

Additionally, periodically refresh the interactive status of the tasks
based on their most recent average of voluntary context switches,
preventing the interactive status from being too "sticky".

Tested-by: Piotr Gorski <lucjan.lucjanov@gmail.com>
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
2024-07-11 16:13:55 +02:00
Andrea Righi
1babb2b92d scx_bpfland: prevent per-CPU kthreads starving other tasks
Avoid dispatching per-CPU kthreads directly, since this may cause
interactivity problems or unfairness, for example if there are too many
softirqs being scheduled (e.g., in presence of high RX network traffic
or when running certain stress tests, like hackbench).

Moreover, in order to help with testing and benchmarks, introduce the
option --local-kthread, that allows to restore the old behavior if
enabled.

Tested-by: Piotr Gorski <lucjan.lucjanov@gmail.com>
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
2024-07-11 16:13:48 +02:00
Andrea Righi
c3ebdd338f scx_bpfland: prevent slice delta overflow
When updating the task vruntime, ensure the time slice delta is always a
positive value. Failing to do so may cause the global vruntime to
increase excessively due to overflows.

Tested-by: Piotr Gorski <lucjan.lucjanov@gmail.com>
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
2024-07-11 15:58:01 +02:00
Andrea Righi
f59aa52fe7 scx_bpfland: expose the amount of online CPUs
Periodically report the amount of online CPUs to stdout.

The online CPUs are initially evaluated looking at the online cpumask,
then the value is updated in the .cpu_offline() / .cpu_online()
callbacks.

Tested-by: Piotr Gorski <lucjan.lucjanov@gmail.com>
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
2024-07-11 15:58:01 +02:00
Andrea Righi
3a47b484af scx_bpfland: report interactive tasks to stdout
Keep track of the CPUs that are running interactive tasks and report
their amount to stdout.

Tested-by: Piotr Gorski <lucjan.lucjanov@gmail.com>
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
2024-07-11 15:58:01 +02:00
Andrea Righi
1a1a16b9e9 scx_bpfland: fix typo in slice_ns definition
The correct default value of slice_ns 5ms, not 5s.

This change doesn't really make any difference in practice, since these
values are changed by the Rust part when the scheduler is started, but
it's good to keep this aligned to the proper values for consistency.

Tested-by: Piotr Gorski <lucjan.lucjanov@gmail.com>
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
2024-07-11 15:58:01 +02:00
Changwoo Min
bdbfeb9fd1 scx_lavd: use logical current clock for virtual deadlines
This commit changes the use of a physical clock to a virtual, logical
clock in calculating deadlines.

- The virtual current clock advances upon a task's running to its
  virtual deadline.

- When enqueuing a task, its virtual deadline from the virtual current
  clock is calculated.

With the above two changes, this guarantees that there is no such task
whose virtual deadline is smaller than the virtual current clock. This
means any enqueuing task can compete with any other already enqueued
tasks. This allows a latency-critical task to be immediately scheduled
if needed.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-11 22:41:56 +09:00
Changwoo Min
408ea7892c scx_lavd: induce sched_prio_to_latency_weight from slice weight
So sched_prio_to_latency_weight is removed.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-11 21:37:21 +09:00
Changwoo Min
bd964acff6 scx_lavd: deprioritize a newly forked task in latency
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-11 21:36:32 +09:00
Changwoo Min
48debe416e scx_lavd: tuning the deadline equation under high load
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-11 21:35:54 +09:00
Changwoo Min
c72e063680 scx_lavd: do not use lat_prio_to_greedy_thresholds
With other optimizations, lat_prio_to_greedy_thresholds is not effective
any more.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-11 21:35:01 +09:00
Changwoo Min
9ed488798e scx_lavd: use task's runtime to determine its deaddline
It has an effect of further perferring shorter jobs.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-11 21:34:25 +09:00
Changwoo Min
e081b2a294 scx_lavd: rename LAVD_MAX_CAS_RETRY to LAVD_MAX_RETRY
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-11 21:33:56 +09:00
Andrea Righi
995577762a scx_bpfland: refill task time slice
Every time we need to dispatch a task re-evalate its time slice as:

 (unused_time_slice + min_time_slice) / 2

This allows to refill the time slice for tasks that haven't used much of
their previously assigned time, improving fairness.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-07-06 14:07:24 +02:00
Andrea Righi
6a64182ef2 scx_bpfland: always classify interactive tasks
Make sure to always classify interactive tasks, even when the system is
not fully utilized. This ensures that if the system suddenly becomes
overloaded, we already know which tasks need to be dispatched to the
priority DSQ.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-07-06 14:07:24 +02:00
Andrea Righi
8dd528abfd scx_bpfland: pass enqueue flags when dispatching kthreads
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-07-06 14:07:10 +02:00
Andrea Righi
fc0d1bd003
Merge pull request #415 from sched-ext/bpfland-output
scx_bpfland: additional stats and output improvements
2024-07-05 19:50:07 +02:00
Tejun Heo
af5e89e73c
Merge pull request #412 from vax-r/flatcg_delta_fetch
scx_flatcg: Make good use of __sync_fetch_and_sub()
2024-07-05 07:39:22 -10:00
Tejun Heo
14d0a0ef64
Merge pull request #411 from vax-r/Fix_typo
scx_flatcg: Fix_typo
2024-07-05 07:35:31 -10:00
Andrea Righi
2bc8f800e7 scx_bpfland: report build id version
Use the version string provided by scx_utils:build_id.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-07-05 09:29:29 +02:00
Andrea Righi
bdb31e98e2 scx_bpfland: show statistics in a more human-readable format
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-07-05 09:29:29 +02:00
Andrea Righi
f9d7844b2e scx_bpfland: split direct dispatches and kthread dispatches
Show separate statistics for direct dispatches and kthread direct
dispatches.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-07-05 09:27:59 +02:00
I Hsin Cheng
aae826b1b3 scx_flatcg: Make good use of __sync_fetch_and_sub()
Fetch the value of "delta" directly from the returned value from
__sync_fetch_and_sub, as it returns the origin value of
cgc->cvtime_delta.

Additional fetching instruction of cgc->cvtime_delta would be redundant
here.

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
2024-07-05 01:03:20 +08:00
I Hsin Cheng
3e52761487 scx_flatcg: Fix_typo
Fix "oppotunistic" to "opportunistic".

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
2024-07-04 22:04:40 +08:00
Andrea Righi
cfe2ed063d scx_bpfland: time-based starvation prevention
Tasks are consumed from various DSQs in the following order:

  per-CPU DSQs => priority DSQ => shared DSQ

Tasks in the shared DSQ may be starved by those in the priority DSQ,
which in turn may be starved by tasks dispatched to any per-CPU DSQ.

To mitigate this, record the timestamp of the last task scheduling event
both from the priority DSQ and the shared DSQ.

If the starvation threshold is exceeded without consuming a task, the
scheduler will be forced to consume a task from the corresponding DSQ.

The starvation threshold can be adjusted using the --starvation-thresh
command line parameter (default is 5ms).

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-07-04 10:52:39 +02:00
Andrea Righi
9e0db4ae17 scx_bpfland: remove unnecessary RCU read protection
There is no need to RCU protect the cpumask for the offline CPUs: it is
created once when the scheduler is initialized and it's never
deallocated.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-07-04 10:24:43 +02:00
Andrea Righi
cef6ca93cf scx_bpfland: adjust default time slice to 5ms
Reduce the default time slice down to 5ms for a faster reaction and
system responsiveness when the system is overcomissioned.

This also helps to provide a more predictable level of performance.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-07-04 10:24:43 +02:00
Andrea Righi
7d15e3171c scx_bpfland: ensure task time slice never exceeds the slice_ns limit
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-07-04 10:24:43 +02:00
Andrea Righi
e8a4d350ad scx_bpfland: unify dispatching kthreads with direct CPU dispatches
Always use direct CPU dispatch for kthreads, there is no need to treat
kthreads in a special way, simply reuse direct CPU dispatch to
prioritize them.

Moreover, change direct CPU dispatches to use scx_bpf_dispatch_vtime(),
since we may dispatch multiple tasks to the same per-CPU DSQ now.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-07-03 09:38:43 +02:00
Andrea Righi
d2231b0aed scx_bpfland: drop built-in idle CPU selection logic
Small refactoring of the idle CPU selection logic:
 - optimize idle CPU selection for tasks that can run on a single CPU
 - drop the built-in idle selection policy and completely rely on the
   custom one

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-07-03 08:54:17 +02:00
Andrea Righi
7c355f50b2 scx_bpfland: use the right cpumask to find any idle CPU
We are incorrectly using the SMT idle cpumask to find any idle CPU, fix
by using the generic idle cpumask.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-07-01 20:36:24 +02:00
Andrea Righi
c458f345b4
Merge pull request #408 from sched-ext/bpfland-cpu-hotplug
scx_bpfland: support CPU hotplugging
2024-07-01 19:41:00 +02:00
Dan Schatzberg
32ac4b2cff
Merge pull request #389 from dschatzberg/mitosis
mitosis: Update synchronization
2024-07-01 09:44:26 -04:00
Andrea Righi
ff7a518d28 scx_bpfland: support CPU hotplugging
Implement CPU hotplugging in scx_bpfland without restarting the
scheduler.

The idle selection logic has been updated to consider online CPUs.
Additionally, a cpumask for offline CPUs has been introduced. Tasks
that have been dispatched to the DSQs associated with offline CPUs are
consumed by the other CPUs that are still online.

Moreover, the dependency on the Topology crate is temporarily dropped
and instead, /sys/devices/system/cpu/smt/active is used to determine if
SMT should be taken into account during idle selection. The Topology
crate will be re-introduced later when scx_bpfland will gain more
topology-aware capabilities.

This fixes #406.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-06-30 23:04:13 +02:00
Andrea Righi
d76551bbd3 scx_rusty: fix stats map initialization
The stats map in scx_rusty is a BPF_MAP_TYPE_PERCPU_ARRAY, with its size
determined by num_possible_cpus(). Initializing it with nr_cpu_ids() can
result in errors such as:

 Error: Failed to zero stat

 Caused by:
     number of values 6 != number of cpus 8

Fix by using num_possible_cpus() to initialize it.

Fixes: 263e02f6 ("rusty: Use nr_cpu_ids instead of nr_cpus_possible")
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-06-30 17:37:14 +02:00
Andrea Righi
74175f5a49 scx_bpfland: properly integrate with meson build
Properly honor the meson build `serialize` option.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-06-28 21:37:00 +02:00
Andrea Righi
f98c35fd07
Merge pull request #388 from sched-ext/bpfland
scheds: introduce scx_bpfland
2024-06-28 21:27:43 +02:00
Andrea Righi
cf4883fbf8 meson: introduce serialize build option
With commit 5d20f89a ("scheds-rust: build rust schedulers in sequence"),
schedulers are now built serially one after the other to prevent meson
and cargo from forking NxN parallel tasks.

However, this change has made building a single scheduler much more
cumbersome, due to the chain of dependencies.

For example, building scx_rusty using the specific meson target would
still result in all schedulers being built, because they all depend on
each other.

To address this issue, introduce the new meson build option
`serialize=true|false` (default is false).

This option allows to disable the schedulers' build chain, restoring the
old behavior.

With this option enabled, it is now possible to build just a single
scheduler, parallelizing the cargo build properly, without triggering
the build of the others. Example:

  $ meson setup build -Dbuildtype=release -Dserialize=false
  $ meson compile -C build scx_rusty

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-06-28 10:17:37 +02:00
Changwoo Min
24a238846e scx_lavd: optimizing deadline related tunables
The competition window was 7.5 msec, half of the targeted latency.
However, it is too wide for some workloads, so unrelated tasks may
compete with each other. Hence, it is tightened to about 1 msec with
LAVD_LAT_WEIGHT_SHIFT to avoid unnecessary competition.

Also, when a system is overloaded, now the time space is stretched more
aggressively (i.e., lat_prio^2) when a task's latency priority is low
(high value).

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-06-28 09:00:45 +09:00
Andrea Righi
7606b95150 scx_bpfland: introduce maximum time slice lag
Introduce a tunable to set a limit of the minimum vruntime that is used
when a task is dispatched, as:

 vtime_min = vtime_now - slice_lag_ns

Increasing the time slice lag can make interactive tasks even more
responsive at the cost of starving regular and newly created tasks.

Default time slice lag is 0.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-06-27 17:28:42 +02:00
Andrea Righi
5a44329d45 scheds: introduce scx_bpfland
Overview
========

This scheduler is derived from scx_rustland, but it is fully implemented
in BFP with minimal user-space Rust part to process command line
options, collect metrics and logs out scheduling statistics.

Unlike scx_rustland, all scheduling decisions are made by the BPF
component.

Motivation
==========

The primary goal of this scheduler is to act as a performance baseline
for comparison with scx_rustland, allowing for a better assessment of
the overhead caused by kernel/user-space interactions.

It can also be used to deploy prototypes initially tested in the
scx_rustland scheduler. In fact, this scheduler is expected to
outperform scx_rustland, due to the elimitation of the kernel/user-space
overhead.

Scheduling policy
=================

scx_bpfland is a vruntime-based sched_ext scheduler that prioritizes
interactive workloads. Its scheduling policy closely mirrors
scx_rustland, but it has been re-implemented in BPF with some small
adjustments.

Tasks are categorized as either interactive or regular based on their
average rate of voluntary context switches per second: tasks that exceed
a specific voluntary context switch threshold are classified as
interactive.

Interactive tasks are prioritized in a higher-priority DSQ, while
regular tasks are placed in a lower-priority DSQ. Within each queue,
tasks are sorted based on their weighted runtime, using the built-in scx
vtime ordering capabilities (scx_bpf_dispatch_vtime()).

Moreover, each task gets a time slice budget. When a task is dispatched,
it receives a time slice equivalent to the remaining unused portion of
its previously allocated time slice (with a minimum threshold applied).

This gives latency-sensitive workloads more chances to exceed their time
slice when needed to perform short bursts of CPU activity without being
interrupted (i.e., real-time audio encoding / decoding workloads).

Results
=======

According to the initial test results, using the same benchmark "playing
a videogame while recompiling the kernel", this scheduler seems to
provide a +5% improvement in the frames-per-second (fps) compared to
scx_rustland, with video games such as Cyberpunk 2077, Counter-Strike 2
and Baldur's Gate 3.

Initial test results indicate that this scheduler offers around a +5%
improvement in frames-per-second (fps) compared to scx_rustland when
using the benchmark "playing a video game while recompiling the kernel".

This improvement was observed in games such as Cyberpunk 2077,
Counter-Strike 2, and Baldur's Gate 3.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-06-27 17:28:42 +02:00
Changwoo Min
f86d564d89 scx_lavd: fast path for ops.dispatch() when fully loaded
When fully loaded so all CPUs are using, skip checking the cpumask.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-06-27 18:00:39 +09:00
David Vernet
fe3ce64a9b
Revert "scx_rusty: Refactor ridx assignment in populate_tasks_by_load" 2024-06-26 17:35:22 -04:00
Changwoo Min
abc6e31fef scx_lavd: for a forked task, inherit its parent's statistics
The old approach was too conservative in running a new task, so when a
fork-heavy workload competes with a CPU-bound workload, the fork-heavy
one is starved. The new approach solves the starvation problem by
inheriting parent's statistics. It seems a good (at least better than
old) guess how a new task will behave.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-06-26 19:00:10 +09:00
Changwoo Min
ac9c49f5b5 scx_lavd: loosen the deadline when overloaded
When the system is highly loaded with compute-intensive tasks, the old
setting chokes latensive-intensive tasks, so loosen the dealine when the
system is overloaded (> 100% utilization).

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-06-26 15:06:31 +09:00