Commit Graph

858 Commits

Author SHA1 Message Date
Tejun Heo
bc1bb5c50f Update libbpf and bpftool commits to the latest
For better compat feature support (ignoring ops which are NULL'd out).
2024-06-06 14:26:45 -10:00
Tejun Heo
3e3720fc7f scx_utils: Add compat support for ops.tick() and ops.dump*()
Match rust scx_ops_load!()'s compat support with C's SCX_OPS_LOAD().
2024-06-06 14:16:36 -10:00
Tejun Heo
200af60f2a scx_layered: Fix load failure due to scheduler_tick() -> sched_tick() rename
- scx_utils: Replace kfunc_exists() with ksym_exists() which doesn't care
  about the type of the symbol.

- scx_layered: Fix load failure on kernels >= v6.10-rc due to
  scheduler_tick() -> sched_tick rename. Attach the tick fentry function to
  either scheduler_tick() or sched_tick().
2024-06-06 12:54:59 -10:00
Andrea Righi
def1ad2947
Merge pull request #336 from sched-ext/rustland-max-time-slice-limit
scx_rustland: never use a time slice that exceeds the default value
2024-06-06 18:34:10 +02:00
Tejun Heo
1dbeed752c
Merge pull request #335 from sirlucjan/config-update
scx: update /etc/default/scx sample flags
2024-06-06 06:32:15 -10:00
Andrea Righi
8a3ee7b801 scx_rustland: never use a time slice that exceeds the default value
Make sure to never assign a time slice longer than the default time
slice, that can be used as an upper limit.

This seems to prevent potential stall conditions (reported by the
CachyOS community) when running CPU-intensive workloads, such as:

 [   68.062813] sched_ext: BPF scheduler "rustland" errored, disabling
 [   68.062831] sched_ext: runnable task stall (ollama_llama_se[3312] failed to run for 5.180s)
 [   68.062832]    scx_watchdog_workfn+0x154/0x1e0
 [   68.062837]    process_one_work+0x18e/0x350
 [   68.062839]    worker_thread+0x2fa/0x490
 [   68.062841]    kthread+0xd2/0x100
 [   68.062842]    ret_from_fork+0x34/0x50
 [   68.062844]    ret_from_fork_asm+0x1a/0x30

Fixes: 6f4cd853 ("scx_rustland: introduce virtual time slice")
Tested-by: SoulHarsh007 <harsh.peshwani@outlook.com>
Tested-by: Piotr Gorski <piotrgorski@cachyos.org>
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-06-06 17:56:23 +02:00
Piotr Gorski
4558d5c3dd
scx: update /etc/default/scx sample flags
Signed-off-by: Piotr Gorski <lucjan.lucjanov@gmail.com>
2024-06-06 17:52:21 +02:00
Andrea Righi
3d62866774
Merge pull request #333 from sched-ext/rustland-virtual-time-slice
scx_rustland: introduce virtual time slice
2024-06-05 07:40:22 +02:00
Tejun Heo
3e921ccb74
Merge pull request #332 from sirlucjan/services-update4
scx.service: start service after graphical target
2024-06-04 11:20:44 -10:00
Andrea Righi
6f4cd853f9 scx_rustland: introduce virtual time slice
Overview
========

Currently, a task's time slice is determined based on the total number
of tasks waiting to be scheduled: the more overloaded the system, the
shorter the time slice.

This approach can help to reduce the average wait time of all tasks,
allowing them to progress more slowly, but uniformly, thus providing a
smoother overall system performance.

However, under heavy system load, this approach can lead to very short
time slices distributed among all tasks, causing excessive context
switches that can badly affect soft real-time workloads.

Moreover, the scheduler tends to operate in a bursty manner (tasks are
queued and dispatched in bursts). This can also result in fluctuations
of longer and shorter time slices, depending on the number of tasks
still waiting in the scheduler's queue.

Such behavior can also negatively impact on soft real-time workloads,
such as real-time audio processing.

Virtual time slice
==================

To mitigate this problem, introduce the concept of virtual time slice:
the idea is to evaluate the optimal time slice of a task, considering
the vruntime as a deadline for the task to complete its work before
releasing the CPU.

This is accomplished by calculating the difference between the task's
vruntime and the global current vruntime and use this value as the task
time slice:

  task_slice = task_vruntime - min_vruntime

In this way, tasks that "promise" to release the CPU quickly (based on
their previous work pattern) get a much higher priority (due to
vruntime-based scheduling and the additional priority boost for being
classified as interactive), but they are also given a shorter time slice
to complete their work and fulfill their promise of rapidity.

At the same time tasks that are more CPU-intensive get de-prioritized,
but they will tend to have a longer time slice available, reducing in
this way the amount of context switches that can negatively affect their
performance.

In conclusion, latency-sensitive tasks get a high priority and a short
time slice (and they can preempt other tasks), CPU-intensive tasks get
low priority and a long time slice.

Example
=======

Let's consider the following theoretical scenario:

 task | time
 -----+-----
   A  | 1
   B  | 3
   C  | 6
   D  | 6

In this case task A represents a short interactive task, task C and D
are CPU-intensive tasks and task B is mainly interactive, but it also
requires some CPU time.

With a uniform time slice, scaled based on the amount of tasks, the
scheduling looks like this (assuming the time slice is 2):

 A B B C C D D A B C C D D C C D D
  |   |   |   | | |   |   |   |
  `---`---`---`-`-`---`---`---`----> 9 context switches

With the virtual time slice the scheduling changes to this:

 A B B C C C D A B C C C D D D D D
  |   |     | | | |     |
  `---`-----`-`-`-`-----`----------> 7 context switches

In the latter scenario, tasks do not receive the same time slice scaled
by the total number of tasks waiting to be scheduled. Instead, their
time slice is adjusted based on their previous CPU usage. Tasks that
used more CPU time are given longer slices and their processing time
tends to be packed together, reducing the amount of context switches.

Meanwhile, latency-sensitive tasks can still be processed as soon as
they need to, because they get a higher priority and they can preempt
other tasks. However, they will get a short time slice, so tasks that
were incorrectly classified as interactive will still be forced to
release the CPU quickly.

Experimental results
====================

This patch has been tested on a on a 8-cores AMD Ryzen 7 5800X 8-Core
Processor (16 threads with SMT), 16GB RAM, NVIDIA GeForce RTX 3070.

The test case involves the usual benchmark of playing a video game while
simultaneously overloading the system with a parallel kernel build
(`make -j32`).

The average frames per second (fps) reported by Steam is used as a
metric for measuring system responsiveness (the higher the better):

 Game                       |  before |  after  | delta  |
 ---------------------------+---------+---------+--------+
 Baldur's Gate 3            |  40 fps |  48 fps | +20.0% |
 Counter-Strike 2           |   8 fps |  15 fps | +87.5% |
 Cyberpunk 2077             |  41 fps |  46 fps | +12.2% |
 Terraria                   |  98 fps | 108 fps | +10.2% |
 Team Fortress 2            |  81 fps |  92 fps | +13.6% |
 WebGL demo (firefox) [1]   |  32 fps |  42 fps | +31.2% |
 ---------------------------+---------+---------+--------+

Apart from the massive boost with Counter-Strike 2 (that should be taken
with a grain of salt, considering the overall poor performance in both
cases), the virtual time slice seems to systematically provide a boost
in responsiveness of around +10-20% fps.

It also seems to significantly prevent potential audio cracking issues
when the system is massively overloaded: no audio cracking was detected
during the entire run of these tests with the virtual deadline change
applied.

[1] https://webglsamples.org/aquarium/aquarium.html

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-06-04 23:01:13 +02:00
Piotr Gorski
1505164ca0
scx.service: start service after graphical target
Signed-off-by: Piotr Gorski <lucjan.lucjanov@gmail.com>
2024-06-04 22:29:33 +02:00
Andrea Righi
40e67897a9
Merge pull request #331 from sched-ext/rustland-core-dispatch-debug
scx_rustland_core: add extra debugging info to dispatch_task()
2024-06-04 18:58:01 +02:00
Tejun Heo
5b496558c1
Merge pull request #329 from sched-ext/htejun/cleanup-loading-and-running
scx: Unify loading and running boilerplate across rust schedulers
2024-06-04 06:51:12 -10:00
Andrea Righi
b363a13310 scx_rustland_core: add enqueue flags to debug info
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-06-04 17:14:38 +02:00
Andrea Righi
89384754ce scx_rustland_core: add task time slice to the debug info
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-06-04 17:14:38 +02:00
Tejun Heo
e556dd375d scx: Unify loading and running boilerplate across rust schedulers
Make restart handling with user_exit_info simpler and consistently use the
load and report macros consistently across the rust schedulers. This makes
all schedulers automatically handle auto restarts from CPU hotplug events.
Note that this is necessary even for scx_lavd which has CPU hotplug
operations as CPU hotplug operations which took place between skel open and
scheduler init can still trigger restart.
2024-06-03 12:25:41 -10:00
David Vernet
a26d3f2220
Merge pull request #328 from sched-ext/rusty_cpumask_overlap
rusty: Use cpumask kfuncs in cpumask_intersects_domain()
2024-06-03 20:42:11 +00:00
David Vernet
0ae676a9ca
rusty: Use cpumask kfuncs in cpumask_intersects_domain()
In cpumask_intersects_domain(), we check whether a given cpumask has any
CPUs in common with the specified domain by looking at the const, static
dom_cpumasks map. This map is only really necessary when creating the
domain struct bpf_cpumask objects at scheduler load time. After that, we
can just use the actual struct bpf_cpumask object embedded in the domain
context. Let's use that and cpumask kfuncs instead.

This allows rusty to load with
https://github.com/sched-ext/sched_ext/pull/216.

Signed-off-by: David Vernet <void@manifault.com>
2024-06-03 15:01:19 -05:00
Tejun Heo
dfc642b0b3
Merge pull request #327 from sched-ext/htejun/bump-versions
Bump versions for a release
2024-06-03 08:41:18 -10:00
Tejun Heo
a2d5310cb6 Bump versions for a release 2024-06-03 08:35:21 -10:00
Andrea Righi
85e1fc5767
Merge pull request #325 from sched-ext/rustland-drop-builtin-idle
scx_rustland: get rid of --builtin-idle option
2024-06-03 19:36:41 +02:00
Tejun Heo
54bc9489e9
Merge pull request #326 from sirlucjan/readme-update2
README: Adding dependencies to allow compilation
2024-06-03 07:13:10 -10:00
Piotr Gorski
35249a1888
README: Adding dependencies to allow compilation
Signed-off-by: Piotr Gorski <lucjan.lucjanov@gmail.com>
2024-06-03 15:06:25 +02:00
Andrea Righi
ccef4d0ba1 scx_rustland: get rid of --builtin-idle option
Commit 23b0bb5f ("scx_rustland: dispatch interactive tasks on any CPU")
allows only interactive tasks to be dispatched on any CPU, enabling them
to quickly use the first idle CPU available. Non-interactive tasks, on
the other hand, are kept on the same CPU as much as possible.

This change deprioritizes CPU-intensive tasks further, but it also helps
to exploit cache locality, while latency-sensitive tasks are dispatched
sooner, improving overall responsiveness, despite the potential
migration cost.

Given this new logic, the built-idle option, which forces all tasks to
be dispatched on the CPU assigned during select_cpu(), no longer offers
significant benefits. It would merely reduce the responsiveness of
interactive tasks.

Therefore, simply remove this option, allowing the scheduler to
determine the target CPU(s) for all tasks based on their nature.

Fixes: 23b0bb5f ("scx_rustland: dispatch interactive tasks on any CPU")
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-06-03 10:02:04 +02:00
Tejun Heo
1ec6c7084b
Merge pull request #322 from vax-r/RW_ONCE
scx_lavd: Adding READ_ONCE()/WRITE_ONCE() macros
2024-06-01 06:25:11 -10:00
I Hsin Cheng
0921fde1f1 scx_lavd: Adding READ_ONCE()/WRITE_ONCE() macros
In order to prevent compiler from merging or refetching load/store
operations or unwanted reordering, we take the implemetation of
READ_ONCE()/WRITE_ONCE() from kernel sources under
"/include/asm-generic/rwonce.h".

Use WRITE_ONCE() in function flip_sys_cpu_util() to ensure the compiler
doesn't perform unnecessary optimization so the compiler won't make
incorrect assumptions when performing the operation of modifying of bit
 flipping.

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
2024-06-01 11:07:52 +08:00
Andrea Righi
6b53adb5d3
Merge pull request #323 from aruhier/pr_kernel_config
Document the needed BPF kernel config
2024-05-31 06:49:45 +02:00
Anthony Ruhier
361bd77642
Document the needed BPF kernel config 2024-05-30 19:07:44 +02:00
Andrea Righi
18b902f7ab
Merge pull request #320 from sched-ext/rustland-core-libc-musl
scx_rustland_core: fix build error with musl
2024-05-29 03:23:29 +02:00
Tejun Heo
ebae7d5e6a
Merge pull request #312 from sched-ext/htejun/layered-updates
scx_layered: Improve affn_viol handling and implement dump method
2024-05-28 10:22:31 -10:00
Tejun Heo
d3ed4cb5c7 scx_layered: Successfully consuming from HI_FALLBACK_DSQ should terminate dispatching
layered_dispatch() was incorrectly continuing down to the lower priority
DSQs after successfully consuming from HI_FALLBACK_DSQ which can lead to
latency issues. Fix it.
2024-05-28 10:20:55 -10:00
Changwoo Min
4ac5da9717
Merge pull request #321 from sched-ext/revert-318-Memory_barrier
Revert "scx_lavd: Enforce memory barrier in flip_sys_cpu_util"
2024-05-27 12:20:18 +09:00
Changwoo Min
4c0f996ddc
Revert "scx_lavd: Enforce memory barrier in flip_sys_cpu_util" 2024-05-27 12:19:21 +09:00
Andrea Righi
4503c7080a scx_rustland_core: fix build error with musl
As reported in #319, we may get a build failure in presence of musl,
that requires additional parameters in sched_param.

Fix by adding a proper conditional to support both gnu libc and musl
libc.

This fixes #319.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-05-26 22:30:04 +02:00
Tejun Heo
66e9141e67
Merge pull request #316 from frelon/add-opensuse-install
Add openSUSE installation notes
2024-05-26 08:27:50 -10:00
Changwoo Min
0371ccae40
Merge pull request #318 from vax-r/Memory_barrier
scx_lavd: Enforce memory barrier in flip_sys_cpu_util
2024-05-26 21:00:25 +09:00
I Hsin Cheng
f839106a57 scx_lavd: Enforce memory barrier in flip_sys_cpu_util
Use the GNU built-in __sync_fetch_and_xor() to perform the XOR operation
on global variable "__sys_cpu_util_idx" to ensure the operations
visibility.

The built-in function "__sync_fetch_and_xor()" can provide both atomic
operation and full memory barrier which is needed by every operation
(especially store operation) on global variables.

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
2024-05-26 15:27:10 +08:00
Fredrik Lönnegren
888172f432
Add openSUSE installation notes
Adds a section how to install scx and sched-ext patched kernel on
openSUSE Tumbleweed.

Signed-off-by: Fredrik Lönnegren <fredrik.lonnegren@suse.com>
2024-05-24 14:50:26 +02:00
Tejun Heo
c09bc2ac69
Merge pull request #314 from vax-r/backward_compat
scx_central: Provide backward compability
2024-05-23 22:05:09 -10:00
I Hsin Cheng
5881c61a5e scx_central: Provide backward compability
Newer sched_ext kernel versions sets the scheduler to schedule all tasks
within the system by default. However, some users are using the old
versions of kernel.

Therefore we call "__COMPAT_scx_bpf_switch_all()" to move all tasks to
"SCHED_EXT" class so scx_central would schedule all tasks by default in
older kernels.
2024-05-24 15:12:34 +08:00
Tejun Heo
99eb56b6b5 scx_layered: Implement layered_dump()
which dumps layer states.
2024-05-23 12:54:17 -10:00
Tejun Heo
a576242b69 scx_layered: Open and grouped layers can handle tasks with custom affinities
The main reason why custom affinities are tricky for scx_layered is because
if we put a task which doesn't allow all CPUs into a layer's DSQ, it may not
get consumed for an indefinite amount of time. However, this is only true
for confined layers. Both open and grouped layers always consumed from all
CPUs and thus don't have this risk.

Let's allow tasks with custom affinities in open and grouped layers.

- In select_cpu(), don't consider direct dispatching to a local DSQ as
  affinity violation even if the target CPU is outside the layer's cpumask
  if the layer is open.

- In enqueue(), separate out per-cpu kthread special case into its own
  block. Note that this is only applied if the layer is not preempting as a
  preempting layer has a higher priority than HI_FALLBACK_DSQ anyway.

- Trigger the LO_FALLBACK_DSQ path for other threads only if the layer is
  confined.

- The preemption path now also runs for tasks with a custom affinity in open
  and grouped layers. Update it so that it only considers the CPUs in the
  preempting task's allowed cpumask.

(cherry picked from commit 82d2f887a4608de61ddf5e15643c10e504a88f7b)
2024-05-23 12:54:17 -10:00
Tejun Heo
1ce23760b5 scx_layered: Improve affinity violation handling
- AFFN_VIOL for per-cpu tasks could be double counted. Once in select_cpu()
  and again in enqueue(). Count in select_cpu() only when direct
  dispatching.

- Violating tasks were prioritized over non-violating ones because they were
  queued on SCX_DSQ_GLOBAL which has priority over all user DSQs. This
  doesn't make sense. Let's introduce two fallback DSQs - HI_FALLBACK_DSQ
  and LO_FALLBACK_DSQ. HI is used for violating kthreads and LO for
  violating user threads. HI is dispatched after preempting layers and LO
  after all other layers. This shouldn't change the behavior too much for
  kthreads while punshing, rather than rewarding, violating user threads.

(cherry picked from commit 67f69645667ba8a155cae9a9b7e90c055d39e23c)
2024-05-23 12:54:17 -10:00
Tejun Heo
7d4243a59d
Merge pull request #311 from jordalgo/c-scheds-fedora
Update INSTALL.md with fedora c scheds info
2024-05-23 09:06:55 -10:00
Jordan Rome
fcf872067a Update INSTALL.md with fedora c scheds info
Also add links to the fedora rpms.
2024-05-23 07:51:39 -07:00
Andrea Righi
1bdc8bd37d
Merge pull request #310 from RinHizakura/fix_scx
Reduce MAX_ENQUEUED_TASKS to fit percpu allocator
2024-05-23 16:46:46 +02:00
Yiwei Lin
8c2236770d Reduce MAX_ENQUEUED_TASKS to fit percpu allocator
The setting of ops->dispatch_max_batch leads to a too large allocated
size for percpu allocator, and it will be unhappy if we want a size
larger than PCPU_MIN_UNIT_SIZE. Reduce MAX_ENQUEUED_TASKS for fix.
2024-05-23 22:33:02 +08:00
Andrea Righi
3e2e581094
Merge pull request #307 from sched-ext/rustland-improve-audio
scx_rustland: improve audio workload and performance predictability
2024-05-23 07:05:05 +02:00
Andrea Righi
4791d862f5 scx_rustland_core: second chance CPU migration
Implement a second-chance migration in select_cpu(): after a task has
been dispatched directly do not try to migrate it immediately on a
different CPU, but force it to stay on prev_cpu for another round.

This seems quite effective on certain architectures (such as on a system
with 11th Gen Intel(R) Core(TM) i7-1195G7 @ 2.90GHz), and it can provide
noticeable benefits with gaming or WebGL applications (such as
https://webglsamples.org/aquarium/aquarium.html) under regular workload
conditions (around +5% fps).

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-05-22 21:38:34 +02:00
Andrea Righi
23b0bb5ff5 scx_rustland: dispatch interactive tasks on any CPU
Dispatch non-interactive tasks on the CPU selected by the built-in idle
selection logic and allow interactive tasks to be dispatched on any CPU.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-05-22 12:12:55 +02:00