Commit Graph

899 Commits

Author SHA1 Message Date
Pietro Righi
66dea6262b scx.service: allow overriding scx variables
Switching the scheduler requires changing SCX_SCHEDULER (and potentially
also SCX_FLAGS) in /etc/default/scx.

This patch allows overriding these settings using systemd environment
variables SCX_SCHEDULER_OVERRIDE and SCX_FLAGS_OVERRIDE, without
changing the default configuration.

Example:

 > grep SCX_SCHEDULER /etc/default/scx
 SCX_SCHEDULER=scx_rusty

 > sudo systemctl status scx
 ...
   Main PID: 8021 (scx_rusty)
 ...

 > sudo systemctl set-environment SCX_SCHEDULER_OVERRIDE=scx_rustland
 > sudo systemctl restart scx
 > sudo systemctl status scx
...
   Main PID: 4021 (scx_rustland)
...

This feature can be useful for quickly testing different schedulers and
settings, without altering the global system configuration.

Signed-off-by: Pietro Righi <pietro.righi.email@gmail.com>
2024-06-14 18:51:11 +02:00
Changwoo Min
3a53162ce7
Merge pull request #355 from multics69/lavd-core-compaction-doc
scx_lavd: add the design of core compaction
2024-06-14 11:55:18 +09:00
Changwoo Min
94a39f419f scx_lavd: add the design of core compaction
The core compaction seems to work great in various hardware. Now it is
time to document its design.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-06-14 11:53:52 +09:00
Changwoo Min
5068d75bf3
Merge pull request #351 from multics69/lavd-power-v2
scx_lavd: improve CPU frequency scaling
2024-06-14 09:29:10 +09:00
Dan Schatzberg
6d7af64943
Merge pull request #346 from sirlucjan/config-update2
scheds: Add scx_mitosis scheduler to /etc/default/scx
2024-06-13 17:48:58 -04:00
Tejun Heo
a3342810c7
Merge pull request #352 from dschatzberg/mitosis
common: Add css iter forward declares
2024-06-13 06:50:06 -10:00
Changwoo Min
1bd2c2206f
Merge pull request #349 from multics69/lavd-suspend-resume
scx_lavd: properly calculate task's runtime after suspend/resume
2024-06-13 07:57:46 +09:00
Dan Schatzberg
114e4b644b common: Add css iter forward declares
These are used in mitosis, but they belong in common code so other
schedulers can do css iteration.

Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
2024-06-12 15:02:48 -07:00
Tejun Heo
08521d4fec
Merge pull request #350 from vimproved/llvm-version-suffix
Support LLVM_VERSION_SUFFIX in clang version parsing regex
2024-06-12 07:27:03 -10:00
Changwoo Min
747bf2a7d7 scx_lavd: add the design of CPU frequency scaling
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-06-13 01:42:19 +09:00
Violet Purcell
2341b67971
Support LLVM_VERSION_SUFFIX in clang version parsing regex
If LLVM is compiled with the LLVM_VERSION_SUFFIX cmake option, then the
version may have an additional suffix, for example "18.1.7+libcxx".
Gentoo for example uses this to fend off ABI issues between libstdc++
and libc++.

Signed-off-by: Violet Purcell <vimproved@inventati.org>
2024-06-12 11:58:27 -04:00
Changwoo Min
2e74b86b4a scx_lavd: logging cpu performance target
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-06-13 00:44:04 +09:00
Changwoo Min
e6348a11e9 scx_lavd: improve frequency scaling logic
The old logic for CPU frequency scaling is that the task's CPU
performance target (i.e., target CPU frequency) is checked every tick
interval and updated immediately. Indeed, it samples and updates a
performance target every tick interval. Ultimately, it fluctuates CPU
frequency every tick interval, resulting in less steady performance.

Now, we take a different strategy. The key idea is to increase the
frequency as soon as possible when a task starts running for quick
adoption to load spikes. However, if necessary, it decreases gradually
every tick interval to avoid frequency fluctuations.

In my testing, it shows more stable performance in many workloads
(games, compilation).

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-06-12 23:40:40 +09:00
Changwoo Min
753f333c09 scx_lavd: refactoring do_update_sys_stat()
Originally, do_update_sys_stat() simply calculated the system-wide CPU
utilization. Over time, it has evolved to collect all kinds of
system-wide, periodic statistics for decision-making, so it has become
bulky. Now, it is time to refactor it for readability. This commit does
not contain functional changes other than refactoring.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-06-12 21:15:25 +09:00
Changwoo Min
9d129f0afa scx_lavd: rename LAVD_CPU_UTIL_INTERVAL_NS to LAVD_SYS_STAT_INTERVAL_NS
The periodic CPU utilization routine does a lot of other work now. So we
rename LAVD_CPU_UTIL_INTERVAL_NS to LAVD_SYS_STAT_INTERVAL_NS.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-06-12 20:06:17 +09:00
Changwoo Min
7046b47b9c scx_lavd: properly calculate task's runtime after suspend/resume
When a device is suspended and resumed, the suspended duration is added
up to a task's runtime if the task was running on the CPU. After the
resume, the task's runtime is incorrectly long and the scheduler starts
to recognize the system is under heavy load. To avoid such problem, the
suspended duration is measured and substracted from the task's runtime.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-06-12 15:58:41 +09:00
Dan Schatzberg
34075829a4
Merge pull request #348 from dschatzberg/mitosis
mitosis: Fix build
2024-06-11 18:41:58 -04:00
Dan Schatzberg
b95cfb0772 mitosis: Fix build
The target wasn't dependent on the previous sched so building all
schedulers ended up not building scx_mitosis which broke the install
script.
2024-06-11 14:33:32 -07:00
Piotr Gorski
bbd3132b8e
scheds: Add scx_mitosis scheduler to /etc/default/scx
Signed-off-by: Piotr Gorski <lucjan.lucjanov@gmail.com>
2024-06-11 23:05:17 +02:00
Dan Schatzberg
9528d4603e
Merge pull request #339 from dschatzberg/mitosis
scheds: Add scx_mitosis scheduler
2024-06-11 16:50:25 -04:00
Dan Schatzberg
3b6e2dee20 scheds: Add scx_mitosis scheduler
scx_mitosis is a dynamic affinity scheduler which assigns cgroups to
Cells and Cells to discrete sets of CPUs. The number of cells is dynamic
as is the CPU assignment. BPF mostly just does vtime scheduling for each
cell, tracks load, and responds to reconfiguration from userspace.
Userspace makes decisions about how to assign cgroups to cells and cells
to cpus.

This is not yet a complete scheduler, much of the userspace logic is a
placeholder as I experiment with better logic. I also want to add richer
scheduling semantics to userspace, e.g. so that cells can do more
"soft-affinity" rather than the strict partitioning implemented
currently.

Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
2024-06-11 10:34:53 -07:00
David Vernet
1dbf874709
Merge pull request #341 from vax-r/rusty_data_races
scx_rusty: Elimate data races possibility for domain min_vruntime
2024-06-11 12:04:40 -05:00
David Vernet
d4a8949f4d
Merge pull request #343 from hodgesds/freq-trans-lat
scx_utils: Add CPU freq transition latency
2024-06-11 11:35:07 -05:00
Tejun Heo
f76ab01a58
Merge pull request #344 from sched-ext/resize_array
uei: Pass skel to RESIZE_ARRAY()
2024-06-11 06:25:32 -10:00
David Vernet
b50ba626cc
uei: Pass skel to RESIZE_ARRAY()
The RESIZE_ARRAY() macro assumes the presence of an in-scope "skel" variable.
This is bad practice and can cause issues in other macros that use it. Let's
update it to explicitly take a skel argument.

Signed-off-by: David Vernet <void@manifault.com>
2024-06-11 10:15:26 -05:00
Daniel Hodges
2ca42428cd scx_utils: Add CPU freq transition latency
This change adds the CPU frequency transition latency from the
`cpuinfo_transition_latency` from sysfs. The value of this field is
described [cpufreq
docs](https://www.kernel.org/doc/Documentation/cpu-freq/user-guide.txt).
On supported systems it returns the CPU frequency transition latency in
nanoseconds. The goal of this change is so that in the future schedulers
can use this data to make better frequency scaling decisions.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-06-11 07:35:34 -07:00
I Hsin Cheng
4e30bb9ccf scx_rusty: Elimate data races possibility for domain min_vruntime
READ_ONCE()/WRITE_ONCE() macros are added in commit 0932fde, we should
be able to utilize the macros to get around the possibility of data
races for domc->min_vruntime.

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
2024-06-11 10:57:03 +08:00
Tejun Heo
30f27d99d9
Merge pull request #340 from sched-ext/htejun/layered-updates
scx_layered: Improve yield, preemption and other behaviors
2024-06-10 11:27:44 -10:00
Tejun Heo
9ec3594b4f scx_layered: Several fixes to address David's review
- pick_idle_cpu() was putting idle_smtmask that it didn't acquire.

- layered_enqueue() was unnecessarily entering preemption path after finding
  an idle CPU.

- No need to test whether scx_bpf_get_idle_cpu/smtmask() return NULL. They
  never do.

- Relocate cctx->yielding test into keep_runinng() from its caller.
2024-06-10 11:23:37 -10:00
Tejun Heo
92317aa2f9 Use __always_inline uniformly
Instead of using __attribute__((always_inline)) use the __always_inline
macro provided by BPF.
2024-06-10 11:23:26 -10:00
Changwoo Min
472ab945b8
scx_lavd: core compaction for low power consumption (#338)
scx_lavd: core compaction for low power consumption

When system-wide CPU utilization is low, it is very likely all the CPUs
are running with very low utilization. That means all CPUs run with low
clock frequency thanks to dynamic frequency scaling and very frequently
go in and out from/to C-state. That results in low performance (i.e.,
low clock frequency) and high power consumption (i.e., frequent
P-/C-state transition).

The idea of *core compaction* is using less number of CPUs when
system-wide CPU utilization is low. The chosen cores (called "active
cores") will run in higher utilization and higher clock frequency, and
the rest of the cores (called "idle cores") will be in a C-state for a
much longer duration. Thus, the core compaction can achieve higher
performance with lower power consumption.

One potential problem of core compaction is latency spikes when all the
active cores are overloaded. A few techniques are incorporated to solve
this problem.

1) Limit the active CPU core's utilization below a certain limit (say 50%).

2) Do not use the core compaction when the system-wide utilization is
   moderate (say 50%).

3) Do not enforce the core compaction for kernel and pinned user-space
   tasks since they are manually optimized for performance.

In my experiments, under a wide range of system-wide CPU utilization
(5%—80%), the core compaction reduces 7-30% power consumption without
sacrificing average and 99p tail latency.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-06-08 09:25:27 +09:00
Tejun Heo
a165970ab9 scx_layered: Add migration statistic
Keep track of how frequent migrations are.
2024-06-07 11:49:39 -10:00
Tejun Heo
5b31d96c3d scx_layered: Implement "preempt_first" layer property
If set, tasks in the layer will try to preempt tasks in their previous CPUs
before trying to find idle CPUs.
2024-06-07 11:49:39 -10:00
Tejun Heo
ece3638664 scx_layered: Allow confined layers to preempt
There's no reason to restrict confined layers from preempting on the CPUs
that they are entitled to. Allow preemption for confined layers.
2024-06-07 11:49:39 -10:00
Tejun Heo
7c48814ed0 scx_layered: Prefer preempting the CPU the task was previously on
Currently, when preempting, searching for the candidate CPU always starts
from the RR preemption cursor. Let's first try the previous CPU the
preempting task was on as that may have some locality benefits.
2024-06-07 11:49:38 -10:00
Tejun Heo
3db3257911 scx_layered: Find and kick an idle CPU from enqueue path
When a task is being enqueued outside wakeup path, ops.select_cpu() isn't
called, so we can end up in a situation where a newly enqueued task keeps
waiting in one of the DSQs while there are idle CPUs. Factor out idle CPU
selection path into pick_idle_cpu() and call it from the enqueue path in
such cases. This problem is shared across schedulers and likely needs a more
generic solution in the future.
2024-06-07 11:49:38 -10:00
Tejun Heo
0f2d1ad2fa scx_layered: Implement a new layer parameter "yield_ignore"
yield(2) currently gives up the entire slice. Add "yield_ignore" layer
parameter which can modulate the magnitude of yiedling. When 1.0, yields are
completely ignored. 0.5, only half worth of the full slice is given up and
so on.
2024-06-07 11:49:38 -10:00
Tejun Heo
4aa8124b9c scx_layered: Add explicit yield() support
Currently, a task which yields is treated the same as a task which has run
out its slice. As the budget charged to a task is calculated from wall clock
time, a repeatedly yielding task can stay at the top of the queue for quite
a while hogging the CPU and spiking the number of scheduling events.

Let's add explicit yield support. An yielding task is now always charged the
full slice and not allowed to keep running on the same CPU.
2024-06-07 11:49:38 -10:00
Tejun Heo
436cd7ba9e scx_layered: Make enqueue path comprehensive and handle CPU preemptions
The keep_running path relies on the implicit last task enqueue which makes
the statistics a bit difficult to track. Let's make the enqueue path
comprehensive:

- Set SCX_OPS_ENQ_LAST and handle the last runnable task enqueue explicitly.

- Implement layered_cpu_release() to re-enqueue tasks from a CPU preempted
  by a higher pri sched class and handle the re-enqueued tasks explicitly in
  layered_enqueue().

- Add more statistics to track all enqueue operations.
2024-06-07 11:49:38 -10:00
Tejun Heo
4a0993ceab scx_layered: Allow long-running tasks to keep running on the same CPU
When a task exhausts its slice, layered currently doesn't make any effort to
keep it on the same CPU. It dispatches the next task to run and then
enqueues the running one. This leads to suboptimal behaviors. e.g. When this
happens to a task in a preempting layer, the task will most likely find an
idle CPU or a task to preempt and then migrate there causing a completely
unnecessary migration.

This patch layered_dispatch() test whether the current task should keep
running on the CPU and then skip dispatching to keep the task running. This
behavior depends on the implicit local DSQ enqueue mechanism which triggers
when there are no other tasks to run.
2024-06-07 11:49:38 -10:00
David Vernet
5ad8d40713
Merge pull request #337 from sched-ext/htejun/fix-layered-load
Bring rust scheduler's compat support to parity with C
2024-06-06 19:57:52 -05:00
Tejun Heo
bc1bb5c50f Update libbpf and bpftool commits to the latest
For better compat feature support (ignoring ops which are NULL'd out).
2024-06-06 14:26:45 -10:00
Tejun Heo
3e3720fc7f scx_utils: Add compat support for ops.tick() and ops.dump*()
Match rust scx_ops_load!()'s compat support with C's SCX_OPS_LOAD().
2024-06-06 14:16:36 -10:00
Tejun Heo
200af60f2a scx_layered: Fix load failure due to scheduler_tick() -> sched_tick() rename
- scx_utils: Replace kfunc_exists() with ksym_exists() which doesn't care
  about the type of the symbol.

- scx_layered: Fix load failure on kernels >= v6.10-rc due to
  scheduler_tick() -> sched_tick rename. Attach the tick fentry function to
  either scheduler_tick() or sched_tick().
2024-06-06 12:54:59 -10:00
Andrea Righi
def1ad2947
Merge pull request #336 from sched-ext/rustland-max-time-slice-limit
scx_rustland: never use a time slice that exceeds the default value
2024-06-06 18:34:10 +02:00
Tejun Heo
1dbeed752c
Merge pull request #335 from sirlucjan/config-update
scx: update /etc/default/scx sample flags
2024-06-06 06:32:15 -10:00
Andrea Righi
8a3ee7b801 scx_rustland: never use a time slice that exceeds the default value
Make sure to never assign a time slice longer than the default time
slice, that can be used as an upper limit.

This seems to prevent potential stall conditions (reported by the
CachyOS community) when running CPU-intensive workloads, such as:

 [   68.062813] sched_ext: BPF scheduler "rustland" errored, disabling
 [   68.062831] sched_ext: runnable task stall (ollama_llama_se[3312] failed to run for 5.180s)
 [   68.062832]    scx_watchdog_workfn+0x154/0x1e0
 [   68.062837]    process_one_work+0x18e/0x350
 [   68.062839]    worker_thread+0x2fa/0x490
 [   68.062841]    kthread+0xd2/0x100
 [   68.062842]    ret_from_fork+0x34/0x50
 [   68.062844]    ret_from_fork_asm+0x1a/0x30

Fixes: 6f4cd853 ("scx_rustland: introduce virtual time slice")
Tested-by: SoulHarsh007 <harsh.peshwani@outlook.com>
Tested-by: Piotr Gorski <piotrgorski@cachyos.org>
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-06-06 17:56:23 +02:00
Piotr Gorski
4558d5c3dd
scx: update /etc/default/scx sample flags
Signed-off-by: Piotr Gorski <lucjan.lucjanov@gmail.com>
2024-06-06 17:52:21 +02:00
Andrea Righi
3d62866774
Merge pull request #333 from sched-ext/rustland-virtual-time-slice
scx_rustland: introduce virtual time slice
2024-06-05 07:40:22 +02:00
Tejun Heo
3e921ccb74
Merge pull request #332 from sirlucjan/services-update4
scx.service: start service after graphical target
2024-06-04 11:20:44 -10:00