Commit Graph

397 Commits

Author SHA1 Message Date
I Hsin Cheng
84b9ac4dce scx_rusty: Pull domain status check
Check whether the BalanceState of pull_dom.load inside function
try_find_move_task is actually the variant NeedsPull. It'll perform task
migration in abit more conservative manner when the system is under high
loading situation.

Experiments are performed when the system is compiling linux kernel and
undergoing a large amount of I/O operation at the same time using fio.

The result showns that before the modification, there're 12,6617 times
of task migrations system wide. After the modification, there're 11,5419
times of task migrations system wide.

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
2024-06-23 21:38:23 +08:00
David Vernet
5038f54701
Merge pull request #377 from jfernandez/metrics-rs
rusty: Integrate stats with the metrics framework
2024-06-21 15:23:20 -05:00
David Vernet
3bd15be840
rlfifo: Use topo.nr_cpu_ids() instead of topo.nr_cpus_possible()
In scx_rlfifo, we're currently using topo.nr_cpus_possible() to
determine how many possible CPU IDs we could have on the system. To
properly support systems whose disabled CPUs may be in the middle of the
range of possible CPU IDs, let's instead use topo.nr_cpu_ids() so that
we don't accidentally dispatch to an invalid DSQ.

Signed-off-by: David Vernet <void@manifault.com>
2024-06-21 12:57:20 -05:00
David Vernet
263e02f644
rusty: Use nr_cpu_ids instead of nr_cpus_possible
In scx_rusty, we're currently using topo.nr_cpus_possible() to determine
how many possible CPU IDs we could have on the system. scx_rusty already
accounts for offlined CPUs, so to properly support systems whose
disabled CPUs may be in the middle of the range of possible CPU IDs,
let's instead use topo.nr_cpu_ids().

Signed-off-by: David Vernet <void@manifault.com>
2024-06-21 12:57:19 -05:00
David Vernet
bdbf4b9c05
topo: Return nr_cpu_ids from host Topology
In some cases, a host may have an odd topology where there are gaps in
CPU IDs (including between possible CPUs). A common pattern in
schedulers is to perform allocations for every possible CPU ID, such as
creating a per-cpu DSQ. In order to avoid confusing schedulers, let's
track the maximum CPU ID on a system so that we can return the number of
CPU IDs on the system which is inclusive of gaps.

We also update scx_rustland in this change to accommodate the fact that
we no longer export nr_cpus_possible() from TopologyMap.

Signed-off-by: David Vernet <void@manifault.com>
2024-06-21 12:57:13 -05:00
Jose Fernandez
83373b1f4e
rusty: Integrate stats with the metrics framework
We need a layer of indirection between the stats collection and their
output destinations. Currently, stats are only printed to stdout. Our
goal is to integrate with various telemetry systems such as Prometheus,
StatsD, and custom metric backends like those used by Meta and Netflix.
Importantly, adding a new backend should not require changes to the
existing stats code.

This patch introduces the `metrics` [1] crate, which provides a
framework for defining metrics and publishing them to different
backends.

The initial implementation includes the `dispatched_tasks_count`
metric, tagged with `type`. This metric increments every time a task is
dispatched, emitting the raw count instead of a percentage. A monotonic
counter is the most suitable metric type for this use case, as
percentages can be calculated at query time if needed. Existing logged
metrics continue to print percentages and remain unchanged.

A new flag, `--enable-prometheus`, has been added. When enabled, it
starts a Prometheus endpoint on port 9000 (default is false). This
endpoint allows metrics to be charted in Prometheus or Grafana
dashboards.

Future changes will migrate additional stats to this framework and add
support for other backends.

[1] https://metrics.rs/

Signed-off-by: Jose Fernandez <josef@netflix.com>
2024-06-21 10:18:44 -06:00
Changwoo Min
9c21ace276
Merge pull request #373 from vax-r/lavd_reuse
scx_lavd: Reuse can_task1_kick_task2
2024-06-19 15:29:05 +09:00
I Hsin Cheng
99960ad960 scx_lavd: Reuse can_task1_kick_task2
Use the function can_task1_kick_task2() to replace places which also
checking the comp_preemption_info between two cpus for better
consistency.

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
2024-06-19 11:01:31 +08:00
Changwoo Min
691869e83f
Merge pull request #369 from sched-ext/lavd-fix-pick-cpu
scx_lavd: properly check for idle CPUs in pick_cpu()
2024-06-19 09:23:17 +09:00
Changwoo Min
dad25f1b5d
Merge pull request #368 from multics69/lavd-perf-misc
scx_lavd: misc performance tuning and code clean up
2024-06-19 07:26:52 +09:00
Andrea Righi
bad9ed13ef scx_lavd: properly check for idle CPUs in pick_cpu()
It seems that we are not updating `is_idle` when we find an idle CPU
with pick_cpu(), causing unnecessary rescheduling events when
select_cpu() is called.

To resolve this, ensure that the is_idle state is correctly set.
Additionally, always ensure that the task is dispatched to the local DSQ
immediately upon finding (and reserving) an idle CPU.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-06-18 17:36:39 +02:00
Changwoo Min
632fa9e4f2 scx_lavd: misc code clean up
- clean up u63 and u32 usages in structures to reduce struct size
- refactoring pick_cpu() for readability

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-06-18 18:11:49 +09:00
Changwoo Min
5165bf5a03 scx_lavd: tuning CPU frequency scaling
The required CPU performance (cpuperf) was set to 1024 (100%) when the
CPU utilization was 100%. When a sudden load spike happens, it makes the
system adapt slowly in the next interval.

The new scheme always reserves some headroom in advance, so it sets
cpuperf to 1024 when the CPU utilization reaches to 85%. This gives some
room to adapt in advance.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-06-18 18:11:49 +09:00
I Hsin Cheng
94e3616c02 scx_rusty: Refactor lookup operation for new_domc in task_set_domain
Modify the execution sequence before lookup operation for new_domc. If
new_dom_id == NO_DOM_FOUND, lookup operation for new_domc is definitely
going to fail so we don't have to wait until we found that new_domc is
NULL, clearing of cpumask and return operation should be done directly
in that case.

Plus we should avoid using try_lookup_dom_ctx outside the context of
lookup_dom_ctx, as it can keep the interface's consistency.

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
2024-06-18 12:58:17 +08:00
David Vernet
0184444285
Merge pull request #366 from sched-ext/task_set_domain_global
rusty: Make dom_xfer_task() a global prog
2024-06-17 14:43:45 -05:00
David Vernet
dfe0ffb312
Merge pull request #347 from sched-ext/rusty_cleanup
rusty: Clean up some logic in rusty
2024-06-17 14:26:53 -05:00
David Vernet
7985ee556e
rusty: Clean up dispatch logic
The rusty dispatch logic is a bit unnecessarily convoluted. Let's clean it up
so that we're just comparing dom ids rather than iterating over arrays nested
inside of pcpu context.

Signed-off-by: David Vernet <void@manifault.com>
2024-06-17 14:24:30 -05:00
David Vernet
87aa86845d
rusty: Refactor + slightly improve wake_sync
Right now, the SCX_WAKE_SYNC logic in rusty is very primitive. We only check to
see if the waker CPU's runqueue is empty, and then migrate the wakee there if
so. We'll want to expand this to be more thorough, such as:

- Checking to see if prev_cpu and waker_cpu share the same LLC when determining
  where to migrate
- Check for whether SCX_WAKE_SYNC migration helps load imbalance between cores
- ...

Right now all of that code is just a big blob in the middle of
rusty_select_cpu(). Let's pull it into its own function to improve readability,
and also add some logic to stay on prev_cpu if it shares an LLC with the waker.

Signed-off-by: David Vernet <void@manifault.com>
2024-06-17 14:24:29 -05:00
David Vernet
fed66fa571
rusty: Make dom_xfer_task() a global prog
It seems that task_set_domain() is nearly at the point where it can
cause the verifier to get confused and think that it's exceeding the
number of available instructions per program. I've seen this a number of
times when making small changes to task_set_domain(), and it's once
again happened @vax-r (I-Hsin Cheng) made a small cleanup change to
rusty in https://github.com/sched-ext/scx/pull/362.

To avoid this, let's just make dom_xfer_task() a separate global program
so that the verifier doens't have to worry about branch pruning, etc
depending on what the caller does. This should hopefully make
task_set_domain() (and its callers) much less brittle.

Signed-off-by: David Vernet <void@manifault.com>
2024-06-17 14:22:26 -05:00
Tejun Heo
dde2942125 compat: Drop __COMPAT_scx_bpf_cpuperf_*()
In preparation of upstreaming, let's set the min version requirement at the
released v6.9 kernels. Drop __COMPAT_scx_bpf_cpuperf_*(). The open helper
macros now check the existence of scx_bpf_cpuperf_cap() and abort if not.
2024-06-16 06:16:53 -10:00
Tejun Heo
13e8388e1e compat: Drop __COMPAT_HAS_CPUMASKS
In preparation of upstreaming, let's set the min version requirement at the
released v6.9 kernels. Drop __COMPAT_HAS_CPUMASKS(). The open helper macros
now check the existence of scx_bpf_nr_cpu_ids() and abort if not.
2024-06-16 06:12:06 -10:00
Tejun Heo
5b5e5be906 compat: Drop __COMPAT_SCX_KICK_IDLE
In preparation of upstreaming, let's set the min version requirement at the
released v6.9 kernels. Drop __COMPAT_SCX_KICK_IDLE. The open helper macros
now check the existence of SCX_KICK_IDLE and abort if not.
2024-06-15 20:24:15 -10:00
Tejun Heo
7c9aedaefe compat: Drop __COMPAT_scx_bpf_switch_all()
In preparation of upstreaming, let's set the min version requirement at the
released v6.9 kernels. Drop __COMPAT_scx_bpf_switch_call(). The open helper
macros now check the existence of SCX_OPS_SWITCH_PARTIAL and abort if not.
2024-06-15 20:03:37 -10:00
Tejun Heo
dd6255a601
Merge pull request #359 from sched-ext/htejun/cosmetic
common.bpf.h: Cosmetic changes
2024-06-15 06:42:00 -10:00
Andrea Righi
cb20a6f136 scx_rlfifo: dispatch all tasks on the first CPU available
With commit 786ec0c0 ("scx_rlfifo: schedule all tasks in user-space")
all the scheduling decisions are now happening in user-space. This also
bypasses the built-in idle selection logic, delegating the CPU selection
for each task to the user-space scheduler.

The easiest way to distribute tasks across the available CPUs is to
simply allow to dispatch them on the first CPU available.

In this way the scheduler becomes usable in practical scenarios and at
the same time it also maintains its simplicity.

This allows to spread all tasks across all the available CPUs

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-06-15 16:13:53 +02:00
Andrea Righi
786ec0c04a scx_rlfifo: schedule all tasks in user-space
Disable all the BPF optimization shortcuts by default and force all
tasks to be processed by the user-space scheduler.

Given that the primary goal of this scheduler is to offer a
straightforward and intuitive example for experimental purposes, this
change simplifies the process for individuals looking to experiment,
allowing them to apply changes to user-space code and quickly observe
the effects, without dealing with any in-kernel optimizations.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-06-15 16:07:39 +02:00
Andrea Righi
59f47d6659 scx_rlfifo: improve code readability
No functional change, just add some comments to better describe the
parameters used when initializing the main BpfScheduler object.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-06-15 16:05:28 +02:00
Tejun Heo
d7677e3e5c scx/common.bpf.h: Rename bpf_log2[l]() to u32/64_log2()
The bpf_ prefix is used for BPF API. Rename bpf_log2() to u32_log2() and
bpf_log2l() to u64_log2(). While at it, relocate them below compiler
directive helpers.
2024-06-14 15:22:39 -10:00
Andrea Righi
8c6fe540eb scx_rustland: prevent excessive starvation when system is congested
Keep track of the maximum vruntime among all tasks and flush them if the
difference between the maximum and minimum vruntime exceeds slice_ns.

This helps to prevent excessive starvation, as every task is guaranteed
to be dispatched within the slice_ns time limit.

Tested-by: Tested-by: SoulHarsh007 <harsh.peshwani@outlook.com>
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-06-14 20:09:19 +02:00
Changwoo Min
94a39f419f scx_lavd: add the design of core compaction
The core compaction seems to work great in various hardware. Now it is
time to document its design.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-06-14 11:53:52 +09:00
Changwoo Min
5068d75bf3
Merge pull request #351 from multics69/lavd-power-v2
scx_lavd: improve CPU frequency scaling
2024-06-14 09:29:10 +09:00
Tejun Heo
a3342810c7
Merge pull request #352 from dschatzberg/mitosis
common: Add css iter forward declares
2024-06-13 06:50:06 -10:00
Dan Schatzberg
114e4b644b common: Add css iter forward declares
These are used in mitosis, but they belong in common code so other
schedulers can do css iteration.

Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
2024-06-12 15:02:48 -07:00
Changwoo Min
747bf2a7d7 scx_lavd: add the design of CPU frequency scaling
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-06-13 01:42:19 +09:00
Changwoo Min
2e74b86b4a scx_lavd: logging cpu performance target
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-06-13 00:44:04 +09:00
Changwoo Min
e6348a11e9 scx_lavd: improve frequency scaling logic
The old logic for CPU frequency scaling is that the task's CPU
performance target (i.e., target CPU frequency) is checked every tick
interval and updated immediately. Indeed, it samples and updates a
performance target every tick interval. Ultimately, it fluctuates CPU
frequency every tick interval, resulting in less steady performance.

Now, we take a different strategy. The key idea is to increase the
frequency as soon as possible when a task starts running for quick
adoption to load spikes. However, if necessary, it decreases gradually
every tick interval to avoid frequency fluctuations.

In my testing, it shows more stable performance in many workloads
(games, compilation).

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-06-12 23:40:40 +09:00
Changwoo Min
753f333c09 scx_lavd: refactoring do_update_sys_stat()
Originally, do_update_sys_stat() simply calculated the system-wide CPU
utilization. Over time, it has evolved to collect all kinds of
system-wide, periodic statistics for decision-making, so it has become
bulky. Now, it is time to refactor it for readability. This commit does
not contain functional changes other than refactoring.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-06-12 21:15:25 +09:00
Changwoo Min
9d129f0afa scx_lavd: rename LAVD_CPU_UTIL_INTERVAL_NS to LAVD_SYS_STAT_INTERVAL_NS
The periodic CPU utilization routine does a lot of other work now. So we
rename LAVD_CPU_UTIL_INTERVAL_NS to LAVD_SYS_STAT_INTERVAL_NS.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-06-12 20:06:17 +09:00
Changwoo Min
7046b47b9c scx_lavd: properly calculate task's runtime after suspend/resume
When a device is suspended and resumed, the suspended duration is added
up to a task's runtime if the task was running on the CPU. After the
resume, the task's runtime is incorrectly long and the scheduler starts
to recognize the system is under heavy load. To avoid such problem, the
suspended duration is measured and substracted from the task's runtime.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-06-12 15:58:41 +09:00
Dan Schatzberg
b95cfb0772 mitosis: Fix build
The target wasn't dependent on the previous sched so building all
schedulers ended up not building scx_mitosis which broke the install
script.
2024-06-11 14:33:32 -07:00
Dan Schatzberg
9528d4603e
Merge pull request #339 from dschatzberg/mitosis
scheds: Add scx_mitosis scheduler
2024-06-11 16:50:25 -04:00
Dan Schatzberg
3b6e2dee20 scheds: Add scx_mitosis scheduler
scx_mitosis is a dynamic affinity scheduler which assigns cgroups to
Cells and Cells to discrete sets of CPUs. The number of cells is dynamic
as is the CPU assignment. BPF mostly just does vtime scheduling for each
cell, tracks load, and responds to reconfiguration from userspace.
Userspace makes decisions about how to assign cgroups to cells and cells
to cpus.

This is not yet a complete scheduler, much of the userspace logic is a
placeholder as I experiment with better logic. I also want to add richer
scheduling semantics to userspace, e.g. so that cells can do more
"soft-affinity" rather than the strict partitioning implemented
currently.

Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
2024-06-11 10:34:53 -07:00
I Hsin Cheng
4e30bb9ccf scx_rusty: Elimate data races possibility for domain min_vruntime
READ_ONCE()/WRITE_ONCE() macros are added in commit 0932fde, we should
be able to utilize the macros to get around the possibility of data
races for domc->min_vruntime.

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
2024-06-11 10:57:03 +08:00
Tejun Heo
30f27d99d9
Merge pull request #340 from sched-ext/htejun/layered-updates
scx_layered: Improve yield, preemption and other behaviors
2024-06-10 11:27:44 -10:00
Tejun Heo
9ec3594b4f scx_layered: Several fixes to address David's review
- pick_idle_cpu() was putting idle_smtmask that it didn't acquire.

- layered_enqueue() was unnecessarily entering preemption path after finding
  an idle CPU.

- No need to test whether scx_bpf_get_idle_cpu/smtmask() return NULL. They
  never do.

- Relocate cctx->yielding test into keep_runinng() from its caller.
2024-06-10 11:23:37 -10:00
Tejun Heo
92317aa2f9 Use __always_inline uniformly
Instead of using __attribute__((always_inline)) use the __always_inline
macro provided by BPF.
2024-06-10 11:23:26 -10:00
Changwoo Min
472ab945b8
scx_lavd: core compaction for low power consumption (#338)
scx_lavd: core compaction for low power consumption

When system-wide CPU utilization is low, it is very likely all the CPUs
are running with very low utilization. That means all CPUs run with low
clock frequency thanks to dynamic frequency scaling and very frequently
go in and out from/to C-state. That results in low performance (i.e.,
low clock frequency) and high power consumption (i.e., frequent
P-/C-state transition).

The idea of *core compaction* is using less number of CPUs when
system-wide CPU utilization is low. The chosen cores (called "active
cores") will run in higher utilization and higher clock frequency, and
the rest of the cores (called "idle cores") will be in a C-state for a
much longer duration. Thus, the core compaction can achieve higher
performance with lower power consumption.

One potential problem of core compaction is latency spikes when all the
active cores are overloaded. A few techniques are incorporated to solve
this problem.

1) Limit the active CPU core's utilization below a certain limit (say 50%).

2) Do not use the core compaction when the system-wide utilization is
   moderate (say 50%).

3) Do not enforce the core compaction for kernel and pinned user-space
   tasks since they are manually optimized for performance.

In my experiments, under a wide range of system-wide CPU utilization
(5%—80%), the core compaction reduces 7-30% power consumption without
sacrificing average and 99p tail latency.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-06-08 09:25:27 +09:00
Tejun Heo
a165970ab9 scx_layered: Add migration statistic
Keep track of how frequent migrations are.
2024-06-07 11:49:39 -10:00
Tejun Heo
5b31d96c3d scx_layered: Implement "preempt_first" layer property
If set, tasks in the layer will try to preempt tasks in their previous CPUs
before trying to find idle CPUs.
2024-06-07 11:49:39 -10:00
Tejun Heo
ece3638664 scx_layered: Allow confined layers to preempt
There's no reason to restrict confined layers from preempting on the CPUs
that they are entitled to. Allow preemption for confined layers.
2024-06-07 11:49:39 -10:00