We need a layer of indirection between the stats collection and their
output destinations. Currently, stats are only printed to stdout. Our
goal is to integrate with various telemetry systems such as Prometheus,
StatsD, and custom metric backends like those used by Meta and Netflix.
Importantly, adding a new backend should not require changes to the
existing stats code.
This patch introduces the `metrics` [1] crate, which provides a
framework for defining metrics and publishing them to different
backends.
The initial implementation includes the `dispatched_tasks_count`
metric, tagged with `type`. This metric increments every time a task is
dispatched, emitting the raw count instead of a percentage. A monotonic
counter is the most suitable metric type for this use case, as
percentages can be calculated at query time if needed. Existing logged
metrics continue to print percentages and remain unchanged.
A new flag, `--enable-prometheus`, has been added. When enabled, it
starts a Prometheus endpoint on port 9000 (default is false). This
endpoint allows metrics to be charted in Prometheus or Grafana
dashboards.
Future changes will migrate additional stats to this framework and add
support for other backends.
[1] https://metrics.rs/
Signed-off-by: Jose Fernandez <josef@netflix.com>
The rusty dispatch logic is a bit unnecessarily convoluted. Let's clean it up
so that we're just comparing dom ids rather than iterating over arrays nested
inside of pcpu context.
Signed-off-by: David Vernet <void@manifault.com>
Right now, the SCX_WAKE_SYNC logic in rusty is very primitive. We only check to
see if the waker CPU's runqueue is empty, and then migrate the wakee there if
so. We'll want to expand this to be more thorough, such as:
- Checking to see if prev_cpu and waker_cpu share the same LLC when determining
where to migrate
- Check for whether SCX_WAKE_SYNC migration helps load imbalance between cores
- ...
Right now all of that code is just a big blob in the middle of
rusty_select_cpu(). Let's pull it into its own function to improve readability,
and also add some logic to stay on prev_cpu if it shares an LLC with the waker.
Signed-off-by: David Vernet <void@manifault.com>
It seems that task_set_domain() is nearly at the point where it can
cause the verifier to get confused and think that it's exceeding the
number of available instructions per program. I've seen this a number of
times when making small changes to task_set_domain(), and it's once
again happened @vax-r (I-Hsin Cheng) made a small cleanup change to
rusty in https://github.com/sched-ext/scx/pull/362.
To avoid this, let's just make dom_xfer_task() a separate global program
so that the verifier doens't have to worry about branch pruning, etc
depending on what the caller does. This should hopefully make
task_set_domain() (and its callers) much less brittle.
Signed-off-by: David Vernet <void@manifault.com>
In preparation of upstreaming, let's set the min version requirement at the
released v6.9 kernels. Drop __COMPAT_scx_bpf_cpuperf_*(). The open helper
macros now check the existence of scx_bpf_cpuperf_cap() and abort if not.
In preparation of upstreaming, let's set the min version requirement at the
released v6.9 kernels. Drop __COMPAT_HAS_CPUMASKS(). The open helper macros
now check the existence of scx_bpf_nr_cpu_ids() and abort if not.
In preparation of upstreaming, let's set the min version requirement at the
released v6.9 kernels. Drop __COMPAT_SCX_KICK_IDLE. The open helper macros
now check the existence of SCX_KICK_IDLE and abort if not.
In preparation of upstreaming, let's set the min version requirement at the
released v6.9 kernels. Drop __COMPAT_scx_bpf_switch_call(). The open helper
macros now check the existence of SCX_OPS_SWITCH_PARTIAL and abort if not.
With commit 786ec0c0 ("scx_rlfifo: schedule all tasks in user-space")
all the scheduling decisions are now happening in user-space. This also
bypasses the built-in idle selection logic, delegating the CPU selection
for each task to the user-space scheduler.
The easiest way to distribute tasks across the available CPUs is to
simply allow to dispatch them on the first CPU available.
In this way the scheduler becomes usable in practical scenarios and at
the same time it also maintains its simplicity.
This allows to spread all tasks across all the available CPUs
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Disable all the BPF optimization shortcuts by default and force all
tasks to be processed by the user-space scheduler.
Given that the primary goal of this scheduler is to offer a
straightforward and intuitive example for experimental purposes, this
change simplifies the process for individuals looking to experiment,
allowing them to apply changes to user-space code and quickly observe
the effects, without dealing with any in-kernel optimizations.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
No functional change, just add some comments to better describe the
parameters used when initializing the main BpfScheduler object.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
The bpf_ prefix is used for BPF API. Rename bpf_log2() to u32_log2() and
bpf_log2l() to u64_log2(). While at it, relocate them below compiler
directive helpers.
Keep track of the maximum vruntime among all tasks and flush them if the
difference between the maximum and minimum vruntime exceeds slice_ns.
This helps to prevent excessive starvation, as every task is guaranteed
to be dispatched within the slice_ns time limit.
Tested-by: Tested-by: SoulHarsh007 <harsh.peshwani@outlook.com>
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
These are used in mitosis, but they belong in common code so other
schedulers can do css iteration.
Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
The old logic for CPU frequency scaling is that the task's CPU
performance target (i.e., target CPU frequency) is checked every tick
interval and updated immediately. Indeed, it samples and updates a
performance target every tick interval. Ultimately, it fluctuates CPU
frequency every tick interval, resulting in less steady performance.
Now, we take a different strategy. The key idea is to increase the
frequency as soon as possible when a task starts running for quick
adoption to load spikes. However, if necessary, it decreases gradually
every tick interval to avoid frequency fluctuations.
In my testing, it shows more stable performance in many workloads
(games, compilation).
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Originally, do_update_sys_stat() simply calculated the system-wide CPU
utilization. Over time, it has evolved to collect all kinds of
system-wide, periodic statistics for decision-making, so it has become
bulky. Now, it is time to refactor it for readability. This commit does
not contain functional changes other than refactoring.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
The periodic CPU utilization routine does a lot of other work now. So we
rename LAVD_CPU_UTIL_INTERVAL_NS to LAVD_SYS_STAT_INTERVAL_NS.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
When a device is suspended and resumed, the suspended duration is added
up to a task's runtime if the task was running on the CPU. After the
resume, the task's runtime is incorrectly long and the scheduler starts
to recognize the system is under heavy load. To avoid such problem, the
suspended duration is measured and substracted from the task's runtime.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
scx_mitosis is a dynamic affinity scheduler which assigns cgroups to
Cells and Cells to discrete sets of CPUs. The number of cells is dynamic
as is the CPU assignment. BPF mostly just does vtime scheduling for each
cell, tracks load, and responds to reconfiguration from userspace.
Userspace makes decisions about how to assign cgroups to cells and cells
to cpus.
This is not yet a complete scheduler, much of the userspace logic is a
placeholder as I experiment with better logic. I also want to add richer
scheduling semantics to userspace, e.g. so that cells can do more
"soft-affinity" rather than the strict partitioning implemented
currently.
Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
READ_ONCE()/WRITE_ONCE() macros are added in commit 0932fde, we should
be able to utilize the macros to get around the possibility of data
races for domc->min_vruntime.
Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
- pick_idle_cpu() was putting idle_smtmask that it didn't acquire.
- layered_enqueue() was unnecessarily entering preemption path after finding
an idle CPU.
- No need to test whether scx_bpf_get_idle_cpu/smtmask() return NULL. They
never do.
- Relocate cctx->yielding test into keep_runinng() from its caller.
scx_lavd: core compaction for low power consumption
When system-wide CPU utilization is low, it is very likely all the CPUs
are running with very low utilization. That means all CPUs run with low
clock frequency thanks to dynamic frequency scaling and very frequently
go in and out from/to C-state. That results in low performance (i.e.,
low clock frequency) and high power consumption (i.e., frequent
P-/C-state transition).
The idea of *core compaction* is using less number of CPUs when
system-wide CPU utilization is low. The chosen cores (called "active
cores") will run in higher utilization and higher clock frequency, and
the rest of the cores (called "idle cores") will be in a C-state for a
much longer duration. Thus, the core compaction can achieve higher
performance with lower power consumption.
One potential problem of core compaction is latency spikes when all the
active cores are overloaded. A few techniques are incorporated to solve
this problem.
1) Limit the active CPU core's utilization below a certain limit (say 50%).
2) Do not use the core compaction when the system-wide utilization is
moderate (say 50%).
3) Do not enforce the core compaction for kernel and pinned user-space
tasks since they are manually optimized for performance.
In my experiments, under a wide range of system-wide CPU utilization
(5%—80%), the core compaction reduces 7-30% power consumption without
sacrificing average and 99p tail latency.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Currently, when preempting, searching for the candidate CPU always starts
from the RR preemption cursor. Let's first try the previous CPU the
preempting task was on as that may have some locality benefits.
When a task is being enqueued outside wakeup path, ops.select_cpu() isn't
called, so we can end up in a situation where a newly enqueued task keeps
waiting in one of the DSQs while there are idle CPUs. Factor out idle CPU
selection path into pick_idle_cpu() and call it from the enqueue path in
such cases. This problem is shared across schedulers and likely needs a more
generic solution in the future.
yield(2) currently gives up the entire slice. Add "yield_ignore" layer
parameter which can modulate the magnitude of yiedling. When 1.0, yields are
completely ignored. 0.5, only half worth of the full slice is given up and
so on.
Currently, a task which yields is treated the same as a task which has run
out its slice. As the budget charged to a task is calculated from wall clock
time, a repeatedly yielding task can stay at the top of the queue for quite
a while hogging the CPU and spiking the number of scheduling events.
Let's add explicit yield support. An yielding task is now always charged the
full slice and not allowed to keep running on the same CPU.
The keep_running path relies on the implicit last task enqueue which makes
the statistics a bit difficult to track. Let's make the enqueue path
comprehensive:
- Set SCX_OPS_ENQ_LAST and handle the last runnable task enqueue explicitly.
- Implement layered_cpu_release() to re-enqueue tasks from a CPU preempted
by a higher pri sched class and handle the re-enqueued tasks explicitly in
layered_enqueue().
- Add more statistics to track all enqueue operations.
When a task exhausts its slice, layered currently doesn't make any effort to
keep it on the same CPU. It dispatches the next task to run and then
enqueues the running one. This leads to suboptimal behaviors. e.g. When this
happens to a task in a preempting layer, the task will most likely find an
idle CPU or a task to preempt and then migrate there causing a completely
unnecessary migration.
This patch layered_dispatch() test whether the current task should keep
running on the CPU and then skip dispatching to keep the task running. This
behavior depends on the implicit local DSQ enqueue mechanism which triggers
when there are no other tasks to run.
- scx_utils: Replace kfunc_exists() with ksym_exists() which doesn't care
about the type of the symbol.
- scx_layered: Fix load failure on kernels >= v6.10-rc due to
scheduler_tick() -> sched_tick rename. Attach the tick fentry function to
either scheduler_tick() or sched_tick().
Make sure to never assign a time slice longer than the default time
slice, that can be used as an upper limit.
This seems to prevent potential stall conditions (reported by the
CachyOS community) when running CPU-intensive workloads, such as:
[ 68.062813] sched_ext: BPF scheduler "rustland" errored, disabling
[ 68.062831] sched_ext: runnable task stall (ollama_llama_se[3312] failed to run for 5.180s)
[ 68.062832] scx_watchdog_workfn+0x154/0x1e0
[ 68.062837] process_one_work+0x18e/0x350
[ 68.062839] worker_thread+0x2fa/0x490
[ 68.062841] kthread+0xd2/0x100
[ 68.062842] ret_from_fork+0x34/0x50
[ 68.062844] ret_from_fork_asm+0x1a/0x30
Fixes: 6f4cd853 ("scx_rustland: introduce virtual time slice")
Tested-by: SoulHarsh007 <harsh.peshwani@outlook.com>
Tested-by: Piotr Gorski <piotrgorski@cachyos.org>
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Overview
========
Currently, a task's time slice is determined based on the total number
of tasks waiting to be scheduled: the more overloaded the system, the
shorter the time slice.
This approach can help to reduce the average wait time of all tasks,
allowing them to progress more slowly, but uniformly, thus providing a
smoother overall system performance.
However, under heavy system load, this approach can lead to very short
time slices distributed among all tasks, causing excessive context
switches that can badly affect soft real-time workloads.
Moreover, the scheduler tends to operate in a bursty manner (tasks are
queued and dispatched in bursts). This can also result in fluctuations
of longer and shorter time slices, depending on the number of tasks
still waiting in the scheduler's queue.
Such behavior can also negatively impact on soft real-time workloads,
such as real-time audio processing.
Virtual time slice
==================
To mitigate this problem, introduce the concept of virtual time slice:
the idea is to evaluate the optimal time slice of a task, considering
the vruntime as a deadline for the task to complete its work before
releasing the CPU.
This is accomplished by calculating the difference between the task's
vruntime and the global current vruntime and use this value as the task
time slice:
task_slice = task_vruntime - min_vruntime
In this way, tasks that "promise" to release the CPU quickly (based on
their previous work pattern) get a much higher priority (due to
vruntime-based scheduling and the additional priority boost for being
classified as interactive), but they are also given a shorter time slice
to complete their work and fulfill their promise of rapidity.
At the same time tasks that are more CPU-intensive get de-prioritized,
but they will tend to have a longer time slice available, reducing in
this way the amount of context switches that can negatively affect their
performance.
In conclusion, latency-sensitive tasks get a high priority and a short
time slice (and they can preempt other tasks), CPU-intensive tasks get
low priority and a long time slice.
Example
=======
Let's consider the following theoretical scenario:
task | time
-----+-----
A | 1
B | 3
C | 6
D | 6
In this case task A represents a short interactive task, task C and D
are CPU-intensive tasks and task B is mainly interactive, but it also
requires some CPU time.
With a uniform time slice, scaled based on the amount of tasks, the
scheduling looks like this (assuming the time slice is 2):
A B B C C D D A B C C D D C C D D
| | | | | | | | |
`---`---`---`-`-`---`---`---`----> 9 context switches
With the virtual time slice the scheduling changes to this:
A B B C C C D A B C C C D D D D D
| | | | | | |
`---`-----`-`-`-`-----`----------> 7 context switches
In the latter scenario, tasks do not receive the same time slice scaled
by the total number of tasks waiting to be scheduled. Instead, their
time slice is adjusted based on their previous CPU usage. Tasks that
used more CPU time are given longer slices and their processing time
tends to be packed together, reducing the amount of context switches.
Meanwhile, latency-sensitive tasks can still be processed as soon as
they need to, because they get a higher priority and they can preempt
other tasks. However, they will get a short time slice, so tasks that
were incorrectly classified as interactive will still be forced to
release the CPU quickly.
Experimental results
====================
This patch has been tested on a on a 8-cores AMD Ryzen 7 5800X 8-Core
Processor (16 threads with SMT), 16GB RAM, NVIDIA GeForce RTX 3070.
The test case involves the usual benchmark of playing a video game while
simultaneously overloading the system with a parallel kernel build
(`make -j32`).
The average frames per second (fps) reported by Steam is used as a
metric for measuring system responsiveness (the higher the better):
Game | before | after | delta |
---------------------------+---------+---------+--------+
Baldur's Gate 3 | 40 fps | 48 fps | +20.0% |
Counter-Strike 2 | 8 fps | 15 fps | +87.5% |
Cyberpunk 2077 | 41 fps | 46 fps | +12.2% |
Terraria | 98 fps | 108 fps | +10.2% |
Team Fortress 2 | 81 fps | 92 fps | +13.6% |
WebGL demo (firefox) [1] | 32 fps | 42 fps | +31.2% |
---------------------------+---------+---------+--------+
Apart from the massive boost with Counter-Strike 2 (that should be taken
with a grain of salt, considering the overall poor performance in both
cases), the virtual time slice seems to systematically provide a boost
in responsiveness of around +10-20% fps.
It also seems to significantly prevent potential audio cracking issues
when the system is massively overloaded: no audio cracking was detected
during the entire run of these tests with the virtual deadline change
applied.
[1] https://webglsamples.org/aquarium/aquarium.html
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Make restart handling with user_exit_info simpler and consistently use the
load and report macros consistently across the rust schedulers. This makes
all schedulers automatically handle auto restarts from CPU hotplug events.
Note that this is necessary even for scx_lavd which has CPU hotplug
operations as CPU hotplug operations which took place between skel open and
scheduler init can still trigger restart.
In cpumask_intersects_domain(), we check whether a given cpumask has any
CPUs in common with the specified domain by looking at the const, static
dom_cpumasks map. This map is only really necessary when creating the
domain struct bpf_cpumask objects at scheduler load time. After that, we
can just use the actual struct bpf_cpumask object embedded in the domain
context. Let's use that and cpumask kfuncs instead.
This allows rusty to load with
https://github.com/sched-ext/sched_ext/pull/216.
Signed-off-by: David Vernet <void@manifault.com>