Remove check if high fallback DSQ has the highest budget and
aggressively consume from fallback DSQs. This is a performance
optimization that yields a small improvement in performance
when running synthetic load tests.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
Reduce the size of struct task_ctx from 3 cache lines to 2 cache lines
by dropping unnecessary fields and optimizing the struct layout.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Use bpf PROG_RUN from userspace for updating cpumask for rather than
relying on scheduler ticks. This should be a lower overhead approach in
that an extra bpf program does not need to be called on every CPU during
tick.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
In earlier kernels, the iterator variable wasn't trusted making the verifier
choke on calling kfuncs on its dereferences. Work around by re-looking up
the task by PID.
When --disable-topology is specified the topology information (e.g. llc map)
supplied to the BPF code disagrees with how the scheduler operates requiring
code paths to be split unnecessarily and making things error-prone (e.g.
layer_dsq_id() returned wrong value with --disable-topology).
- Add Topology::with_flattened_llc_node() which create a dummy topo with one
llc and node regardless of the underlying hardware and make layered use it
when --disable-topology.
- Add explicit nr_llcs == 1 handling to layer_dsq_id() to generate better
code when topology is disabled and remove explicit disable_topology
branches in the callers.
- Fix layer->cache_mask when a layer doesn't explicitly specify nodes and
drop the disable_topology branch in layered_dump().
On 6.9 kernels the verifier is not able to track `struct bpf_cpumasks`
properly on nested structs. Move the cpumasks from the `cached_cpus`
struct back to the `task_ctx` struct so older versions of the verifier
can pass.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
- Remember hi_fallback_dsq_id for each CPU in cpu_ctx and use the remembered
values.
- Make antistall_scan() walk each hi fallback DSQ once instead of multiple
times through CPU iteration.
- Remove unused functions.
keep_running() and antistall_scan() were incorrectly assuming that
layer->index equals DSQ ID. Fix them. Also, remove a compile warning while
at it around cpumask cast.
It's confusing to use tctx->last_cpu for making active choices as it makes
layered deviate from other schedulers unnecessarily. Use last_cpu only for
migration accounting in layered_running().
- In layered_enqueue(), layered_select_cpu() already returned prev_cpu for
non-direct-dispatch cases and the CPU the task is currently on should
match tctx->last_cpu. Use task_cpu instead.
- In keep_running(), the current CPU always matches tctx->last_cpu. Always
use bpf_get_smp_processor_id().
A task may end up in a layer which doesn't have any CPUs that are allowed
for the task. They are accounted as affinity violations and put onto a
fallback DSQ. When antistall_set() is trying to find the CPU to run a
stalled DSQ, it ignores CPUs that are not in the first task's
layered_cpumask. This makes antistall skip stalling DSQs with affnity
violating tasks at the front.
Consider all allowed CPUs for affinity violating tasks. While at it, combine
the two if blocks to set antistall to improve readability.
Instead of using a constant runtime value in the deadline calculation,
use the adjusted runtime value of a task. Since tasks' runtime value
follows a highly skewed distribution, we convert the highly skewed
distribution to a mildly skewed distribution to avoid stalls. This
resolves the audio breaking issue in osu! under heavy background workloads.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
consume_preempting() wasn't teting layer->preempt in consume_preempting()
when --local-llc-iterations ending up treating all layers as preempting
layers and often leading to HI fallback starvations under saturation. Fix
it.
layered_running() is calling scx_bpf_cpuperf_set() whenever a task of a
layer w/ cpuperf setting starts running which can be every task switch.
There's no reason to repeatedly call with the same value. Remember the last
value and call iff the new value is different.
This reduces the bpftop reported CPU consumption of scx_bpf_cpuperf_set()
from ~1.2% to ~0.7% while running rd-hashd at full CPU saturation on Ryzen
3900x.
Introduce scx_flash (Fair Latency-Aware ScHeduler), a scheduler that
focuses on ensuring fairness among tasks and performance predictability.
This scheduler is introduced as a replacement of the "lowlatency" mode
in scx_bpfland, that has been dropped in commit 78101e4 ("scx_bpfland:
drop lowlatency mode and the priority DSQ").
scx_flash operates based on an EDF (Earliest Deadline First) policy,
where each task is assigned a latency weight. This weight is adjusted
dynamically, influenced by the task's static weight and how often it
releases the CPU before its full assigned time slice is used: tasks that
release the CPU early receive a higher latency weight, granting them
a higher priority over tasks that fully use their time slice.
The combination of dynamic latency weights and EDF scheduling ensures
responsive and stable performance, even in overcommitted systems, making
the scheduler particularly well-suited for latency-sensitive workloads,
such as multimedia or real-time audio processing.
Tested-by: Peter Jung <ptr1337@cachyos.org>
Tested-by: Piotr Gorski <piotrgorski@cachyos.org>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
- Cache llc and node masked cpumasks instead of calculating them each time.
They're recalculated only when the task has migrated cross the matching
boundary and recalculation is necessary.
- llc and node masks should be taken from the wakee's previous CPU not the
waker's CPU.
- idle_smtmask is already considered by scx_bpf_pick_idle_cpu(). No need to
and it manually.
- big_cpumask code updated to be simpler. This should also be converted to
use cached cpumask. big_cpumask portion is not tested.
This brings down CPU utilization of select_cpu() from ~2.7% to ~1.7% while
running rd-hashd at saturation on Ryzen 3900x.
We duplicate the definition of most fields in every layer kind. This makes
reading the config harder than it needs to be, and turns every simple read of a
common field into a `match` statement that is largely redundant.
Utilise `#[serde(flatten)]` to embed a common struct into each of the LayerKind
variants. Rather than matching on the type this can be directly accessed with
`.kind.common()` and `.kind.common_mut()`. Alternatively, you can extend
existing matches to match out the common parts as demonstrated in this diff
where necessary.
There is some further code cleanup that can be done in the changed read sites,
but I wanted to make it clear that this change doesn't change behaviour, so
tried to make these changes in the least obtrusive way.
Drive-by: fix the formatting of the lazy_static section in main.rs by using
`lazy_static::lazy_static`.
Test plan:
```
# main
$ cargo build --release && target/release/scx_layered --example /tmp/test_old.json
# this change
$ cargo build --release && target/release/scx_layered --example /tmp/test_new.json
$ diff /tmp/test_{old,new}.json
# no diff
```