Commit Graph

1144 Commits

Author SHA1 Message Date
Tejun Heo
77eec19792
Merge pull request #929 from sched-ext/htejun/layered-updates
scx_layered: Perf improvements and a bug fix
2024-11-18 17:41:40 +00:00
Andrea Righi
5b4b6df5e4
Merge branch 'main' into scx-fair 2024-11-18 07:42:09 +01:00
Tejun Heo
56e0dae81d scx_layered: Fix linter disagreement 2024-11-17 06:03:30 -10:00
Tejun Heo
93a0bc9969 scx_layered: Fix consume_preempting() when --local-llc-iteration
consume_preempting() wasn't teting layer->preempt in consume_preempting()
when --local-llc-iterations ending up treating all layers as preempting
layers and often leading to HI fallback starvations under saturation. Fix
it.
2024-11-17 05:54:03 -10:00
Tejun Heo
51d4945d69 scx_layered: Don't call scx_bpf_cpuperf_set() unnecessarily
layered_running() is calling scx_bpf_cpuperf_set() whenever a task of a
layer w/ cpuperf setting starts running which can be every task switch.
There's no reason to repeatedly call with the same value. Remember the last
value and call iff the new value is different.

This reduces the bpftop reported CPU consumption of scx_bpf_cpuperf_set()
from ~1.2% to ~0.7% while running rd-hashd at full CPU saturation on Ryzen
3900x.
2024-11-16 05:45:44 -10:00
Andrea Righi
678b10133d scheds: introduce scx_flash
Introduce scx_flash (Fair Latency-Aware ScHeduler), a scheduler that
focuses on ensuring fairness among tasks and performance predictability.

This scheduler is introduced as a replacement of the "lowlatency" mode
in scx_bpfland, that has been dropped in commit 78101e4 ("scx_bpfland:
drop lowlatency mode and the priority DSQ").

scx_flash operates based on an EDF (Earliest Deadline First) policy,
where each task is assigned a latency weight. This weight is adjusted
dynamically, influenced by the task's static weight and how often it
releases the CPU before its full assigned time slice is used: tasks that
release the CPU early receive a higher latency weight, granting them
a higher priority over tasks that fully use their time slice.

The combination of dynamic latency weights and EDF scheduling ensures
responsive and stable performance, even in overcommitted systems, making
the scheduler particularly well-suited for latency-sensitive workloads,
such as multimedia or real-time audio processing.

Tested-by: Peter Jung <ptr1337@cachyos.org>
Tested-by: Piotr Gorski <piotrgorski@cachyos.org>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
2024-11-16 14:49:25 +01:00
Tejun Heo
75dd81e3e6 scx_layered: Improve topology aware select_cpu()
- Cache llc and node masked cpumasks instead of calculating them each time.
  They're recalculated only when the task has migrated cross the matching
  boundary and recalculation is necessary.

- llc and node masks should be taken from the wakee's previous CPU not the
  waker's CPU.

- idle_smtmask is already considered by scx_bpf_pick_idle_cpu(). No need to
  and it manually.

- big_cpumask code updated to be simpler. This should also be converted to
  use cached cpumask. big_cpumask portion is not tested.

This brings down CPU utilization of select_cpu() from ~2.7% to ~1.7% while
running rd-hashd at saturation on Ryzen 3900x.
2024-11-15 16:29:47 -10:00
Tejun Heo
2b52d172d4 scx_layered: Encapsulate per-task layered cpumask caching
and fix build warnings while at it. Maybe we should drop const from
cast_mask().
2024-11-15 14:30:03 -10:00
Tejun Heo
1293ae21fc scx_layered: Stat output format update
Rearrange things a bit so that lines are not too long.
2024-11-15 13:38:56 -10:00
Jake Hillion
d35d5271f5 layered: split out common parts of LayerKind
We duplicate the definition of most fields in every layer kind. This makes
reading the config harder than it needs to be, and turns every simple read of a
common field into a `match` statement that is largely redundant.

Utilise `#[serde(flatten)]` to embed a common struct into each of the LayerKind
variants. Rather than matching on the type this can be directly accessed with
`.kind.common()` and `.kind.common_mut()`. Alternatively, you can extend
existing matches to match out the common parts as demonstrated in this diff
where necessary.

There is some further code cleanup that can be done in the changed read sites,
but I wanted to make it clear that this change doesn't change behaviour, so
tried to make these changes in the least obtrusive way.

Drive-by: fix the formatting of the lazy_static section in main.rs by using
`lazy_static::lazy_static`.

Test plan:
```
# main
$ cargo build --release && target/release/scx_layered --example /tmp/test_old.json
# this change
$ cargo build --release && target/release/scx_layered --example /tmp/test_new.json
$ diff /tmp/test_{old,new}.json
# no diff
```
2024-11-15 21:57:22 +00:00
Daniel Hodges
1afb7d5835 scx_layered: Fix formatting
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-11-15 08:54:05 -08:00
Daniel Hodges
3a3a7d71ad
Merge branch 'main' into layered-dispatch-local 2024-11-14 16:10:12 -05:00
Daniel Hodges
4fc0509178 scx_layered: Add flag to control llc iteration on dispatch
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-11-13 12:43:45 -08:00
Daniel Hodges
0096c0632b scx_layered: Fix cost accounting for dsqs
Fix cost accounting for fallback DSQs on refresh that DSQ budgets
get refilled appropriately. Add helper functions for converting to and
from a DSQ id to a LLC budget id. During preemption a layer should check
if it is attempting to preempt from a layer that has more budget and
only preempt if the preempting layer has more budget.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-11-13 07:23:53 -08:00
Daniel Hodges
72f21dba06
Merge pull request #922 from hodgesds/layered-cost-dump-fixes
scx_layered: Fix dump format
2024-11-12 18:28:31 +00:00
Daniel Hodges
f7009f7960 scx_layered: Fix dump format
Fix a small bug where incorrect per CPU costs were being dumped. The
output format should now appropriately match the per-CPU costs. The
following dump shows the correct format:

    HI_FALLBACK[1024] nr_queued=46 -25755ms
    HI_FALLBACK[1025] nr_queued=43 -25947ms
    LO_FALLBACK nr_queued=0 -0ms
    COST GLOBAL[0][random] budget=16791955959896739
    capacity=16791955959896739
    COST GLOBAL[1][hodgesd] budget=16791955959896739
    capacity=16791955959896739
    COST GLOBAL[2][stress-ng] budget=43243243243243243
    capacity=43243243243243243
    COST GLOBAL[3][normal] budget=33583911919793478
    capacity=33583911919793478
    COST FALLBACK[1024][0] budget=16791955959896739
    capacity=16791955959896739
    COST FALLBACK[1025][1] budget=16791955959896739
    capacity=16791955959896739
    COST CPU[0][0][random] budget=5405405405405405 capacity=5405405405405405
    COST CPU[0][1][hodgesd] budget=2702702694605435
    capacity=2702702702702702
    COST CPU[0][2][stress-ng] budget=540514231324919
    capacity=540540540540540
    COST CPU[0][3][normal] budget=5405405342325615 capacity=5405405405405405
    COST CPU[0]FALLBACK[0][1024] budget=0 capacity=5405405405405405
    COST CPU[0]FALLBACK[1][1025] budget=1 capacity=2702702694605435
    COST CPU[1][0][random] budget=5405405405405405 capacity=5405405405405405
    COST CPU[1][1][hodgesd] budget=2702702675501951
    capacity=2702702702702702
    COST CPU[1][2][stress-ng] budget=540514250569731
    capacity=540540540540540

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-11-12 10:22:17 -08:00
Daniel Hodges
ff15f257be scx_layered: Fix formatting
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-11-12 10:09:22 -08:00
Daniel Hodges
673316827b
Merge pull request #918 from hodgesds/layered-slice-helper
scx_layered: Add helper for layer slice duration
2024-11-11 18:24:43 +00:00
Daniel Hodges
775d09ae1f scx_layered: Consume from local LLCs for dispatch
When dispatching consume from DSQs in the local LLC first before trying
remote DSQs. This should still be fair as the layer iteration order will
be maintained.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-11-11 09:22:03 -08:00
Daniel Hodges
b2505e74df
Merge branch 'main' into layered-consume-fix 2024-11-11 11:29:43 -05:00
Daniel Hodges
1ed387d7f3 scx_layered: Fix error in dispatch consumption
Fix bug is consume_non_open where it improperly returns 0 when the DSQ
is not consumed.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-11-11 08:19:54 -08:00
Daniel Hodges
cad3413886 scx_layered: Add helper for layer slice duration
Add a helper for returning the appropriate slice duration for a layer
and replace a various instances where the slice value was being
recalculated.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-11-11 06:11:32 -08:00
Pat Somaru
89f4aa1351
scx_layered: add antistall
add timer based antistall to scx_layered and new flags
to enable/disable and specify seconds of delay before
it turns on.

also update ci config to make sure this verifies/runs.
2024-11-08 20:31:02 -05:00
Tejun Heo
bb91ad0084 scx_layered: Work around older kernels choking on function calls from sleepable progs
Verifier in older kernels choke on function calls from sleepable progs
triggering non-sensical RCU state error:

   frame1: R1_w=scalar(id=674,smin=smin32=0,smax=umax=smax32=umax32=51,var_off=(0x0; 0x3f)) R10=; return *llc_ptr;
  1072: (61) r0 = *(u32 *)(r2 +0)       ; frame1: R0_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) R2_w=map_value(map=bpf_bpf.rodata,ks=4,vs=9570,off=4400,smin=smin32=0,smax=umax=smax32=umax32=204,var_off=(0x0; 0xfc)) refs=13,647
  ; }
  1073: (95) exit
  bpf_rcu_read_unlock is missing
  processed 10663 insns (limit 1000000) max_states_per_insn 8 total_states 615 peak_states 281 mark_read 20
  -- END PROG LOAD LOG --

Work around by adding and using __always_inline variant of cpu_to_llc_id()
from layered_init(). Note that we can't switch everyone to __always_inline
as that can lead to verification failure due to ins limit.
2024-11-08 08:47:57 -10:00
Lohith C V
a2e119ae23 scx_lavd: docs: fix typos 2024-11-08 16:25:55 +05:30
Daniel Hodges
3b47782bf4 scx_layered: Add fallback costs to dump
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-11-07 19:49:09 -05:00
Daniel Hodges
73926d6481
Merge pull request #912 from hodgesds/layered-mask-cleanup
scx_layered: Cleanup cpumask
2024-11-07 22:52:28 +00:00
5ae1b84533
Merge pull request #908 from JakeHillion/pr908
layered/topo: lift layer specific checks out of per-LLC loop
2024-11-07 21:48:20 +00:00
Daniel Hodges
ee4fd3dace scx_layered: Cleanup cpumask
Cleanup remaining cpumasks to use `cast_mask`.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-11-07 13:18:10 -08:00
Daniel Hodges
637fc3f6e1 scx_layered: Use layer idle_smt option
When selecting the idle CPU use the idle_smt option on the layer. This
may improve cache locality in some cases by placing tasks on CPUs that
are on closer cache lines.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-11-07 13:05:19 -08:00
Daniel Hodges
7db2ef22d0 scx_layered: Fix verifier issue on older kernels
On some older kernels layered fails to validate. Prevent certain helpers
from being inlined to pass the verifier.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-11-07 12:20:58 -08:00
Jake Hillion
ba54808150 layered/topo: lift layer specific checks out of per-LLC loop
The loops in topology aware mode were recently refactored to place the -per-LLC
loops inside the per-layer loops. However, the layer specific checks were left
in the inner loops, slowing this down unnecessarily.

Pull the layer specific checks from the inner loop into the outer loop.

Also changes these functions to `__weak` to ensure they don't get inlined -
they're expected to be verified as global functions.

Note to reviewers: this looks good to me, but I'd appreciate if you reviewed
the De Morgan applications in detail.

Test plan:
- `cargo build --release && sudo target/release/scx_layered --run-example` on a
  machine with multiple LLCs. It's possible to stall it quite easily with
  stress-ng but I believe this is the case on main.
2024-11-07 18:34:44 +00:00
Changwoo Min
416de68b72
Merge pull request #904 from multics69/lavd-drop-padding
scx_lavd: drop padding in cpdom_cpumask, which was a workaround
2024-11-07 16:13:07 +00:00
Changwoo Min
56357a79db
Merge pull request #903 from multics69/lavd-issue-897
scx_lavd: update cur_logical_clk atomically
2024-11-07 16:12:56 +00:00
Daniel Hodges
3cc849f234 scx_layered: Fix verifier issue when tracing
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-11-07 06:43:40 -08:00
Daniel Hodges
487baa4a03 scx_layered: Add fallback DSQ cost accounting
Add fallback DSQ cost accounting so that fallback DSQ costs are
accounted for and so that dispatch of fallback DSQs can be done in a
standardized way.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-11-07 05:25:57 -08:00
Changwoo Min
22cb9e9ce1 scx_lavd: drop padding in cpdom_cpumask, which was a workaround
The verifier error seems to stem from the wrong vmlinux.h.
Also, PR #889 seems to completely fix the problem.
So, drop the workaround.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-11-07 16:13:06 +09:00
Changwoo Min
e9ba2d53fa scx_lavd: update cur_logical_clk atomically
Previously, the cur_logical_clk is updated with WIRTE_ONCE(),
which does not guarantee the atomicity when concurrent writes happen
-- which is possible. So change it using CAS (compare-and-swap).

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-11-07 16:01:50 +09:00
Emil Tsalapatis
5e35a12ce3 remove stray print 2024-11-06 18:02:08 -08:00
Emil Tsalapatis
42880404e1 Merge branch 'main' of https://github.com/sched-ext/scx into core_enums 2024-11-06 12:44:23 -08:00
Emil Tsalapatis
2f174db96f use the enum singleton in the userspace scheduler components 2024-11-06 12:17:16 -08:00
Emil Tsalapatis
1cabed9d09 Autogenerate enums and BPF enum setters for Rust schedulers 2024-11-06 12:17:16 -08:00
Emil Tsalapatis
d500c50098 add autogenerated enum definitions for Rust schedulers 2024-11-06 12:17:16 -08:00
Dan Schatzberg
fb635cb8f0
Merge pull request #438 from dschatzberg/mitosis
Refactor select_cpu + enqueue for proper synchronization and handle of !wakeup
2024-11-06 18:36:05 +00:00
Andrea Righi
f402f118db
Merge pull request #899 from sched-ext/bpfland-rework
scx_bpfland: rework
2024-11-06 17:55:36 +00:00
Tejun Heo
ad45727139 version: v1.0.6 2024-11-06 06:54:26 -10:00
Dan Schatzberg
af2cb1abbe scx_mitosis: add RCU-like synchronization
scx_mitosis relied on the implicit assumption that after a sched tick,
all outstanding scheduling events had completed but this might not
actually be correct. This feels like a natural use-case for RCU, but
there is no way to directly make use of RCU in BPF. Instead, this commit
implements an RCU-like synchronization mechanism.

Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
2024-11-06 08:33:29 -08:00
Emil Tsalapatis
479d515a45
Merge branch 'main' into core_enums 2024-11-06 11:07:42 -05:00
Emil Tsalapatis
23f302cf13 add SCX_SLICE_* macros to scx_utils and use them for the Rust schedulers 2024-11-06 07:52:04 -08:00
Andrea Righi
78101e4688 scx_bpfland: drop lowlatency mode and the priority DSQ
Schedule all tasks using a single global DSQ. This gives a better
control to prevent potential starvation conditions.

With this change, scx_bpfland adopts a logic similar to scx_rusty and
scx_lavd, prioritizing tasks based on the frequency of their wait and
wake-up events, rather than relying exclusively on the average amount of
voluntary context switches.

Tasks are still classified as interactive / non-interactive based on the
amount of voluntary context switches, but this is only affecting the
cpufreq logic.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
2024-11-06 15:06:39 +01:00