Commit Graph

1291 Commits

Author SHA1 Message Date
Jake Hillion
0f9c1a0a73 layered/timers: support verifying on older kernels and fix logic
Some of the new timer code doesn't verify on older kernels like 6.9. Modify the
code a little to get it verifying again.

Also applies some small fixes to the logic. Error handling was a little off
before and we were using the wrong key in lookups.

Test plan:
- CI
2024-10-25 11:31:00 +01:00
Changwoo Min
ea600d2f3b
Merge pull request #846 from multics69/lavd-issue-385
scx_lavd: fix uninitialized memory access at comp_preemption_info()
2024-10-25 01:47:20 +00:00
Pat Somaru
1e0e0d2f50
make timerlib work the best it can with tooling 2024-10-24 13:12:53 -04:00
Pat Somaru
8ab38559aa
fix lsp to work after multiarch support 2024-10-24 13:12:53 -04:00
Daniel Hodges
e38282d61a scx_layered: Fix declarations in timer 2024-10-24 09:09:53 -07:00
Daniel Hodges
41a612f34d scx_layered: Add monitor
Add a monitor timer for scx_layered. For now the monitor is a noop.
2024-10-24 04:49:41 -04:00
Changwoo Min
4f6947736f scx_lavd: fix uninitialized memory access comp_preemption_info()
The previous code accesses uninitialized memory in comp_preemption_info()
when called from can_task1_kick_task2() <-try_yield_current_cpu()
to test if a task 2 is a lock holder or not. However, task2 is guaranteed
not a lock holder in all its callers. So move the lock holder testing to
can_cpu1_kick_cpu2().

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-10-24 16:07:53 +09:00
Changwoo Min
a13bb8028e
Merge pull request #837 from multics69/lavd-tuning-v4
scx_lavd: various optimizations for more consistent performance
2024-10-23 22:56:31 +00:00
Tejun Heo
cc8633996b Revert "fix ci errors due to __str update in kfunc signature"
This reverts commit 29918c03c8.
2024-10-23 08:58:06 -10:00
Changwoo Min
b90ecd7e8f scx_lavd: proactively kick a CPU at the ops.enqueue() path
When a task is enqueued, kick an idle CPU in the chosen scheduling
domain. This will reduce temporary stall time of the task by waking
up the CPU as early as possible.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-10-23 21:43:11 +09:00
Changwoo Min
731a7871d7 scx_lavd: change the greedy penalty function
We used to give a penalty in latency linearly to the greedy ratio.
However, this impacts the greedy ratio too much in determining the
virtual deadline, especially among under-utilized tasks (< 100.0%).
Now, we treat all under-utilized tasks with the same greedy ratio
(= 100.0%). For over-utilized tasks, we give a bit milder penalty
to avoid sudden latency spikes.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-10-23 21:42:55 +09:00
Changwoo Min
9acf950b75 scx_lavd: change how to use the context information for latency criticality
Previously, contextual information—such as sync wakeup and kernel
task—was incorporated into the final latency criticality value ad hoc
by adding a constant. Instead, let's make everything proportional to
run time and waker and wakee frequencies by scaling up/down the run
time and the frequencies.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-10-23 21:32:18 +09:00
Pat Somaru
29918c03c8
fix ci errors due to __str update in kfunc signature 2024-10-23 02:18:26 -04:00
Changwoo Min
fdca0c04ed
Merge pull request #831 from multics69/lavd-fix-bpf-veri
scx_lavd: fix/work around a verifier error
2024-10-23 01:45:01 +09:00
Daniel Hodges
4898f5082a scx_layered: Add timer helpers
Add registry of timers and a helper for running timers.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-10-22 07:57:44 -07:00
Changwoo Min
6fb57643fb scx_lavd: remove the time restriction in preemption
Previously, the preemption is allowed only when a task is at the
early in its time slice by using LAVD_PREEMPT_KICK_MARGIN and
LAVD_PREEMPT_TICK_MARGIN. This is not necessary any more because
the lock holder preemption can avoid harmful preemptions. So we
remove LAVD_PREEMPT_KICK_MARGIN and LAVD_PREEMPT_TICK_MARGIN and
unleash the preemption.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-10-22 17:48:56 +09:00
Changwoo Min
07ed821511 scx_lavd: incorporate task's weight to latency criticality
When calculating task's latency criticality, incorporate task's
weight into runtime, wake_freq, and wait_freq more systematically.
It looks nicer and works better under heavy load.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-10-22 17:48:56 +09:00
Changwoo Min
47dd1b9582 scx_lavd: respect a chosen cpu even if it is not idle
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-10-22 17:48:56 +09:00
Changwoo Min
257a3db376 scx_lavd: add ops.cpu_release()
When a CPU is released to serve higher priority scheduler class,
requeue the tasks in a local DSQ to the global enqueue.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-10-22 17:48:56 +09:00
Changwoo Min
89749ecad7 scx_lavd: fix/work around a verifier error
Without this, the BPF verifier spits the following errors
with *some* version of vmlinux.h. So added +1 to work around
the problem.

---------------
; bpf_for(j, 0, 64) { @ main.bpf.c:1926
509: (bf) r1 = r8                     ; R1_w=fp-32 R8_w=fp-32 refs=66,2035
510: (b4) w2 = 0                      ; R2_w=0 refs=66,2035
511: (b4) w3 = 64                     ; R3_w=64 refs=66,2035
512: (85) call bpf_iter_num_new#104189        ; R0=scalar() fp-32=iter_num(ref_id=2048,state=active,depth=0) refs=66,2035,2048
513: (bf) r1 = r8                     ; R1=fp-32 R8=fp-32 refs=66,2035,2048
514: (85) call bpf_iter_num_next#104191 515: R0_w=rdonly_mem(id=2049,ref_obj_id=2048,sz=4) R6=scalar(id=2047,smin=smin32=0,smax=umax=smax32=umax32=7,var_off=(0x0; 0x7)) R7=scalar() R8=fp-32 R9=map_value(map=bpf_bpf.bss,ks=4,vs=4584,off=384,smin=smin32=0,smax=umax=smax32=umax32=3968,var_off=(0x0; 0xf80)) R10=fp0 fp-16=iter_num(ref_id=66,state=active,depth=1) fp-24=iter_num(ref_id=2035,state=active,depth=1) fp-32=iter_num(ref_id=2048,state=active,depth=1) fp-80=scalar(id=1) fp-88=map_value(map=.data.LAVD,ks=4,vs=1320,off=40,smin=smin32=0,smax=umax=smax32=umax32=1240,var_off=(0x0; 0x7f8)) fp-96=????0 fp-112=rcu_ptr_bpf_cpumask() fp-120=rcu_ptr_bpf_cpumask() fp-128=rcu_ptr_bpf_cpumask() fp-136=rcu_ptr_bpf_cpumask() refs=66,2035,2048
; bpf_for(j, 0, 64) { @ main.bpf.c:1926
515: (15) if r0 == 0x0 goto pc+49     ; R0_w=rdonly_mem(id=2049,ref_obj_id=2048,sz=4) refs=66,2035,2048
516: (64) w6 <<= 6                    ; R6=scalar(smin=smin32=0,smax=umax=smax32=umax32=448,var_off=(0x0; 0x1c0)) refs=66,2035,2048
517: (61) r8 = *(u32 *)(r0 +0)        ; R0=rdonly_mem(id=2049,ref_obj_id=2048,sz=4) R8_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) refs=66,2035,2048
518: (26) if w8 > 0x3f goto pc+46     ; R8_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=63,var_off=(0x0; 0x3f)) refs=66,2035,2048
; if (cpumask & 0x1LLU << j) { @ main.bpf.c:1927
519: (bf) r1 = r7                     ; R1_w=scalar(id=2053) R7=scalar(id=2053) refs=66,2035,2048
520: (7f) r1 >>= r8                   ; R1_w=scalar() R8_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=63,var_off=(0x0; 0x3f)) refs=66,2035,2048
521: (57) r1 &= 1                     ; R1_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=1,var_off=(0x0; 0x1)) refs=66,2035,2048
522: (15) if r1 == 0x0 goto pc+38     ; R1_w=1 refs=66,2035,2048
; cpu = (i * 64) + j; @ main.bpf.c:1928
523: (4c) w8 |= w6                    ; R6=scalar(smin=smin32=0,smax=umax=smax32=umax32=448,var_off=(0x0; 0x1c0)) R8_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=511,var_off=(0x0; 0x1ff)) refs=66,2035,2048
; bpf_cpumask_set_cpu(cpu, cd_cpumask); @ main.bpf.c:1929
524: (bc) w1 = w8                     ; R1_w=scalar(id=2054,smin=smin32=0,smax=umax=smax32=umax32=511,var_off=(0x0; 0x1ff)) R8_w=scalar(id=2054,smin=smin32=0,smax=umax=smax32=umax32=511,var_off=(0x0; 0x1ff)) refs=66,2035,2048
525: (79) r2 = *(u64 *)(r10 -88)      ; R2_w=map_value(map=.data.LAVD,ks=4,vs=1320,off=40,smin=smin32=0,smax=umax=smax32=umax32=1240,var_off=(0x0; 0x7f8)) R10=fp0 fp-88=map_value(map=.data.LAVD,ks=4,vs=1320,off=40,smin=smin32=0,smax=umax=smax32=umax32=1240,var_off=(0x0; 0x7f8)) refs=66,2035,2048
526: (85) call bpf_cpumask_set_cpu#93595
invalid access to map value, value_size=1320 off=1280 size=48
R2 max value is outside of the allowed memory range
processed 24200 insns (limit 1000000) max_states_per_insn 19 total_states 961 peak_states 789 mark_read 44
---------------

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-10-22 17:19:37 +09:00
Changwoo Min
d5b8aafa1a
Merge pull request #822 from multics69/lavd-tuning-v3
scx_lavd: misc performance tuning
2024-10-22 09:57:58 +09:00
Tejun Heo
6ea15f9f9f
Merge pull request #819 from minosfuture/vmlinux_per_arch
Use per-arch vmlinux.h v2
2024-10-21 19:36:52 +00:00
likewhatevs
303c6d09a0
Merge pull request #824 from likewhatevs/layered-exit-task-no-missing-ctx
scx_layered: fix exit_task ctx lookup err
2024-10-21 14:52:07 +00:00
Jake Hillion
55c9636f78 layered: bpf: add layer kind to layer
Currently we have an approximation of LayerKind in the BPF code with `open` on
the layer, but it is difficult/impossible to tell the difference between an
Open and a Grouped layer. Add a `kind` field to the BPF `layer` and plumb
through an enum from the Rust side.
2024-10-21 11:32:17 +01:00
Changwoo Min
5f19fa0bab scx_lavd: refill time slice once for a lock holder
When a task holds a lock, refill its time slice once at the
ops.dispatch() path to avoid the lock holder preemption problem.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-10-21 15:56:51 +09:00
Changwoo Min
5a852dc3d9 scx_lavd: direct dispatch when there is an idle CPU
When there is an idle CPU, direct dispatch is performed to reduce
scheduling latency. This didn't work well before, but it seems
to work well now with other tunings.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-10-21 15:56:51 +09:00
Changwoo Min
420de70159 scx_lavd: give more penalty to long-running tasks
Giving more penalties to a long-running tasks helps to segregate
latency-critical tasks, which are usually short-running, to
long-running tasks, which are compute-intensive.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-10-21 15:56:41 +09:00
Pat Somaru
d89c571593
scx_layered: do not attempt ctx lookup on tasks exited before running on scx 2024-10-20 17:47:24 -04:00
Andrea Righi
fb3f1d0b43
Merge pull request #821 from sched-ext/rustland-min-vtime-budget
Some checks failed
build-and-test / lint (push) Has been cancelled
build-and-test / build-kernel (push) Has been cancelled
build-and-test / pages (push) Has been cancelled
build-and-test / integration-test (scx_bpfland) (push) Has been cancelled
build-and-test / integration-test (scx_lavd) (push) Has been cancelled
build-and-test / integration-test (scx_layered) (push) Has been cancelled
build-and-test / integration-test (scx_rlfifo) (push) Has been cancelled
build-and-test / integration-test (scx_rustland) (push) Has been cancelled
build-and-test / integration-test (scx_rusty) (push) Has been cancelled
build-and-test / layered-matrix (scx_layered, --disable-topology=false) (push) Has been cancelled
build-and-test / layered-matrix (scx_layered, --disable-topology=true) (push) Has been cancelled
build-and-test / rust-test-core (scx_loader) (push) Has been cancelled
build-and-test / rust-test-core (scx_rustland_core) (push) Has been cancelled
build-and-test / rust-test-core (scx_stats) (push) Has been cancelled
build-and-test / rust-test-core (scx_utils) (push) Has been cancelled
build-and-test / rust-test-schedulers (scx_bpfland) (push) Has been cancelled
build-and-test / rust-test-schedulers (scx_lavd) (push) Has been cancelled
build-and-test / rust-test-schedulers (scx_layered) (push) Has been cancelled
build-and-test / rust-test-schedulers (scx_rlfifo) (push) Has been cancelled
build-and-test / rust-test-schedulers (scx_rustland) (push) Has been cancelled
build-and-test / rust-test-schedulers (scx_rusty) (push) Has been cancelled
bpf-next-test / build-kernel (push) Has been cancelled
bpf-next-test / integration-test (scx_bpfland) (push) Has been cancelled
bpf-next-test / integration-test (scx_lavd) (push) Has been cancelled
bpf-next-test / integration-test (scx_layered) (push) Has been cancelled
bpf-next-test / integration-test (scx_rlfifo) (push) Has been cancelled
bpf-next-test / integration-test (scx_rustland) (push) Has been cancelled
bpf-next-test / integration-test (scx_rusty) (push) Has been cancelled
scx_rustland: Adjust task's vruntime budget based on latency weight
2024-10-20 07:44:35 +00:00
Changwoo Min
bf1b014d63
Merge pull request #818 from multics69/lavd-tuning
Some checks are pending
build-and-test / lint (push) Waiting to run
build-and-test / build-kernel (push) Waiting to run
build-and-test / integration-test (scx_bpfland) (push) Blocked by required conditions
build-and-test / integration-test (scx_lavd) (push) Blocked by required conditions
build-and-test / integration-test (scx_layered) (push) Blocked by required conditions
build-and-test / integration-test (scx_rlfifo) (push) Blocked by required conditions
build-and-test / integration-test (scx_rustland) (push) Blocked by required conditions
build-and-test / integration-test (scx_rusty) (push) Blocked by required conditions
build-and-test / layered-matrix (scx_layered, --disable-topology=false) (push) Blocked by required conditions
build-and-test / layered-matrix (scx_layered, --disable-topology=true) (push) Blocked by required conditions
build-and-test / rust-test-core (scx_loader) (push) Blocked by required conditions
build-and-test / rust-test-core (scx_rustland_core) (push) Blocked by required conditions
build-and-test / rust-test-core (scx_stats) (push) Blocked by required conditions
build-and-test / rust-test-core (scx_utils) (push) Blocked by required conditions
build-and-test / rust-test-schedulers (scx_bpfland) (push) Blocked by required conditions
build-and-test / rust-test-schedulers (scx_lavd) (push) Blocked by required conditions
build-and-test / rust-test-schedulers (scx_layered) (push) Blocked by required conditions
build-and-test / rust-test-schedulers (scx_rlfifo) (push) Blocked by required conditions
build-and-test / rust-test-schedulers (scx_rustland) (push) Blocked by required conditions
build-and-test / rust-test-schedulers (scx_rusty) (push) Blocked by required conditions
build-and-test / pages (push) Waiting to run
scx_lavd: add missing reset_lock_futex_boost()
2024-10-20 01:41:54 +00:00
Daniel Hodges
e72e5ce0f4
Merge pull request #744 from minosfuture/main
Some checks are pending
build-and-test / lint (push) Waiting to run
build-and-test / build-kernel (push) Waiting to run
build-and-test / integration-test (scx_bpfland) (push) Blocked by required conditions
build-and-test / integration-test (scx_lavd) (push) Blocked by required conditions
build-and-test / integration-test (scx_layered) (push) Blocked by required conditions
build-and-test / integration-test (scx_rlfifo) (push) Blocked by required conditions
build-and-test / integration-test (scx_rustland) (push) Blocked by required conditions
build-and-test / integration-test (scx_rusty) (push) Blocked by required conditions
build-and-test / layered-matrix (scx_layered, --disable-topology=false) (push) Blocked by required conditions
build-and-test / layered-matrix (scx_layered, --disable-topology=true) (push) Blocked by required conditions
build-and-test / rust-test-core (scx_loader) (push) Blocked by required conditions
build-and-test / rust-test-core (scx_rustland_core) (push) Blocked by required conditions
build-and-test / rust-test-core (scx_stats) (push) Blocked by required conditions
build-and-test / rust-test-core (scx_utils) (push) Blocked by required conditions
build-and-test / rust-test-schedulers (scx_bpfland) (push) Blocked by required conditions
build-and-test / rust-test-schedulers (scx_lavd) (push) Blocked by required conditions
build-and-test / rust-test-schedulers (scx_layered) (push) Blocked by required conditions
build-and-test / rust-test-schedulers (scx_rlfifo) (push) Blocked by required conditions
build-and-test / rust-test-schedulers (scx_rustland) (push) Blocked by required conditions
build-and-test / rust-test-schedulers (scx_rusty) (push) Blocked by required conditions
build-and-test / pages (push) Waiting to run
scx_layered: Fix crash on aarch64 due to unavailable cache id file
2024-10-19 22:33:53 +00:00
Ming Yang
1b5359ef4a Use per-arch vmlinux.h v2
Rework per-arch vmlinux solution
* have per-arch directory under sched/include/arch/, in which we
  maintain vmlinux.h symlink and real file
  vmlinux-{kernel_ver}-g{sha1}.h. The original sched/include/vmlinux/
  folder is removed.
* update meson build `-I` option to find the new vmlinux.h position
* update cargo build scripts to use the per-arch vmlinux.h for
  generating bindings
* keep the original ClangInfo refactoring changes

Signed-off-by: Ming Yang <minos.future@gmail.com>
2024-10-19 10:50:59 -07:00
Andrea Righi
30a2a2013c scx_rustland: Adjust task's vruntime budget based on latency weight
Adjust the amount of vruntime budget an idle task can accumulate in
function of its latency weight, which is derived from the average number
of voluntary context switches.

This ensures that latency-sensitive tasks naturally receive an
additional priority boost and we can get avoid scaling down the vruntime
to determine the task's deadline, making the scheduler more fair.

It also makes the scheduler more robust, now rustland can survive
intensive stress tests, such as `stress-ng --cpu-sched 64` or hackbench.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-10-19 19:32:14 +02:00
Daniel Hodges
b1b76ee72a
scx_rusty: Cleanup cpumask casting
Use the cask_mask helper function to cleanup scx_rusty.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-10-19 12:01:36 -04:00
Changwoo Min
2fd395bbbf scx_lavd: remove unnecessary load tracking
The algorithm has been evolved to decide the time slice without
tracking the system-wide load. So remove the obsolete load tracking
code.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-10-19 15:39:24 +09:00
Changwoo Min
8d63024be7 scx_lavd: add missing reset_lock_futex_boost()
reset_lock_futex_boost() should be called every context switch of a
task. Otherwise, in the worst case, a task and that CPU could block
the preemption. To avoid such a situation, add missing
reset_lock_futex_boost() calls.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-10-19 15:39:18 +09:00
Ming Yang
f3f4726c09 scx_layered: Read CPU topology for building CpuPool
Building CpuPool from cache-cpu topology did not apply on arm, because
`/sys/devices/system/cpu/cpu{}/cache/index{}/id` file is unavailable.

Read CPU topology instead.

Signed-off-by: Ming Yang <minos.future@gmail.com>
2024-10-17 23:41:08 -07:00
Andrea Righi
48bbcd24dd scx_bpfland: tune default settings
Adjust some default settings after the rework done with commit 112a5d4
("scx_bpfland: rework lowlatency mode to adjust tasks priority").

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-10-17 21:46:51 +02:00
Andrea Righi
4d68133f3b scx_bpfland: rework lowlatency mode to adjust tasks priority
Rework lowlatency mode as following:
 - introduce task dynamic priority: task weight multiplied by the
   average amount of voluntary context switches
 - use dynamic priority to determine task's vruntime (instead of the
   static task's weight)
 - task's minimum vruntime is evaluated in function of the dynamic
   priority (tasks with a higher dynamic priority can have a smaller
   vruntime compared to tasks with a lower dynamic priority)

The dynamic priority allows to maintain a good system responsiveness
also without applying the classification of tasks in "interactive" and
"regular", therefore in lowlatency mode only the shared DSQ will be
used (priority DSQ is disabled).

Using a separate priority queue to dispatch "interactive" tasks makes
the scheduler less fair, allowing latency-sensitive tasks to be
prioritized even when there is a high number of tasks in the system
(e.g., `stress-ng -c 1024` or similar scenarios), where relying solely
on dynamic priority may not be sufficient.

On the other hand, disabling the classification of "interactive" tasks
results in a fairer scheduler and more predictable performance, making
it better suited for soft real-time applications (e.g, audio and
multimedia).

Therefore, the --lowlatency option is retained to allow users to choose
between more predictable performance (by disabling the interactive task
classification) or a more responsive system (default).

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-10-17 21:46:51 +02:00
Andrea Righi
d336892c71
Merge pull request #816 from sched-ext/rustland-core-update-doc
scx_rustland_core: update documentation about the new API
2024-10-17 19:18:16 +00:00
Andrea Righi
a155ff2ada scx_rustland_core: update documentation about the new API
Update the documentation adding the new task statistics provided by
scx_rustland_core.

Fixes: be681c7 ("scx_rustland_core: pass nvcsw, slice and dsq_vtime to user-space")
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-10-17 19:07:51 +02:00
f1b1830512
Merge pull request #814 from JakeHillion/pr814
layered: add RandomTopo layer growth algorithm
2024-10-17 17:05:53 +00:00
Jake Hillion
1415b4a454 layered: make disable_topology arg require equals
The recent changes to `disable_topology` making the arg an `Option<bool>`
instead of a `bool` caused an issue with it incorrectly attaching arguments.
Make the argument `require_equals` to fix this case.

This is a behaviour change for anybody previously relying on `-t true`,
`-t false`, `--disable-topology true`, or `--disable-topology false`. The
equals syntax worked before and continues to work after, as demonstrated in the
CI.

Test plan:

Before:
```sh
$ sudo target/release/scx_layered -t f:/tmp/test.json
error: invalid value 'f:/tmp/test.json' for '--disable-topology
[<DISABLE_TOPOLOGY>]'
  [possible values: true, false]

  For more information, try '--help'.
```

After:
```sh
$ sudo target/release/scx_layered -t f:/tmp/test.json
14:44:00 [INFO] CPUs: online/possible=176/176 nr_cores=88
14:44:00 [INFO] Disabling topology awareness
...
^CEXIT: Scheduler unregistered from user space
```
2024-10-17 15:46:30 +01:00
Jake Hillion
a0fe303b61 layered: add RandomTopo layer growth algorithm
Add an additional layer growth algorithm, named 'RandomTopo'. It follows these
rules:
- Randomise NUMA nodes. List each core in each NUMA node before a core from
  another NUMA node.
- Randomise LLCs within each NUMA node. List each core in each LLC before a
  core in a different LLC.
- Randomise the core order within each LLC.

This attempts to provide a relatively evenly distributed set of cores while
considering topology. Unlike `Topo`, it does not require you to specify the
ordering and instead generates it from the hardware, making desyncs between the
config and the hardware less likely.

Currently `RandomTopo` considers topology even with `--disable-topology=true`.
I can see the arguments for this going both ways. On one hand requesting
disable topology suggests you want no consideration of machine topology, and
`RandomTopo` should decay to `Random` (which it does on single node/LLC machines
anyway). On the other hand, the config explicitly specifies `RandomTopo` and
should consider the topology. If anyone feels strongly I can change this to
respect `disable_topology`.

Test plan:
```sh
$ sudo target/release/scx_layered -v f:/tmp/test.json
...
14:31:19 [DEBUG] layer: batch algo: RandomTopo core order: [47, 44, 43, 42, 40, 45, 46, 41, 38, 37, 36, 39, 34, 32, 35, 33, 54, 49, 50, 52, 51, 48, 55, 53, 68, 64, 66, 67, 70, 69, 71, 65, 9, 10, 12, 15, 14, 11, 8, 13, 59, 60, 57, 63, 62, 56, 58, 61, 2, 3, 5, 4, 0, 6, 7, 1, 86, 83, 85, 87, 84, 81, 80, 82, 20, 22, 19, 23, 21, 18, 17, 16, 30, 25, 26, 31, 28, 27, 29, 24, 78, 73, 74, 79, 75, 77, 76, 72]
14:31:19 [DEBUG] layer: immediate algo: RandomTopo core order: [45, 40, 46, 42, 47, 43, 41, 44, 80, 82, 83, 84, 85, 86, 81, 87, 13, 10, 9, 15, 14, 12, 11, 8, 36, 38, 39, 32, 34, 35, 33, 37, 7, 3, 1, 0, 2, 5, 4, 6, 53, 52, 54, 48, 50, 49, 55, 51, 76, 77, 79, 78, 73, 74, 72, 75, 71, 66, 64, 67, 70, 69, 65, 68, 24, 26, 31, 25, 28, 30, 27, 29, 58, 56, 59, 61, 57, 62, 60, 63, 16, 19, 17, 23, 22, 20, 18, 21]
...
```

This is a machine with 1 NUMA/11 LLCs with 8 cores per LLC and you can see the
results are grouped by LLC but random within.
2024-10-17 15:36:00 +01:00
Daniel Hodges
b01ff79080
Merge pull request #805 from hodgesds/layered-refresh-cleanup
scx_layered: Refactor refresh cpumasks
2024-10-16 19:06:15 +00:00
Andrea Righi
2ea47af4bc
Merge pull request #804 from sched-ext/rustland-fixes
scx_rustland fixes and improvements
2024-10-16 18:26:03 +00:00
Tejun Heo
84d8abf913 Revert "Use per-arch vmlinux.h"
This reverts commit a23f3566e3.
2024-10-16 06:42:28 -10:00
Tejun Heo
bd79059f1a Revert "Add vmlinux.h for multiple arch"
This reverts commit 7067092555.
2024-10-16 06:42:18 -10:00
Dan Schatzberg
730052a0c4
Merge pull request #803 from dschatzberg/mitosis_fallback_dsq
scx_mitosis: Handle pinned tasks
2024-10-16 13:26:23 +00:00
Andrea Righi
763da6ab55 scx_rlfifo: operate in a more work-conserving way
Make scx_rlfifo even simpler and keep dispatching tasks even if the CPUs
are all busy.

This allows to better stress test the scx_rustland_core backend, by
using both the per-CPU DSQs and the global shared DSQ.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-10-16 14:06:00 +02:00
Andrea Righi
b07de1d7d5 scx_rustland: clarify EDF scheduling
scx_rustland is now effectively a deadline-based scheduler and not a
pure vruntime-based scheduler.

Clarify this in the source code. No functional change.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-10-16 14:06:00 +02:00
Andrea Righi
c4b6408e92 scx_rustland: smooth vruntime update
Update vruntime adding the used virtual time slice of each task as soon
they are scheduled.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-10-16 14:06:00 +02:00
Andrea Righi
0b2de2c10c scx_rustland: use built-in nvcsw metrics
Use the nvcsw metric from the scx_rustland_core backend, intead of
retrieving this metric in user-space via procfs.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-10-16 14:06:00 +02:00
Andrea Righi
97629178e2 scx_rustland_core: bump up version to 2.2.2
Bump up the minor version to reflect the new backward-compatible
functionality added.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-10-16 14:06:00 +02:00
Daniel Hodges
907746745e scx_layered: Refactor refresh cpumasks
Refactor the logic for refresh cpumasks to be easy to read and verify.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-10-15 17:58:10 -07:00
Tejun Heo
4841df8138
Merge pull request #793 from minosfuture/vmlinux_per_arch
Use per-arch vmlinux.h
2024-10-15 19:52:42 +00:00
Dan Schatzberg
96ebe6b84a scx_mitosis: Handle pinned tasks
Pinned tasks should just be routed to a fallback DSQ. kthreads are given
a higher priority than non-kthreads so use two fallback DSQs.

Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
2024-10-15 09:09:01 -07:00
Dan Schatzberg
902f41adf0
Merge pull request #799 from dschatzberg/mitosis_dispatch_no_wakeup
scx_mitosis: handle enqueue() on !wakeup
2024-10-15 13:46:07 +00:00
Daniel Hodges
71d63010af scx_layered: Refactor layer iteration
Remove DSQ iter algos.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-10-14 13:13:53 -07:00
Dan Schatzberg
a17f16e4b9 scx_mitosis: handle enqueue() on !wakeup
If we're not on the wakeup path, we may see enqueue() invoked without
select_cpu() which will require an idle cpu lookup. In order to fix
this, we refactor the idle_cpu lookup in select_cpu so it can be invoked
from enqueue().

Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
2024-10-14 10:13:07 -07:00
Daniel Hodges
912d6e01c1 scx_layered: Add LLC integration test
Add an integration test for testing that the `llcs` field on the layer
config works properly.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-10-14 07:27:29 -07:00
Daniel Hodges
ed18e43612
Merge pull request #795 from hodgesds/bpftrace-tests
scx_layered: Add topology integration test
2024-10-14 12:54:54 +00:00
Daniel Hodges
e456c83536 scx_layered: Add topology integration test
Add a bpftrace script that does a topology aware test. The test script
runs a bpftrace script that asserts that stress-ng processes are
scheduled on NUMA node 0 only.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-10-13 20:23:11 -07:00
Ming Yang
f7cdf08754 scx_mitosis: Fix static assertion of scx_bpf_task_cgroup failing __weak check
it failed the static assertion in macro bpf_ksym_exists.

Signed-off-by: Ming Yang <minos.future@gmail.com>
2024-10-13 07:57:12 -07:00
Ming Yang
7067092555 Add vmlinux.h for multiple arch
Following the change of using per-arch vmlinux.h. Add it for the
remaining archs.

Signed-off-by: Ming Yang <minos.future@gmail.com>
2024-10-13 07:57:12 -07:00
Ming Yang
a23f3566e3 Use per-arch vmlinux.h
vmlinux.h is not compatible across archs.

Handle this compatibility issue by
* Add arch info into vmlinux.h real file name
* Link vmlinux.h to the target-arch real file at build time
* Use target-arch real file for scx_utils bindgen.

Also refactored clang related logic into a new clang_info mod, which is
shared by bpf_builder.rs and builder.rs.

Signed-off-by: Ming Yang <minos.future@gmail.com>
2024-10-13 07:57:12 -07:00
Changwoo Min
c1f4051a14 scx_lavd: fix int overflow in calculating avg_lat_cri
u32 is not big enough to hold the sum of lat_cri in a period,
so sum_lat_cri (u32) was overflown, resulting in incorrect
avg_lat_cri. Change the type from u32 to u64, avoiding the
interger overflow. Note that {sum/avg}_lat_cri is only for
deubugging so it is irrelevant in making scheduling decisions.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-10-13 00:58:36 +09:00
Changwoo Min
6c9bbe66dc scx_lavd: remove unnecessary downscaling in deadline calculation
The downscaling is not necessary in calculating task's virtual
deadline because virtual dealine represents only relative order
in task scheduling. Hence downscaling incurs only inacuracy
caused by truncation.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-10-13 00:41:23 +09:00
Changwoo Min
6ddc3f0a2b scx_lavd: do not inspect scx_lavd process itself
Print the task status of scx_lavd is not useful,
so filter it out.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-10-12 17:21:08 +09:00
Andrea Righi
197dee93f4 scx_bpfland: get rid of per-CPU DSQs
Using per-CPU DSQs seems to introduce more issues than benefits
(potential stalls, etc.). Therefore, let's get rid of the per-CPU DSQs
and use SCX_DSQ_LOCAL for tasks directly dispatched to specific CPUs.

This change seems to also improve performance on 6.12 and it makes the
scheduler a lot more stable and consistent.

The issues will be investigated separately, providing a separate stress
test scheduler, designed to stress test per-CPU DSQs.

Tested-by: Piotr Gorski <piotrgorski@cachyos.org>
Tested-by: Eric Naim <dnaim@cachyos.org>
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-10-12 08:15:51 +02:00
Andrea Righi
198f22656c scx_bpfland: clarify error code returned by pick_idle_cpu()
Return more meaningful error codes from pick_idle_cpu(). No functional
change, just improved code readability.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-10-12 08:08:48 +02:00
Andrea Righi
ceb4f1755f scx_bpfland: always refill task timeslice in ops.dispatch()
When a task exhausts its timeslice and no other tasks are ready to run,
we automatically refill its timeslice, but only if the current CPU is a
fully idle SMT core.

If we don’t handle the refill, the sched_ext core will default to
refilling using SCX_SLICE_DFL, which may not be optimal.

To ensure better control over the task’s timeslice, always refill it
when no other tasks are available to run.

Fixes: 6e24fcc ("scx_bpfland: keep tasks running on full-idle SMT cores")
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-10-12 08:08:48 +02:00
Andrea Righi
54d704ceda scx_bpfland: pick a random idle CPU when prev_cpu is not valid
Pick any random idle CPU when the previous CPU isn't valid anymore
according to the task's cpumask.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-10-12 08:08:48 +02:00
Changwoo Min
836cf9faa4
Merge pull request #779 from multics69/lavd-futex-v2
scx_lavd: mitigate the lock holder preemption problem
2024-10-12 02:42:33 +00:00
Daniel Hodges
a08a76ccd6 scx_layered: Cleanup non topology path
More cleanup in the non topology path to remove copy/pasta declarations.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-10-11 10:18:34 -07:00
eb59085e61
Merge pull request #781 from JakeHillion/pr781
layered: move configuration into library component
2024-10-11 16:39:23 +00:00
Jake Hillion
52c279a469 layered: make default value for disable_topology dynamic
Disable topology currently defaults to `false` (topology enabled...). Change
this so that topology is enabled by default on hardware that may benefit from
it (multiple NUMA nodes or LLCs) and disabled on hardware that does not benefit
from it.

This is a slightly noisy change as we have to move ownership of the newly
mutable layer specs into the `Scheduler` object (previously they were a
borrow). We don't have a `Topology` object to make the default decision from
until `Scheduler::init`, and I think this is because of the possibility of hot
plugs. We therefore have to clone the `Vec<LayerSpec>` each time as it is
potentially mutable.

Test plan:
- CI. Updated to be explicit about topology in both cases.

Single NUMA multi-LLC machine:
```
$ scx_layered --run-example
...
13:34:01 [INFO] Topology awareness not specified, selecting enabled based on
hardware
...
$ scx_layered --run-example --disable-topology=true
...
13:33:41 [INFO] Disabling topology awareness
...
$ scx_layered --run-example -t
...
13:33:15 [INFO] Disabling topology awareness
...
$ scx_layered --run-example --disable-topology=false
# none of the above messages present
```

Single NUMA single LLC machine:
```
$ scx_layered --run-example
15:33:10 [INFO] Topology awareness not specified, selecting disabled based on
hardware
```
2024-10-11 17:09:07 +01:00
Jake Hillion
143a55cda1 layered: move configuration into library component
Move the LayerConfig and its children from `main.rs` into `lib.rs`. This allows
other tooling, such as config managers or test executors, to modify layered
configs programmatically.

The end goal is to move everything in `layered` except for the argument parsing
into a `run_layered` function, but I haven't done it in this diff because it's
a larger change. This is a common pattern in Rust projects to do as little as
possible in `main.rs` for extensibility.

The only change here, other than publicity and where things are located, is the
signature of `CpuPool::alloc_cpus`. It previously relied on `&Layer`, and this
changes it to the two elements of `Layer` it uses. This allows `Layer` to stay
confined to `main.rs` (for now) to prevent scope creep in this PR.

This may be inconvenient in the short term for WIPs and anyone doing non-Cargo
builds (cough me), but having things split into more files should make
rebases/merges easier in the long run.

Test plan:
- `cargo build --release`
- CI.
2024-10-11 15:55:29 +01:00
Changwoo Min
648c95be9e scx_lavd: fix incorrect task comparison for preemption
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-10-11 21:53:24 +09:00
likewhatevs
b88f567e25
Merge pull request #782 from likewhatevs/lsp-nice-util
layered -- make lsp work nice on util include file
2024-10-11 12:30:19 +00:00
Pat Somaru
2b309dbbb4
make lsp work nice on util include 2024-10-11 08:06:29 -04:00
Pat Somaru
7627e1cc42
scx_layered: fix lsp etc on util.bpf.c 2024-10-11 08:02:23 -04:00
Changwoo Min
5b4b255cbb scx_lavd: do not preempt while holding a lock
When a task holds a lock, it should not yield its time slice or it
should not be preempted out. In this way, we can mitigate harmful
preemption of lock holders and reduce the total preemption counts.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-10-11 18:49:09 +09:00
Changwoo Min
bd17589a6e scx_lavd: boost latency criticality when a task holds a lock
When a lock holder exhausts its time slide, it will be re-enqueued
to a DSQ waiting for shceduling while holding a lock. In this case,
prioritize its latency criticality proportionally, so a lock holder
would be not stuck in a DSQ for a long time, improving system-wide
progress.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-10-11 18:48:56 +09:00
Changwoo Min
77b8e65571 scx_lavd: tracing all blocking locks and futexes
Trace the acquisition and release of blocking locks for kernel and
fuxtexes for user-space. This is necessary to boost a lock holder
task in terms of latency and time slice. We do not boost shared
lock holders (e.g., read lock in rw_semaphore) since the kernel
already prioritizes the readers over writers.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-10-11 17:03:48 +09:00
Ryan Wilson
8c8250b1e2 [layered] Implement reverse weight DSQ algorithm 2024-10-10 12:53:25 -07:00
Daniel Hodges
9f60053312
Merge pull request #775 from hodgesds/layered-idle-cleanup
scx_layered: Cleanup topology preempt path
2024-10-10 18:34:08 +00:00
Daniel Hodges
fb4dcf91eb scx_layered: Change default DSQ iter algo
Change the default DSQ iter algo from round robin to linear.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-10-10 11:10:27 -07:00
Daniel Hodges
b22e83d4d5 scx_layered: Cleanup topology preempt path
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-10-10 09:56:42 -07:00
Andrea Righi
d62989e462 scx_bpfland: fix cpumask initialization error
In the WAKE_SYNC path lf L3 cache awareness is disabled (--disable-l3)
we may hit the following error:

  Error: EXIT: scx_bpf_error (CPU L3 cpumask not initialized)

Fix this by setting the L3 cpumask to the whole primary domain if L3
cache awareness is disabled.

Tested-by: Eric Naim <dnaim@cachyos.org>
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-10-10 09:30:54 +02:00
Daniel Hodges
fe00e2c7be
scx_layered: Refactor topo preemption
Refactor topology preemption logic so the non topology aware code is
contianed to a separate function. This should make maintaining the non
topology aware code path far easier.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-10-09 21:24:07 -04:00
Daniel Hodges
451c68b44e
scx_layered: Cleanup debug messages
Cleanup debug messages to use a common prefix when the scheduler is
initialized.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-10-09 19:06:28 -04:00
Daniel Hodges
81a5250d49 scx_layered: Fix verifier errors
Fix verifier errors when using different DSQ iteration algorithms and
cleanup some code.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-10-09 14:36:12 -07:00
Dan Schatzberg
12cf482487
Merge pull request #767 from dschatzberg/mitosis-build
mitosis: Fix build
2024-10-09 19:32:35 +00:00
Dan Schatzberg
c794c389da mitosis: apply autoformatting
Apply clang-format autoformatting on the c code and cargo fmt on the
rust code.

Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
2024-10-09 10:56:27 -07:00
483a565d7f
Merge pull request #759 from JakeHillion/pr759
layered: attempt to work steal from own llc before others
2024-10-09 17:42:23 +00:00
Daniel Hodges
678c205572
Merge pull request #766 from hodgesds/layered-load-fixes
scx_layered: Rename load_adj statistic
2024-10-09 17:12:24 +00:00
Jake Hillion
d9dc46b5d2 layered: attempt to work steal from own llc before others 2024-10-09 17:39:06 +01:00
Dan Schatzberg
347147b10d mitosis: fix build
Minimal changes to make sure scx_mitosis can build with the latest scx
changes.

Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
2024-10-09 08:30:15 -07:00
Daniel Hodges
30258cff1b scx_layered: Update docs for layer_preempt_weight_disable
Update docs for layer_preempt_weight_disable and
layer_growth_weight_disable.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-10-09 06:37:54 -07:00
Daniel Hodges
edc673460d scx_layered: Rename load_adj statistic
Rename the `load_adj` statistic to `load_frac_adj`, which is a more
accurate representation of what the statistic is calculating. The
statistic is a fractional representation of the load of a layer adjusted
for infeasible weights.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-10-09 06:23:37 -07:00
c23efb1ed3
Merge pull request #749 from JakeHillion/pr749
layered: split dispatch into no_topo version
2024-10-09 13:15:12 +00:00
Jake Hillion
19d09c3cc1 layered: split dispatch into no_topo version
Refactor layered_dispatch into two functions: layered_dispatch_no_topo and
layered_dispatch. layered_dispatch will delegate to layered_dispatch_no_topo in
the disable_topology case.

Although this code doesn't run when loaded by BPF due to the global constant
bool blocking it, it makes the functions really hard to parse as a human. As
they diverge more and more it makes sense to split them into separate
manageable functions.

This is basically a mechanical change. I duplicated the existing function,
replaced all `disable_topology` with true in `no_topo` and false in the
existing function, then removed all branches which can't be hit.

Test plan:
- Runs on my dev box (6.9.0 fbkernel) with `scx_layered --run-example -n`.
- As above with `-t`.
- CI.
2024-10-09 13:33:06 +01:00
Daniel Hodges
2b5829e275
Merge pull request #763 from ryantimwilson/rusty-default-weights-fix
[rusty] Fix load stats when host is under-utilized
2024-10-09 12:14:51 +00:00
likewhatevs
29bb3110ec
Merge pull request #765 from likewhatevs/update-dispatch
scx_layered: enable configuring layer iteration when no topo
2024-10-09 06:22:40 +00:00
Pat Somaru
8e2f195af1
enable configuring layer iteration when no topo
enable configuring layer iteration order in dispatch
when topology is disabled.

replace some member_vptr's in that iteration with regular
accesses
2024-10-09 01:53:19 -04:00
Andrea Righi
e3e381dc8e
Merge pull request #755 from sched-ext/bpfland-prevent-kthread-stall
scx_bpfland: prevent per-CPU DSQ stall with per-CPU kthreads
2024-10-09 05:28:59 +00:00
Ryan Wilson
fbdb6664ec [rusty] Fix load stats when host is under-utilized 2024-10-08 21:08:07 -07:00
Pat Somaru
c90144d761
Revert "Merge pull request #746 from likewhatevs/layered-delay"
This reverts commit 2077b9a799, reversing
changes made to eb73005d07.
2024-10-08 22:01:05 -04:00
Daniel Hodges
e6773d43b1 scx_layered: Make stress-ng non exclusive in example
Test CI hosts are VMs currently and making stress-ng exclusive may
starve the host.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-10-08 10:49:51 -07:00
Daniel Hodges
66f967c06d
Merge pull request #756 from hodgesds/layered-example-stress
scx_layered: Add stress-ng example layer
2024-10-08 15:31:44 +00:00
likewhatevs
e1f6c792fe
Merge pull request #757 from JakeHillion/pr757
layered: cleanup warnings in bpf compilation
2024-10-08 15:29:12 +00:00
Jake Hillion
85daa2be32 layered: cleanup warnings in bpf compilation
clang is correctly warning that we use various uninitialised variables. clean
these up so real errors are easier to read.

The largest change here is to non-topological layered_dispatch. The
matching_dsq logic seems to be incorrect. It checks whether an uninitialised
variable is 0, if it is sets it, then only uses the variable if the value is 0.
I have changed this to default to -1, then use the value if it is no longer -1.
2024-10-08 16:25:43 +01:00
Daniel Hodges
f3191afca7 scx_layered: Add stress-ng example layer
Add a stress-ng example layer, which will be used for CI testing with
stress-ng.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-10-08 07:56:54 -07:00
Andrea Righi
c8a9207371 scx_bpfland: prevent per-CPU DSQ stall with per-CPU kthreads
Since per-CPU kthreads may show an inconsistent prev_cpu and/or cpumask,
dispatch them directly to local DSQ and allow to preempt the current
running task.

This allows to prevent per-CPU kthread stalls and it also helps to
prioritize them, as are usually important for system performance and
responsiveness.

Moreover, change the behavior of --local-kthreads to prioritize all
kthreads when this option is used.

This addresses issue #728.

NOTE: ideally we may want to fix this in the kernel by making sure to
always expose a consistent prev_cpu and cpumask also for kthreads, but
at the moment this change allows to prevent some annoying stalls and
performance-wise it doesn't seem to introduce any regression. In fact,
the usual gaming/fps benchmarks show even a slight improvement in
responsiveness with this change applied.

Thanks to YUBY from the CachyOS community for all the extremely valuable
help with the intensive stress tests.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-10-08 15:02:31 +02:00
Daniel Hodges
d7576d4b44
Merge pull request #754 from minosfuture/cpu_pool_doc
scx_layered: Add doc comment to CpuPool
2024-10-08 12:22:55 +00:00
likewhatevs
2077b9a799
Merge pull request #746 from likewhatevs/layered-delay
scx_layered: lighten/reduce nested loops in layered dispatch
2024-10-08 11:32:55 +00:00
Ming Yang
0dbb8c2374 scx_layered: Add doc comment to CpuPool
Add doc comment to `CpuPool` as a quick reference for each member.
Most importantly, differentiate "cpu" and "core", as logical core and
physical core, respectively.

Signed-off-by: Ming Yang <minos.future@gmail.com>
2024-10-07 21:48:46 -07:00
Pat Somaru
51d9e90d39
formatting 2024-10-07 18:54:30 -04:00
Pat Somaru
d2ac627942
formatting 2024-10-07 18:47:27 -04:00
Pat Somaru
3369836970
formatting 2024-10-07 18:44:44 -04:00
Pat Somaru
e0ce4711d4
flatten and simplify dispatch 2024-10-07 18:36:07 -04:00
Daniel Hodges
eb73005d07
Merge pull request #747 from hodgesds/layered-idle-order
scx_layered: Update idle topology selection order
2024-10-07 20:01:38 +00:00
Ryan Wilson
a76778a4ab scx_rusty: Fix BPF crash during CPU hotplug
When hotplugging CPUs in rapid succession, scx_rusty would crash with:
```
scx_bpf_error (Failed to lookup dom[4294967295]
```

The root cause is if the scheduler is restarted fast enough, a task
on a previously hotplugged CPU may not have moved off that CPU yet.
Thus, the CPU -> domain map would contain an invalid domain (u32::max)
and we would fail to lookup the domain correctly in rusty_select_cpu
for prev_cpu.

To fix this, if the CPU is offline, we do not try to allocate to the
same NUMA node (assuming hotplug is a rare operation) beyond domestic
domain. Instead we use greedy allocation - first idle, then busy - then
any CPU.
2024-10-07 11:59:36 -07:00
Daniel Hodges
0b497d6df0 scx_layered: Update idle topology selection order
Update the idle topology selection order, the current logic is:

core architecture (big/little) -> LLC -> NUMA -> Machine

It's probably better to try to keep cache lines clean and do:

LLC -> core architecture (big/little) -> NUMA -> Machine

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-10-07 10:34:11 -07:00
Daniel Hodges
024a2aa658 scx_layered: Improve perf on non topo aware paths
Improve the performance on non topology aware paths by skipping some map
lookups and uneccessary initializations.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-10-07 07:56:18 -07:00
Daniel Hodges
24fba4ab8d scx_layered: Add idle smt layer configuration
Add support for layer configuration for idle CPU selection. This allows
layers to choose whether or not to restrict idle CPU selection to SMT
idle CPUs.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-10-07 06:58:54 -07:00
Daniel Hodges
2f280ac025 scx_layered: Use idle smt mask for idle selection
In the non topology aware code the idle smt mask is used for finding
idle cpus. Update topology aware idle selection to also use the idle
smt mask. In certain benchmarks this can improve performance.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-10-07 05:40:59 -07:00
Daniel Hodges
30feecc5ae
Merge pull request #743 from hodgesds/layered-big-little-mask
scx_layered: Add big cpumask
2024-10-07 11:05:01 +00:00
Daniel Hodges
d86638ef0b
scx_layered: Add big cpumask
Add big cpumask to scx_layered and prefer selecting big idle cores when
using the BigLittle growth algo.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-10-06 14:05:12 -04:00
Andrea Righi
9a29547e5b scx_bpfland: rework lowlatency mode
In lowlatency mode (option --lowlatency) tasks are ordered using a
deadline that is evaluated as the vruntime minus a certain "bonus",
determined in function of the max time slice and the average amount of
voluntary context switches, to amplify the priority boost of the tasks
that are voluntarily releasing the CPU (which are typically
interactive).

However, this method can be extremely unfair in some cases: tasks with
short bursts of voluntary context switches may receive a huge priority
boost, making the rest of the system almost unresponsive (see massive
hackbench stress tests for example).

To prevent this rework the task's deadline logic to use the vruntime and
a "deadline component" that is a function of the average used time
slice, scaled using a dynamic task priority (evaluated as the static
task priority and the its average amount of voluntary context switches).

This logic seems to prevent excessive prioritization of tasks performing
short intensive bursts of voluntary context switches.

It also makes lowlatency mode in scx_bpfland (somehow) more similar to
the deadline logic used by scx_rusty.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-10-05 17:44:09 +02:00
Changwoo Min
a673dcf809
Merge pull request #736 from multics69/scx-futex-v1
scx_lavd: split main.bpf.c into multiple files
2024-10-05 13:11:15 +09:00
Pat Somaru
efabcfcdc3
Replace PID with Task Pointer in Rusty
Replace PID with Task Pointer in Rusty

Fixes: #610
2024-10-04 18:06:37 -04:00
Daniel Hodges
c56e60b86a scx_layered: Add better debug output of iter algo
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-10-04 11:36:36 -07:00
Daniel Hodges
e1241d6e52 scx_layered: Cleanup layer growth weight limits
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-10-04 11:16:58 -07:00
Daniel Hodges
17f9b3f4f3 scx_layered: Cleanup layer infeasible weight calc
Cleanup the calculation of the infeasible weight to not use an
unneccesary collect.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-10-04 10:12:22 -07:00
Daniel Hodges
0476a10f83 scx_layered: Cleanup from code review
Cleanup from code review.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-10-04 10:09:38 -07:00
Daniel Hodges
817e310a31 scx_layered: Add default dsq iter algo
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-10-04 09:58:26 -07:00
Daniel Hodges
7ee12091c3 scx_layered: Add DSQ iteration algo
Add DSQ iteration algorithms.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-10-04 09:58:23 -07:00
Daniel Hodges
6929501aea scx_layered: Refactor stats variable names
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-10-04 09:56:37 -07:00
Daniel Hodges
f066580612 scx_layered: Use dcycle for infeasible weights
Fix a bug to use duty cycle for infeasible weights calculations.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-10-04 09:56:37 -07:00
Daniel Hodges
c55d34c319 scx_layered: Cleanup unused metrics
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-10-04 09:56:37 -07:00
Daniel Hodges
c0c4e183f0 scx_layered: Cargo fmt
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-10-04 09:56:37 -07:00
Daniel Hodges
f3b3d4f19c scx_layered: Add weighted layer DSQ iteration
Add a flag to control DSQ iteration across layers by layer weight. This
helps prevent starvation by iterating over layers with the lowest weight
first.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-10-04 09:56:37 -07:00
Daniel Hodges
bd75ac8dbf scx_layered: Add flags for growth and preemption
Add two new flags `layer_preempt_weight_disable` and
`layer_growth_weight_disable` to disabled preemption and layer growth
when weighted layer load exceeds the configured threshold.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-10-04 09:56:37 -07:00
Daniel Hodges
e48e675cff scx_layered: Remove LoadLedger from stats
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-10-04 09:56:37 -07:00
Daniel Hodges
2518c99bf2 scx_layered: Refactor load calculation
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-10-04 09:56:37 -07:00
Daniel Hodges
54dbf35680 scx_layered: Add weights to userspace layer config
Add weights to userspace layer config.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-10-04 09:56:37 -07:00
Daniel Hodges
07be9dcf59 scx_layered: Add stats for adjusted layer weights
Add stats for infeasible weights adjusted layer stats.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-10-04 09:56:37 -07:00
Daniel Hodges
da38d69009 scx_layered: Add layer weights
Add weights to layers and use the infeasible weights crate to properly
apply weights during contention to prevent starvation.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-10-04 09:56:37 -07:00