Commit Graph

298 Commits

Author SHA1 Message Date
Changwoo Min
fe5554b83d scx_lavd: move reset_lock_futex_boost() to ops.running()
Resetting reset_lock_futex_boost() at ops.enqueue() is not accurate,
so move it to the running. This way, we can prevent the lock holder
preemption only when a lock is acquired during ops.runnging() and
ops.stopping().

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-10-30 13:52:52 +09:00
Changwoo Min
0f58a9cd39 scx_lavd: calculate task's latency criticality in the direct dispatch path
Even in the direct dispatch path, calculating the task's latency
criticality is still necessary since the latency criticality is
used for the preemptablity test. This addressed the following
GitHub issue:

https://github.com/sched-ext/scx/issues/856

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-10-30 12:58:06 +09:00
Changwoo Min
5b91a525bb scx_lavd: kick CPU explicitly at the ops.enqueue() path
When the current task is decided to yield, we should explicitly call
scx_bpf_kick_cpu(_, SCX_KICK_PREEMPT). Setting the current task's time
slice to zero is not sufficient in this because the sched_ext core
does not call resched_curr() at the ops.enqueue() path.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-10-28 17:31:34 +09:00
Changwoo Min
f56b79b19c scx_lavd: yield for preemption only when a task is ineligible
An eligible task is unlikely preemptible. In other words, an ineligible
task is more likely preemptible since its greedy ratio penalty in virtual
deadline calculation. Hence, we skip the predictability test
for an eligible task.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-10-28 12:45:50 +09:00
Changwoo Min
4f6947736f scx_lavd: fix uninitialized memory access comp_preemption_info()
The previous code accesses uninitialized memory in comp_preemption_info()
when called from can_task1_kick_task2() <-try_yield_current_cpu()
to test if a task 2 is a lock holder or not. However, task2 is guaranteed
not a lock holder in all its callers. So move the lock holder testing to
can_cpu1_kick_cpu2().

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-10-24 16:07:53 +09:00
Changwoo Min
b90ecd7e8f scx_lavd: proactively kick a CPU at the ops.enqueue() path
When a task is enqueued, kick an idle CPU in the chosen scheduling
domain. This will reduce temporary stall time of the task by waking
up the CPU as early as possible.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-10-23 21:43:11 +09:00
Changwoo Min
731a7871d7 scx_lavd: change the greedy penalty function
We used to give a penalty in latency linearly to the greedy ratio.
However, this impacts the greedy ratio too much in determining the
virtual deadline, especially among under-utilized tasks (< 100.0%).
Now, we treat all under-utilized tasks with the same greedy ratio
(= 100.0%). For over-utilized tasks, we give a bit milder penalty
to avoid sudden latency spikes.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-10-23 21:42:55 +09:00
Changwoo Min
9acf950b75 scx_lavd: change how to use the context information for latency criticality
Previously, contextual information—such as sync wakeup and kernel
task—was incorporated into the final latency criticality value ad hoc
by adding a constant. Instead, let's make everything proportional to
run time and waker and wakee frequencies by scaling up/down the run
time and the frequencies.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-10-23 21:32:18 +09:00
Changwoo Min
6fb57643fb scx_lavd: remove the time restriction in preemption
Previously, the preemption is allowed only when a task is at the
early in its time slice by using LAVD_PREEMPT_KICK_MARGIN and
LAVD_PREEMPT_TICK_MARGIN. This is not necessary any more because
the lock holder preemption can avoid harmful preemptions. So we
remove LAVD_PREEMPT_KICK_MARGIN and LAVD_PREEMPT_TICK_MARGIN and
unleash the preemption.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-10-22 17:48:56 +09:00
Changwoo Min
07ed821511 scx_lavd: incorporate task's weight to latency criticality
When calculating task's latency criticality, incorporate task's
weight into runtime, wake_freq, and wait_freq more systematically.
It looks nicer and works better under heavy load.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-10-22 17:48:56 +09:00
Changwoo Min
47dd1b9582 scx_lavd: respect a chosen cpu even if it is not idle
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-10-22 17:48:56 +09:00
Changwoo Min
257a3db376 scx_lavd: add ops.cpu_release()
When a CPU is released to serve higher priority scheduler class,
requeue the tasks in a local DSQ to the global enqueue.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-10-22 17:48:56 +09:00
Changwoo Min
89749ecad7 scx_lavd: fix/work around a verifier error
Without this, the BPF verifier spits the following errors
with *some* version of vmlinux.h. So added +1 to work around
the problem.

---------------
; bpf_for(j, 0, 64) { @ main.bpf.c:1926
509: (bf) r1 = r8                     ; R1_w=fp-32 R8_w=fp-32 refs=66,2035
510: (b4) w2 = 0                      ; R2_w=0 refs=66,2035
511: (b4) w3 = 64                     ; R3_w=64 refs=66,2035
512: (85) call bpf_iter_num_new#104189        ; R0=scalar() fp-32=iter_num(ref_id=2048,state=active,depth=0) refs=66,2035,2048
513: (bf) r1 = r8                     ; R1=fp-32 R8=fp-32 refs=66,2035,2048
514: (85) call bpf_iter_num_next#104191 515: R0_w=rdonly_mem(id=2049,ref_obj_id=2048,sz=4) R6=scalar(id=2047,smin=smin32=0,smax=umax=smax32=umax32=7,var_off=(0x0; 0x7)) R7=scalar() R8=fp-32 R9=map_value(map=bpf_bpf.bss,ks=4,vs=4584,off=384,smin=smin32=0,smax=umax=smax32=umax32=3968,var_off=(0x0; 0xf80)) R10=fp0 fp-16=iter_num(ref_id=66,state=active,depth=1) fp-24=iter_num(ref_id=2035,state=active,depth=1) fp-32=iter_num(ref_id=2048,state=active,depth=1) fp-80=scalar(id=1) fp-88=map_value(map=.data.LAVD,ks=4,vs=1320,off=40,smin=smin32=0,smax=umax=smax32=umax32=1240,var_off=(0x0; 0x7f8)) fp-96=????0 fp-112=rcu_ptr_bpf_cpumask() fp-120=rcu_ptr_bpf_cpumask() fp-128=rcu_ptr_bpf_cpumask() fp-136=rcu_ptr_bpf_cpumask() refs=66,2035,2048
; bpf_for(j, 0, 64) { @ main.bpf.c:1926
515: (15) if r0 == 0x0 goto pc+49     ; R0_w=rdonly_mem(id=2049,ref_obj_id=2048,sz=4) refs=66,2035,2048
516: (64) w6 <<= 6                    ; R6=scalar(smin=smin32=0,smax=umax=smax32=umax32=448,var_off=(0x0; 0x1c0)) refs=66,2035,2048
517: (61) r8 = *(u32 *)(r0 +0)        ; R0=rdonly_mem(id=2049,ref_obj_id=2048,sz=4) R8_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) refs=66,2035,2048
518: (26) if w8 > 0x3f goto pc+46     ; R8_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=63,var_off=(0x0; 0x3f)) refs=66,2035,2048
; if (cpumask & 0x1LLU << j) { @ main.bpf.c:1927
519: (bf) r1 = r7                     ; R1_w=scalar(id=2053) R7=scalar(id=2053) refs=66,2035,2048
520: (7f) r1 >>= r8                   ; R1_w=scalar() R8_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=63,var_off=(0x0; 0x3f)) refs=66,2035,2048
521: (57) r1 &= 1                     ; R1_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=1,var_off=(0x0; 0x1)) refs=66,2035,2048
522: (15) if r1 == 0x0 goto pc+38     ; R1_w=1 refs=66,2035,2048
; cpu = (i * 64) + j; @ main.bpf.c:1928
523: (4c) w8 |= w6                    ; R6=scalar(smin=smin32=0,smax=umax=smax32=umax32=448,var_off=(0x0; 0x1c0)) R8_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=511,var_off=(0x0; 0x1ff)) refs=66,2035,2048
; bpf_cpumask_set_cpu(cpu, cd_cpumask); @ main.bpf.c:1929
524: (bc) w1 = w8                     ; R1_w=scalar(id=2054,smin=smin32=0,smax=umax=smax32=umax32=511,var_off=(0x0; 0x1ff)) R8_w=scalar(id=2054,smin=smin32=0,smax=umax=smax32=umax32=511,var_off=(0x0; 0x1ff)) refs=66,2035,2048
525: (79) r2 = *(u64 *)(r10 -88)      ; R2_w=map_value(map=.data.LAVD,ks=4,vs=1320,off=40,smin=smin32=0,smax=umax=smax32=umax32=1240,var_off=(0x0; 0x7f8)) R10=fp0 fp-88=map_value(map=.data.LAVD,ks=4,vs=1320,off=40,smin=smin32=0,smax=umax=smax32=umax32=1240,var_off=(0x0; 0x7f8)) refs=66,2035,2048
526: (85) call bpf_cpumask_set_cpu#93595
invalid access to map value, value_size=1320 off=1280 size=48
R2 max value is outside of the allowed memory range
processed 24200 insns (limit 1000000) max_states_per_insn 19 total_states 961 peak_states 789 mark_read 44
---------------

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-10-22 17:19:37 +09:00
Changwoo Min
5f19fa0bab scx_lavd: refill time slice once for a lock holder
When a task holds a lock, refill its time slice once at the
ops.dispatch() path to avoid the lock holder preemption problem.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-10-21 15:56:51 +09:00
Changwoo Min
5a852dc3d9 scx_lavd: direct dispatch when there is an idle CPU
When there is an idle CPU, direct dispatch is performed to reduce
scheduling latency. This didn't work well before, but it seems
to work well now with other tunings.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-10-21 15:56:51 +09:00
Changwoo Min
420de70159 scx_lavd: give more penalty to long-running tasks
Giving more penalties to a long-running tasks helps to segregate
latency-critical tasks, which are usually short-running, to
long-running tasks, which are compute-intensive.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-10-21 15:56:41 +09:00
Changwoo Min
2fd395bbbf scx_lavd: remove unnecessary load tracking
The algorithm has been evolved to decide the time slice without
tracking the system-wide load. So remove the obsolete load tracking
code.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-10-19 15:39:24 +09:00
Changwoo Min
8d63024be7 scx_lavd: add missing reset_lock_futex_boost()
reset_lock_futex_boost() should be called every context switch of a
task. Otherwise, in the worst case, a task and that CPU could block
the preemption. To avoid such a situation, add missing
reset_lock_futex_boost() calls.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-10-19 15:39:18 +09:00
Changwoo Min
c1f4051a14 scx_lavd: fix int overflow in calculating avg_lat_cri
u32 is not big enough to hold the sum of lat_cri in a period,
so sum_lat_cri (u32) was overflown, resulting in incorrect
avg_lat_cri. Change the type from u32 to u64, avoiding the
interger overflow. Note that {sum/avg}_lat_cri is only for
deubugging so it is irrelevant in making scheduling decisions.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-10-13 00:58:36 +09:00
Changwoo Min
6c9bbe66dc scx_lavd: remove unnecessary downscaling in deadline calculation
The downscaling is not necessary in calculating task's virtual
deadline because virtual dealine represents only relative order
in task scheduling. Hence downscaling incurs only inacuracy
caused by truncation.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-10-13 00:41:23 +09:00
Changwoo Min
6ddc3f0a2b scx_lavd: do not inspect scx_lavd process itself
Print the task status of scx_lavd is not useful,
so filter it out.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-10-12 17:21:08 +09:00
Changwoo Min
648c95be9e scx_lavd: fix incorrect task comparison for preemption
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-10-11 21:53:24 +09:00
Changwoo Min
5b4b255cbb scx_lavd: do not preempt while holding a lock
When a task holds a lock, it should not yield its time slice or it
should not be preempted out. In this way, we can mitigate harmful
preemption of lock holders and reduce the total preemption counts.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-10-11 18:49:09 +09:00
Changwoo Min
bd17589a6e scx_lavd: boost latency criticality when a task holds a lock
When a lock holder exhausts its time slide, it will be re-enqueued
to a DSQ waiting for shceduling while holding a lock. In this case,
prioritize its latency criticality proportionally, so a lock holder
would be not stuck in a DSQ for a long time, improving system-wide
progress.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-10-11 18:48:56 +09:00
Changwoo Min
77b8e65571 scx_lavd: tracing all blocking locks and futexes
Trace the acquisition and release of blocking locks for kernel and
fuxtexes for user-space. This is necessary to boost a lock holder
task in terms of latency and time slice. We do not boost shared
lock holders (e.g., read lock in rw_semaphore) since the kernel
already prioritizes the readers over writers.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-10-11 17:03:48 +09:00
Changwoo Min
7c5c83a3a2 scx_lavd: split main.bpf.c into multiple files
As the main.bpf.c file grows, it gets hard to maintain.
So, split it into multiple logical files. There is no
functional change.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-10-05 00:25:40 +09:00
Changwoo Min
b1070449b2
Merge pull request #714 from multics69/lavd-hotplug
scx_lavd: support CPU hotplug correctly
2024-10-03 07:35:25 +09:00
Tejun Heo
7402895f4a version: v1.0.5 2024-10-02 08:34:57 -10:00
I Hsin Cheng
7fbef2aa0b scx_lavd: Fix typo
Fix "alreay" to "already".

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
2024-10-02 17:39:10 +08:00
Changwoo Min
770a59f69d scx_lavd: support CPU hotplug correctly
Use scx_utils::NR_CPU_IDS to iterate whole CPUs and separately count the
number of online CPUs to support CPU hotplug correctly.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-10-02 14:19:18 +09:00
Changwoo Min
fb7bc0a850 scx_lavd: fix incorrect preemtability test
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-10-02 13:24:43 +09:00
Ming Yang
445743487a Add #stat_doc attribute macro to Stats struct
`#stat_doc` extends the document from stat desc property.

Add this attribute macro to the remaining Stats structs.

Signed-off-by: Ming Yang <minos.future@gmail.com>
2024-09-30 22:12:11 -07:00
Changwoo Min
ade6931bfc scx_lavd: fix incorrect neighbor_bit initialization
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-09-29 02:27:40 +09:00
Changwoo Min
6d116208c8 scx_lavd: do not perform the victim selection for an invalid cpu
When finding a victim candidate for preemption, a randomly chosen
candidate could be out of valid CPU range due to CPU offline, etc. In
this case, try another CPU randomly.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-09-29 02:22:18 +09:00
Changwoo Min
cd7846f4d2 scx_lavd: more accurately determine the performance criticality threshold
We used the average performance criticality of tasks as a threshold to
determine the proper core type (big or little). However, if the big
core's compute capacity is not half of the total compute capacity, such
an average-based determination becomes suboptimal. If fewer tasks are
classified as performance-critical tasks and requested to run on big
cores, the big cores would be wasted by stealing arbitrary
non-performance-critical tasks. That could result in performance
instability.

Hence, determine the threshold more accurately by considering (active)
big cores' compute capacity and the (approximated) distribution of
performance criticality of tasks.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-09-27 16:56:30 +09:00
Changwoo Min
f07023e42b scx_lavd: rename avg_perf_cri to thr_perf_cri
As a preparation to improve the performance criticality logic, we first
rename "avg_perf_cri" to "thr_perf_cri" since average is no longer the
threshold.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-09-27 13:12:20 +09:00
I Hsin Cheng
61cb3f7fc5 scx_common_bpf: Append cast_mask()
Remove cast_mask() function distributed throughout different schedulers
and add it in common.bpf.h so every scheduler can reference it once they
need to.

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
2024-09-24 16:01:19 +08:00
Changwoo Min
71fa92cf1c scx_lavd: propagate waker's latency criticality to its wakee
If a waker is more latency critical than a wakee, inherit a waker's
latency criticality for the wakee. This allows the wakee to consider the
context of who wakes me up. For now, we limit such inheritance to one
hop and one schedule.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-09-23 12:56:16 +09:00
Changwoo Min
7321a89724 scx_lavd: find a victim cpu for preemption within task's compute domain
Previously, we found a victim from the entire CPUs, which include remote
or non-compatible CPUs. Now we limit our search for victim finding
within a task's compute domain.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-09-22 12:47:18 +09:00
Changwoo Min
8d8d8f9f61 scx_lavd: consider waker's CPU when ops.select_cpu()
In case of sync wake-up, consider waker's CPU also to improve cache
locality.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-09-22 01:57:49 +09:00
Changwoo Min
95e2f4dabe scx_lavd: boost the latency critility of kernel threads
Many kernel threads performs latency critical tasks (e.g., net, gpu). In
particular, AMD GPU driver runs the most part in the kernel space using
kworker. Hence, treat kernel threads as if a woken up task.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-09-14 00:41:02 +09:00
Changwoo Min
4b4f42fce1 scx_lavd: add a short circuit for the case of no turbo core
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-09-13 16:02:07 +09:00
Jake Hillion
8ca45cfa37
lint: enable cargo fmt (#643)
Use `cargo fmt` with a specific nightly branch in the CI to enforce formatting. Globally format these files while the diff is still small so we can stay on top of it.

Test plan:
- CI lint check passes.
2024-09-11 10:03:20 +01:00
patso
c1df85914b
remove dependency on rlimit.rs
the rlimit crate is the only dependency crate
with a build.rs. build.rs files complicate portability.
this removes the need for rlimit.rs
2024-09-10 01:16:53 -04:00
Changwoo Min
17e0e08e6e
Merge pull request #621 from multics69/lavd-greedy-fix
scx_lavd: improve greedy ratio calculation and more
2024-09-07 10:52:00 +09:00
Changwoo Min
36df970a8f scx_lavd: add debug print for turbo cores
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-09-06 19:23:17 +09:00
Changwoo Min
351a1c6656 scx_lavd: enable autopilot mode by default
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-09-06 19:23:12 +09:00
Changwoo Min
ebe9375b6a scx_lavd: pretty printing of status
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-09-06 16:27:20 +09:00
Changwoo Min
461cb9a3a0 scx_lavd: fix calculation of greedy_ratio
The service time (taskc->svc_time) should be the sum of total CPU time
consumed not jut a delta.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-09-06 16:22:40 +09:00
Tejun Heo
46fc2e1a49 version: v1.0.4 2024-09-05 18:12:45 -10:00