Commit Graph

1291 Commits

Author SHA1 Message Date
Emil Tsalapatis
c545d23e79 factor enum handling into existing headers/operations 2024-11-06 07:03:40 -08:00
Andrea Righi
78101e4688 scx_bpfland: drop lowlatency mode and the priority DSQ
Schedule all tasks using a single global DSQ. This gives a better
control to prevent potential starvation conditions.

With this change, scx_bpfland adopts a logic similar to scx_rusty and
scx_lavd, prioritizing tasks based on the frequency of their wait and
wake-up events, rather than relying exclusively on the average amount of
voluntary context switches.

Tasks are still classified as interactive / non-interactive based on the
amount of voluntary context switches, but this is only affecting the
cpufreq logic.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
2024-11-06 15:06:39 +01:00
Changwoo Min
d0eeebf98a scx_lavd: deprioritize a long runtime by prioritizing frequencies further
Since tasks' average runtimes show skewed distribution, directly
using the runtime in the deadline calculation causes several
performance regressions. Instead, let's use the constant factor
and further prioritize frequency factors to deprioritize the long
runtime tasks.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-11-06 18:08:57 +09:00
Changwoo Min
cfe23aa21b scx_lavd: avoid self-IPI at preemption
Revert the change of sending self-IPI at preemption when a victim
CPU is the current CPU. The cost of self-IPI is prohibitively expensive
in some workloads (e.g., perf bench). Instead, resetting task' time slice
to zero.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-11-06 13:59:24 +09:00
Emil Tsalapatis
a1d0e7e638 autogenerate scx enum definitions 2024-11-05 13:52:25 -08:00
Emil Tsalapatis
31b9fb4135 set all enums in userspace before loading 2024-11-05 12:20:31 -08:00
Emil Tsalapatis
ff861d3e2c introduce CO:RE enum readers and use them for scx_central 2024-11-05 08:29:45 -08:00
Andrea Righi
efc41dd936 scx_bpfland: strict domain affinity
Rather then always migrating tasks across LLC domains when no idle CPU
is available in their current LLC domain, allow migration but attempt to
bring tasks back to their original LLC domain whenever possible.

To do so, define the task's scheduling domain upon task creation or when
its affinity changes, and ensure the task remains within this domain
throughout its lifetime.

In the future we will add a proper load balancing logic, but for now
this change seems to provide consistent performance improvement in
certain server workloads.

For example, simple CUDA benchmarks show a performance boost of about
+10-20% with this change applied (on multi-LLC / NUMA machines).

Signed-off-by: Andrea Righi <arighi@nvidia.com>
2024-11-05 16:37:51 +01:00
Andrea Righi
064d6fb560 scx_bpfland: consider all tasks as regular if priority DSQ is congested
This allows to prevent excessive starvation of regular tasks in presence
of high amount of interactive tasks (e.g., when running stress tests,
such as hackbench).

Signed-off-by: Andrea Righi <arighi@nvidia.com>
2024-11-05 16:37:51 +01:00
Andrea Righi
8a655d94f5 scx_bpfland: do not overly prioritize WAKE_SYNC tasks
This can lead to stalls when a high number of interactive tasks are
running in the system (i.e.., hackbench or similar stress tests).

Signed-off-by: Andrea Righi <arighi@nvidia.com>
2024-11-05 16:37:51 +01:00
Andrea Righi
f0c8de3477 scx_bpfland: do not exclude exiting tasks
Add SCX_OPS_ENQ_EXITING to the scheduler flags, since we are not using
bpf_task_from_pid() and the scheduler can handle exiting tasks.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
2024-11-05 16:37:51 +01:00
Andrea Righi
eb99e45ced scx_bpfland: consistent vruntime update
Ensure that task vruntime is always updated in ops.running() to maintain
consistency with other schedulers.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
2024-11-05 16:37:51 +01:00
I Hsin Cheng
a5fd42a719 scx_rusty: Fix filtering logic error
Fix task filtering logic error to avoid the possibility of migrate the
same task over again. The orginal logic operation was "||" which might
include tasks already migrated to be taken into consideration again.
Change the condition to "&&" so we can elimate the error.

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
2024-11-05 14:46:40 +08:00
I Hsin Cheng
a248fdc5e3 scx_rusty: Restore push domain tasks when no task found
Inside the function "try_find_move_task()", it returns directly when
there's no task found to be moved. If the cause is from lack of ability
to fulfilled the condition by "task_filter()", load balancer will try to
find move task again and remove "task_filter()" by setting it directly
to a function returns true.

However, in the fallback case, the tasks within the domains will be
empty. Swap the tasks back into domains vector before returning can
solve the issue.

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
2024-11-05 14:25:29 +08:00
Changwoo Min
e10a92f392
Merge pull request #887 from multics69/scx-minor-fixes
scx_lavd: entirely drop kernel lock tracing
2024-11-05 00:19:02 +00:00
Emil Tsalapatis
9c6ad33fda
Merge pull request #891 from etsal/global_costc
scx_layered: point costc to global struct when initializing budgets
2024-11-04 21:41:47 +00:00
Emil Tsalapatis
9cf137be99 scx_layered: remove ->open from layer struct 2024-11-04 11:42:23 -08:00
Emil Tsalapatis
2b0909f9d7 scx_layered: point costc to global struct when initializing budgets 2024-11-04 11:40:07 -08:00
Abdul Rehman
3b0b5e6f38 [bug-fix]
* increasing the number of bits for cpumask
2024-11-04 10:54:29 -05:00
Daniel Hodges
5dcdcfc50f scx_layered: Refactor cost naming
Refactor cost struct usage so that it always is used as `costc` for
clarity.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-11-04 07:07:13 -08:00
Changwoo Min
517ef89444 scx_lavd: drop all kernel lock tracing
The combination of kernel versions and kerenl configs generates
different kernel symbols. For example, in an old kernel version,
__mutex_lock() is not generated. Also, there is no workaround
from the fentry/fexit/kprobe side currently. Let's entirely drop
the kernel locking for now and revisit it later.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-11-04 21:36:32 +09:00
Changwoo Min
796d324555 scx_lavd: improve readabilty
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-11-04 21:18:43 +09:00
Changwoo Min
882212574a scx_lavd: fix a variable name to kill a warning
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-11-04 16:20:53 +09:00
Changwoo Min
e1b880f7c3 scx_lavd: fix CI error for missing kernel symbols
Revised the lock tracking code, relying on stable symbols with various
kernel configurations. There are two changes:

- Entirely drop tracing rt_mutex, which can be on and off with kconfig

- Replace mutex_lock() families to __mutex_lock(), which is stable
  across kernel configs. The downside of such change is it is now
  possible to trace the lock fast path, so lock tracing is a bit less
  accurate. But let's live with it for now until a better solution is found.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-11-04 16:05:53 +09:00
Andrea Righi
f6f5481081
Merge branch 'main' into bpfland-minor-cleanups 2024-11-02 15:28:37 +01:00
Daniel Hodges
7e0f2cd3f3 scx_layered Fix trace format
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-11-01 10:16:14 -07:00
Daniel Hodges
cb5b1961b8
Merge branch 'main' into layered-fallback-stall 2024-11-01 10:24:00 -04:00
Daniel Hodges
0a518e9f9e scx_layered: Add additional drain to fallback DSQs
Fallback DSQs are not accounted with costs. If a layer is saturating the
machine it is possible to not consume from the fallback DSQ and stall
the task. This introduces and additional consumption from the fallback
DSQ when a layer runs out of budget. In addition, tasks that use partial
CPU affinities should be placed into the fallback DSQ. This change was
tested with stress-ng --cacheline `nproc` for several minutes without
causing stalls (which would stall on main).

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-11-01 06:04:55 -07:00
Daniel Hodges
a8109d3341 scx_layered: Fix dump output format
Flip the order of layer id vs layer name so that the output makes sense.
Example output:

LO_FALLBACK nr_queued=0 -0ms
COST GLOBAL[0][random] budget=22000000000 capacity=22000000000
COST GLOBAL[1][hodgesd] budget=0 capacity=0
COST GLOBAL[2][stress-ng] budget=0 capacity=0
COST GLOBAL[3][normal] budget=0 capacity=0
COST CPU[0][0][random] budget=62500000000000 capacity=62500000000000
COST CPU[0][1][random] budget=100000000000000 capacity=100000000000000
COST CPU[0][2][random] budget=124911500964411 capacity=125000000000000

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-11-01 04:06:33 -07:00
Changwoo Min
18a80977bb scx_lavd: create DSQs on their associated NUMA nodes
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-11-01 11:18:22 +09:00
Changwoo Min
ce31d3c59e scx_lavd: optimize consume_starving_task()
Loop until the total number of DSQs not the theoretical maximum of DSQs.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-11-01 10:43:34 +09:00
Changwoo Min
673f80d3f7 scx_lavd: fix warnings
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-11-01 10:13:28 +09:00
Changwoo Min
4c3f1fd61c
Merge pull request #867 from vax-r/lavd_typo
scx_lavd: Fix typos
2024-11-01 09:40:50 +09:00
Changwoo Min
93ec656916
Merge pull request #866 from vax-r/lavd_fix_type
scx_lavd: Correct the type of taskc within lavd_dispatch()
2024-11-01 09:39:58 +09:00
Andrea Righi
628605cdee scx_bpfland: get rid of the global dynamic nvcsw threshold
The dynamic nvcsw threshold is not used anymore in the scheduler and it
doesn't make sense to report it in the scheduler's statistics, so let's
just drop it.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
2024-10-31 21:48:44 +01:00
Andrea Righi
827f6c6147 scx_bpfland: get rid of MAX_LATENCY_WEIGHT
Get rid of the static MAX_LATENCY_WEIGHT and always rely on the value
specified by --nvcsw-max-thresh.

This allows to tune the maximum latency weight when running in
lowlatency mode (via --nvcsw-max-thresh) and it also restores the
maximum nvcsw limit in non-lowlatency mode, that was incorrectly changed
during the lowlatency refactoring.

Fixes: 4d68133 ("scx_bpfland: rework lowlatency mode to adjust tasks priority")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
2024-10-31 21:48:44 +01:00
Andrea Righi
72e9451c4a scx_bpfland: evaluate nvcsw without using kernel metrics
Evalute the amount of voluntary context switches directly in the BPF
code, without relying on the kernel p->nvcsw metric.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
2024-10-31 21:48:44 +01:00
Daniel Hodges
6839a84926 scx_layered: Add layer CPU cost to dump
Add the layer CPU cost when dumping. This is useful for understanding
the per layer cost accounting when layered is stalled.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-10-31 10:45:33 -07:00
Daniel Hodges
9e42480a62 scx_layered: Add layer name to bpf
Add the layer name to the bpf representation of a layer. When printing
debug output print the layer name as well as the layer index.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-10-31 05:12:12 -07:00
I Hsin Cheng
3eed58cd26 scx_lavd: Fix typos
Fix "infomation" to "information"

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
2024-10-31 18:55:55 +08:00
I Hsin Cheng
f55cc965ac scx_lavd: Correct the type of taskc within lavd_dispatch()
The type of "taskc" within "lavd_dispatch()" was "struct task_struct *",
while it should be "struct task_ctx *".

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
2024-10-31 18:37:41 +08:00
Changwoo Min
83b5f4eb23
Merge pull request #861 from multics69/lavd-opt-dispatch
scx_lavd: tuning and optimizing latency criticality calculation
2024-10-31 08:10:21 +09:00
Daniel Hodges
a8d245b164 scx_layered: Refactor dispatch
Refactor dispatch to use a separate set of global helpers for topo aware
dispatch. This change only refactors dispatch to make it more
maintainable, without any functional changes.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-10-30 07:27:05 -07:00
Changwoo Min
82b25a94f4 scx_lavd: boost task's latency criticality when pinned to a cpu
Pinning a task to a single CPU is a widely-used optimization to
improve latency by reusing cache. So when a task is pinned to
a single CPU, let's boost its latency criticality.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-10-30 15:28:18 +09:00
Changwoo Min
fe5554b83d scx_lavd: move reset_lock_futex_boost() to ops.running()
Resetting reset_lock_futex_boost() at ops.enqueue() is not accurate,
so move it to the running. This way, we can prevent the lock holder
preemption only when a lock is acquired during ops.runnging() and
ops.stopping().

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-10-30 13:52:52 +09:00
Changwoo Min
0f58a9cd39 scx_lavd: calculate task's latency criticality in the direct dispatch path
Even in the direct dispatch path, calculating the task's latency
criticality is still necessary since the latency criticality is
used for the preemptablity test. This addressed the following
GitHub issue:

https://github.com/sched-ext/scx/issues/856

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-10-30 12:58:06 +09:00
Changwoo Min
3dcaefcb2f
Merge pull request #854 from multics69/lavd-preempt
scx_lavd: optimize preemption
2024-10-29 00:32:12 +00:00
Daniel Hodges
ad1bfee885 scx_layered: Add cost accounting
Add cost accounting for layers to make weights work on the BPF side.
This is done at both the CPU level as well as globally. When a CPU
runs out of budget it acquires budget from the global context. If a
layer runs out of global budgets then all budgets are reset. Weight
handling is done by iterating over layers by their available budget.
Layers budgets are proportional to their weights.
2024-10-28 13:09:04 -07:00
Changwoo Min
5b91a525bb scx_lavd: kick CPU explicitly at the ops.enqueue() path
When the current task is decided to yield, we should explicitly call
scx_bpf_kick_cpu(_, SCX_KICK_PREEMPT). Setting the current task's time
slice to zero is not sufficient in this because the sched_ext core
does not call resched_curr() at the ops.enqueue() path.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-10-28 17:31:34 +09:00
Changwoo Min
f56b79b19c scx_lavd: yield for preemption only when a task is ineligible
An eligible task is unlikely preemptible. In other words, an ineligible
task is more likely preemptible since its greedy ratio penalty in virtual
deadline calculation. Hence, we skip the predictability test
for an eligible task.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-10-28 12:45:50 +09:00
Jake Hillion
0f9c1a0a73 layered/timers: support verifying on older kernels and fix logic
Some of the new timer code doesn't verify on older kernels like 6.9. Modify the
code a little to get it verifying again.

Also applies some small fixes to the logic. Error handling was a little off
before and we were using the wrong key in lookups.

Test plan:
- CI
2024-10-25 11:31:00 +01:00
Changwoo Min
ea600d2f3b
Merge pull request #846 from multics69/lavd-issue-385
scx_lavd: fix uninitialized memory access at comp_preemption_info()
2024-10-25 01:47:20 +00:00
Pat Somaru
1e0e0d2f50
make timerlib work the best it can with tooling 2024-10-24 13:12:53 -04:00
Pat Somaru
8ab38559aa
fix lsp to work after multiarch support 2024-10-24 13:12:53 -04:00
Daniel Hodges
e38282d61a scx_layered: Fix declarations in timer 2024-10-24 09:09:53 -07:00
Daniel Hodges
41a612f34d scx_layered: Add monitor
Add a monitor timer for scx_layered. For now the monitor is a noop.
2024-10-24 04:49:41 -04:00
Changwoo Min
4f6947736f scx_lavd: fix uninitialized memory access comp_preemption_info()
The previous code accesses uninitialized memory in comp_preemption_info()
when called from can_task1_kick_task2() <-try_yield_current_cpu()
to test if a task 2 is a lock holder or not. However, task2 is guaranteed
not a lock holder in all its callers. So move the lock holder testing to
can_cpu1_kick_cpu2().

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-10-24 16:07:53 +09:00
Changwoo Min
a13bb8028e
Merge pull request #837 from multics69/lavd-tuning-v4
scx_lavd: various optimizations for more consistent performance
2024-10-23 22:56:31 +00:00
Tejun Heo
cc8633996b Revert "fix ci errors due to __str update in kfunc signature"
This reverts commit 29918c03c8.
2024-10-23 08:58:06 -10:00
Changwoo Min
b90ecd7e8f scx_lavd: proactively kick a CPU at the ops.enqueue() path
When a task is enqueued, kick an idle CPU in the chosen scheduling
domain. This will reduce temporary stall time of the task by waking
up the CPU as early as possible.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-10-23 21:43:11 +09:00
Changwoo Min
731a7871d7 scx_lavd: change the greedy penalty function
We used to give a penalty in latency linearly to the greedy ratio.
However, this impacts the greedy ratio too much in determining the
virtual deadline, especially among under-utilized tasks (< 100.0%).
Now, we treat all under-utilized tasks with the same greedy ratio
(= 100.0%). For over-utilized tasks, we give a bit milder penalty
to avoid sudden latency spikes.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-10-23 21:42:55 +09:00
Changwoo Min
9acf950b75 scx_lavd: change how to use the context information for latency criticality
Previously, contextual information—such as sync wakeup and kernel
task—was incorporated into the final latency criticality value ad hoc
by adding a constant. Instead, let's make everything proportional to
run time and waker and wakee frequencies by scaling up/down the run
time and the frequencies.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-10-23 21:32:18 +09:00
Pat Somaru
29918c03c8
fix ci errors due to __str update in kfunc signature 2024-10-23 02:18:26 -04:00
Changwoo Min
fdca0c04ed
Merge pull request #831 from multics69/lavd-fix-bpf-veri
scx_lavd: fix/work around a verifier error
2024-10-23 01:45:01 +09:00
Daniel Hodges
4898f5082a scx_layered: Add timer helpers
Add registry of timers and a helper for running timers.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-10-22 07:57:44 -07:00
Changwoo Min
6fb57643fb scx_lavd: remove the time restriction in preemption
Previously, the preemption is allowed only when a task is at the
early in its time slice by using LAVD_PREEMPT_KICK_MARGIN and
LAVD_PREEMPT_TICK_MARGIN. This is not necessary any more because
the lock holder preemption can avoid harmful preemptions. So we
remove LAVD_PREEMPT_KICK_MARGIN and LAVD_PREEMPT_TICK_MARGIN and
unleash the preemption.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-10-22 17:48:56 +09:00
Changwoo Min
07ed821511 scx_lavd: incorporate task's weight to latency criticality
When calculating task's latency criticality, incorporate task's
weight into runtime, wake_freq, and wait_freq more systematically.
It looks nicer and works better under heavy load.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-10-22 17:48:56 +09:00
Changwoo Min
47dd1b9582 scx_lavd: respect a chosen cpu even if it is not idle
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-10-22 17:48:56 +09:00
Changwoo Min
257a3db376 scx_lavd: add ops.cpu_release()
When a CPU is released to serve higher priority scheduler class,
requeue the tasks in a local DSQ to the global enqueue.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-10-22 17:48:56 +09:00
Changwoo Min
89749ecad7 scx_lavd: fix/work around a verifier error
Without this, the BPF verifier spits the following errors
with *some* version of vmlinux.h. So added +1 to work around
the problem.

---------------
; bpf_for(j, 0, 64) { @ main.bpf.c:1926
509: (bf) r1 = r8                     ; R1_w=fp-32 R8_w=fp-32 refs=66,2035
510: (b4) w2 = 0                      ; R2_w=0 refs=66,2035
511: (b4) w3 = 64                     ; R3_w=64 refs=66,2035
512: (85) call bpf_iter_num_new#104189        ; R0=scalar() fp-32=iter_num(ref_id=2048,state=active,depth=0) refs=66,2035,2048
513: (bf) r1 = r8                     ; R1=fp-32 R8=fp-32 refs=66,2035,2048
514: (85) call bpf_iter_num_next#104191 515: R0_w=rdonly_mem(id=2049,ref_obj_id=2048,sz=4) R6=scalar(id=2047,smin=smin32=0,smax=umax=smax32=umax32=7,var_off=(0x0; 0x7)) R7=scalar() R8=fp-32 R9=map_value(map=bpf_bpf.bss,ks=4,vs=4584,off=384,smin=smin32=0,smax=umax=smax32=umax32=3968,var_off=(0x0; 0xf80)) R10=fp0 fp-16=iter_num(ref_id=66,state=active,depth=1) fp-24=iter_num(ref_id=2035,state=active,depth=1) fp-32=iter_num(ref_id=2048,state=active,depth=1) fp-80=scalar(id=1) fp-88=map_value(map=.data.LAVD,ks=4,vs=1320,off=40,smin=smin32=0,smax=umax=smax32=umax32=1240,var_off=(0x0; 0x7f8)) fp-96=????0 fp-112=rcu_ptr_bpf_cpumask() fp-120=rcu_ptr_bpf_cpumask() fp-128=rcu_ptr_bpf_cpumask() fp-136=rcu_ptr_bpf_cpumask() refs=66,2035,2048
; bpf_for(j, 0, 64) { @ main.bpf.c:1926
515: (15) if r0 == 0x0 goto pc+49     ; R0_w=rdonly_mem(id=2049,ref_obj_id=2048,sz=4) refs=66,2035,2048
516: (64) w6 <<= 6                    ; R6=scalar(smin=smin32=0,smax=umax=smax32=umax32=448,var_off=(0x0; 0x1c0)) refs=66,2035,2048
517: (61) r8 = *(u32 *)(r0 +0)        ; R0=rdonly_mem(id=2049,ref_obj_id=2048,sz=4) R8_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) refs=66,2035,2048
518: (26) if w8 > 0x3f goto pc+46     ; R8_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=63,var_off=(0x0; 0x3f)) refs=66,2035,2048
; if (cpumask & 0x1LLU << j) { @ main.bpf.c:1927
519: (bf) r1 = r7                     ; R1_w=scalar(id=2053) R7=scalar(id=2053) refs=66,2035,2048
520: (7f) r1 >>= r8                   ; R1_w=scalar() R8_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=63,var_off=(0x0; 0x3f)) refs=66,2035,2048
521: (57) r1 &= 1                     ; R1_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=1,var_off=(0x0; 0x1)) refs=66,2035,2048
522: (15) if r1 == 0x0 goto pc+38     ; R1_w=1 refs=66,2035,2048
; cpu = (i * 64) + j; @ main.bpf.c:1928
523: (4c) w8 |= w6                    ; R6=scalar(smin=smin32=0,smax=umax=smax32=umax32=448,var_off=(0x0; 0x1c0)) R8_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=511,var_off=(0x0; 0x1ff)) refs=66,2035,2048
; bpf_cpumask_set_cpu(cpu, cd_cpumask); @ main.bpf.c:1929
524: (bc) w1 = w8                     ; R1_w=scalar(id=2054,smin=smin32=0,smax=umax=smax32=umax32=511,var_off=(0x0; 0x1ff)) R8_w=scalar(id=2054,smin=smin32=0,smax=umax=smax32=umax32=511,var_off=(0x0; 0x1ff)) refs=66,2035,2048
525: (79) r2 = *(u64 *)(r10 -88)      ; R2_w=map_value(map=.data.LAVD,ks=4,vs=1320,off=40,smin=smin32=0,smax=umax=smax32=umax32=1240,var_off=(0x0; 0x7f8)) R10=fp0 fp-88=map_value(map=.data.LAVD,ks=4,vs=1320,off=40,smin=smin32=0,smax=umax=smax32=umax32=1240,var_off=(0x0; 0x7f8)) refs=66,2035,2048
526: (85) call bpf_cpumask_set_cpu#93595
invalid access to map value, value_size=1320 off=1280 size=48
R2 max value is outside of the allowed memory range
processed 24200 insns (limit 1000000) max_states_per_insn 19 total_states 961 peak_states 789 mark_read 44
---------------

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-10-22 17:19:37 +09:00
Changwoo Min
d5b8aafa1a
Merge pull request #822 from multics69/lavd-tuning-v3
scx_lavd: misc performance tuning
2024-10-22 09:57:58 +09:00
Tejun Heo
6ea15f9f9f
Merge pull request #819 from minosfuture/vmlinux_per_arch
Use per-arch vmlinux.h v2
2024-10-21 19:36:52 +00:00
likewhatevs
303c6d09a0
Merge pull request #824 from likewhatevs/layered-exit-task-no-missing-ctx
scx_layered: fix exit_task ctx lookup err
2024-10-21 14:52:07 +00:00
Jake Hillion
55c9636f78 layered: bpf: add layer kind to layer
Currently we have an approximation of LayerKind in the BPF code with `open` on
the layer, but it is difficult/impossible to tell the difference between an
Open and a Grouped layer. Add a `kind` field to the BPF `layer` and plumb
through an enum from the Rust side.
2024-10-21 11:32:17 +01:00
Changwoo Min
5f19fa0bab scx_lavd: refill time slice once for a lock holder
When a task holds a lock, refill its time slice once at the
ops.dispatch() path to avoid the lock holder preemption problem.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-10-21 15:56:51 +09:00
Changwoo Min
5a852dc3d9 scx_lavd: direct dispatch when there is an idle CPU
When there is an idle CPU, direct dispatch is performed to reduce
scheduling latency. This didn't work well before, but it seems
to work well now with other tunings.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-10-21 15:56:51 +09:00
Changwoo Min
420de70159 scx_lavd: give more penalty to long-running tasks
Giving more penalties to a long-running tasks helps to segregate
latency-critical tasks, which are usually short-running, to
long-running tasks, which are compute-intensive.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-10-21 15:56:41 +09:00
Pat Somaru
d89c571593
scx_layered: do not attempt ctx lookup on tasks exited before running on scx 2024-10-20 17:47:24 -04:00
Andrea Righi
fb3f1d0b43
Merge pull request #821 from sched-ext/rustland-min-vtime-budget
Some checks failed
build-and-test / lint (push) Has been cancelled
build-and-test / build-kernel (push) Has been cancelled
build-and-test / pages (push) Has been cancelled
build-and-test / integration-test (scx_bpfland) (push) Has been cancelled
build-and-test / integration-test (scx_lavd) (push) Has been cancelled
build-and-test / integration-test (scx_layered) (push) Has been cancelled
build-and-test / integration-test (scx_rlfifo) (push) Has been cancelled
build-and-test / integration-test (scx_rustland) (push) Has been cancelled
build-and-test / integration-test (scx_rusty) (push) Has been cancelled
build-and-test / layered-matrix (scx_layered, --disable-topology=false) (push) Has been cancelled
build-and-test / layered-matrix (scx_layered, --disable-topology=true) (push) Has been cancelled
build-and-test / rust-test-core (scx_loader) (push) Has been cancelled
build-and-test / rust-test-core (scx_rustland_core) (push) Has been cancelled
build-and-test / rust-test-core (scx_stats) (push) Has been cancelled
build-and-test / rust-test-core (scx_utils) (push) Has been cancelled
build-and-test / rust-test-schedulers (scx_bpfland) (push) Has been cancelled
build-and-test / rust-test-schedulers (scx_lavd) (push) Has been cancelled
build-and-test / rust-test-schedulers (scx_layered) (push) Has been cancelled
build-and-test / rust-test-schedulers (scx_rlfifo) (push) Has been cancelled
build-and-test / rust-test-schedulers (scx_rustland) (push) Has been cancelled
build-and-test / rust-test-schedulers (scx_rusty) (push) Has been cancelled
bpf-next-test / build-kernel (push) Has been cancelled
bpf-next-test / integration-test (scx_bpfland) (push) Has been cancelled
bpf-next-test / integration-test (scx_lavd) (push) Has been cancelled
bpf-next-test / integration-test (scx_layered) (push) Has been cancelled
bpf-next-test / integration-test (scx_rlfifo) (push) Has been cancelled
bpf-next-test / integration-test (scx_rustland) (push) Has been cancelled
bpf-next-test / integration-test (scx_rusty) (push) Has been cancelled
scx_rustland: Adjust task's vruntime budget based on latency weight
2024-10-20 07:44:35 +00:00
Changwoo Min
bf1b014d63
Merge pull request #818 from multics69/lavd-tuning
Some checks are pending
build-and-test / lint (push) Waiting to run
build-and-test / build-kernel (push) Waiting to run
build-and-test / integration-test (scx_bpfland) (push) Blocked by required conditions
build-and-test / integration-test (scx_lavd) (push) Blocked by required conditions
build-and-test / integration-test (scx_layered) (push) Blocked by required conditions
build-and-test / integration-test (scx_rlfifo) (push) Blocked by required conditions
build-and-test / integration-test (scx_rustland) (push) Blocked by required conditions
build-and-test / integration-test (scx_rusty) (push) Blocked by required conditions
build-and-test / layered-matrix (scx_layered, --disable-topology=false) (push) Blocked by required conditions
build-and-test / layered-matrix (scx_layered, --disable-topology=true) (push) Blocked by required conditions
build-and-test / rust-test-core (scx_loader) (push) Blocked by required conditions
build-and-test / rust-test-core (scx_rustland_core) (push) Blocked by required conditions
build-and-test / rust-test-core (scx_stats) (push) Blocked by required conditions
build-and-test / rust-test-core (scx_utils) (push) Blocked by required conditions
build-and-test / rust-test-schedulers (scx_bpfland) (push) Blocked by required conditions
build-and-test / rust-test-schedulers (scx_lavd) (push) Blocked by required conditions
build-and-test / rust-test-schedulers (scx_layered) (push) Blocked by required conditions
build-and-test / rust-test-schedulers (scx_rlfifo) (push) Blocked by required conditions
build-and-test / rust-test-schedulers (scx_rustland) (push) Blocked by required conditions
build-and-test / rust-test-schedulers (scx_rusty) (push) Blocked by required conditions
build-and-test / pages (push) Waiting to run
scx_lavd: add missing reset_lock_futex_boost()
2024-10-20 01:41:54 +00:00
Daniel Hodges
e72e5ce0f4
Merge pull request #744 from minosfuture/main
Some checks are pending
build-and-test / lint (push) Waiting to run
build-and-test / build-kernel (push) Waiting to run
build-and-test / integration-test (scx_bpfland) (push) Blocked by required conditions
build-and-test / integration-test (scx_lavd) (push) Blocked by required conditions
build-and-test / integration-test (scx_layered) (push) Blocked by required conditions
build-and-test / integration-test (scx_rlfifo) (push) Blocked by required conditions
build-and-test / integration-test (scx_rustland) (push) Blocked by required conditions
build-and-test / integration-test (scx_rusty) (push) Blocked by required conditions
build-and-test / layered-matrix (scx_layered, --disable-topology=false) (push) Blocked by required conditions
build-and-test / layered-matrix (scx_layered, --disable-topology=true) (push) Blocked by required conditions
build-and-test / rust-test-core (scx_loader) (push) Blocked by required conditions
build-and-test / rust-test-core (scx_rustland_core) (push) Blocked by required conditions
build-and-test / rust-test-core (scx_stats) (push) Blocked by required conditions
build-and-test / rust-test-core (scx_utils) (push) Blocked by required conditions
build-and-test / rust-test-schedulers (scx_bpfland) (push) Blocked by required conditions
build-and-test / rust-test-schedulers (scx_lavd) (push) Blocked by required conditions
build-and-test / rust-test-schedulers (scx_layered) (push) Blocked by required conditions
build-and-test / rust-test-schedulers (scx_rlfifo) (push) Blocked by required conditions
build-and-test / rust-test-schedulers (scx_rustland) (push) Blocked by required conditions
build-and-test / rust-test-schedulers (scx_rusty) (push) Blocked by required conditions
build-and-test / pages (push) Waiting to run
scx_layered: Fix crash on aarch64 due to unavailable cache id file
2024-10-19 22:33:53 +00:00
Ming Yang
1b5359ef4a Use per-arch vmlinux.h v2
Rework per-arch vmlinux solution
* have per-arch directory under sched/include/arch/, in which we
  maintain vmlinux.h symlink and real file
  vmlinux-{kernel_ver}-g{sha1}.h. The original sched/include/vmlinux/
  folder is removed.
* update meson build `-I` option to find the new vmlinux.h position
* update cargo build scripts to use the per-arch vmlinux.h for
  generating bindings
* keep the original ClangInfo refactoring changes

Signed-off-by: Ming Yang <minos.future@gmail.com>
2024-10-19 10:50:59 -07:00
Andrea Righi
30a2a2013c scx_rustland: Adjust task's vruntime budget based on latency weight
Adjust the amount of vruntime budget an idle task can accumulate in
function of its latency weight, which is derived from the average number
of voluntary context switches.

This ensures that latency-sensitive tasks naturally receive an
additional priority boost and we can get avoid scaling down the vruntime
to determine the task's deadline, making the scheduler more fair.

It also makes the scheduler more robust, now rustland can survive
intensive stress tests, such as `stress-ng --cpu-sched 64` or hackbench.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-10-19 19:32:14 +02:00
Daniel Hodges
b1b76ee72a
scx_rusty: Cleanup cpumask casting
Use the cask_mask helper function to cleanup scx_rusty.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-10-19 12:01:36 -04:00
Changwoo Min
2fd395bbbf scx_lavd: remove unnecessary load tracking
The algorithm has been evolved to decide the time slice without
tracking the system-wide load. So remove the obsolete load tracking
code.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-10-19 15:39:24 +09:00
Changwoo Min
8d63024be7 scx_lavd: add missing reset_lock_futex_boost()
reset_lock_futex_boost() should be called every context switch of a
task. Otherwise, in the worst case, a task and that CPU could block
the preemption. To avoid such a situation, add missing
reset_lock_futex_boost() calls.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-10-19 15:39:18 +09:00
Ming Yang
f3f4726c09 scx_layered: Read CPU topology for building CpuPool
Building CpuPool from cache-cpu topology did not apply on arm, because
`/sys/devices/system/cpu/cpu{}/cache/index{}/id` file is unavailable.

Read CPU topology instead.

Signed-off-by: Ming Yang <minos.future@gmail.com>
2024-10-17 23:41:08 -07:00
Andrea Righi
48bbcd24dd scx_bpfland: tune default settings
Adjust some default settings after the rework done with commit 112a5d4
("scx_bpfland: rework lowlatency mode to adjust tasks priority").

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-10-17 21:46:51 +02:00
Andrea Righi
4d68133f3b scx_bpfland: rework lowlatency mode to adjust tasks priority
Rework lowlatency mode as following:
 - introduce task dynamic priority: task weight multiplied by the
   average amount of voluntary context switches
 - use dynamic priority to determine task's vruntime (instead of the
   static task's weight)
 - task's minimum vruntime is evaluated in function of the dynamic
   priority (tasks with a higher dynamic priority can have a smaller
   vruntime compared to tasks with a lower dynamic priority)

The dynamic priority allows to maintain a good system responsiveness
also without applying the classification of tasks in "interactive" and
"regular", therefore in lowlatency mode only the shared DSQ will be
used (priority DSQ is disabled).

Using a separate priority queue to dispatch "interactive" tasks makes
the scheduler less fair, allowing latency-sensitive tasks to be
prioritized even when there is a high number of tasks in the system
(e.g., `stress-ng -c 1024` or similar scenarios), where relying solely
on dynamic priority may not be sufficient.

On the other hand, disabling the classification of "interactive" tasks
results in a fairer scheduler and more predictable performance, making
it better suited for soft real-time applications (e.g, audio and
multimedia).

Therefore, the --lowlatency option is retained to allow users to choose
between more predictable performance (by disabling the interactive task
classification) or a more responsive system (default).

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-10-17 21:46:51 +02:00
Andrea Righi
d336892c71
Merge pull request #816 from sched-ext/rustland-core-update-doc
scx_rustland_core: update documentation about the new API
2024-10-17 19:18:16 +00:00
Andrea Righi
a155ff2ada scx_rustland_core: update documentation about the new API
Update the documentation adding the new task statistics provided by
scx_rustland_core.

Fixes: be681c7 ("scx_rustland_core: pass nvcsw, slice and dsq_vtime to user-space")
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-10-17 19:07:51 +02:00
f1b1830512
Merge pull request #814 from JakeHillion/pr814
layered: add RandomTopo layer growth algorithm
2024-10-17 17:05:53 +00:00
Jake Hillion
1415b4a454 layered: make disable_topology arg require equals
The recent changes to `disable_topology` making the arg an `Option<bool>`
instead of a `bool` caused an issue with it incorrectly attaching arguments.
Make the argument `require_equals` to fix this case.

This is a behaviour change for anybody previously relying on `-t true`,
`-t false`, `--disable-topology true`, or `--disable-topology false`. The
equals syntax worked before and continues to work after, as demonstrated in the
CI.

Test plan:

Before:
```sh
$ sudo target/release/scx_layered -t f:/tmp/test.json
error: invalid value 'f:/tmp/test.json' for '--disable-topology
[<DISABLE_TOPOLOGY>]'
  [possible values: true, false]

  For more information, try '--help'.
```

After:
```sh
$ sudo target/release/scx_layered -t f:/tmp/test.json
14:44:00 [INFO] CPUs: online/possible=176/176 nr_cores=88
14:44:00 [INFO] Disabling topology awareness
...
^CEXIT: Scheduler unregistered from user space
```
2024-10-17 15:46:30 +01:00
Jake Hillion
a0fe303b61 layered: add RandomTopo layer growth algorithm
Add an additional layer growth algorithm, named 'RandomTopo'. It follows these
rules:
- Randomise NUMA nodes. List each core in each NUMA node before a core from
  another NUMA node.
- Randomise LLCs within each NUMA node. List each core in each LLC before a
  core in a different LLC.
- Randomise the core order within each LLC.

This attempts to provide a relatively evenly distributed set of cores while
considering topology. Unlike `Topo`, it does not require you to specify the
ordering and instead generates it from the hardware, making desyncs between the
config and the hardware less likely.

Currently `RandomTopo` considers topology even with `--disable-topology=true`.
I can see the arguments for this going both ways. On one hand requesting
disable topology suggests you want no consideration of machine topology, and
`RandomTopo` should decay to `Random` (which it does on single node/LLC machines
anyway). On the other hand, the config explicitly specifies `RandomTopo` and
should consider the topology. If anyone feels strongly I can change this to
respect `disable_topology`.

Test plan:
```sh
$ sudo target/release/scx_layered -v f:/tmp/test.json
...
14:31:19 [DEBUG] layer: batch algo: RandomTopo core order: [47, 44, 43, 42, 40, 45, 46, 41, 38, 37, 36, 39, 34, 32, 35, 33, 54, 49, 50, 52, 51, 48, 55, 53, 68, 64, 66, 67, 70, 69, 71, 65, 9, 10, 12, 15, 14, 11, 8, 13, 59, 60, 57, 63, 62, 56, 58, 61, 2, 3, 5, 4, 0, 6, 7, 1, 86, 83, 85, 87, 84, 81, 80, 82, 20, 22, 19, 23, 21, 18, 17, 16, 30, 25, 26, 31, 28, 27, 29, 24, 78, 73, 74, 79, 75, 77, 76, 72]
14:31:19 [DEBUG] layer: immediate algo: RandomTopo core order: [45, 40, 46, 42, 47, 43, 41, 44, 80, 82, 83, 84, 85, 86, 81, 87, 13, 10, 9, 15, 14, 12, 11, 8, 36, 38, 39, 32, 34, 35, 33, 37, 7, 3, 1, 0, 2, 5, 4, 6, 53, 52, 54, 48, 50, 49, 55, 51, 76, 77, 79, 78, 73, 74, 72, 75, 71, 66, 64, 67, 70, 69, 65, 68, 24, 26, 31, 25, 28, 30, 27, 29, 58, 56, 59, 61, 57, 62, 60, 63, 16, 19, 17, 23, 22, 20, 18, 21]
...
```

This is a machine with 1 NUMA/11 LLCs with 8 cores per LLC and you can see the
results are grouped by LLC but random within.
2024-10-17 15:36:00 +01:00
Daniel Hodges
b01ff79080
Merge pull request #805 from hodgesds/layered-refresh-cleanup
scx_layered: Refactor refresh cpumasks
2024-10-16 19:06:15 +00:00
Andrea Righi
2ea47af4bc
Merge pull request #804 from sched-ext/rustland-fixes
scx_rustland fixes and improvements
2024-10-16 18:26:03 +00:00
Tejun Heo
84d8abf913 Revert "Use per-arch vmlinux.h"
This reverts commit a23f3566e3.
2024-10-16 06:42:28 -10:00
Tejun Heo
bd79059f1a Revert "Add vmlinux.h for multiple arch"
This reverts commit 7067092555.
2024-10-16 06:42:18 -10:00
Dan Schatzberg
730052a0c4
Merge pull request #803 from dschatzberg/mitosis_fallback_dsq
scx_mitosis: Handle pinned tasks
2024-10-16 13:26:23 +00:00
Andrea Righi
763da6ab55 scx_rlfifo: operate in a more work-conserving way
Make scx_rlfifo even simpler and keep dispatching tasks even if the CPUs
are all busy.

This allows to better stress test the scx_rustland_core backend, by
using both the per-CPU DSQs and the global shared DSQ.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-10-16 14:06:00 +02:00