Commit Graph

1122 Commits

Author SHA1 Message Date
Pat Somaru
89f4aa1351
scx_layered: add antistall
add timer based antistall to scx_layered and new flags
to enable/disable and specify seconds of delay before
it turns on.

also update ci config to make sure this verifies/runs.
2024-11-08 20:31:02 -05:00
Tejun Heo
bb91ad0084 scx_layered: Work around older kernels choking on function calls from sleepable progs
Verifier in older kernels choke on function calls from sleepable progs
triggering non-sensical RCU state error:

   frame1: R1_w=scalar(id=674,smin=smin32=0,smax=umax=smax32=umax32=51,var_off=(0x0; 0x3f)) R10=; return *llc_ptr;
  1072: (61) r0 = *(u32 *)(r2 +0)       ; frame1: R0_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) R2_w=map_value(map=bpf_bpf.rodata,ks=4,vs=9570,off=4400,smin=smin32=0,smax=umax=smax32=umax32=204,var_off=(0x0; 0xfc)) refs=13,647
  ; }
  1073: (95) exit
  bpf_rcu_read_unlock is missing
  processed 10663 insns (limit 1000000) max_states_per_insn 8 total_states 615 peak_states 281 mark_read 20
  -- END PROG LOAD LOG --

Work around by adding and using __always_inline variant of cpu_to_llc_id()
from layered_init(). Note that we can't switch everyone to __always_inline
as that can lead to verification failure due to ins limit.
2024-11-08 08:47:57 -10:00
Lohith C V
a2e119ae23 scx_lavd: docs: fix typos 2024-11-08 16:25:55 +05:30
Daniel Hodges
3b47782bf4 scx_layered: Add fallback costs to dump
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-11-07 19:49:09 -05:00
Daniel Hodges
73926d6481
Merge pull request #912 from hodgesds/layered-mask-cleanup
scx_layered: Cleanup cpumask
2024-11-07 22:52:28 +00:00
5ae1b84533
Merge pull request #908 from JakeHillion/pr908
layered/topo: lift layer specific checks out of per-LLC loop
2024-11-07 21:48:20 +00:00
Daniel Hodges
ee4fd3dace scx_layered: Cleanup cpumask
Cleanup remaining cpumasks to use `cast_mask`.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-11-07 13:18:10 -08:00
Daniel Hodges
637fc3f6e1 scx_layered: Use layer idle_smt option
When selecting the idle CPU use the idle_smt option on the layer. This
may improve cache locality in some cases by placing tasks on CPUs that
are on closer cache lines.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-11-07 13:05:19 -08:00
Daniel Hodges
7db2ef22d0 scx_layered: Fix verifier issue on older kernels
On some older kernels layered fails to validate. Prevent certain helpers
from being inlined to pass the verifier.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-11-07 12:20:58 -08:00
Jake Hillion
ba54808150 layered/topo: lift layer specific checks out of per-LLC loop
The loops in topology aware mode were recently refactored to place the -per-LLC
loops inside the per-layer loops. However, the layer specific checks were left
in the inner loops, slowing this down unnecessarily.

Pull the layer specific checks from the inner loop into the outer loop.

Also changes these functions to `__weak` to ensure they don't get inlined -
they're expected to be verified as global functions.

Note to reviewers: this looks good to me, but I'd appreciate if you reviewed
the De Morgan applications in detail.

Test plan:
- `cargo build --release && sudo target/release/scx_layered --run-example` on a
  machine with multiple LLCs. It's possible to stall it quite easily with
  stress-ng but I believe this is the case on main.
2024-11-07 18:34:44 +00:00
Changwoo Min
416de68b72
Merge pull request #904 from multics69/lavd-drop-padding
scx_lavd: drop padding in cpdom_cpumask, which was a workaround
2024-11-07 16:13:07 +00:00
Changwoo Min
56357a79db
Merge pull request #903 from multics69/lavd-issue-897
scx_lavd: update cur_logical_clk atomically
2024-11-07 16:12:56 +00:00
Daniel Hodges
3cc849f234 scx_layered: Fix verifier issue when tracing
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-11-07 06:43:40 -08:00
Daniel Hodges
487baa4a03 scx_layered: Add fallback DSQ cost accounting
Add fallback DSQ cost accounting so that fallback DSQ costs are
accounted for and so that dispatch of fallback DSQs can be done in a
standardized way.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-11-07 05:25:57 -08:00
Changwoo Min
22cb9e9ce1 scx_lavd: drop padding in cpdom_cpumask, which was a workaround
The verifier error seems to stem from the wrong vmlinux.h.
Also, PR #889 seems to completely fix the problem.
So, drop the workaround.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-11-07 16:13:06 +09:00
Changwoo Min
e9ba2d53fa scx_lavd: update cur_logical_clk atomically
Previously, the cur_logical_clk is updated with WIRTE_ONCE(),
which does not guarantee the atomicity when concurrent writes happen
-- which is possible. So change it using CAS (compare-and-swap).

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-11-07 16:01:50 +09:00
Emil Tsalapatis
5e35a12ce3 remove stray print 2024-11-06 18:02:08 -08:00
Emil Tsalapatis
42880404e1 Merge branch 'main' of https://github.com/sched-ext/scx into core_enums 2024-11-06 12:44:23 -08:00
Emil Tsalapatis
2f174db96f use the enum singleton in the userspace scheduler components 2024-11-06 12:17:16 -08:00
Emil Tsalapatis
1cabed9d09 Autogenerate enums and BPF enum setters for Rust schedulers 2024-11-06 12:17:16 -08:00
Emil Tsalapatis
d500c50098 add autogenerated enum definitions for Rust schedulers 2024-11-06 12:17:16 -08:00
Dan Schatzberg
fb635cb8f0
Merge pull request #438 from dschatzberg/mitosis
Refactor select_cpu + enqueue for proper synchronization and handle of !wakeup
2024-11-06 18:36:05 +00:00
Andrea Righi
f402f118db
Merge pull request #899 from sched-ext/bpfland-rework
scx_bpfland: rework
2024-11-06 17:55:36 +00:00
Tejun Heo
ad45727139 version: v1.0.6 2024-11-06 06:54:26 -10:00
Dan Schatzberg
af2cb1abbe scx_mitosis: add RCU-like synchronization
scx_mitosis relied on the implicit assumption that after a sched tick,
all outstanding scheduling events had completed but this might not
actually be correct. This feels like a natural use-case for RCU, but
there is no way to directly make use of RCU in BPF. Instead, this commit
implements an RCU-like synchronization mechanism.

Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
2024-11-06 08:33:29 -08:00
Emil Tsalapatis
479d515a45
Merge branch 'main' into core_enums 2024-11-06 11:07:42 -05:00
Emil Tsalapatis
23f302cf13 add SCX_SLICE_* macros to scx_utils and use them for the Rust schedulers 2024-11-06 07:52:04 -08:00
Andrea Righi
78101e4688 scx_bpfland: drop lowlatency mode and the priority DSQ
Schedule all tasks using a single global DSQ. This gives a better
control to prevent potential starvation conditions.

With this change, scx_bpfland adopts a logic similar to scx_rusty and
scx_lavd, prioritizing tasks based on the frequency of their wait and
wake-up events, rather than relying exclusively on the average amount of
voluntary context switches.

Tasks are still classified as interactive / non-interactive based on the
amount of voluntary context switches, but this is only affecting the
cpufreq logic.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
2024-11-06 15:06:39 +01:00
Changwoo Min
d0eeebf98a scx_lavd: deprioritize a long runtime by prioritizing frequencies further
Since tasks' average runtimes show skewed distribution, directly
using the runtime in the deadline calculation causes several
performance regressions. Instead, let's use the constant factor
and further prioritize frequency factors to deprioritize the long
runtime tasks.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-11-06 18:08:57 +09:00
Changwoo Min
cfe23aa21b scx_lavd: avoid self-IPI at preemption
Revert the change of sending self-IPI at preemption when a victim
CPU is the current CPU. The cost of self-IPI is prohibitively expensive
in some workloads (e.g., perf bench). Instead, resetting task' time slice
to zero.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-11-06 13:59:24 +09:00
Andrea Righi
efc41dd936 scx_bpfland: strict domain affinity
Rather then always migrating tasks across LLC domains when no idle CPU
is available in their current LLC domain, allow migration but attempt to
bring tasks back to their original LLC domain whenever possible.

To do so, define the task's scheduling domain upon task creation or when
its affinity changes, and ensure the task remains within this domain
throughout its lifetime.

In the future we will add a proper load balancing logic, but for now
this change seems to provide consistent performance improvement in
certain server workloads.

For example, simple CUDA benchmarks show a performance boost of about
+10-20% with this change applied (on multi-LLC / NUMA machines).

Signed-off-by: Andrea Righi <arighi@nvidia.com>
2024-11-05 16:37:51 +01:00
Andrea Righi
064d6fb560 scx_bpfland: consider all tasks as regular if priority DSQ is congested
This allows to prevent excessive starvation of regular tasks in presence
of high amount of interactive tasks (e.g., when running stress tests,
such as hackbench).

Signed-off-by: Andrea Righi <arighi@nvidia.com>
2024-11-05 16:37:51 +01:00
Andrea Righi
8a655d94f5 scx_bpfland: do not overly prioritize WAKE_SYNC tasks
This can lead to stalls when a high number of interactive tasks are
running in the system (i.e.., hackbench or similar stress tests).

Signed-off-by: Andrea Righi <arighi@nvidia.com>
2024-11-05 16:37:51 +01:00
Andrea Righi
f0c8de3477 scx_bpfland: do not exclude exiting tasks
Add SCX_OPS_ENQ_EXITING to the scheduler flags, since we are not using
bpf_task_from_pid() and the scheduler can handle exiting tasks.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
2024-11-05 16:37:51 +01:00
Andrea Righi
eb99e45ced scx_bpfland: consistent vruntime update
Ensure that task vruntime is always updated in ops.running() to maintain
consistency with other schedulers.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
2024-11-05 16:37:51 +01:00
I Hsin Cheng
a5fd42a719 scx_rusty: Fix filtering logic error
Fix task filtering logic error to avoid the possibility of migrate the
same task over again. The orginal logic operation was "||" which might
include tasks already migrated to be taken into consideration again.
Change the condition to "&&" so we can elimate the error.

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
2024-11-05 14:46:40 +08:00
I Hsin Cheng
a248fdc5e3 scx_rusty: Restore push domain tasks when no task found
Inside the function "try_find_move_task()", it returns directly when
there's no task found to be moved. If the cause is from lack of ability
to fulfilled the condition by "task_filter()", load balancer will try to
find move task again and remove "task_filter()" by setting it directly
to a function returns true.

However, in the fallback case, the tasks within the domains will be
empty. Swap the tasks back into domains vector before returning can
solve the issue.

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
2024-11-05 14:25:29 +08:00
Changwoo Min
e10a92f392
Merge pull request #887 from multics69/scx-minor-fixes
scx_lavd: entirely drop kernel lock tracing
2024-11-05 00:19:02 +00:00
Emil Tsalapatis
9c6ad33fda
Merge pull request #891 from etsal/global_costc
scx_layered: point costc to global struct when initializing budgets
2024-11-04 21:41:47 +00:00
Emil Tsalapatis
9cf137be99 scx_layered: remove ->open from layer struct 2024-11-04 11:42:23 -08:00
Emil Tsalapatis
2b0909f9d7 scx_layered: point costc to global struct when initializing budgets 2024-11-04 11:40:07 -08:00
Daniel Hodges
5dcdcfc50f scx_layered: Refactor cost naming
Refactor cost struct usage so that it always is used as `costc` for
clarity.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-11-04 07:07:13 -08:00
Changwoo Min
517ef89444 scx_lavd: drop all kernel lock tracing
The combination of kernel versions and kerenl configs generates
different kernel symbols. For example, in an old kernel version,
__mutex_lock() is not generated. Also, there is no workaround
from the fentry/fexit/kprobe side currently. Let's entirely drop
the kernel locking for now and revisit it later.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-11-04 21:36:32 +09:00
Changwoo Min
796d324555 scx_lavd: improve readabilty
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-11-04 21:18:43 +09:00
Changwoo Min
882212574a scx_lavd: fix a variable name to kill a warning
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-11-04 16:20:53 +09:00
Changwoo Min
e1b880f7c3 scx_lavd: fix CI error for missing kernel symbols
Revised the lock tracking code, relying on stable symbols with various
kernel configurations. There are two changes:

- Entirely drop tracing rt_mutex, which can be on and off with kconfig

- Replace mutex_lock() families to __mutex_lock(), which is stable
  across kernel configs. The downside of such change is it is now
  possible to trace the lock fast path, so lock tracing is a bit less
  accurate. But let's live with it for now until a better solution is found.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-11-04 16:05:53 +09:00
Andrea Righi
f6f5481081
Merge branch 'main' into bpfland-minor-cleanups 2024-11-02 15:28:37 +01:00
Daniel Hodges
7e0f2cd3f3 scx_layered Fix trace format
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-11-01 10:16:14 -07:00
Daniel Hodges
cb5b1961b8
Merge branch 'main' into layered-fallback-stall 2024-11-01 10:24:00 -04:00
Daniel Hodges
0a518e9f9e scx_layered: Add additional drain to fallback DSQs
Fallback DSQs are not accounted with costs. If a layer is saturating the
machine it is possible to not consume from the fallback DSQ and stall
the task. This introduces and additional consumption from the fallback
DSQ when a layer runs out of budget. In addition, tasks that use partial
CPU affinities should be placed into the fallback DSQ. This change was
tested with stress-ng --cacheline `nproc` for several minutes without
causing stalls (which would stall on main).

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-11-01 06:04:55 -07:00