add timer based antistall to scx_layered and new flags
to enable/disable and specify seconds of delay before
it turns on.
also update ci config to make sure this verifies/runs.
Verifier in older kernels choke on function calls from sleepable progs
triggering non-sensical RCU state error:
frame1: R1_w=scalar(id=674,smin=smin32=0,smax=umax=smax32=umax32=51,var_off=(0x0; 0x3f)) R10=; return *llc_ptr;
1072: (61) r0 = *(u32 *)(r2 +0) ; frame1: R0_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) R2_w=map_value(map=bpf_bpf.rodata,ks=4,vs=9570,off=4400,smin=smin32=0,smax=umax=smax32=umax32=204,var_off=(0x0; 0xfc)) refs=13,647
; }
1073: (95) exit
bpf_rcu_read_unlock is missing
processed 10663 insns (limit 1000000) max_states_per_insn 8 total_states 615 peak_states 281 mark_read 20
-- END PROG LOAD LOG --
Work around by adding and using __always_inline variant of cpu_to_llc_id()
from layered_init(). Note that we can't switch everyone to __always_inline
as that can lead to verification failure due to ins limit.
When selecting the idle CPU use the idle_smt option on the layer. This
may improve cache locality in some cases by placing tasks on CPUs that
are on closer cache lines.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
On some older kernels layered fails to validate. Prevent certain helpers
from being inlined to pass the verifier.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
The loops in topology aware mode were recently refactored to place the -per-LLC
loops inside the per-layer loops. However, the layer specific checks were left
in the inner loops, slowing this down unnecessarily.
Pull the layer specific checks from the inner loop into the outer loop.
Also changes these functions to `__weak` to ensure they don't get inlined -
they're expected to be verified as global functions.
Note to reviewers: this looks good to me, but I'd appreciate if you reviewed
the De Morgan applications in detail.
Test plan:
- `cargo build --release && sudo target/release/scx_layered --run-example` on a
machine with multiple LLCs. It's possible to stall it quite easily with
stress-ng but I believe this is the case on main.
Add fallback DSQ cost accounting so that fallback DSQ costs are
accounted for and so that dispatch of fallback DSQs can be done in a
standardized way.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
The verifier error seems to stem from the wrong vmlinux.h.
Also, PR #889 seems to completely fix the problem.
So, drop the workaround.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Previously, the cur_logical_clk is updated with WIRTE_ONCE(),
which does not guarantee the atomicity when concurrent writes happen
-- which is possible. So change it using CAS (compare-and-swap).
Signed-off-by: Changwoo Min <changwoo@igalia.com>
scx_mitosis relied on the implicit assumption that after a sched tick,
all outstanding scheduling events had completed but this might not
actually be correct. This feels like a natural use-case for RCU, but
there is no way to directly make use of RCU in BPF. Instead, this commit
implements an RCU-like synchronization mechanism.
Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
Schedule all tasks using a single global DSQ. This gives a better
control to prevent potential starvation conditions.
With this change, scx_bpfland adopts a logic similar to scx_rusty and
scx_lavd, prioritizing tasks based on the frequency of their wait and
wake-up events, rather than relying exclusively on the average amount of
voluntary context switches.
Tasks are still classified as interactive / non-interactive based on the
amount of voluntary context switches, but this is only affecting the
cpufreq logic.
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Since tasks' average runtimes show skewed distribution, directly
using the runtime in the deadline calculation causes several
performance regressions. Instead, let's use the constant factor
and further prioritize frequency factors to deprioritize the long
runtime tasks.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Revert the change of sending self-IPI at preemption when a victim
CPU is the current CPU. The cost of self-IPI is prohibitively expensive
in some workloads (e.g., perf bench). Instead, resetting task' time slice
to zero.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Rather then always migrating tasks across LLC domains when no idle CPU
is available in their current LLC domain, allow migration but attempt to
bring tasks back to their original LLC domain whenever possible.
To do so, define the task's scheduling domain upon task creation or when
its affinity changes, and ensure the task remains within this domain
throughout its lifetime.
In the future we will add a proper load balancing logic, but for now
this change seems to provide consistent performance improvement in
certain server workloads.
For example, simple CUDA benchmarks show a performance boost of about
+10-20% with this change applied (on multi-LLC / NUMA machines).
Signed-off-by: Andrea Righi <arighi@nvidia.com>
This allows to prevent excessive starvation of regular tasks in presence
of high amount of interactive tasks (e.g., when running stress tests,
such as hackbench).
Signed-off-by: Andrea Righi <arighi@nvidia.com>
This can lead to stalls when a high number of interactive tasks are
running in the system (i.e.., hackbench or similar stress tests).
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Add SCX_OPS_ENQ_EXITING to the scheduler flags, since we are not using
bpf_task_from_pid() and the scheduler can handle exiting tasks.
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Ensure that task vruntime is always updated in ops.running() to maintain
consistency with other schedulers.
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Fix task filtering logic error to avoid the possibility of migrate the
same task over again. The orginal logic operation was "||" which might
include tasks already migrated to be taken into consideration again.
Change the condition to "&&" so we can elimate the error.
Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
Inside the function "try_find_move_task()", it returns directly when
there's no task found to be moved. If the cause is from lack of ability
to fulfilled the condition by "task_filter()", load balancer will try to
find move task again and remove "task_filter()" by setting it directly
to a function returns true.
However, in the fallback case, the tasks within the domains will be
empty. Swap the tasks back into domains vector before returning can
solve the issue.
Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
The combination of kernel versions and kerenl configs generates
different kernel symbols. For example, in an old kernel version,
__mutex_lock() is not generated. Also, there is no workaround
from the fentry/fexit/kprobe side currently. Let's entirely drop
the kernel locking for now and revisit it later.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Revised the lock tracking code, relying on stable symbols with various
kernel configurations. There are two changes:
- Entirely drop tracing rt_mutex, which can be on and off with kconfig
- Replace mutex_lock() families to __mutex_lock(), which is stable
across kernel configs. The downside of such change is it is now
possible to trace the lock fast path, so lock tracing is a bit less
accurate. But let's live with it for now until a better solution is found.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Fallback DSQs are not accounted with costs. If a layer is saturating the
machine it is possible to not consume from the fallback DSQ and stall
the task. This introduces and additional consumption from the fallback
DSQ when a layer runs out of budget. In addition, tasks that use partial
CPU affinities should be placed into the fallback DSQ. This change was
tested with stress-ng --cacheline `nproc` for several minutes without
causing stalls (which would stall on main).
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>