Fix task filtering logic error to avoid the possibility of migrate the
same task over again. The orginal logic operation was "||" which might
include tasks already migrated to be taken into consideration again.
Change the condition to "&&" so we can elimate the error.
Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
Inside the function "try_find_move_task()", it returns directly when
there's no task found to be moved. If the cause is from lack of ability
to fulfilled the condition by "task_filter()", load balancer will try to
find move task again and remove "task_filter()" by setting it directly
to a function returns true.
However, in the fallback case, the tasks within the domains will be
empty. Swap the tasks back into domains vector before returning can
solve the issue.
Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
The combination of kernel versions and kerenl configs generates
different kernel symbols. For example, in an old kernel version,
__mutex_lock() is not generated. Also, there is no workaround
from the fentry/fexit/kprobe side currently. Let's entirely drop
the kernel locking for now and revisit it later.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Revised the lock tracking code, relying on stable symbols with various
kernel configurations. There are two changes:
- Entirely drop tracing rt_mutex, which can be on and off with kconfig
- Replace mutex_lock() families to __mutex_lock(), which is stable
across kernel configs. The downside of such change is it is now
possible to trace the lock fast path, so lock tracing is a bit less
accurate. But let's live with it for now until a better solution is found.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Fallback DSQs are not accounted with costs. If a layer is saturating the
machine it is possible to not consume from the fallback DSQ and stall
the task. This introduces and additional consumption from the fallback
DSQ when a layer runs out of budget. In addition, tasks that use partial
CPU affinities should be placed into the fallback DSQ. This change was
tested with stress-ng --cacheline `nproc` for several minutes without
causing stalls (which would stall on main).
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
Flip the order of layer id vs layer name so that the output makes sense.
Example output:
LO_FALLBACK nr_queued=0 -0ms
COST GLOBAL[0][random] budget=22000000000 capacity=22000000000
COST GLOBAL[1][hodgesd] budget=0 capacity=0
COST GLOBAL[2][stress-ng] budget=0 capacity=0
COST GLOBAL[3][normal] budget=0 capacity=0
COST CPU[0][0][random] budget=62500000000000 capacity=62500000000000
COST CPU[0][1][random] budget=100000000000000 capacity=100000000000000
COST CPU[0][2][random] budget=124911500964411 capacity=125000000000000
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
The dynamic nvcsw threshold is not used anymore in the scheduler and it
doesn't make sense to report it in the scheduler's statistics, so let's
just drop it.
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Get rid of the static MAX_LATENCY_WEIGHT and always rely on the value
specified by --nvcsw-max-thresh.
This allows to tune the maximum latency weight when running in
lowlatency mode (via --nvcsw-max-thresh) and it also restores the
maximum nvcsw limit in non-lowlatency mode, that was incorrectly changed
during the lowlatency refactoring.
Fixes: 4d68133 ("scx_bpfland: rework lowlatency mode to adjust tasks priority")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Evalute the amount of voluntary context switches directly in the BPF
code, without relying on the kernel p->nvcsw metric.
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Add the layer CPU cost when dumping. This is useful for understanding
the per layer cost accounting when layered is stalled.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
Add the layer name to the bpf representation of a layer. When printing
debug output print the layer name as well as the layer index.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
The type of "taskc" within "lavd_dispatch()" was "struct task_struct *",
while it should be "struct task_ctx *".
Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
Refactor dispatch to use a separate set of global helpers for topo aware
dispatch. This change only refactors dispatch to make it more
maintainable, without any functional changes.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
Pinning a task to a single CPU is a widely-used optimization to
improve latency by reusing cache. So when a task is pinned to
a single CPU, let's boost its latency criticality.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Resetting reset_lock_futex_boost() at ops.enqueue() is not accurate,
so move it to the running. This way, we can prevent the lock holder
preemption only when a lock is acquired during ops.runnging() and
ops.stopping().
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Even in the direct dispatch path, calculating the task's latency
criticality is still necessary since the latency criticality is
used for the preemptablity test. This addressed the following
GitHub issue:
https://github.com/sched-ext/scx/issues/856
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Add cost accounting for layers to make weights work on the BPF side.
This is done at both the CPU level as well as globally. When a CPU
runs out of budget it acquires budget from the global context. If a
layer runs out of global budgets then all budgets are reset. Weight
handling is done by iterating over layers by their available budget.
Layers budgets are proportional to their weights.
When the current task is decided to yield, we should explicitly call
scx_bpf_kick_cpu(_, SCX_KICK_PREEMPT). Setting the current task's time
slice to zero is not sufficient in this because the sched_ext core
does not call resched_curr() at the ops.enqueue() path.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
An eligible task is unlikely preemptible. In other words, an ineligible
task is more likely preemptible since its greedy ratio penalty in virtual
deadline calculation. Hence, we skip the predictability test
for an eligible task.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Some of the new timer code doesn't verify on older kernels like 6.9. Modify the
code a little to get it verifying again.
Also applies some small fixes to the logic. Error handling was a little off
before and we were using the wrong key in lookups.
Test plan:
- CI