Add SCX_OPS_ENQ_EXITING to the scheduler flags, since we are not using
bpf_task_from_pid() and the scheduler can handle exiting tasks.
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Ensure that task vruntime is always updated in ops.running() to maintain
consistency with other schedulers.
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Fix task filtering logic error to avoid the possibility of migrate the
same task over again. The orginal logic operation was "||" which might
include tasks already migrated to be taken into consideration again.
Change the condition to "&&" so we can elimate the error.
Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
Inside the function "try_find_move_task()", it returns directly when
there's no task found to be moved. If the cause is from lack of ability
to fulfilled the condition by "task_filter()", load balancer will try to
find move task again and remove "task_filter()" by setting it directly
to a function returns true.
However, in the fallback case, the tasks within the domains will be
empty. Swap the tasks back into domains vector before returning can
solve the issue.
Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
The combination of kernel versions and kerenl configs generates
different kernel symbols. For example, in an old kernel version,
__mutex_lock() is not generated. Also, there is no workaround
from the fentry/fexit/kprobe side currently. Let's entirely drop
the kernel locking for now and revisit it later.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Revised the lock tracking code, relying on stable symbols with various
kernel configurations. There are two changes:
- Entirely drop tracing rt_mutex, which can be on and off with kconfig
- Replace mutex_lock() families to __mutex_lock(), which is stable
across kernel configs. The downside of such change is it is now
possible to trace the lock fast path, so lock tracing is a bit less
accurate. But let's live with it for now until a better solution is found.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Fallback DSQs are not accounted with costs. If a layer is saturating the
machine it is possible to not consume from the fallback DSQ and stall
the task. This introduces and additional consumption from the fallback
DSQ when a layer runs out of budget. In addition, tasks that use partial
CPU affinities should be placed into the fallback DSQ. This change was
tested with stress-ng --cacheline `nproc` for several minutes without
causing stalls (which would stall on main).
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
Flip the order of layer id vs layer name so that the output makes sense.
Example output:
LO_FALLBACK nr_queued=0 -0ms
COST GLOBAL[0][random] budget=22000000000 capacity=22000000000
COST GLOBAL[1][hodgesd] budget=0 capacity=0
COST GLOBAL[2][stress-ng] budget=0 capacity=0
COST GLOBAL[3][normal] budget=0 capacity=0
COST CPU[0][0][random] budget=62500000000000 capacity=62500000000000
COST CPU[0][1][random] budget=100000000000000 capacity=100000000000000
COST CPU[0][2][random] budget=124911500964411 capacity=125000000000000
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
The dynamic nvcsw threshold is not used anymore in the scheduler and it
doesn't make sense to report it in the scheduler's statistics, so let's
just drop it.
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Get rid of the static MAX_LATENCY_WEIGHT and always rely on the value
specified by --nvcsw-max-thresh.
This allows to tune the maximum latency weight when running in
lowlatency mode (via --nvcsw-max-thresh) and it also restores the
maximum nvcsw limit in non-lowlatency mode, that was incorrectly changed
during the lowlatency refactoring.
Fixes: 4d68133 ("scx_bpfland: rework lowlatency mode to adjust tasks priority")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Evalute the amount of voluntary context switches directly in the BPF
code, without relying on the kernel p->nvcsw metric.
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Add the layer CPU cost when dumping. This is useful for understanding
the per layer cost accounting when layered is stalled.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
Add the layer name to the bpf representation of a layer. When printing
debug output print the layer name as well as the layer index.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
The type of "taskc" within "lavd_dispatch()" was "struct task_struct *",
while it should be "struct task_ctx *".
Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
Refactor dispatch to use a separate set of global helpers for topo aware
dispatch. This change only refactors dispatch to make it more
maintainable, without any functional changes.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
Pinning a task to a single CPU is a widely-used optimization to
improve latency by reusing cache. So when a task is pinned to
a single CPU, let's boost its latency criticality.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Resetting reset_lock_futex_boost() at ops.enqueue() is not accurate,
so move it to the running. This way, we can prevent the lock holder
preemption only when a lock is acquired during ops.runnging() and
ops.stopping().
Signed-off-by: Changwoo Min <changwoo@igalia.com>