The dynamic nvcsw threshold is not used anymore in the scheduler and it
doesn't make sense to report it in the scheduler's statistics, so let's
just drop it.
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Get rid of the static MAX_LATENCY_WEIGHT and always rely on the value
specified by --nvcsw-max-thresh.
This allows to tune the maximum latency weight when running in
lowlatency mode (via --nvcsw-max-thresh) and it also restores the
maximum nvcsw limit in non-lowlatency mode, that was incorrectly changed
during the lowlatency refactoring.
Fixes: 4d68133 ("scx_bpfland: rework lowlatency mode to adjust tasks priority")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Evalute the amount of voluntary context switches directly in the BPF
code, without relying on the kernel p->nvcsw metric.
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Add the layer CPU cost when dumping. This is useful for understanding
the per layer cost accounting when layered is stalled.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
Add the layer name to the bpf representation of a layer. When printing
debug output print the layer name as well as the layer index.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
The type of "taskc" within "lavd_dispatch()" was "struct task_struct *",
while it should be "struct task_ctx *".
Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
Refactor dispatch to use a separate set of global helpers for topo aware
dispatch. This change only refactors dispatch to make it more
maintainable, without any functional changes.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
Pinning a task to a single CPU is a widely-used optimization to
improve latency by reusing cache. So when a task is pinned to
a single CPU, let's boost its latency criticality.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Resetting reset_lock_futex_boost() at ops.enqueue() is not accurate,
so move it to the running. This way, we can prevent the lock holder
preemption only when a lock is acquired during ops.runnging() and
ops.stopping().
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Even in the direct dispatch path, calculating the task's latency
criticality is still necessary since the latency criticality is
used for the preemptablity test. This addressed the following
GitHub issue:
https://github.com/sched-ext/scx/issues/856
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Add cost accounting for layers to make weights work on the BPF side.
This is done at both the CPU level as well as globally. When a CPU
runs out of budget it acquires budget from the global context. If a
layer runs out of global budgets then all budgets are reset. Weight
handling is done by iterating over layers by their available budget.
Layers budgets are proportional to their weights.
When the current task is decided to yield, we should explicitly call
scx_bpf_kick_cpu(_, SCX_KICK_PREEMPT). Setting the current task's time
slice to zero is not sufficient in this because the sched_ext core
does not call resched_curr() at the ops.enqueue() path.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
An eligible task is unlikely preemptible. In other words, an ineligible
task is more likely preemptible since its greedy ratio penalty in virtual
deadline calculation. Hence, we skip the predictability test
for an eligible task.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Set CONFIG_PROVE_LOCKING=y in the defalut virtme-ng config, to enable
lockdep and additional lock debugging features, which can help catch
lock-related kernel issues in advance.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Some of the new timer code doesn't verify on older kernels like 6.9. Modify the
code a little to get it verifying again.
Also applies some small fixes to the logic. Error handling was a little off
before and we were using the wrong key in lookups.
Test plan:
- CI
The previous code accesses uninitialized memory in comp_preemption_info()
when called from can_task1_kick_task2() <-try_yield_current_cpu()
to test if a task 2 is a lock holder or not. However, task2 is guaranteed
not a lock holder in all its callers. So move the lock holder testing to
can_cpu1_kick_cpu2().
Signed-off-by: Changwoo Min <changwoo@igalia.com>
add retries to kernel clone step
odds are that like, when we most care about cloning a fresh upstream
(i.e. some fix was just pushed), things are going to be flakiest (i.e.
no caches anywhere), so retry 10 times ~3 minutes total between tries
when trying to clone kernel source.
When a task is enqueued, kick an idle CPU in the chosen scheduling
domain. This will reduce temporary stall time of the task by waking
up the CPU as early as possible.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
We used to give a penalty in latency linearly to the greedy ratio.
However, this impacts the greedy ratio too much in determining the
virtual deadline, especially among under-utilized tasks (< 100.0%).
Now, we treat all under-utilized tasks with the same greedy ratio
(= 100.0%). For over-utilized tasks, we give a bit milder penalty
to avoid sudden latency spikes.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Previously, contextual information—such as sync wakeup and kernel
task—was incorporated into the final latency criticality value ad hoc
by adding a constant. Instead, let's make everything proportional to
run time and waker and wakee frequencies by scaling up/down the run
time and the frequencies.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Enabling SCHED_MC in the kernel used for testing allows us to
potentially run more complext tests, simulating different CPU topologies
and have access to such topology data through the in-kernel scheduler's
information.
This can be useful as we add more topology awareness logic to the
sched_ext core (e.g., in the built-in idle CPU selection policy).
Therefore add this option to the default .config (and also fix a missing
newline at the end of the file).
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>