The ability for kthreads to preempt other tasks was initially introduced
as a workaround for a kernel bug that caused kthreads to become stuck in
the DSQ without being consumed.
Instead of dispatching kthreads directly and allow them to preempt other
tasks, consider them as interactive and always dispatch them to the
priority DSQ.
Moreover, if local_kthreads is specified per-CPU kthreads will be always
dispatched directly to the local DSQ, before any other task (without
preempting them).
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Get rid of the L2 cache awareness, that usually doesn't give any benefit
and only introduces overhead, and only consider LLC domain for
cache-awareness.
Moreover, reduce the usage of cpumask's and apply scheduling domains
logic to tasks that can run on all CPUs: if a task's scheduling domain
is restricted by user-space (through CPU affinity), the task will simply
use the flat scheduling domain defined by user-space.
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Set CONFIG_PROVE_LOCKING=y in the defalut virtme-ng config, to enable
lockdep and additional lock debugging features, which can help catch
lock-related kernel issues in advance.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Some of the new timer code doesn't verify on older kernels like 6.9. Modify the
code a little to get it verifying again.
Also applies some small fixes to the logic. Error handling was a little off
before and we were using the wrong key in lookups.
Test plan:
- CI
The previous code accesses uninitialized memory in comp_preemption_info()
when called from can_task1_kick_task2() <-try_yield_current_cpu()
to test if a task 2 is a lock holder or not. However, task2 is guaranteed
not a lock holder in all its callers. So move the lock holder testing to
can_cpu1_kick_cpu2().
Signed-off-by: Changwoo Min <changwoo@igalia.com>
add retries to kernel clone step
odds are that like, when we most care about cloning a fresh upstream
(i.e. some fix was just pushed), things are going to be flakiest (i.e.
no caches anywhere), so retry 10 times ~3 minutes total between tries
when trying to clone kernel source.
When a task is enqueued, kick an idle CPU in the chosen scheduling
domain. This will reduce temporary stall time of the task by waking
up the CPU as early as possible.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
We used to give a penalty in latency linearly to the greedy ratio.
However, this impacts the greedy ratio too much in determining the
virtual deadline, especially among under-utilized tasks (< 100.0%).
Now, we treat all under-utilized tasks with the same greedy ratio
(= 100.0%). For over-utilized tasks, we give a bit milder penalty
to avoid sudden latency spikes.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Previously, contextual information—such as sync wakeup and kernel
task—was incorporated into the final latency criticality value ad hoc
by adding a constant. Instead, let's make everything proportional to
run time and waker and wakee frequencies by scaling up/down the run
time and the frequencies.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Enabling SCHED_MC in the kernel used for testing allows us to
potentially run more complext tests, simulating different CPU topologies
and have access to such topology data through the in-kernel scheduler's
information.
This can be useful as we add more topology awareness logic to the
sched_ext core (e.g., in the built-in idle CPU selection policy).
Therefore add this option to the default .config (and also fix a missing
newline at the end of the file).
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
This script iterates over a list archs and generates vmlinux.h for each.
Generated files are put under the corresponding arch directory
Signed-off-by: Ming Yang <minos.future@gmail.com>
Previously, the preemption is allowed only when a task is at the
early in its time slice by using LAVD_PREEMPT_KICK_MARGIN and
LAVD_PREEMPT_TICK_MARGIN. This is not necessary any more because
the lock holder preemption can avoid harmful preemptions. So we
remove LAVD_PREEMPT_KICK_MARGIN and LAVD_PREEMPT_TICK_MARGIN and
unleash the preemption.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
When calculating task's latency criticality, incorporate task's
weight into runtime, wake_freq, and wait_freq more systematically.
It looks nicer and works better under heavy load.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
When a CPU is released to serve higher priority scheduler class,
requeue the tasks in a local DSQ to the global enqueue.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Currently we have an approximation of LayerKind in the BPF code with `open` on
the layer, but it is difficult/impossible to tell the difference between an
Open and a Grouped layer. Add a `kind` field to the BPF `layer` and plumb
through an enum from the Rust side.
When a task holds a lock, refill its time slice once at the
ops.dispatch() path to avoid the lock holder preemption problem.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
When there is an idle CPU, direct dispatch is performed to reduce
scheduling latency. This didn't work well before, but it seems
to work well now with other tunings.
Signed-off-by: Changwoo Min <changwoo@igalia.com>