Resetting reset_lock_futex_boost() at ops.enqueue() is not accurate,
so move it to the running. This way, we can prevent the lock holder
preemption only when a lock is acquired during ops.runnging() and
ops.stopping().
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Even in the direct dispatch path, calculating the task's latency
criticality is still necessary since the latency criticality is
used for the preemptablity test. This addressed the following
GitHub issue:
https://github.com/sched-ext/scx/issues/856
Signed-off-by: Changwoo Min <changwoo@igalia.com>
When the current task is decided to yield, we should explicitly call
scx_bpf_kick_cpu(_, SCX_KICK_PREEMPT). Setting the current task's time
slice to zero is not sufficient in this because the sched_ext core
does not call resched_curr() at the ops.enqueue() path.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
An eligible task is unlikely preemptible. In other words, an ineligible
task is more likely preemptible since its greedy ratio penalty in virtual
deadline calculation. Hence, we skip the predictability test
for an eligible task.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
The previous code accesses uninitialized memory in comp_preemption_info()
when called from can_task1_kick_task2() <-try_yield_current_cpu()
to test if a task 2 is a lock holder or not. However, task2 is guaranteed
not a lock holder in all its callers. So move the lock holder testing to
can_cpu1_kick_cpu2().
Signed-off-by: Changwoo Min <changwoo@igalia.com>
When a task is enqueued, kick an idle CPU in the chosen scheduling
domain. This will reduce temporary stall time of the task by waking
up the CPU as early as possible.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
We used to give a penalty in latency linearly to the greedy ratio.
However, this impacts the greedy ratio too much in determining the
virtual deadline, especially among under-utilized tasks (< 100.0%).
Now, we treat all under-utilized tasks with the same greedy ratio
(= 100.0%). For over-utilized tasks, we give a bit milder penalty
to avoid sudden latency spikes.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Previously, contextual information—such as sync wakeup and kernel
task—was incorporated into the final latency criticality value ad hoc
by adding a constant. Instead, let's make everything proportional to
run time and waker and wakee frequencies by scaling up/down the run
time and the frequencies.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Previously, the preemption is allowed only when a task is at the
early in its time slice by using LAVD_PREEMPT_KICK_MARGIN and
LAVD_PREEMPT_TICK_MARGIN. This is not necessary any more because
the lock holder preemption can avoid harmful preemptions. So we
remove LAVD_PREEMPT_KICK_MARGIN and LAVD_PREEMPT_TICK_MARGIN and
unleash the preemption.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
When calculating task's latency criticality, incorporate task's
weight into runtime, wake_freq, and wait_freq more systematically.
It looks nicer and works better under heavy load.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
When a CPU is released to serve higher priority scheduler class,
requeue the tasks in a local DSQ to the global enqueue.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
When a task holds a lock, refill its time slice once at the
ops.dispatch() path to avoid the lock holder preemption problem.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
When there is an idle CPU, direct dispatch is performed to reduce
scheduling latency. This didn't work well before, but it seems
to work well now with other tunings.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Giving more penalties to a long-running tasks helps to segregate
latency-critical tasks, which are usually short-running, to
long-running tasks, which are compute-intensive.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
The algorithm has been evolved to decide the time slice without
tracking the system-wide load. So remove the obsolete load tracking
code.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
reset_lock_futex_boost() should be called every context switch of a
task. Otherwise, in the worst case, a task and that CPU could block
the preemption. To avoid such a situation, add missing
reset_lock_futex_boost() calls.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
u32 is not big enough to hold the sum of lat_cri in a period,
so sum_lat_cri (u32) was overflown, resulting in incorrect
avg_lat_cri. Change the type from u32 to u64, avoiding the
interger overflow. Note that {sum/avg}_lat_cri is only for
deubugging so it is irrelevant in making scheduling decisions.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
The downscaling is not necessary in calculating task's virtual
deadline because virtual dealine represents only relative order
in task scheduling. Hence downscaling incurs only inacuracy
caused by truncation.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
When a task holds a lock, it should not yield its time slice or it
should not be preempted out. In this way, we can mitigate harmful
preemption of lock holders and reduce the total preemption counts.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
When a lock holder exhausts its time slide, it will be re-enqueued
to a DSQ waiting for shceduling while holding a lock. In this case,
prioritize its latency criticality proportionally, so a lock holder
would be not stuck in a DSQ for a long time, improving system-wide
progress.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Trace the acquisition and release of blocking locks for kernel and
fuxtexes for user-space. This is necessary to boost a lock holder
task in terms of latency and time slice. We do not boost shared
lock holders (e.g., read lock in rw_semaphore) since the kernel
already prioritizes the readers over writers.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
As the main.bpf.c file grows, it gets hard to maintain.
So, split it into multiple logical files. There is no
functional change.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Use scx_utils::NR_CPU_IDS to iterate whole CPUs and separately count the
number of online CPUs to support CPU hotplug correctly.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
`#stat_doc` extends the document from stat desc property.
Add this attribute macro to the remaining Stats structs.
Signed-off-by: Ming Yang <minos.future@gmail.com>
When finding a victim candidate for preemption, a randomly chosen
candidate could be out of valid CPU range due to CPU offline, etc. In
this case, try another CPU randomly.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
We used the average performance criticality of tasks as a threshold to
determine the proper core type (big or little). However, if the big
core's compute capacity is not half of the total compute capacity, such
an average-based determination becomes suboptimal. If fewer tasks are
classified as performance-critical tasks and requested to run on big
cores, the big cores would be wasted by stealing arbitrary
non-performance-critical tasks. That could result in performance
instability.
Hence, determine the threshold more accurately by considering (active)
big cores' compute capacity and the (approximated) distribution of
performance criticality of tasks.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
As a preparation to improve the performance criticality logic, we first
rename "avg_perf_cri" to "thr_perf_cri" since average is no longer the
threshold.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Remove cast_mask() function distributed throughout different schedulers
and add it in common.bpf.h so every scheduler can reference it once they
need to.
Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
If a waker is more latency critical than a wakee, inherit a waker's
latency criticality for the wakee. This allows the wakee to consider the
context of who wakes me up. For now, we limit such inheritance to one
hop and one schedule.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Previously, we found a victim from the entire CPUs, which include remote
or non-compatible CPUs. Now we limit our search for victim finding
within a task's compute domain.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Many kernel threads performs latency critical tasks (e.g., net, gpu). In
particular, AMD GPU driver runs the most part in the kernel space using
kworker. Hence, treat kernel threads as if a woken up task.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Use `cargo fmt` with a specific nightly branch in the CI to enforce formatting. Globally format these files while the diff is still small so we can stay on top of it.
Test plan:
- CI lint check passes.