Tasks are consumed from various DSQs in the following order:
per-CPU DSQs => priority DSQ => shared DSQ
Tasks in the shared DSQ may be starved by those in the priority DSQ,
which in turn may be starved by tasks dispatched to any per-CPU DSQ.
To mitigate this, record the timestamp of the last task scheduling event
both from the priority DSQ and the shared DSQ.
If the starvation threshold is exceeded without consuming a task, the
scheduler will be forced to consume a task from the corresponding DSQ.
The starvation threshold can be adjusted using the --starvation-thresh
command line parameter (default is 5ms).
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
There is no need to RCU protect the cpumask for the offline CPUs: it is
created once when the scheduler is initialized and it's never
deallocated.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Reduce the default time slice down to 5ms for a faster reaction and
system responsiveness when the system is overcomissioned.
This also helps to provide a more predictable level of performance.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Always use direct CPU dispatch for kthreads, there is no need to treat
kthreads in a special way, simply reuse direct CPU dispatch to
prioritize them.
Moreover, change direct CPU dispatches to use scx_bpf_dispatch_vtime(),
since we may dispatch multiple tasks to the same per-CPU DSQ now.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Small refactoring of the idle CPU selection logic:
- optimize idle CPU selection for tasks that can run on a single CPU
- drop the built-in idle selection policy and completely rely on the
custom one
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
We are incorrectly using the SMT idle cpumask to find any idle CPU, fix
by using the generic idle cpumask.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Implement CPU hotplugging in scx_bpfland without restarting the
scheduler.
The idle selection logic has been updated to consider online CPUs.
Additionally, a cpumask for offline CPUs has been introduced. Tasks
that have been dispatched to the DSQs associated with offline CPUs are
consumed by the other CPUs that are still online.
Moreover, the dependency on the Topology crate is temporarily dropped
and instead, /sys/devices/system/cpu/smt/active is used to determine if
SMT should be taken into account during idle selection. The Topology
crate will be re-introduced later when scx_bpfland will gain more
topology-aware capabilities.
This fixes#406.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Introduce a tunable to set a limit of the minimum vruntime that is used
when a task is dispatched, as:
vtime_min = vtime_now - slice_lag_ns
Increasing the time slice lag can make interactive tasks even more
responsive at the cost of starving regular and newly created tasks.
Default time slice lag is 0.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Overview
========
This scheduler is derived from scx_rustland, but it is fully implemented
in BFP with minimal user-space Rust part to process command line
options, collect metrics and logs out scheduling statistics.
Unlike scx_rustland, all scheduling decisions are made by the BPF
component.
Motivation
==========
The primary goal of this scheduler is to act as a performance baseline
for comparison with scx_rustland, allowing for a better assessment of
the overhead caused by kernel/user-space interactions.
It can also be used to deploy prototypes initially tested in the
scx_rustland scheduler. In fact, this scheduler is expected to
outperform scx_rustland, due to the elimitation of the kernel/user-space
overhead.
Scheduling policy
=================
scx_bpfland is a vruntime-based sched_ext scheduler that prioritizes
interactive workloads. Its scheduling policy closely mirrors
scx_rustland, but it has been re-implemented in BPF with some small
adjustments.
Tasks are categorized as either interactive or regular based on their
average rate of voluntary context switches per second: tasks that exceed
a specific voluntary context switch threshold are classified as
interactive.
Interactive tasks are prioritized in a higher-priority DSQ, while
regular tasks are placed in a lower-priority DSQ. Within each queue,
tasks are sorted based on their weighted runtime, using the built-in scx
vtime ordering capabilities (scx_bpf_dispatch_vtime()).
Moreover, each task gets a time slice budget. When a task is dispatched,
it receives a time slice equivalent to the remaining unused portion of
its previously allocated time slice (with a minimum threshold applied).
This gives latency-sensitive workloads more chances to exceed their time
slice when needed to perform short bursts of CPU activity without being
interrupted (i.e., real-time audio encoding / decoding workloads).
Results
=======
According to the initial test results, using the same benchmark "playing
a videogame while recompiling the kernel", this scheduler seems to
provide a +5% improvement in the frames-per-second (fps) compared to
scx_rustland, with video games such as Cyberpunk 2077, Counter-Strike 2
and Baldur's Gate 3.
Initial test results indicate that this scheduler offers around a +5%
improvement in frames-per-second (fps) compared to scx_rustland when
using the benchmark "playing a video game while recompiling the kernel".
This improvement was observed in games such as Cyberpunk 2077,
Counter-Strike 2, and Baldur's Gate 3.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>