Introduce a basic support of CPU topology awareness. With this change,
the scheduler will prioritize dispatching tasks to idle CPUs with fewer
busy SMT siblings, then, it will proceed to CPUs with more busy SMT
siblings, in ascending order.
To implement this, introduce a new CoreMapping abstraction, that
provides a mapping of the available core IDs in the system along with
their corresponding lists of CPU IDs. This, coupled with the
get_cpu_pid() method from the BpfScheduler abstraction, allows the
user-space scheduler to enforce the policy outlined above and improve
performance on SMT systems.
Keep in mind that this improvement is relevent only when the amount of
tasks running in the system is less than the amount of CPUs. As soon as
the amount of running tasks increases, they will be distributed across
all available CPUs and cores, thereby negating the advantages of SMT
isolation.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Even if the current implementation of the user-space scheduler doesn't
require to allocate aligned memory, add a simple support to aligned
allocations in RustLandAllocator, in order to make it more generic and
potentially usable by other schedulers / components.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Periodically report a page fault counter in the scheduler output. The
user-space scheduler should never trigger page faults, otherwise we may
experience deadlocks (that would trigger the sched-ext watchdog,
unloading the scheduler).
Reporting a page fault counter periodically to stdout can be really
helpful to debug potential issues with the custom allocator.
Moreover, group together also nr_sched_congested and
nr_failed_dispatches with nr_page_faults and use the sum of all these
counters to determine the healthy status of the user-space scheduler
(reporting it to stdout as well).
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
To prevent potential deadlock conditions under heavy loads, any
scheduler that delegates scheduling decisions to user-space should avoid
triggering page faults.
To address this issue, replace the default Rust allocator with a custom
one (RustLandAllocator), designed to operate on a pre-allocated buffer.
This, coupled with the memory locking (via mlockall), prevents page
faults from happening during the execution of the user-space scheduler,
avoiding the deadlock condition.
This memory allocator is completely transparent to the user-space
scheduler code and it is applied automatically when the bpf module is
imported.
In the future we may decide to move this allocator to a more generic
place (scx_utils crate), so that also other user-space Rust schedulers
can use it.
This initial implementation of the RustLandAllocator is very simple: a
basic block-based allocator that uses an array to track the status of
each memory block (allocated or free).
This allocator can be improved in the future, but right now, despite its
simplicity, it shows a reasonable speed and efficiency in meeting memory
requests from the user-space scheduler, having to deal mostly with small
and uniformly sized allocations.
With this change in place scx_rustland survived more than 10hrs on a
heavily stressed system (with stress-ng and kernel builds running in a
loop):
$ ps -o pid,rss,etime,cmd -p `pidof scx_rustland`
PID RSS ELAPSED CMD
34966 75840 10:00:44 ./build/scheds/rust/scx_rustland/debug/scx_rustland
Without this change it is possible to trigger the sched-ext watchdog
timeout in less than 5min, under the same system load conditions.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Entries from TaskInfoMap associated to exiting tasks are already removed
via the BPF .exit_task() callback, so drop the obsolete TODO note and
replace it with a proper comment.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Improve priority boosting using voluntary context switches metric.
Overview
========
The current criteria to apply the time slice boost (option `-b`) is to
distinguish between newly created tasks and tasks that are already
running: in order to prioritize interactive applications (games,
multimedia, etc.) we apply a time slice usage penalty on newly created
tasks, indirectly boosting the priority of tasks that are already
running, which are likely to be the interactive applications that we
aim to prioritize.
Problem
=======
This approach works well when the background workload forks a bunch of
short-lived tasks (e.g., a parallel kernel build), but it fails to
properly classify CPU-intensive background tasks (i.e., video/3D
rendering, encryption, large data analysis, etc.), because these
applications, typically, do not generate many short-lived processes.
In presence of such workloads the time slice penalty is not enforced,
resulting in a lack of any boost for interactive applications.
Solution
========
A more effective critiria for distinguishing between interactive
applications and background CPU-intensive applications is to examine the
voluntary context switches: an application that periodically releases
the CPU voluntarily is very likely to be interactive.
Therefore, change the time slice boost logic to apply a bonus (scale down
the accounted used time slice) to tasks that show an increase in their
voluntary context switches counter over a time frame of 10 sec.
Based on experimental results, this simple heurstic appears to be quite
effective in classifying interactive tasks and prioritize them over
potential background CPU-intensive tasks.
Additionally, having a better criteria to identify interactive tasks
allow to prioritize also newly created tasks, thereby enhancing the
responsiveness of interactive shell sessions.
This always ensures the prompt execution of system commands, even when
the system is massively overloaded, unlike the previous time slice boost
logic, which made interactive shell sessions less responsive by
deprioritizing newly created tasks.
Results
=======
With this new logic in place it is possible to play a video game (e.g.,
Terraria) without experiencing any frame rate drop (60 fps), while a
parallel CPU stress test (`stress-ng -c 32`) is running in the
background. The same result can also be obtained with a parallel kernel
build (`make -j 32`). Thus, there is no regression compared to the
previous "ideal" test case.
Even when mixing both workloads (`make -j 16` + `stress-ng -c 16`),
Terraria can still be played without noticeable lag in the audio or
video, maintaining a consistent 60 fps.
In addition to that, shell commands are also very responsive.
Following, the results (average and standard deviation of 10 runs) of
two simple interactive shell commands, while both the `make -j 16` and
`stress-ng -c 16` workloads are running in background:
avg time "uname -r" "ps axuw > /dev/null"
=========================================================
EEVDF 11.1ms 231.8ms
scx_rustland 2.6ms 212.0ms
stdev "uname -r" "ps axuw > /dev/null"
=========================================================
EEVDF 2.28 23.41
scx_rustland 0.70 9.11
Tests conducted on a 8-cores laptop (11th Gen Intel i7-1195G7 @
4.800GHz) with 16GB of RAM.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Provide the number of voluntary context switches (nvcsw) for each task
to the user-space scheduler.
This extra information can then be used by the scheduler to enhance its
decision-making process when scheduling tasks.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
552b75a9c7 ("scx: Build fix after kernel update") updated scx_flatcg along
with other schedulers to use the new direct dispatching from
ops.select_cpu() mechanism. However, this was buggy for flatcg.
flatcg uses direct dispatch for two purposes - as an optimization when there
are idle cpus and to avoid dealing with custom CPU affinities in the
dispatch logic. While the former can be moved to ops.select_cpu(), the
latter can't as it should also apply to tasks which get enqueued without
preceding ops.select_cpu(), e.g., when the task gets requeued after an
attribute change or runs out of time slice. The API update incorrectly moved
both to ops.select_cpu() leading to futile retries of try_pick_next_cgroup()
and scheduling misbheaviors.
Fix it by separating out the two cases and only keeping the idle
optimization case in ops.select_cpu().
Signed-off-by: Tejun Heo <tj@kernel.org>
We may end up stalling for too long in fcg_dispatch() if
try_pick_next_cgroup() doesn't find another valid cgroup to pick. This
can be quite risky, considering that we are holding the rq lock in
dispatch().
This condition can be reproduced easily in our CI, where we can trigger
stalling softirq works:
[ 4.972926] NOHZ tick-stop error: local softirq work is pending, handler #200!!!
Or rcu stalls:
[ 47.731900] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 47.731900] rcu: 1-...!: (0 ticks this GP) idle=b29c/1/0x4000000000000000 softirq=2204/2204 fqs=0
[ 47.731900] rcu: 3-...!: (0 ticks this GP) idle=db74/1/0x4000000000000000 softirq=2286/2286 fqs=0
[ 47.731900] rcu: (detected by 0, t=26002 jiffies, g=6029, q=54 ncpus=4)
[ 47.731900] Sending NMI from CPU 0 to CPUs 1:
To mitigate this issue reduce the amount of try_pick_next_cgroup()
retries from BPF_MAX_LOOPS (8M) to CGROUP_MAX_RETRIES (1024).
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Introduce a parameter to prioritize active running tasks over newly
created tasks.
This option can be used to enhance interactive applications (e.g.,
games, audio/video, GUIs, etc.) that are concurrently running with
fork-intensive background workloads (such as a large parallel build for
example).
The boost value (which functions as a penalty) is applied to the time
slice attributed to newly generated tasks, increasing their vruntime
and, in an indirect manner, "boosting" the priority of all the other
concurrent active tasks.
The time slice boost parameter was applied in the live demo video [1] to
enhance the frames per second (fps) of a video game (Terraria), running
simultaneously with a parallel kernel build (`make -j 32`) on an 8-core
laptop (the value used in the video matches the existing setting of
running `scx_rustland -b 200`).
[1] https://www.youtube.com/watch?v=oCfVbz9jvVQ
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
With the introduction of a the dynamic time slice that scales down based
on the number of tasks in the system, there is no obvious benefit in
utilizing SCX_ENQ_PREEMPT to dispatch the user-space scheduler.
The reduced time slice as the task count increases already enhances the
user-space scheduler's opportunities to run and efficiently manage
scheduling tasks, even when the system is massively overloaded.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Move scaling after tasks are sent to the dispatcher: tasks are
dispatched based on the amount of idle CPUs, so checking for any
remaining tasks still sitting in the scheduler after dispatch gives a
better idea how busy the system is.
Moreover, do not scale the time slice based on nr_cpus (otherwise,
systems with a large amount of CPUs would rarely get any scaling at
all).
Instead, apply a scaling factor as a function of how many tasks are
still waiting in the scheduler: nr_scheduled / 2. This method scales
better as the number of CPUs increases.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Now that we can dispatch directly from select_cpu() we can make the code
more compact and readable by removing the force_local logic.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
After updates to reflect the updated init and direct dispatch API, the
schedulers aren't compatible with older kernels. Bump versions and publish
releases.
In the latest kernel, sched_ext API has changed in two areas:
- ops.prep_enable/cancel_enable/enable/disable() replaced with
ops.init_task/enable/disable/exit_task().
- scx_bpf_dispatch() can now be called from ops.select_cpu(). Also,
SCX_ENQ_LOCAL flag is removed. Instead, users can call
scx_bpf_select_cpu_dfl() from ops.select_cpu() and use the @is_idle out
param value to determine whether to dispatch directly.
This commit updates all schedules so that they build.
- Init functions renamed / merged / split.
- ops.select_cpu() is added to several schedulers and local direct
disptching logic is moved there.
This is the minimum update which is need to make the schedulers build and
work. It needs further update to e.g. move vtime udpates to ops.enable().
With the introduction of a the dynamic time slice that scales down based
on the number of tasks in the system, there is no need anymore to apply
a constant scaling factor to time slice to extend the range of the
allowed time slices.
Therefore, get rid of the static scaling and use slice_ns as the upper
limit for the time slice accounted to the tasks.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
libbpf_rs::num_possible_cpus() may take into account multi-threads
multi-cores information, that are not used efficiently by the scheduler
at the moment.
For simplicity rely on /proc/stat to determine the amount of CPUs that
can be used by the scheduler and provide a proper abstraction to access
this information from the bpf Rust module.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Fix the ternary operator expression to return the CPU id, instead of the
boolean result of the condition.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
With the introduction of a variable time slice that scales down in
function of the amount of waiting tasks, the scheduler is able to handle
a steady stream of newly spawned tasks, without having to de-prioritize
them to guarantee a good level of system responsiveness.
Hence, the logic for de-prioritizing new tasks can be removed, as it
currently doesn't provide any measurable benefits. In fact, it even
proves counterproductive as it can implicitly slow down the interactive
performance of shell sessions when the system is overloaded with a
significant amount of CPU hogs (e.g, `stress-ng -c 128`).
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Implement a simple logic in the user-space scheduler to automatically
adjust the tasks' time slice: reduce the time slice by a scaling factor
of (nr_waiting / nr_cpus + 1), where nr_waiting is the amount of tasks
waiting in the scheduler and nr_cpus is the amount of CPUs in the
system.
Using a fine-grained time slice as the number of tasks in the system
grows, improves responsiveness of low-latency activities (e.g., audio,
video games), also in presence of other CPU-intensive tasks that are
concurrently running in the system.
On the other hand, extending the time slice when only a limited number
of tasks are active in the system contributes to an enhancement in the
overall system throughput and a reduced amount of context switches.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Add to BpfScheduler() the new methods set_effective_slice_us() and
get_effective_slice_us().
These methods can be used by the user-space scheduler to dynamically
adjust (and retrieve) the effective time slice used to dispatch tasks
within the BPF dispatcher.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Cache slice_ns into the main scheduler class to avoid accessing it via
self.bpf.skel.rodata().slice_ns every single time.
This also makes the scheduler code more clear and more abstracted from
the BPF details.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Always report task comm, nr_queued and nr_scheduled in the log messages.
Moreover, report also task name (comm) and cpu when possible.
All these extra information can be really helpful to trace and debug
scheduling issues.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
We allow tasks to bypass the user-space scheduler and be dispatched
directly using a shortcut in the enqueue path, if their running CPU is
immediately available or if the task is per-CPU kthread.
However, the shortcut is disabled if the user-space scheduler has some
pending activities to do (to avoid disrupting too much its decision).
In this case the shortcut is disabled also for per-CPU kthreads and that
may cause priority-inversion problems in the system, triggering some
stall of some per-CPU kthreads (such as rcuog/N) and short system
lockups, if the system is overloaded.
Prevent this by always enabing the dispatch shortcut for per-CPU
kthreads.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
When we fail to push a task to the queued BPF map we fallback to direct
dispatch, but we can't use SCX_DSQ_LOCAL_ON. So, make sure to use
SCX_DSQ_GLOBAL in this case to prevent scheduler crashes.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Instead of just trying the target CPU and the previously used CPU, we
could cycle among all the available CPUs (if both those CPUs cannot be
used), before using the global DSQ.
This allows to not de-prioritize too much tasks that can't be scheduled
on the CPU selected by the scheduler (or their previously used CPU), and
we can still dispatch them using SCX_DSQ_LOCAL_ON, like any other task.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Assign min_vruntime to the task before the weighted time slice is
evaluated, then add the time slice.
In this way we still ensure that the task's vruntime is in the range
(min_vruntime + 1, min_vruntime + max_slice_ns], but we don't nullify
the effect of the evaluated time slice if the starting vruntime of the
task is too small.
Also change update_enqueued() to return the evaluated weighted time
slice (that can be used in the future).
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Change TaskTree.push() to accept directly a Task object, rather than
each individual attribute. Moreover, Task attributes don't need to be
public, since both TaskTree and Task are only used locally.
This makes the code more elegant and more readable.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Introduce a new counter to report the amount of failed dispatches: if
the scheduler designates a target CPU for a task, and both the chosen
CPU and the previously utilized one are unavailable when the task is
dispatched, the task will be sent to the global DSQ, and the counter
will be incremented.
Also mark all the methods to access these statistics counters as
optional. In the future we may also provide a "verbose" option and show
these statistics only when the scheduler runs in verbose mode.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Move the code responsible for interfacing with the BPF component into
its own module and provide high-level abstractions for the user-space
scheduler, hiding all the internal BPF implementation details.
This makes the user-space scheduler code much more readable and it
allows potential developers/contributors that want to focus at the pure
scheduling details to modify the scheduler in a generic way, without
having to worry about the internal BPF details.
In the future we may even decide to provide the BPF abstraction as a
separate crate, that could be used as a baseline to implement user-space
schedulers in Rust.
API overview
============
The main BPF interface is provided by BpfScheduler(). When this object
is initialized it will take care of registering and initializing the BPF
component.
Then the scheduler can use the BpfScheduler() instance to receive tasks
(in the form of QueuedTask object) and dispatch tasks (in the form of
DispatchedTask objects), using respectively the methods dequeue_task()
and dispatch_task().
The CPU ownership map can be accessed using the method get_cpu_pid(),
this also allows to keep track of the idle and busy CPUs, with the
corrsponding PIDs associated to them.
BPF counters and statistics can be accessed using the methods
nr_*_mut(), in particular nr_queued_mut() and nr_scheduled_mut() can be
updated to notify the BPF component if the user-space scheduler has some
pending work to do or not.
Finally the methods read_bpf_exit_kind() and report_bpf_exit_kind() can
be used respectively to read the exit code and exit message from the BPF
component, when the scheduler is unregistered.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
This because each scheduler has it's own Rust Crate
and it's better if they had a README associated with each one.
https://crates.io/crates/scx_layered
We always try to use the current CPU (from the .dispatch() callback) to
run the user-space scheduler itself and if the current CPU is not usable
(according to the cpumask) we just re-use the previouly used CPU.
However, if the previously used CPU is also not usable, we may trigger
the following error:
sched_ext: runtime error (SCX_DSQ_LOCAL[_ON] verdict target cpu 4 not allowed for scx_rustland[256201])
Potentially this can also happen with any task, so improve the dispatch
logic as following:
- dispatch on the target CPU, if usable
- otherwise dispatch on the previously used CPU, if usable
- otherwise dispatch on the global DSQ
Moreover, rename dispatch_on_cpu() -> dispatch_task() for better
clarity.
This should be enough to handle all the possible decisions made by the
user-space scheduler, making the dispatcher more robust.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
In the dispatch callback we can dispatch tasks to any CPU, according to
the scheduler decisions, so there's no reason to check for the available
dispatch slots in the current CPU only, to determine if we need to stop
dispatching tasks.
Since the scheduler is aware of the idle state of the CPUs (via the CPU
ownership map) it has all the information to automatically regulate the
flow of dispatched tasks and not overflow the dispatch slots, therefore
it is safe to remove this check.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
No functional change, only a little polishing, including updates to
comments and documentation to align with the latest changes in the code.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
While bypassing the user-space scheduler can provide some benefits at
reducing the scheduling overhead, doing so underneath the scheduler
while it is actively taking decisions may disrupt its work and have a
negative effect on the overall system performance.
For this reason, activate the logic to bypass the user-space scheduler
only when there is no pending work it.
This change makes the scheduler much more reliable, for example on a
8-cores system it is really easy to trigger short lockups or even
trigger the sched-ext watchdog that kicks out the scheduler, running the
following stress test:
$ stress-ng -c 128
With this change applied the system remains reasonably responsive and
the scheduler is never disabled by the sched-ext watchdog.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Instead of accounting (max_slice_ns / 2) to the vruntime of all the new
tasks, add that to thier regular weighted time delta, as an additional
penalty.
This allows to distinguish new CPU intensive tasks vs new less CPU
intensive tasks, and prioritize the latter over the former.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Use SCX_ENQ_PREEMPT to dispatch the user-space scheduler. This can help
to mitigate starvation in presence of many cpu hogs (way more than the
amount of available CPUs) running in the system, by giving the scheduler
more chances to drain the amount of tasks that may be starving in a
waiting state.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
The current implementation of the user-space scheduler is strongly
prioritizing newly created tasks by setting their initial vruntime to
(min_vruntime + 1); this prioritization places them ahead of other tasks
waiting to run.
While this approach is efficient for processing short-lived tasks, it
makes the scheduler vulnerable to fork-bomb attacks and significantly
penalizes interactive workloads (e.g., "foreground" applications), in
particular in the presence of background applications that are spawning
multiple tasks, such as parallel builds.
Instead of prioritizing newly created tasks, do the opposite and account
(max_slice_ns / 2) to their initial vruntime, to make sure they are not
scheduled before the other tasks that are already waiting for the CPU in
the current scheduler run.
This allows to mitigate potential fork-bomb attacks and it strongly
improves the responsiveness of interactive applications (such as UI,
audio/video streams, gaming, etc.).
With this change applied, under certain conditions, scx_rustland can
even outperform the default Linux scheduler.
For example, with a parallel kernel build (make -j32) running in the
background, I can play Terraria with a constant rate of ~30-40 fps,
while the default Linux scheduler can handle only ~20-30 fps under the
same conditions.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Avoid updating task information for tasks that are exiting, as they
won't be used by the user-space scheduler.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
With commit a7677fd ("scx_rustland: bypass user-space scheduler for
short-lived kthreads") we were try to mitigate a problem that was
actually introduced by using the wrong formula to evaluate weighted
vruntime, see commit 2900b20 ("scx_rustland: evaluate the proper
vruntime delta").
Reverting that (pseudo-)optimization doesn't seem to introduce any
performance/latency regression and it makes the code more elegant,
therefore drop it.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
We can sometimes hit scenarios in the scx_userland scheduler where there
is work to be done in user space, but we incorrectly fail to run the
user space scheduler. In order to avoid this, we can use global
variables that are set from both BPF and user space. The BPF-side
variable reflects when one or more tasks have been enqueued, and the
user space-side variable reflects when user space has received tasks but
has not yet dispatched them.
In the ops.update_idle() callback, we can check these variables and send
a resched IPI to a core to ensure that the user-space scheduler is
always scheduled when there's work to be done.
Signed-off-by: David Vernet <void@manifault.com>
Improve build portability by including asm-generic/errno.h, instead of
linux/errno.h.
The difference between these two headers can be summarized as following:
- asm-generic/errno.h contains generic error code definitions that are
intended to be common across different architectures,
- linux/errno.h includes architecture-specific error codes and
provides additional (or overrides) error code definitions based on
the specific architecture where the code is compiled.
Considering the architecture-independent nature of scx, the advantages
of being able to use architecture-specific error codes are marginal or
negligible (and we should probably discourage using them).
Moving towards asm-generic/errno.h, however, allows the removal of
cross-compilation dependencies (such as the gcc-multilib package in
Debian/Ubuntu) and improves the code portability across various
architectures and distributions.
This also allows to remove a symlink hack from the github workflow.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>