Commit Graph

210 Commits

Author SHA1 Message Date
Andrea Righi
732ba4900b scx_rustland: avoid using SCX_ENQ_PREEMPT
With the introduction of a the dynamic time slice that scales down based
on the number of tasks in the system, there is no obvious benefit in
utilizing SCX_ENQ_PREEMPT to dispatch the user-space scheduler.

The reduced time slice as the task count increases already enhances the
user-space scheduler's opportunities to run and efficiently manage
scheduling tasks, even when the system is massively overloaded.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-10 17:32:29 +01:00
Andrea Righi
db9a29d618 scx_rustland: improve dynamic slice scaling
Move scaling after tasks are sent to the dispatcher: tasks are
dispatched based on the amount of idle CPUs, so checking for any
remaining tasks still sitting in the scheduler after dispatch gives a
better idea how busy the system is.

Moreover, do not scale the time slice based on nr_cpus (otherwise,
systems with a large amount of CPUs would rarely get any scaling at
all).

Instead, apply a scaling factor as a function of how many tasks are
still waiting in the scheduler: nr_scheduled / 2. This method scales
better as the number of CPUs increases.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-09 22:11:07 +01:00
Andrea Righi
1da2983804 scx_rustland: get rid of force_local
Now that we can dispatch directly from select_cpu() we can make the code
more compact and readable by removing the force_local logic.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-09 22:11:07 +01:00
Andrea Righi
6ead675fb6 scx_rustland: add a link to the live demo in the README
Update the README.md adding a link to a live demo video of the
scheduler.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-09 22:11:07 +01:00
Tejun Heo
942b0269b8 Bump versions
After updates to reflect the updated init and direct dispatch API, the
schedulers aren't compatible with older kernels. Bump versions and publish
releases.
2024-01-08 18:49:54 -10:00
Tejun Heo
552b75a9c7 scx: Build fix after kernel update
In the latest kernel, sched_ext API has changed in two areas:

- ops.prep_enable/cancel_enable/enable/disable() replaced with
  ops.init_task/enable/disable/exit_task().

- scx_bpf_dispatch() can now be called from ops.select_cpu(). Also,
  SCX_ENQ_LOCAL flag is removed. Instead, users can call
  scx_bpf_select_cpu_dfl() from ops.select_cpu() and use the @is_idle out
  param value to determine whether to dispatch directly.

This commit updates all schedules so that they build.

- Init functions renamed / merged / split.

- ops.select_cpu() is added to several schedulers and local direct
  disptching logic is moved there.

This is the minimum update which is need to make the schedulers build and
work. It needs further update to e.g. move vtime udpates to ops.enable().
2024-01-08 14:48:24 -10:00
Andrea Righi
1ea5aebfb4 scx_rustland: always consider slice_ns as maximum time slice
With the introduction of a the dynamic time slice that scales down based
on the number of tasks in the system, there is no need anymore to apply
a constant scaling factor to time slice to extend the range of the
allowed time slices.

Therefore, get rid of the static scaling and use slice_ns as the upper
limit for the time slice accounted to the tasks.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-08 19:22:38 +01:00
Andrea Righi
9b482f48f1 scx_rustland: determine the amount of cores via /proc/stat
libbpf_rs::num_possible_cpus() may take into account multi-threads
multi-cores information, that are not used efficiently by the scheduler
at the moment.

For simplicity rely on /proc/stat to determine the amount of CPUs that
can be used by the scheduler and provide a proper abstraction to access
this information from the bpf Rust module.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-08 19:11:25 +01:00
Andrea Righi
0d107d6220 scx_rustland: return the proper cpu value from get_task_cpu()
Fix the ternary operator expression to return the CPU id, instead of the
boolean result of the condition.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-08 19:10:59 +01:00
Andrea Righi
fa6915cc0a scx_rustland: simplify update_enqueued()
With the introduction of a variable time slice that scales down in
function of the amount of waiting tasks, the scheduler is able to handle
a steady stream of newly spawned tasks, without having to de-prioritize
them to guarantee a good level of system responsiveness.

Hence, the logic for de-prioritizing new tasks can be removed, as it
currently doesn't provide any measurable benefits. In fact, it even
proves counterproductive as it can implicitly slow down the interactive
performance of shell sessions when the system is overloaded with a
significant amount of CPU hogs (e.g, `stress-ng -c 128`).

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-08 07:38:52 +01:00
Andrea Righi
bf98154ee1 scx_rustland: use dynamic time slice in the user-space scheduler
Implement a simple logic in the user-space scheduler to automatically
adjust the tasks' time slice: reduce the time slice by a scaling factor
of (nr_waiting / nr_cpus + 1), where nr_waiting is the amount of tasks
waiting in the scheduler and nr_cpus is the amount of CPUs in the
system.

Using a fine-grained time slice as the number of tasks in the system
grows, improves responsiveness of low-latency activities (e.g., audio,
video games), also in presence of other CPU-intensive tasks that are
concurrently running in the system.

On the other hand, extending the time slice when only a limited number
of tasks are active in the system contributes to an enhancement in the
overall system throughput and a reduced amount of context switches.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-08 07:38:52 +01:00
Andrea Righi
303c4ea548 scx_rustland: dynamic time slice support
Add to BpfScheduler() the new methods set_effective_slice_us() and
get_effective_slice_us().

These methods can be used by the user-space scheduler to dynamically
adjust (and retrieve) the effective time slice used to dispatch tasks
within the BPF dispatcher.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-08 07:35:31 +01:00
Andrea Righi
2a32d81859 scx_rustland: store default slice_ns in the scheduler class
Cache slice_ns into the main scheduler class to avoid accessing it via
self.bpf.skel.rodata().slice_ns every single time.

This also makes the scheduler code more clear and more abstracted from
the BPF details.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-07 16:14:51 +01:00
Andrea Righi
8ccbbdadee scx_userland: improve BPF logging
Always report task comm, nr_queued and nr_scheduled in the log messages.
Moreover, report also task name (comm) and cpu when possible.

All these extra information can be really helpful to trace and debug
scheduling issues.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-07 16:14:51 +01:00
Andrea Righi
295873ac41 scx_rustland: always dispatch per-CPU kthreads from enqueue
We allow tasks to bypass the user-space scheduler and be dispatched
directly using a shortcut in the enqueue path, if their running CPU is
immediately available or if the task is per-CPU kthread.

However, the shortcut is disabled if the user-space scheduler has some
pending activities to do (to avoid disrupting too much its decision).

In this case the shortcut is disabled also for per-CPU kthreads and that
may cause priority-inversion problems in the system, triggering some
stall of some per-CPU kthreads (such as rcuog/N) and short system
lockups, if the system is overloaded.

Prevent this by always enabing the dispatch shortcut for per-CPU
kthreads.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-06 11:06:53 +01:00
Andrea Righi
0c3bdb16fe scx_rustland: prevent using SCX_DSQ_LOCAL_ON from enqueue()
When we fail to push a task to the queued BPF map we fallback to direct
dispatch, but we can't use SCX_DSQ_LOCAL_ON. So, make sure to use
SCX_DSQ_GLOBAL in this case to prevent scheduler crashes.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-06 11:06:53 +01:00
Andrea Righi
05d997c539 scx_rustland: more robust CPU selection logic in the dispatcher
Instead of just trying the target CPU and the previously used CPU, we
could cycle among all the available CPUs (if both those CPUs cannot be
used), before using the global DSQ.

This allows to not de-prioritize too much tasks that can't be scheduled
on the CPU selected by the scheduler (or their previously used CPU), and
we can still dispatch them using SCX_DSQ_LOCAL_ON, like any other task.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-06 11:06:53 +01:00
Andrea Righi
18a990ae82 scx_rustland: assign min_vruntime before time slice evaluation
Assign min_vruntime to the task before the weighted time slice is
evaluated, then add the time slice.

In this way we still ensure that the task's vruntime is in the range
(min_vruntime + 1, min_vruntime + max_slice_ns], but we don't nullify
the effect of the evaluated time slice if the starting vruntime of the
task is too small.

Also change update_enqueued() to return the evaluated weighted time
slice (that can be used in the future).

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-06 11:06:53 +01:00
Andrea Righi
92109c95a9 scx_rustland: small TaskTree.push() refactoring
Change TaskTree.push() to accept directly a Task object, rather than
each individual attribute. Moreover, Task attributes don't need to be
public, since both TaskTree and Task are only used locally.

This makes the code more elegant and more readable.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-06 11:06:53 +01:00
Andrea Righi
96f3eb42be
Merge pull request #68 from sched-ext/scx-rustland-refactoring
scx_rustland: refactoring
2024-01-04 20:42:30 +01:00
Andrea Righi
7813992896 scx_rustland: introduce nr_failed_dispatches
Introduce a new counter to report the amount of failed dispatches: if
the scheduler designates a target CPU for a task, and both the chosen
CPU and the previously utilized one are unavailable when the task is
dispatched, the task will be sent to the global DSQ, and the counter
will be incremented.

Also mark all the methods to access these statistics counters as
optional. In the future we may also provide a "verbose" option and show
these statistics only when the scheduler runs in verbose mode.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-04 17:36:06 +01:00
Andrea Righi
796a7ebc0e scx_rustland: provide an abstraction layer for the BPF component
Move the code responsible for interfacing with the BPF component into
its own module and provide high-level abstractions for the user-space
scheduler, hiding all the internal BPF implementation details.

This makes the user-space scheduler code much more readable and it
allows potential developers/contributors that want to focus at the pure
scheduling details to modify the scheduler in a generic way, without
having to worry about the internal BPF details.

In the future we may even decide to provide the BPF abstraction as a
separate crate, that could be used as a baseline to implement user-space
schedulers in Rust.

API overview
============

The main BPF interface is provided by BpfScheduler(). When this object
is initialized it will take care of registering and initializing the BPF
component.

Then the scheduler can use the BpfScheduler() instance to receive tasks
(in the form of QueuedTask object) and dispatch tasks (in the form of
DispatchedTask objects), using respectively the methods dequeue_task()
and dispatch_task().

The CPU ownership map can be accessed using the method get_cpu_pid(),
this also allows to keep track of the idle and busy CPUs, with the
corrsponding PIDs associated to them.

BPF counters and statistics can be accessed using the methods
nr_*_mut(), in particular nr_queued_mut() and nr_scheduled_mut() can be
updated to notify the BPF component if the user-space scheduler has some
pending work to do or not.

Finally the methods read_bpf_exit_kind() and report_bpf_exit_kind() can
be used respectively to read the exit code and exit message from the BPF
component, when the scheduler is unregistered.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-04 16:49:09 +01:00
Jordan Rome
5bacefcdbe Add README files for each rust scheduler
This because each scheduler has it's own Rust Crate
and it's better if they had a README associated with each one.

https://crates.io/crates/scx_layered
2024-01-04 07:35:44 -08:00
Andrea Righi
7c11837a61 scx_rustland: make dispatcher more robust
We always try to use the current CPU (from the .dispatch() callback) to
run the user-space scheduler itself and if the current CPU is not usable
(according to the cpumask) we just re-use the previouly used CPU.

However, if the previously used CPU is also not usable, we may trigger
the following error:

 sched_ext: runtime error (SCX_DSQ_LOCAL[_ON] verdict target cpu 4 not allowed for scx_rustland[256201])

Potentially this can also happen with any task, so improve the dispatch
logic as following:

 - dispatch on the target CPU, if usable
 - otherwise dispatch on the previously used CPU, if usable
 - otherwise dispatch on the global DSQ

Moreover, rename dispatch_on_cpu() -> dispatch_task() for better
clarity.

This should be enough to handle all the possible decisions made by the
user-space scheduler, making the dispatcher more robust.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-04 10:21:40 +01:00
Andrea Righi
69c1dfc03c scx_rustland: remove unnecessary scx_bpf_dispatch_nr_slots() check
In the dispatch callback we can dispatch tasks to any CPU, according to
the scheduler decisions, so there's no reason to check for the available
dispatch slots in the current CPU only, to determine if we need to stop
dispatching tasks.

Since the scheduler is aware of the idle state of the CPUs (via the CPU
ownership map) it has all the information to automatically regulate the
flow of dispatched tasks and not overflow the dispatch slots, therefore
it is safe to remove this check.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-04 09:41:54 +01:00
Andrea Righi
6b1e7d927d scx_rustland: update comments and documentation in the BPF part
No functional change, only a little polishing, including updates to
comments and documentation to align with the latest changes in the code.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-04 09:40:49 +01:00
Andrea Righi
bb1c32d395 scx_rustland: avoid bypassing the scheduler with pending activities
While bypassing the user-space scheduler can provide some benefits at
reducing the scheduling overhead, doing so underneath the scheduler
while it is actively taking decisions may disrupt its work and have a
negative effect on the overall system performance.

For this reason, activate the logic to bypass the user-space scheduler
only when there is no pending work it.

This change makes the scheduler much more reliable, for example on a
8-cores system it is really easy to trigger short lockups or even
trigger the sched-ext watchdog that kicks out the scheduler, running the
following stress test:

  $ stress-ng -c 128

With this change applied the system remains reasonably responsive and
the scheduler is never disabled by the sched-ext watchdog.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-03 22:54:14 +01:00
Andrea Righi
5d15d34777 scx_rustland: charge additional time slice to new tasks
Instead of accounting (max_slice_ns / 2) to the vruntime of all the new
tasks, add that to thier regular weighted time delta, as an additional
penalty.

This allows to distinguish new CPU intensive tasks vs new less CPU
intensive tasks, and prioritize the latter over the former.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-03 22:54:10 +01:00
Andrea Righi
8820af8d36 scx_rustland: enable user-space scheduler to preempt other tasks
Use SCX_ENQ_PREEMPT to dispatch the user-space scheduler. This can help
to mitigate starvation in presence of many cpu hogs (way more than the
amount of available CPUs) running in the system, by giving the scheduler
more chances to drain the amount of tasks that may be starving in a
waiting state.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-03 22:54:00 +01:00
Andrea Righi
5d9182d9c3 scx_rustland: prioritize interactive workloads
The current implementation of the user-space scheduler is strongly
prioritizing newly created tasks by setting their initial vruntime to
(min_vruntime + 1); this prioritization places them ahead of other tasks
waiting to run.

While this approach is efficient for processing short-lived tasks, it
makes the scheduler vulnerable to fork-bomb attacks and significantly
penalizes interactive workloads (e.g., "foreground" applications), in
particular in the presence of background applications that are spawning
multiple tasks, such as parallel builds.

Instead of prioritizing newly created tasks, do the opposite and account
(max_slice_ns / 2) to their initial vruntime, to make sure they are not
scheduled before the other tasks that are already waiting for the CPU in
the current scheduler run.

This allows to mitigate potential fork-bomb attacks and it strongly
improves the responsiveness of interactive applications (such as UI,
audio/video streams, gaming, etc.).

With this change applied, under certain conditions, scx_rustland can
even outperform the default Linux scheduler.

For example, with a parallel kernel build (make -j32) running in the
background, I can play Terraria with a constant rate of ~30-40 fps,
while the default Linux scheduler can handle only ~20-30 fps under the
same conditions.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-03 18:28:54 +01:00
Andrea Righi
50b5f6e8c6 scx_rustland: do not update exiting tasks statistics
Avoid updating task information for tasks that are exiting, as they
won't be used by the user-space scheduler.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-03 09:10:20 +01:00
Andrea Righi
b7a9d3775a scx_rustland: schedule non-cpu intensive kthreads normally
With commit a7677fd ("scx_rustland: bypass user-space scheduler for
short-lived kthreads") we were try to mitigate a problem that was
actually introduced by using the wrong formula to evaluate weighted
vruntime, see commit 2900b20 ("scx_rustland: evaluate the proper
vruntime delta").

Reverting that (pseudo-)optimization doesn't seem to introduce any
performance/latency regression and it makes the code more elegant,
therefore drop it.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-03 07:46:01 +01:00
Andrea Righi
a09482f0ef scx_rustland: notify user-space scheduler about exiting tasks
Instead of implementing a garbage collector to periodically free up
exiting tasks' resources, implement a proper synchronous mechanism to
notify the user-space scheduler about the exiting tasks from the BPF
component, using the .disable() callback.

When the user-space scheduler receives a queued task with a negative CPU
number, it can then release all the resources associated with that task
(which currently includes only the entry in the TaskInfoMap for now).

This allows to get rid of the TaskInfoMap periodic garbage collector
routine, save a lot of syscalls in procfs (used to check if the pids
were still alive), and improve the overall scheduler performance.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-02 12:57:27 +01:00
Andrea Righi
280796c4bd scx_rustland: small code refactoring
No functional change, make the user-space scheduler code a bit more
readable and more Rust idiomatic.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-01 19:47:30 +01:00
Andrea Righi
2900b208fe scx_rustland: evaluate the proper vruntime delta
The forumla used to evaluate the weighted time delta is not correct,
it's not considering the weight as a percentage. Fix this by using the
proper formula.

Moreover, take into account also the task weight when evaluating the
maximum time delta to account in vruntime and make sure that we never
charge a task more than slice_ns.

This helps to prevent starvation of low priority tasks.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-01 19:47:30 +01:00
Andrea Righi
90e92ace2d scx_rustland: prevent starvation handling short-lived tasks properly
Prevent newly created short-lived tasks from starving the other tasks
sitting in the user-space scheduler.

This can be done setting an initial vruntime of (min_vruntime + 1) to
newly scheduled tasks, instead of min_vruntime: this ensures a
progressing global vruntime durig each scheduler run, providing a
priority boost to newer tasks (that is still beneficial for potential
short-lived tasks) while also preventing excessive starvation of the
other tasks sitting in the user-space scheduler, waiting to be
dispatched.

Without this change it is really easy to create a stall condition simply
by forking a bunch of short-lived tasks in a busy loop, with this change
applied the scheduler can handle properly the consistent flow of newly
created short-lived tasks, without introducing any stall.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-01 16:58:28 +01:00
Andrea Righi
676bd88ada bpf_rustland: do not dispatch the scheduler to the global DSQ
Never dispatch the user-space scheduler to the global DSQ, while all
the other tasks are dispatched to the local per-CPU DSQ.

Since tasks are consumed from the local DSQ first and then from the
global DSQ, we may end up starving the scheduler if we dispatch only
this one on the global DSQ.

In fact it is really easy to trigger a stall with a workload that
triggers many context switches in the system, for example (on a 8 cores
system):

 $ stress-ng --cpu 32 --iomix 4 --vm 2 --vm-bytes 128M --fork 4 --timeout 30s

 ...
 09:28:11 [WARN] EXIT: scx_rustland[1455943] failed to run for 5.275s
 09:28:11 [INFO] Unregister RustLand scheduler

To prevent this from happening also dispatch the user-space scheduler on
the local DSQ, using the current CPU where .dispatch() is called, if
possible, or the previously used CPU otherwise.

Apply the same logic when the scheduler is congested: dispatch on the
previously used CPU using the local DSQ.

In this way all tasks will always get the same "dispatch priority" and
we can prevent the scheduler starvation issue.

Note that with this change in place dispatch_global() is never used and
we can get rid of it.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-01 14:17:23 +01:00
Andrea Righi
0fc46b2be2 scx_rustland: remove SCX_ENQ_LAST check in is_task_cpu_available()
With commit 49f2e7c ("scx_rustland: enable SCX_OPS_ENQ_LAST") we have
enabled SCX_OPS_ENQ_LAST that seems to save some unnecessary user-space
scheduler activations when the system is mostly idle.

We are also checking for the SCX_ENQ_LAST in the enqueue flags, that
apparently it is not needed and we can achieve the same behavior
dropping this check.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-01 14:17:23 +01:00
Andrea Righi
840260141d scx_rustland: never account more than slice_ns to vruntime
In any case make sure that we never account more than the maximum
slice_ns to a task's vruntime.

This helps to prevent starving a task for too long in the user-space
scheduler.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-01 14:17:23 +01:00
Andrea Righi
61c77b7d87 scx_rustland: clean up old entries in the task map
The user-space scheduler maintains an internal hash map of tasks
information (indexed by their pid). Tasks are only added to this hash
map and never removed. After running the scheduler for a while we may
experience a performance degration, because the hash map keeps growing.

Therefore implement a mechanism of garbage collector to remove the old
entries from the task map (periodically removing pids that don't exist
anymore).

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-01 14:17:23 +01:00
Andrea Righi
27739065bc scx_rustland: rename variable id -> pos for better clarity
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-01 14:17:23 +01:00
Andrea Righi
1cdcb8af60 scx_rustland: show the CPU where the scheduler is running
In the scheduler statistics reported periodically to stdout, instead of
showing "pid=0" for the CPU where the scheduler is running (like an idle
CPU), show "[self]".

This helps to identify exactly where the user-space scheduler is running
(when and where it migrates, etc.).

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2023-12-31 17:03:30 +01:00
Andrea Righi
a7677fdf28 scx_rustland: bypass user-space scheduler for short-lived kthreads
Bypass the user-space scheduler for kthreads that still have more than
half of their runtime budget.

As they are likely to release the CPU soon, granting them a substantial
priority boost can enhance the overall system performance.

In the event that one of these kthreads turns into a CPU hog, it will
deplete its runtime budget and therefore it will be scheduled like
any other normal task through the user-space scheduler.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2023-12-31 16:40:05 +01:00
Andrea Righi
405a11308e scx_rustland: always use dispatch_on_cpu() when possible
Use dispatch_on_cpu() when possible, so that all tasks dispatched by the
user-space scheduler gets the same priority, instead of having some of
them dispatched to the global DSQ and others dispatched to the per-CPU
DSQ.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2023-12-31 16:08:31 +01:00
Andrea Righi
49f2e7ce06 scx_rustland: enable SCX_OPS_ENQ_LAST
Make sure the scheduler is not activated if we are deadling with the
last task running.

This allows to consistency reduce scx_rustland CPU usage in systems that
are mostly idle (and avoid unnecessary power consumption).

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2023-12-31 16:06:45 +01:00
Andrea Righi
0522219bea scx_rustland: prevent dispatching multiple tasks on the same idle cpu
When a task is dispatched we always try to pick the previously used CPU
(if idle) to minimize the migration overhead. Alternatively, if such CPU
is not available, we pick any other idle CPU in the system.

However, we don't update the list of idle CPUs as we dispatch tasks,
therefore we may end up sending multiple tasks to the same idle CPU (if
their previously used CPU is the same) and we may even skip some idle
CPUs completely.

Change this logic to make sure that we never dispatch multiple tasks to
the same idle CPU, by updating the list of idle CPUs as we send tasks to
the BPF dispatcher.

This also avoids dispatching tasks with a closely matched vruntime to
the same CPU, thereby negating the advantages of the vruntime ordering.
With this change in place, we ensure that tasks with a similar vruntime
are dispatched to different CPUs, leading to significant improvements in
latency performance.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2023-12-31 09:37:39 +01:00
Andrea Righi
38145f8dc9 scx_rustland: check CPU selection validity
When the scheduler decides to assign a different CPU to the task always
make sure the assignment is valid according to the task cpumask. If it's
not valid simply dispatch the task to the global DSQ.

This prevents the scheduler from exiting with errors like this:

  09:11:02 [WARN] EXIT: SCX_DSQ_LOCAL[_ON] verdict target cpu 7 not allowed for gcc[440718]

In the future we may want move this check directly into the user-space
scheduler, but for now let's keep this check in the BPF dispatcher as a
quick fix.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2023-12-30 10:40:46 +01:00
Andrea Righi
1a2c9f5fd4 scx_rustland: improve scheduler's idle CPU selection
The current CPU selection logic in the scheduler presents some
inefficiencies.

When a task is drained from the BPF queue, the scheduler immediately
checks whether the CPU previously assigned to the task is still idle,
assigning it if it is. Otherwise, it iterates through available CPUs,
always starting from CPU #0, and selects the first idle one without
updating its state. This approach is consistently applied to the entire
batch of tasks drained from the BPF queue, resulting in all of them
being assigned to the same idle CPU (also with a higher likelihood of
allocation to lower CPU ids rather than higher ones).

While dispatching a batch of tasks to the same idle CPU is not
necessarily problematic, a fairer distribution among the list of idle
CPUs would be preferable.

Therefore change the CPU selection logic to distribute tasks equally
among the idle CPUs, still maintaining the preference for the previously
used one. Additionally, apply the CPU selection logic just before tasks
are dispatched, rather than assigning a CPU when tasks are drained from
the BPF queue. This adjustment is important, because tasks may linger in
the scheduler's internal structures for a bit and the idle state of the
CPUs in the system may change during that period.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2023-12-30 10:34:08 +01:00
Andrea Righi
e90bc923f9 scx_rustland: introduce nr_waiting concept
We want to activate the user-space scheduler only when there are pending
tasks that require scheduling actions.

To do so we keep track of the queued tasks via nr_queued, that is
incremented in .enqueue() when a task is sent to the user-space
scheduler and decremented in .dispatch() when a task is dispatched.

However, we may trigger an unbalance if the same pid is sent to the
scheduler multiple times (because the scheduler store all the tasks by
their unique pid).

When this happens nr_queued is never decremented back to 0, leading the
user-space scheduler to constantly spin, even if there's no activity to
do.

To prevent this from happening split nr_queued into nr_queued and
nr_scheduled. The former will be updated by the BPF component every time
that a task is sent to the scheduler and it's up to the user-space
scheduler to reset the counter when the queue is fully dreained. The
latter is maintained by the user-space scheduler and represents the
amount of tasks that are still processed by the scheduler and are
waiting to be dispatched.

The sum of nr_queued + nr_scheduled will be called nr_waiting and we can
rely on this metric to determine if the user-space scheduler has some
pending work to do or not.

This change makes rust_rustland more reliable and it strongly reduces
the CPU usage of the user-space scheduler by eliminating a lot of
unnecessary activations.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2023-12-29 21:15:04 +01:00
Andrea Righi
d67dfe50f9 scx_rustland: treat the CPU running the user-space scheduler as idle
Considering the CPU where the user-space scheduler is running as busy
doesn't really provide any benefit, since the user-space scheduler is
constantly dispatching an amount of tasks equal to the amount of idle
CPUs and then yields (therefore its own CPU should be considered idle).

Considering the CPU where the user-space scheduler is running as busy
doesn't provide any benefit, as the scheduler consistently dispatches
tasks equal to the number of idle CPUs and then yields (therefore its
own CPU should be considered idle).

This also allows to reduce the overall user-space scheduler CPU
utilization, especially when the system is mostly idle, without
introducing any measurable performance regression.

Measuring the average CPU utilization of a (mostly) idle system over a
time period of 60 sec:

 - wihout this patch: 5.41% avg cpu util
 - with this patch:   2.26% avg cpu util

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2023-12-29 21:14:58 +01:00
Andrea Righi
6df4d7e0c6 scx_rustland: introduce an update_idle() callback
Move the logic to activate the userspace scheduler to an update_idle()
callback, which is called when the CPU is about to go idle.

This disables the built-in idle tracking mechanism, so it allows to rely
completely on the internal CPU ownership logic (via get_cpu_owner() and
set_cpu_owner()) and it also allows to share the idle state with the
user-space scheduler via the BPF_MAP_TYPE_ARRAY cpu_map.

Moreover, when the user-space scheduler is activated, kick the idle cpu
to trigger immediate dispatch and avoid bubbles in the scheduling
pipeline.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2023-12-28 14:41:08 +01:00
Andrea Righi
1baae38e7f Revert "scx_rustland: always dispatch kthreads on the local CPU"
This reverts commit 9237e1d ("scx_rustland: always dispatch kthreads on
the local CPU").

Do not always prioritize all kthreads, we may have unbound workqueue
workers that can consume a lot of CPU cycles (e.g., encryption workers),
so we definitely want to apply the scheduling for those.

Therefore, restore the old behavior to prioritize only per-CPU kthreads.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2023-12-28 14:40:03 +01:00
Andrea Righi
9237e1d835 scx_rustland: always dispatch kthreads on the local CPU
Adding extra overhead to any kthread can potentially slow down the
entire system, so make sure this never happens by dispatching all
kthreads directly on the same local CPU (not just the per-CPU kthreads),
bypassing the user-space scheduler.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2023-12-27 14:15:46 +01:00
Andrea Righi
f0ece7af6b scx_rustland: wake-up user-space scheduler when a CPU is released
Trigger the user-space scheduler only upon a task's CPU release event
(avoiding its activation during each enqueue event) and only if there
are tasks waiting to be processed by the user-space scheduler.

This should save unnecessary calls to the user-space scheduler, reducing
the overall overhead of the scheduler.

Moreover, rename nr_enqueues to nr_queued and store the amount of tasks
currently queued to the user-space scheduler (that are waiting to be
dispatched).

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2023-12-27 14:15:46 +01:00
Andrea Righi
7d01be9568 scx_rustland: provide get/set_cpu_owner()
Provide the following primitives to get and set CPU ownership in the BPF
part. This improves code readability and these primitives can be used by
the BPF part as a baseline to implement a better CPU idle tracking in
the future.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2023-12-27 14:15:39 +01:00
Andrea Righi
cd7e1c6248 scx_rustland: clarify BPF / user-space interlocking
BPF doesn't have full memory model yet, and while strict atomicity might
not be necessary in this context, it is advisable to enhance clarity in
the interlocking model.

To achieve this, provide the following primitives to operate on
usersched_needed:

  static void set_usersched_needed(void)

  static bool test_and_clear_usersched_needed(void)

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2023-12-26 14:28:24 +01:00
Andrea Righi
e038a530ae scx_rustland: dispatch tasks in batch
Dispatch tasks in a batch equal to the amount of idle CPUs in the
system.

This allows to reduce the pressure on the dispatcher queues, improving
the effectiveness of the scheduler (by having more tasks sitting in the
scheduler task pool) and mitigating potential priority inversion issues.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2023-12-23 10:44:03 +01:00
Andrea Righi
4d98862674 scx_rustland: expose CPU information to the user-space scheduler
Provide an interface for the BPF dispatcher and user-space scheduler to
share CPU information. This information can empower the user-space
scheduler to make more informed decisions and enable the implementation
of a broader range of scheduling policies.

With this change the BPF dispatcher provides a CPU map (one entry per
CPU) that stores the pid that is running on each CPU (0 if the CPU is
idle). The CPU map is updated by the BPF dispatcher in the .running()
and .stopping() callbacks.

The dispatcher then sends to the user-space scheduler a suggestion of
the candidate CPU for each task that needs to run (that is always the
previously used CPU), along with all the task's information.

The user-space scheduler can decide to confirm the selected CPU or to
choose a different one, using all the shared CPU information.

Lastly, the selected CPU is communicated back to the dispatcher along
with all the task's information and the BPF dispatcher takes care of
executing the task on the selected CPU, eventually triggering a
migration.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2023-12-23 10:38:56 +01:00
Andrea Righi
968ac80a3f scx_rustland: handle graceful vs non-graceful exit
Do not report an exit error message if it's empty. Moreover, distinguish
between a graceful exit vs a non-graceful exit.

In general, try to follow the behavior of user_exit_info.h for the C
schedulers.

NOTE: in the future the whole exit handling probably can be moved to a
more generic place (scx_utils) to prevent code duplication across
schedulers and also to prevent small inconsistencies like this one.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2023-12-22 19:44:14 +01:00
Andrea Righi
f7f0e3236c scx_rustland: rename from scx_rustlite
Rename scx_rustlite to scx_rustland to better represent the mirroring of
scx_userland (in C), but implemented in Rust.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2023-12-22 00:20:14 +01:00