Commit Graph

271 Commits

Author SHA1 Message Date
Andrea Righi
1da2983804 scx_rustland: get rid of force_local
Now that we can dispatch directly from select_cpu() we can make the code
more compact and readable by removing the force_local logic.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-09 22:11:07 +01:00
Andrea Righi
6ead675fb6 scx_rustland: add a link to the live demo in the README
Update the README.md adding a link to a live demo video of the
scheduler.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-09 22:11:07 +01:00
Tejun Heo
74923c6cdb
Merge pull request #76 from sched-ext/htejun
Bump versions
2024-01-08 18:51:47 -10:00
Tejun Heo
942b0269b8 Bump versions
After updates to reflect the updated init and direct dispatch API, the
schedulers aren't compatible with older kernels. Bump versions and publish
releases.
2024-01-08 18:49:54 -10:00
David Vernet
4ff504a65c
Merge pull request #75 from sched-ext/htejun
scx: Build fix after kernel update
2024-01-08 21:22:20 -06:00
Tejun Heo
552b75a9c7 scx: Build fix after kernel update
In the latest kernel, sched_ext API has changed in two areas:

- ops.prep_enable/cancel_enable/enable/disable() replaced with
  ops.init_task/enable/disable/exit_task().

- scx_bpf_dispatch() can now be called from ops.select_cpu(). Also,
  SCX_ENQ_LOCAL flag is removed. Instead, users can call
  scx_bpf_select_cpu_dfl() from ops.select_cpu() and use the @is_idle out
  param value to determine whether to dispatch directly.

This commit updates all schedules so that they build.

- Init functions renamed / merged / split.

- ops.select_cpu() is added to several schedulers and local direct
  disptching logic is moved there.

This is the minimum update which is need to make the schedulers build and
work. It needs further update to e.g. move vtime udpates to ops.enable().
2024-01-08 14:48:24 -10:00
Tejun Heo
0ed47cd9a3
Merge pull request #74 from sched-ext/scx-rustland-multicore-fixes
scx_rustland: multicore fixes
2024-01-08 09:21:13 -10:00
Andrea Righi
1ea5aebfb4 scx_rustland: always consider slice_ns as maximum time slice
With the introduction of a the dynamic time slice that scales down based
on the number of tasks in the system, there is no need anymore to apply
a constant scaling factor to time slice to extend the range of the
allowed time slices.

Therefore, get rid of the static scaling and use slice_ns as the upper
limit for the time slice accounted to the tasks.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-08 19:22:38 +01:00
Andrea Righi
9b482f48f1 scx_rustland: determine the amount of cores via /proc/stat
libbpf_rs::num_possible_cpus() may take into account multi-threads
multi-cores information, that are not used efficiently by the scheduler
at the moment.

For simplicity rely on /proc/stat to determine the amount of CPUs that
can be used by the scheduler and provide a proper abstraction to access
this information from the bpf Rust module.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-08 19:11:25 +01:00
Andrea Righi
0d107d6220 scx_rustland: return the proper cpu value from get_task_cpu()
Fix the ternary operator expression to return the CPU id, instead of the
boolean result of the condition.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-08 19:10:59 +01:00
Andrea Righi
b008830ee8
Merge pull request #73 from sched-ext/scx-rustland-dynamic-timeslice
scx_rustland: dynamic time slice
2024-01-08 09:53:36 +01:00
Andrea Righi
fa6915cc0a scx_rustland: simplify update_enqueued()
With the introduction of a variable time slice that scales down in
function of the amount of waiting tasks, the scheduler is able to handle
a steady stream of newly spawned tasks, without having to de-prioritize
them to guarantee a good level of system responsiveness.

Hence, the logic for de-prioritizing new tasks can be removed, as it
currently doesn't provide any measurable benefits. In fact, it even
proves counterproductive as it can implicitly slow down the interactive
performance of shell sessions when the system is overloaded with a
significant amount of CPU hogs (e.g, `stress-ng -c 128`).

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-08 07:38:52 +01:00
Andrea Righi
bf98154ee1 scx_rustland: use dynamic time slice in the user-space scheduler
Implement a simple logic in the user-space scheduler to automatically
adjust the tasks' time slice: reduce the time slice by a scaling factor
of (nr_waiting / nr_cpus + 1), where nr_waiting is the amount of tasks
waiting in the scheduler and nr_cpus is the amount of CPUs in the
system.

Using a fine-grained time slice as the number of tasks in the system
grows, improves responsiveness of low-latency activities (e.g., audio,
video games), also in presence of other CPU-intensive tasks that are
concurrently running in the system.

On the other hand, extending the time slice when only a limited number
of tasks are active in the system contributes to an enhancement in the
overall system throughput and a reduced amount of context switches.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-08 07:38:52 +01:00
Andrea Righi
303c4ea548 scx_rustland: dynamic time slice support
Add to BpfScheduler() the new methods set_effective_slice_us() and
get_effective_slice_us().

These methods can be used by the user-space scheduler to dynamically
adjust (and retrieve) the effective time slice used to dispatch tasks
within the BPF dispatcher.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-08 07:35:31 +01:00
Tejun Heo
63fe690271
Merge pull request #72 from sched-ext/scx-rustland-enhancements
scx_rustland: enhancements
2024-01-07 11:16:36 -10:00
Andrea Righi
2a32d81859 scx_rustland: store default slice_ns in the scheduler class
Cache slice_ns into the main scheduler class to avoid accessing it via
self.bpf.skel.rodata().slice_ns every single time.

This also makes the scheduler code more clear and more abstracted from
the BPF details.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-07 16:14:51 +01:00
Andrea Righi
8ccbbdadee scx_userland: improve BPF logging
Always report task comm, nr_queued and nr_scheduled in the log messages.
Moreover, report also task name (comm) and cpu when possible.

All these extra information can be really helpful to trace and debug
scheduling issues.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-07 16:14:51 +01:00
Andrea Righi
295873ac41 scx_rustland: always dispatch per-CPU kthreads from enqueue
We allow tasks to bypass the user-space scheduler and be dispatched
directly using a shortcut in the enqueue path, if their running CPU is
immediately available or if the task is per-CPU kthread.

However, the shortcut is disabled if the user-space scheduler has some
pending activities to do (to avoid disrupting too much its decision).

In this case the shortcut is disabled also for per-CPU kthreads and that
may cause priority-inversion problems in the system, triggering some
stall of some per-CPU kthreads (such as rcuog/N) and short system
lockups, if the system is overloaded.

Prevent this by always enabing the dispatch shortcut for per-CPU
kthreads.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-06 11:06:53 +01:00
Andrea Righi
0c3bdb16fe scx_rustland: prevent using SCX_DSQ_LOCAL_ON from enqueue()
When we fail to push a task to the queued BPF map we fallback to direct
dispatch, but we can't use SCX_DSQ_LOCAL_ON. So, make sure to use
SCX_DSQ_GLOBAL in this case to prevent scheduler crashes.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-06 11:06:53 +01:00
Andrea Righi
05d997c539 scx_rustland: more robust CPU selection logic in the dispatcher
Instead of just trying the target CPU and the previously used CPU, we
could cycle among all the available CPUs (if both those CPUs cannot be
used), before using the global DSQ.

This allows to not de-prioritize too much tasks that can't be scheduled
on the CPU selected by the scheduler (or their previously used CPU), and
we can still dispatch them using SCX_DSQ_LOCAL_ON, like any other task.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-06 11:06:53 +01:00
Andrea Righi
18a990ae82 scx_rustland: assign min_vruntime before time slice evaluation
Assign min_vruntime to the task before the weighted time slice is
evaluated, then add the time slice.

In this way we still ensure that the task's vruntime is in the range
(min_vruntime + 1, min_vruntime + max_slice_ns], but we don't nullify
the effect of the evaluated time slice if the starting vruntime of the
task is too small.

Also change update_enqueued() to return the evaluated weighted time
slice (that can be used in the future).

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-06 11:06:53 +01:00
Andrea Righi
92109c95a9 scx_rustland: small TaskTree.push() refactoring
Change TaskTree.push() to accept directly a Task object, rather than
each individual attribute. Moreover, Task attributes don't need to be
public, since both TaskTree and Task are only used locally.

This makes the code more elegant and more readable.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-06 11:06:53 +01:00
Tejun Heo
dfcc52c866
Merge pull request #71 from jordalgo/bump-rust
bump scx_rusty and scx_layered
2024-01-05 07:03:49 +09:00
Jordan Rome
661ea57c5c bump scx_rusty and scx_layered
These were supposed to be bumped in this commit:
fed1dae9da
2024-01-04 13:57:29 -08:00
Andrea Righi
96f3eb42be
Merge pull request #68 from sched-ext/scx-rustland-refactoring
scx_rustland: refactoring
2024-01-04 20:42:30 +01:00
Andrea Righi
7813992896 scx_rustland: introduce nr_failed_dispatches
Introduce a new counter to report the amount of failed dispatches: if
the scheduler designates a target CPU for a task, and both the chosen
CPU and the previously utilized one are unavailable when the task is
dispatched, the task will be sent to the global DSQ, and the counter
will be incremented.

Also mark all the methods to access these statistics counters as
optional. In the future we may also provide a "verbose" option and show
these statistics only when the scheduler runs in verbose mode.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-04 17:36:06 +01:00
David Vernet
c9494e00ed
Merge pull request #67 from jordalgo/rust-readmes
Add README files for each rust scheduler
2024-01-04 09:49:33 -06:00
Andrea Righi
796a7ebc0e scx_rustland: provide an abstraction layer for the BPF component
Move the code responsible for interfacing with the BPF component into
its own module and provide high-level abstractions for the user-space
scheduler, hiding all the internal BPF implementation details.

This makes the user-space scheduler code much more readable and it
allows potential developers/contributors that want to focus at the pure
scheduling details to modify the scheduler in a generic way, without
having to worry about the internal BPF details.

In the future we may even decide to provide the BPF abstraction as a
separate crate, that could be used as a baseline to implement user-space
schedulers in Rust.

API overview
============

The main BPF interface is provided by BpfScheduler(). When this object
is initialized it will take care of registering and initializing the BPF
component.

Then the scheduler can use the BpfScheduler() instance to receive tasks
(in the form of QueuedTask object) and dispatch tasks (in the form of
DispatchedTask objects), using respectively the methods dequeue_task()
and dispatch_task().

The CPU ownership map can be accessed using the method get_cpu_pid(),
this also allows to keep track of the idle and busy CPUs, with the
corrsponding PIDs associated to them.

BPF counters and statistics can be accessed using the methods
nr_*_mut(), in particular nr_queued_mut() and nr_scheduled_mut() can be
updated to notify the BPF component if the user-space scheduler has some
pending work to do or not.

Finally the methods read_bpf_exit_kind() and report_bpf_exit_kind() can
be used respectively to read the exit code and exit message from the BPF
component, when the scheduler is unregistered.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-04 16:49:09 +01:00
Jordan Rome
5bacefcdbe Add README files for each rust scheduler
This because each scheduler has it's own Rust Crate
and it's better if they had a README associated with each one.

https://crates.io/crates/scx_layered
2024-01-04 07:35:44 -08:00
Andrea Righi
7c11837a61 scx_rustland: make dispatcher more robust
We always try to use the current CPU (from the .dispatch() callback) to
run the user-space scheduler itself and if the current CPU is not usable
(according to the cpumask) we just re-use the previouly used CPU.

However, if the previously used CPU is also not usable, we may trigger
the following error:

 sched_ext: runtime error (SCX_DSQ_LOCAL[_ON] verdict target cpu 4 not allowed for scx_rustland[256201])

Potentially this can also happen with any task, so improve the dispatch
logic as following:

 - dispatch on the target CPU, if usable
 - otherwise dispatch on the previously used CPU, if usable
 - otherwise dispatch on the global DSQ

Moreover, rename dispatch_on_cpu() -> dispatch_task() for better
clarity.

This should be enough to handle all the possible decisions made by the
user-space scheduler, making the dispatcher more robust.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-04 10:21:40 +01:00
Andrea Righi
69c1dfc03c scx_rustland: remove unnecessary scx_bpf_dispatch_nr_slots() check
In the dispatch callback we can dispatch tasks to any CPU, according to
the scheduler decisions, so there's no reason to check for the available
dispatch slots in the current CPU only, to determine if we need to stop
dispatching tasks.

Since the scheduler is aware of the idle state of the CPUs (via the CPU
ownership map) it has all the information to automatically regulate the
flow of dispatched tasks and not overflow the dispatch slots, therefore
it is safe to remove this check.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-04 09:41:54 +01:00
Andrea Righi
6b1e7d927d scx_rustland: update comments and documentation in the BPF part
No functional change, only a little polishing, including updates to
comments and documentation to align with the latest changes in the code.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-04 09:40:49 +01:00
Tejun Heo
c6b9173bf4
Merge pull request #66 from sched-ext/scx-userland-reliability
Improve scx_rustland reliability
2024-01-04 10:46:35 +09:00
Andrea Righi
bb1c32d395 scx_rustland: avoid bypassing the scheduler with pending activities
While bypassing the user-space scheduler can provide some benefits at
reducing the scheduling overhead, doing so underneath the scheduler
while it is actively taking decisions may disrupt its work and have a
negative effect on the overall system performance.

For this reason, activate the logic to bypass the user-space scheduler
only when there is no pending work it.

This change makes the scheduler much more reliable, for example on a
8-cores system it is really easy to trigger short lockups or even
trigger the sched-ext watchdog that kicks out the scheduler, running the
following stress test:

  $ stress-ng -c 128

With this change applied the system remains reasonably responsive and
the scheduler is never disabled by the sched-ext watchdog.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-03 22:54:14 +01:00
Andrea Righi
5d15d34777 scx_rustland: charge additional time slice to new tasks
Instead of accounting (max_slice_ns / 2) to the vruntime of all the new
tasks, add that to thier regular weighted time delta, as an additional
penalty.

This allows to distinguish new CPU intensive tasks vs new less CPU
intensive tasks, and prioritize the latter over the former.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-03 22:54:10 +01:00
Andrea Righi
8820af8d36 scx_rustland: enable user-space scheduler to preempt other tasks
Use SCX_ENQ_PREEMPT to dispatch the user-space scheduler. This can help
to mitigate starvation in presence of many cpu hogs (way more than the
amount of available CPUs) running in the system, by giving the scheduler
more chances to drain the amount of tasks that may be starving in a
waiting state.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-03 22:54:00 +01:00
Tejun Heo
020ae33fe2
Merge pull request #65 from jordalgo/arch-map-update
Add new archs for bpf_builder
2024-01-04 04:02:33 +09:00
Jordan Rome
6caf6c5c99 Add new archs for bpf_builder
This is to fix fedora build failures for these archs:
s390x and ppc64le

Error:
```
---- bpf_builder::tests::test_bpf_builder_new stdout ----
thread 'bpf_builder::tests::test_bpf_builder_new' panicked at src/bpf_builder.rs:592:9:
Failed to create BpfBuilder (Err(CPU arch "s390x" not found in ARCH_MAP))
```

https://koji.fedoraproject.org/koji/taskinfo?taskID=111114326
2024-01-03 10:50:33 -08:00
David Vernet
9f1a3973d8
Merge pull request #64 from arighi/improve-interactive-workloads
scx_rustland: improve interactive workloads
2024-01-03 12:10:26 -06:00
Andrea Righi
5d9182d9c3 scx_rustland: prioritize interactive workloads
The current implementation of the user-space scheduler is strongly
prioritizing newly created tasks by setting their initial vruntime to
(min_vruntime + 1); this prioritization places them ahead of other tasks
waiting to run.

While this approach is efficient for processing short-lived tasks, it
makes the scheduler vulnerable to fork-bomb attacks and significantly
penalizes interactive workloads (e.g., "foreground" applications), in
particular in the presence of background applications that are spawning
multiple tasks, such as parallel builds.

Instead of prioritizing newly created tasks, do the opposite and account
(max_slice_ns / 2) to their initial vruntime, to make sure they are not
scheduled before the other tasks that are already waiting for the CPU in
the current scheduler run.

This allows to mitigate potential fork-bomb attacks and it strongly
improves the responsiveness of interactive applications (such as UI,
audio/video streams, gaming, etc.).

With this change applied, under certain conditions, scx_rustland can
even outperform the default Linux scheduler.

For example, with a parallel kernel build (make -j32) running in the
background, I can play Terraria with a constant rate of ~30-40 fps,
while the default Linux scheduler can handle only ~20-30 fps under the
same conditions.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-03 18:28:54 +01:00
Andrea Righi
50b5f6e8c6 scx_rustland: do not update exiting tasks statistics
Avoid updating task information for tasks that are exiting, as they
won't be used by the user-space scheduler.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-03 09:10:20 +01:00
Andrea Righi
b7a9d3775a scx_rustland: schedule non-cpu intensive kthreads normally
With commit a7677fd ("scx_rustland: bypass user-space scheduler for
short-lived kthreads") we were try to mitigate a problem that was
actually introduced by using the wrong formula to evaluate weighted
vruntime, see commit 2900b20 ("scx_rustland: evaluate the proper
vruntime delta").

Reverting that (pseudo-)optimization doesn't seem to introduce any
performance/latency regression and it makes the code more elegant,
therefore drop it.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-03 07:46:01 +01:00
Tejun Heo
70088fd7da
Merge pull request #63 from sched-ext/userland_updates
Userland scheduler updates
2024-01-03 10:15:18 +09:00
David Vernet
e8978ebe23
scx_userland: Introduce ops.update_idle() callback
We can sometimes hit scenarios in the scx_userland scheduler where there
is work to be done in user space, but we incorrectly fail to run the
user space scheduler. In order to avoid this, we can use global
variables that are set from both BPF and user space. The BPF-side
variable reflects when one or more tasks have been enqueued, and the
user space-side variable reflects when user space has received tasks but
has not yet dispatched them.

In the ops.update_idle() callback, we can check these variables and send
a resched IPI to a core to ensure that the user-space scheduler is
always scheduled when there's work to be done.

Signed-off-by: David Vernet <void@manifault.com>
2024-01-02 16:29:19 -06:00
David Vernet
620abac46f
Merge pull request #62 from arighi/c-include-portability
scheds: c: improve build portability
2024-01-02 11:51:43 -06:00
Andrea Righi
bcbce040b6 scheds: c: improve build portability
Improve build portability by including asm-generic/errno.h, instead of
linux/errno.h.

The difference between these two headers can be summarized as following:

  - asm-generic/errno.h contains generic error code definitions that are
    intended to be common across different architectures,

  - linux/errno.h includes architecture-specific error codes and
    provides additional (or overrides) error code definitions based on
    the specific architecture where the code is compiled.

Considering the architecture-independent nature of scx, the advantages
of being able to use architecture-specific error codes are marginal or
negligible (and we should probably discourage using them).

Moving towards asm-generic/errno.h, however, allows the removal of
cross-compilation dependencies (such as the gcc-multilib package in
Debian/Ubuntu) and improves the code portability across various
architectures and distributions.

This also allows to remove a symlink hack from the github workflow.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-02 17:39:46 +01:00
David Vernet
d05c7cf6c3
Merge pull request #51 from arighi/virtme-ng-github-workflow
test the schedulers in the github workflow using virtme-ng
2024-01-02 08:43:54 -06:00
Tejun Heo
54363be254
Merge pull request #61 from arighi/scx-rustland-exiting-tasks
scx_rustland: notify user-space scheduler about exiting tasks
2024-01-02 21:59:36 +09:00
Andrea Righi
a09482f0ef scx_rustland: notify user-space scheduler about exiting tasks
Instead of implementing a garbage collector to periodically free up
exiting tasks' resources, implement a proper synchronous mechanism to
notify the user-space scheduler about the exiting tasks from the BPF
component, using the .disable() callback.

When the user-space scheduler receives a queued task with a negative CPU
number, it can then release all the resources associated with that task
(which currently includes only the entry in the TaskInfoMap for now).

This allows to get rid of the TaskInfoMap periodic garbage collector
routine, save a lot of syscalls in procfs (used to check if the pids
were still alive), and improve the overall scheduler performance.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-02 12:57:27 +01:00
Tejun Heo
6c437ae2b6
Merge pull request #60 from arighi/scx-rustland-prevent-starvation
scx_rustland: prevent starvation and improve responsiveness
2024-01-02 13:46:20 +09:00