Commit Graph

465 Commits

Author SHA1 Message Date
Andrea Righi
61c77b7d87 scx_rustland: clean up old entries in the task map
The user-space scheduler maintains an internal hash map of tasks
information (indexed by their pid). Tasks are only added to this hash
map and never removed. After running the scheduler for a while we may
experience a performance degration, because the hash map keeps growing.

Therefore implement a mechanism of garbage collector to remove the old
entries from the task map (periodically removing pids that don't exist
anymore).

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-01 14:17:23 +01:00
Andrea Righi
27739065bc scx_rustland: rename variable id -> pos for better clarity
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-01 14:17:23 +01:00
Tejun Heo
70803d5e14
Merge pull request #59 from arighi/lowlatency-improvements
scx_rustland: lowlatency improvements
2024-01-01 06:14:50 +09:00
Andrea Righi
1cdcb8af60 scx_rustland: show the CPU where the scheduler is running
In the scheduler statistics reported periodically to stdout, instead of
showing "pid=0" for the CPU where the scheduler is running (like an idle
CPU), show "[self]".

This helps to identify exactly where the user-space scheduler is running
(when and where it migrates, etc.).

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2023-12-31 17:03:30 +01:00
Andrea Righi
a7677fdf28 scx_rustland: bypass user-space scheduler for short-lived kthreads
Bypass the user-space scheduler for kthreads that still have more than
half of their runtime budget.

As they are likely to release the CPU soon, granting them a substantial
priority boost can enhance the overall system performance.

In the event that one of these kthreads turns into a CPU hog, it will
deplete its runtime budget and therefore it will be scheduled like
any other normal task through the user-space scheduler.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2023-12-31 16:40:05 +01:00
Andrea Righi
405a11308e scx_rustland: always use dispatch_on_cpu() when possible
Use dispatch_on_cpu() when possible, so that all tasks dispatched by the
user-space scheduler gets the same priority, instead of having some of
them dispatched to the global DSQ and others dispatched to the per-CPU
DSQ.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2023-12-31 16:08:31 +01:00
Andrea Righi
49f2e7ce06 scx_rustland: enable SCX_OPS_ENQ_LAST
Make sure the scheduler is not activated if we are deadling with the
last task running.

This allows to consistency reduce scx_rustland CPU usage in systems that
are mostly idle (and avoid unnecessary power consumption).

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2023-12-31 16:06:45 +01:00
Tejun Heo
804180a74a
Merge pull request #58 from arighi/scx-rustland-improve-idle-cpu-assignment
scx_rustland: prevent dispatching multiple tasks on the same idle cpu
2023-12-31 18:00:47 +09:00
Andrea Righi
0522219bea scx_rustland: prevent dispatching multiple tasks on the same idle cpu
When a task is dispatched we always try to pick the previously used CPU
(if idle) to minimize the migration overhead. Alternatively, if such CPU
is not available, we pick any other idle CPU in the system.

However, we don't update the list of idle CPUs as we dispatch tasks,
therefore we may end up sending multiple tasks to the same idle CPU (if
their previously used CPU is the same) and we may even skip some idle
CPUs completely.

Change this logic to make sure that we never dispatch multiple tasks to
the same idle CPU, by updating the list of idle CPUs as we send tasks to
the BPF dispatcher.

This also avoids dispatching tasks with a closely matched vruntime to
the same CPU, thereby negating the advantages of the vruntime ordering.
With this change in place, we ensure that tasks with a similar vruntime
are dispatched to different CPUs, leading to significant improvements in
latency performance.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2023-12-31 09:37:39 +01:00
Tejun Heo
641f9b76e9
Merge pull request #57 from arighi/scx-rustland-improve-cpu-selection
scx_rustland: improve scheduler cpu selection
2023-12-30 21:56:48 +09:00
Andrea Righi
38145f8dc9 scx_rustland: check CPU selection validity
When the scheduler decides to assign a different CPU to the task always
make sure the assignment is valid according to the task cpumask. If it's
not valid simply dispatch the task to the global DSQ.

This prevents the scheduler from exiting with errors like this:

  09:11:02 [WARN] EXIT: SCX_DSQ_LOCAL[_ON] verdict target cpu 7 not allowed for gcc[440718]

In the future we may want move this check directly into the user-space
scheduler, but for now let's keep this check in the BPF dispatcher as a
quick fix.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2023-12-30 10:40:46 +01:00
Andrea Righi
1a2c9f5fd4 scx_rustland: improve scheduler's idle CPU selection
The current CPU selection logic in the scheduler presents some
inefficiencies.

When a task is drained from the BPF queue, the scheduler immediately
checks whether the CPU previously assigned to the task is still idle,
assigning it if it is. Otherwise, it iterates through available CPUs,
always starting from CPU #0, and selects the first idle one without
updating its state. This approach is consistently applied to the entire
batch of tasks drained from the BPF queue, resulting in all of them
being assigned to the same idle CPU (also with a higher likelihood of
allocation to lower CPU ids rather than higher ones).

While dispatching a batch of tasks to the same idle CPU is not
necessarily problematic, a fairer distribution among the list of idle
CPUs would be preferable.

Therefore change the CPU selection logic to distribute tasks equally
among the idle CPUs, still maintaining the preference for the previously
used one. Additionally, apply the CPU selection logic just before tasks
are dispatched, rather than assigning a CPU when tasks are drained from
the BPF queue. This adjustment is important, because tasks may linger in
the scheduler's internal structures for a bit and the idle state of the
CPUs in the system may change during that period.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2023-12-30 10:34:08 +01:00
Tejun Heo
474a14970e
Merge pull request #56 from arighi/scx-rustland-reduce-scheduler-overhead
scx_rustland: reduce scheduler overhead
2023-12-30 08:02:09 +09:00
Andrea Righi
e90bc923f9 scx_rustland: introduce nr_waiting concept
We want to activate the user-space scheduler only when there are pending
tasks that require scheduling actions.

To do so we keep track of the queued tasks via nr_queued, that is
incremented in .enqueue() when a task is sent to the user-space
scheduler and decremented in .dispatch() when a task is dispatched.

However, we may trigger an unbalance if the same pid is sent to the
scheduler multiple times (because the scheduler store all the tasks by
their unique pid).

When this happens nr_queued is never decremented back to 0, leading the
user-space scheduler to constantly spin, even if there's no activity to
do.

To prevent this from happening split nr_queued into nr_queued and
nr_scheduled. The former will be updated by the BPF component every time
that a task is sent to the scheduler and it's up to the user-space
scheduler to reset the counter when the queue is fully dreained. The
latter is maintained by the user-space scheduler and represents the
amount of tasks that are still processed by the scheduler and are
waiting to be dispatched.

The sum of nr_queued + nr_scheduled will be called nr_waiting and we can
rely on this metric to determine if the user-space scheduler has some
pending work to do or not.

This change makes rust_rustland more reliable and it strongly reduces
the CPU usage of the user-space scheduler by eliminating a lot of
unnecessary activations.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2023-12-29 21:15:04 +01:00
Andrea Righi
d67dfe50f9 scx_rustland: treat the CPU running the user-space scheduler as idle
Considering the CPU where the user-space scheduler is running as busy
doesn't really provide any benefit, since the user-space scheduler is
constantly dispatching an amount of tasks equal to the amount of idle
CPUs and then yields (therefore its own CPU should be considered idle).

Considering the CPU where the user-space scheduler is running as busy
doesn't provide any benefit, as the scheduler consistently dispatches
tasks equal to the number of idle CPUs and then yields (therefore its
own CPU should be considered idle).

This also allows to reduce the overall user-space scheduler CPU
utilization, especially when the system is mostly idle, without
introducing any measurable performance regression.

Measuring the average CPU utilization of a (mostly) idle system over a
time period of 60 sec:

 - wihout this patch: 5.41% avg cpu util
 - with this patch:   2.26% avg cpu util

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2023-12-29 21:14:58 +01:00
Andrea Righi
05f5c69747 ci: use virtme-ng to test the schedulers
Use virtme-ng to run the schedulers after they're built; virtme-ng
allows to pick an arbitrary sched-ext enabled kernel and run it
virtualizing the entire user-space root filesystem, so we can basically
exceute the recompiled schedulers inside such kernel.

This should allow to catch potential run-time issue in advance (both in
the kernel and the schedulers).

The sched-ext kernel is taken from the Ubuntu ppa (ppa:arighi/sched-ext)
at the moment, since it is the easiest / fastest way to get a
precompiled sched-ext kernel to run inside the Ubuntu 22.04 testing
environment.

The schedulers are tested using the new meson target "test_sched", the
specific actions are defined in meson-scripts/test_sched.

By default each test has a timeout of 30 sec, after the virtme-ng
completes the boot (that should be enough to initialize the scheduler
and run the scheduler for some seconds), while the total lifetime of the
virtme-ng guest is set to 60 sec, after this time the guest will be
killed (this allows to catch potential kernel crashes / hangs).

If a single scheduler fails the test, the entire "test_sched" action
will be interrupted and the overall test result will be considered a
failure.

At the moment scx_layered is excluded from the tests, because it
requires a special configuration (we should probably pre-generate a
default config in the workflow actions and change the scheduler to use
the default config if it's executed without any argument).

Moreover, scx_flatcg is also temporarily excluded from the tests,
because of these known issues:
 - https://github.com/sched-ext/scx/issues/49
 - https://github.com/sched-ext/sched_ext/pull/101

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2023-12-29 15:54:10 +01:00
Andrea Righi
dbc8e23980 scx_userland: flush stdout when printing stats
Periodically flush stdout to help following the scheduler progress
during testing.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2023-12-29 15:53:12 +01:00
Andrea Righi
614a1ff901 scx_flatcg: flush stdout when printing stats
Periodically flush stdout to help following the scheduler progress
during testing.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2023-12-29 15:53:12 +01:00
Tejun Heo
3206464405
Merge pull request #55 from arighi/scx-rustland-doc
scx_rustland: add documentation to scheds/rust/README.md
2023-12-29 17:35:09 +09:00
Andrea Righi
cc17780c24 scx_rustland: add documentation to scheds/rust/README.md
Add documentation for scx_rustland to the README.md files of the Rust
schedulers.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2023-12-29 09:13:54 +01:00
Tejun Heo
d2a173fc51
Merge pull request #53 from sched-ext/htejun
Suppress the deprecation warning from bindgen and bump versions
2023-12-29 07:07:06 +09:00
Tejun Heo
98773131df Bump versions to publish scx_utils fedora compat change 2023-12-29 06:58:45 +09:00
Tejun Heo
c47a4b6716 scx_utils: Explain what's going on with bindgen version and suppress deprecation warning
This is a followup to https://github.com/sched-ext/scx/pull/50. See the
comment in BpfBuilder::bindgen_bpf_intf() for details.
2023-12-29 06:56:07 +09:00
Tejun Heo
1d868dbf89
Merge pull request #50 from jordalgo/downgrade-bindgen
Downgrade bindgen to 0.68
2023-12-29 06:28:20 +09:00
Tejun Heo
e230e86272
Merge pull request #52 from arighi/scx-rustland-update-idle
scx_rustland: introduce update_idle callback
2023-12-29 06:10:40 +09:00
Andrea Righi
6df4d7e0c6 scx_rustland: introduce an update_idle() callback
Move the logic to activate the userspace scheduler to an update_idle()
callback, which is called when the CPU is about to go idle.

This disables the built-in idle tracking mechanism, so it allows to rely
completely on the internal CPU ownership logic (via get_cpu_owner() and
set_cpu_owner()) and it also allows to share the idle state with the
user-space scheduler via the BPF_MAP_TYPE_ARRAY cpu_map.

Moreover, when the user-space scheduler is activated, kick the idle cpu
to trigger immediate dispatch and avoid bubbles in the scheduling
pipeline.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2023-12-28 14:41:08 +01:00
Andrea Righi
1baae38e7f Revert "scx_rustland: always dispatch kthreads on the local CPU"
This reverts commit 9237e1d ("scx_rustland: always dispatch kthreads on
the local CPU").

Do not always prioritize all kthreads, we may have unbound workqueue
workers that can consume a lot of CPU cycles (e.g., encryption workers),
so we definitely want to apply the scheduling for those.

Therefore, restore the old behavior to prioritize only per-CPU kthreads.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2023-12-28 14:40:03 +01:00
Tejun Heo
990cd058fe
Merge pull request #48 from arighi/scx-rustland-userspace-interlocking
scx_rustland: clarify and improve BPF / userspace interlocking
2023-12-28 08:26:55 +09:00
Jordan Rome
c8a721b033 Downgrade bindgen to 0.68
This is so we can package scx_utils into fedora without having
to upgrade rust-bindgen
(https://bodhi.fedoraproject.org/updates/FEDORA-2023-18e7f124e1).

To make this happen we need to stop using the `CargoCallbacks::new`
constructor which was added in 0.69. Old way seems legit according
to the docs:
https://rust-lang.github.io/rust-bindgen/non-system-libraries.html
2023-12-27 12:19:28 -08:00
Andrea Righi
9237e1d835 scx_rustland: always dispatch kthreads on the local CPU
Adding extra overhead to any kthread can potentially slow down the
entire system, so make sure this never happens by dispatching all
kthreads directly on the same local CPU (not just the per-CPU kthreads),
bypassing the user-space scheduler.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2023-12-27 14:15:46 +01:00
Andrea Righi
f0ece7af6b scx_rustland: wake-up user-space scheduler when a CPU is released
Trigger the user-space scheduler only upon a task's CPU release event
(avoiding its activation during each enqueue event) and only if there
are tasks waiting to be processed by the user-space scheduler.

This should save unnecessary calls to the user-space scheduler, reducing
the overall overhead of the scheduler.

Moreover, rename nr_enqueues to nr_queued and store the amount of tasks
currently queued to the user-space scheduler (that are waiting to be
dispatched).

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2023-12-27 14:15:46 +01:00
Andrea Righi
7d01be9568 scx_rustland: provide get/set_cpu_owner()
Provide the following primitives to get and set CPU ownership in the BPF
part. This improves code readability and these primitives can be used by
the BPF part as a baseline to implement a better CPU idle tracking in
the future.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2023-12-27 14:15:39 +01:00
Andrea Righi
cd7e1c6248 scx_rustland: clarify BPF / user-space interlocking
BPF doesn't have full memory model yet, and while strict atomicity might
not be necessary in this context, it is advisable to enhance clarity in
the interlocking model.

To achieve this, provide the following primitives to operate on
usersched_needed:

  static void set_usersched_needed(void)

  static bool test_and_clear_usersched_needed(void)

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2023-12-26 14:28:24 +01:00
Tejun Heo
8443d8ac16
Merge pull request #47 from arighi/scx-rustland-cpu
scx_rustland improvements
2023-12-24 06:29:15 +09:00
Andrea Righi
e038a530ae scx_rustland: dispatch tasks in batch
Dispatch tasks in a batch equal to the amount of idle CPUs in the
system.

This allows to reduce the pressure on the dispatcher queues, improving
the effectiveness of the scheduler (by having more tasks sitting in the
scheduler task pool) and mitigating potential priority inversion issues.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2023-12-23 10:44:03 +01:00
Andrea Righi
4d98862674 scx_rustland: expose CPU information to the user-space scheduler
Provide an interface for the BPF dispatcher and user-space scheduler to
share CPU information. This information can empower the user-space
scheduler to make more informed decisions and enable the implementation
of a broader range of scheduling policies.

With this change the BPF dispatcher provides a CPU map (one entry per
CPU) that stores the pid that is running on each CPU (0 if the CPU is
idle). The CPU map is updated by the BPF dispatcher in the .running()
and .stopping() callbacks.

The dispatcher then sends to the user-space scheduler a suggestion of
the candidate CPU for each task that needs to run (that is always the
previously used CPU), along with all the task's information.

The user-space scheduler can decide to confirm the selected CPU or to
choose a different one, using all the shared CPU information.

Lastly, the selected CPU is communicated back to the dispatcher along
with all the task's information and the BPF dispatcher takes care of
executing the task on the selected CPU, eventually triggering a
migration.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2023-12-23 10:38:56 +01:00
Andrea Righi
968ac80a3f scx_rustland: handle graceful vs non-graceful exit
Do not report an exit error message if it's empty. Moreover, distinguish
between a graceful exit vs a non-graceful exit.

In general, try to follow the behavior of user_exit_info.h for the C
schedulers.

NOTE: in the future the whole exit handling probably can be moved to a
more generic place (scx_utils) to prevent code duplication across
schedulers and also to prevent small inconsistencies like this one.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2023-12-22 19:44:14 +01:00
Tejun Heo
c7b52d485d
Merge pull request #45 from sirlucjan/0.1.3
Bump to 0.1.3
2023-12-22 08:50:45 +09:00
Piotr Gorski
c6eb66616f
Bump to 0.1.3
Signed-off-by: Piotr Gorski <lucjan.lucjanov@gmail.com>
2023-12-22 00:48:50 +01:00
Tejun Heo
d3e8e52b1a
Merge pull request #44 from arighi/scx-rustland
scx_rustland: rename from scx_rustlite
2023-12-22 08:40:01 +09:00
Andrea Righi
f7f0e3236c scx_rustland: rename from scx_rustlite
Rename scx_rustlite to scx_rustland to better represent the mirroring of
scx_userland (in C), but implemented in Rust.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2023-12-22 00:20:14 +01:00
David Vernet
4cadb92003
Merge pull request #38 from arighi/scx-rustlite
scx_rustlite: simple vtime-based scheduler written in Rust
2023-12-21 13:31:52 -06:00
Andrea Righi
086c6dffc8 scx_rustlite: simple user-space scheduler written in Rust
This scheduler is made of a BPF component (dispatcher) that implements
the low level sched-ext functionalities and a user-space counterpart
(scheduler), written in Rust, that implements the actual scheduling
policy.

The main goal of this scheduler is to be easy to read and well
documented, so that newcomers (i.e., students, researchers, junior devs,
etc.) can use this as a template to quickly experiment scheduling
theory.

For this reason the design of this scheduler is mostly focused on
simplicity and code readability.

Moreover, the BPF dispatcher is completely agnostic of the particular
scheduling policy implemented by the user-space scheduler. For this
reason developers that are willing to use this scheduler to experiment
scheduling policies should be able to simply modify the Rust component,
without having to deal with any internal kernel / BPF details.

Future improvements:

 - Transfer the responsibility of determining the CPU for executing a
   particular task to the user-space scheduler.

   Right now this logic is still fully implemented in the BPF part and
   the user-space scheduler can only decide the order of execution of
   the tasks, that significantly restricts the scheduling policies that
   can be implemented in the user-space scheduler.

 - Experiment the possibility to send tasks from the user-space
   scheduler to the BPF dispatcher using a batch size, instead of
   draining the task queue completely and sending all the tasks at once
   every single time.

   A batch size should help to reduce the overhead and it should also
   help to reduce the wakeups of the user-space scheduler.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2023-12-21 18:53:30 +01:00
Tejun Heo
cfb41a77fc
Merge pull request #43 from sched-ext/ubuntu_2204_ci
ci: Run CI job on Ubuntu 22.04
2023-12-21 06:55:45 +09:00
David Vernet
1bf04d0972
ci: Run CI job on Ubuntu 22.04
Andrea pointed out that we can and should be using Ubuntu 22.04.
Unfortunately it still doesn't ship some of the deps we need like
clang-17, but it does at least ship virtme-ng, so it's good for us to
use this so that we can actually test running the schedulers in a
virtme-ng VM when it supports being run in docker.

Also, update the job to run on pushes, and not just when a PR is opened

Suggested-by: Andrea Righi <andrea.righi@canonical.com>
Signed-off-by: David Vernet <void@manifault.com>
2023-12-20 15:29:49 -06:00
Tejun Heo
79b0c3ea89
Merge pull request #41 from multics69/link-blogs
Update README for additional resources (blog posts and articles)
2023-12-19 16:57:32 +09:00
Changwoo Min
23cecf2532 Update README for additional resources (blog posts and articles) 2023-12-19 12:28:44 +09:00
David Vernet
eb7b3c99f0
Merge pull request #40 from sched-ext/ci
scx: Add CI action that builds schedulers for PRs
2023-12-18 21:17:47 -06:00
David Vernet
4523b10e45
scx: Add CI action that builds schedulers for PRs
When Ubuntu ships with sched_ext, we can also maybe test loading the
schedulers (not sure if the runners can run as root though). For now, we
should at least have a CI job that lets us verify that the schedulers
can _build_. To that end, this patch adds a basic CI action that builds
the schedulers.

As is, this is a bit brittle in that we're having to manually download
and install a few dependencies. I don't see a better way for now without
hosting our own runners with our own containers, but that's a bigger
investment. For now, hopefully this will get us _some_ coverage.

Signed-off-by: David Vernet <void@manifault.com>
2023-12-18 21:12:50 -06:00
Tejun Heo
3049d60883
Merge pull request #39 from sched-ext/nest_fixes
Fix some things in Nest
2023-12-18 13:15:41 -10:00