Commit Graph

428 Commits

Author SHA1 Message Date
Tejun Heo
3e7ef35649
Merge pull request #250 from multics69/lavd-issue-234
scx_lavd: replesih time slice at ops.running() only when necessary
2024-04-29 09:01:04 -10:00
Tejun Heo
5b7b7d5193
Merge pull request #247 from multics69/lavd-issue-244
scx_lavd: always inline submit_task_ctx to make the verifier happy
2024-04-29 07:53:38 -10:00
Changwoo Min
5f63e0ca30 scx_lavd: replesih time slice at ops.running() only when necessary
The current code replenishes the task's time slice whenever the task
becomes ops.running(). However, there is a case where such behavior can
starve the other tasks, causing the watchdog timeout error. One (if not
all) such case is when a task is preempted while running by the higher
scheduler class (e.g., RT, DL). In such a case, the task will be transit
in a cycle of ops.running() -> ops.stopping() -> ops.running() -> etc.
Whenever it becomes re-running, it will be placed at the head of local
DSQ and ops.running() will renew its time slice. Hence, in the worst
case, the task can run forever since its time slice is never exhausted.
The fix is assigning the time slice only once by checking if the time
slice is calculated before.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-04-29 12:13:31 +09:00
Andrea Righi
cabde30736 scx_utils: bump up version to 0.8.0
Bump up scx-utils version to provide the new scx_utils::TopologyMap.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-04-28 21:01:16 +02:00
Andrea Righi
5effb4fc4c scx_rustland: bump up version to 0.0.5
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-04-28 12:01:38 +02:00
Andrea Righi
0785246ee2 scx_rustland: provide --version option
Provide a command line option to print the version of the scheduler and
the scx_rustland_core crate.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-04-28 12:01:38 +02:00
Andrea Righi
fb2f5c240e scx_rustland_core: bump up version to 0.3
Given that rustland_core now supports task preemption and it has been
tested successfully, it's worhtwhile to cut a new version of the crate.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-04-28 12:01:38 +02:00
Andrea Righi
905960f752 scx_lavd: use c_char consistently
In Rust c_char can be aliased to i8 or u8, depending on the particular
target architecture.

For example, trying to build scx_lavd on ppc64 triggers the following
error:

error[E0308]: mismatched types
   --> src/main.rs:200:38
    |
200 |         let c_tx_cm: *const c_char = (&tx.comm as *const [i8; 17]) as *const i8;
    |                      -------------   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ expected `*const u8`, found `*const i8`
    |                      |
    |                      expected due to this
    |
    = note: expected raw pointer `*const u8`
               found raw pointer `*const i8`

To fix this, consistently use c_char instead of assuming it corresponds
to i8.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-04-27 17:21:19 +02:00
Changwoo Min
f470b1aa13 scx_lavd: always inline submit_task_ctx to make the verifier happy
In _some_ kernel versions, loading scx_lavd fails with an error of
"bpf_rcu_read_unlock is missing". The usage of
bpf_rcu_read_lock/unlock() in proc_dump_all_tasks() is correct but the
bpf verifier still think bpf_rcu_read_unlock() is missing. The most
plausible reason so far is that the problematic kernel does not have a
commit 6fceea0fa59f ("bpf: Transfer RCU lock state between subprog
calls"), failing inter-procedural analysis between proc_dump_all_tasks()
and submit_task_ctx(). Thus, we force inline submit_task_ctx() (no
inter-procedural analysis by the verifier is necessary) for the time
being.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-04-28 00:11:38 +09:00
Changwoo Min
d0d0a18b10 scx_lavd: fix copyright information
Correct the copyright and author information

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-04-26 16:36:58 +09:00
Andrea Righi
973aded5a8
Merge pull request #238 from sched-ext/rustland-reduce-topology-overhead
scx_rustland: reduce overhead by caching host topology
2024-04-24 22:24:23 +02:00
David Vernet
5ba137e8c9
layered: Make layered backwards compat with cpufreq
Only the very newest kernels support scx_bpf_cpuperf_set(). Let's update
scx_layered to accommodate older kernels as well.

Signed-off-by: David Vernet <void@manifault.com>
2024-04-24 14:01:51 -05:00
Tejun Heo
9a9b4dd23e
Merge pull request #239 from hodgesds/cpufreq_helpers
Add CPU frequency related helpers and extend scx_layered
2024-04-24 07:22:15 -10:00
Andrea Righi
5302ff1cdc scx_rustland: use TopologyMap for efficient CPU topology iteration
Looking at perf top it seems that the scheduler can spend a significant
amount of time iterating over the CPU topology/cpumask information,
especially when the system is running a significant amount of tasks:

  2.57% scx_rustland [.] <scx_utils::cpumask::CpumaskIntoIterator as core::iter::traits::iterator::Iterator>::next

Considering that scx_rustland doesn't support CPU hotplugging yet (it
requires a full restart to properly handle CPU hotplug events), we can
completely avoid this overhead by caching a TopologyMap object at the
beginning, when the scheduler starts, instead of constantly
re-evaluating the CPU topology information.

This allows to reduce the scheduler overhead by ~5% CPU utilization
under heavy load conditions (from ~65% -> ~60%, according to top).

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-04-24 17:08:06 +02:00
Daniel Hodges
32e97bf4d5 Adds CPU frequency related helpers and extend scx_layered
This change adds `scx_bpf_cpuperf_cap`, `scx_bpf_cpuperf_cur` and
`scx_bpf_cpuperf_set` definitions that were recently introduced into
[`sched_ext`](https://github.com/sched-ext/sched_ext/pull/180). It adds
a `perf` field to `scx_layered` to allow for controlling performance per
layer.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-04-24 07:27:52 -07:00
David Vernet
a8daf372b2
Merge pull request #241 from sched-ext/cpumask_efficient
topology: Don't allocate on calls to span()
2024-04-24 09:21:15 -05:00
David Vernet
24c248eebb
layered: Add support for filtering on process name
If a library creates threads, those threads will often have the same
name. If two different processes of different priority both use a
library, it may be that we want the library's threads in each process to
be put into different layers.

To support this, let's add the ability to filter not only by task name,
but also by process name via the task thread group leader's comm.

Tested by creating two executables named "foo" and "bar", which both
spawn a bunch of tasks named "exp_worker" that spin until being
interrupted. With this config: https://pastebin.com/Uz2phzxQ, the tasks
were correctly matched to the expected layers.

Signed-off-by: David Vernet <void@manifault.com>
2024-04-23 23:12:37 -05:00
David Vernet
c187c65702
topology: Don't allocate on calls to span()
We're currently cloning cpumasks returned by calls to {Core, Cache,
Node, Topology}::span(). If a caller needs to clone it, they can. Let's
not penalize the callers that just want to query the underlying cpumask.

Signed-off-by: David Vernet <void@manifault.com>
2024-04-23 22:59:42 -05:00
David Vernet
a998fb7d01
layered: Clarify f: and file: prefix behavior
Some people have expressed confusion at this behavior. Let's be a bit
more explicit in the documentation.

Signed-off-by: David Vernet <void@manifault.com>
2024-04-23 20:39:28 -05:00
Andrea Righi
fbe9a80af8 scx_rustland: introduce --no-preemption
Provide a run-time option to disable task preemption.

This option can be used to improve the throughput of the CPU-intensive
tasks while still providing a good level of responsiveness in the
system.

By default preemption is enabled, to provide a higher level of
responsiveness to the interactive tasks.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-04-23 07:13:30 +02:00
Andrea Righi
0ffaaac6db scx_rustland: enable preemption
Use the new scx_rustland_core dispatch flag RL_PREEMPT_CPU to allow
interactive tasks to preempt other tasks with scx_rustland.

If the built-in idle selection logic is enforced (option `-i`), the
scheduler prioritizes keeping tasks on the target CPU designated by this
logic. With preemption enabled, these tasks have a higher likelihood of
reusing their cached working set, potentially improving performance.

Alternatively, when tasks are dispatched to the first available CPU
(default behavior), interactive tasks benefit from running more promptly
by kicking out other tasks before their assigned time slice expires.

This potentially allows to increase the default time slice to higher
values in the future, to improve the overall throughput in the system
and, at the same time, still maintain a good level of responsiveness,
because interactive tasks are now able to run pretty much immediately,
independently on the remaining time slice of the other tasks that are
contending the CPUs in the system.

= Results =

Measuring the performance of the usual benchmark "playing a video game
while running a parallel kernel build in background" seems to give
around 2-10% boost in the fps with preemption enabled, depending on the
particular video game.

Results were obtained running a `make -j32` kernel build on a AMD Ryzen
7 5800X 8-Cores 16GB RAM, while testing video games such as Baldur's
Gate 3 (with a solid +10% fps), Counter Strike 2 (around +5%) and Team
Fortress 2 (+2% boost).

Moreover, some WebGL applications (such as
https://webglsamples.org/aquarium/aquarium.html) seem to benefit even
more with preemption enabled, providing up to a +15% fps boost.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-04-23 07:13:30 +02:00
Andrea Righi
6d2aac1591 scx_rustland_core: introduce dispatch flags
Reserve some bits of the `cpu` attribute of a task to store special
dispatch flags.

Initially, let's introduce just RL_CPU_ANY to replace the special value
NO_CPU, indicating that the task can be dispatched on any CPU,
specifically the first CPU that becomes available.

This allows to keep the CPU value assigned by the builtin idle selection
logic, that can potentially be used later for further optimizations.

Moreover, having the possibility to specify dispatch flags gives more
flexibility and it allows to map new scheduling features to such flags.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-04-23 07:13:30 +02:00
takase1121
3e12676ca2
scheds-rust: add explanation for chaining schedulers 2024-04-23 08:30:38 +08:00
takase1121
5d20f89a87
scheds-rust: build rust schedulers in sequence 2024-04-23 08:06:27 +08:00
David Vernet
5f1eac85ff
layered: Fix init_task
When I transitioned layered to using task local storage, I messed up
initializing the task ctx, not realizing we previously had a separate
variable that was initializing the hasmap entry. We need to initialize
the task's layer to -11, and also set refresh_layer to 1.

Signed-off-by: David Vernet <void@manifault.com>
2024-04-18 09:44:32 -05:00
David Vernet
45589cd0f7
lavd: Fix a few typos
Noticed a few typos. Let's fix em up

Signed-off-by: David Vernet <void@manifault.com>
2024-04-17 08:17:52 -05:00
David Vernet
ffced1f615
rusty: Remove explicit padding
As of libbpf-rs 0.23.0 (which contains commit
9d9e979fcf),
libbpf-rs now generates rust structs that honor padding. We can
therefore remove the custom padding in scx_rusty's struct pcpu_ctx.

For example, here is the generated pub struct pcpu_ctx:

pub struct pcpu_ctx {
    pub dom_rr_cur: u32,
    pub dom_id: u32,
    pub nr_node_doms: u32,
    pub node_doms: [u32; 64],
    pub __pad_268: [u8; 52],
}

And here is the matching struct in the BPF object file:

struct pcpu_ctx {
        u32                        dom_rr_cur;           /*     0     4 */
        u32                        dom_id;               /*     4     4 */
        u32                        nr_node_doms;         /*     8     4 */
        u32                        node_doms[64];        /*    12   256 */

        /* size: 320, cachelines: 5, members: 4 */
        /* padding: 52 */
} __attribute__((__aligned__(64)));

Signed-off-by: David Vernet <void@manifault.com>
2024-04-12 13:52:13 -05:00
David Vernet
e032ee7cc0
rusty: Add lookup_pcpu_ctx() helper
Getting rid of more boilerplate

Signed-off-by: David Vernet <void@manifault.com>
2024-04-11 19:30:23 -05:00
David Vernet
885a9fd7da
rusty: Make lookup_task_ctx() static
It doesn't need to be a global prog. Let's make it static.

Signed-off-by: David Vernet <void@manifault.com>
2024-04-11 19:30:23 -05:00
David Vernet
0ff73754cf
rusty: Add create_save_cpumask() helper
We have a lot of boilerplate code where we create a cpumask, initialize
it, and then bpf_kptr_xchg() it into the map. In an effort to slightly
reduce the amount of boilerplate, let's create a helper that can
alleviate some of it.

Signed-off-by: David Vernet <void@manifault.com>
2024-04-11 19:30:21 -05:00
David Vernet
e27d5b4e67
rusty: Fix a few random issues
There are some random issues in the code, like unused variables, and bad
print formatters. I'm not sure why the compiler isn't consistently
complaining, but let's fix them.

Signed-off-by: David Vernet <void@manifault.com>
2024-04-11 19:21:02 -05:00
David Vernet
31cc2dccb9
rusty: Allocate DSQ on appropriate NUMA node
In scx_rusty, now that we have a complete view of the host's topology
thanks to the Topology crate, we can update our calls to
scx_bpf_create_dsq() to create the DSQ on the NUMA node of the domain.
It's unclear how much this will end up mattering for performance in the
typical case, but we might as well do the right thing given that host
topolgoy is static, and we have the information.

Signed-off-by: David Vernet <void@manifault.com>
2024-04-11 00:01:25 -05:00
Dan Schatzberg
6eefc8c27f
Fix error typo
ENONET means "Machine is not on the network" - this was supposed to be ENOENT "No such file or directory"
2024-04-10 15:28:05 -04:00
Changwoo Min
f53c29759e
scx_lavd: support preemption (in some scenarios) (#224)
* scx-lavd: preemption of a lower-priority task using kick cpu

When a task is enqueued to the global queue, the scheduler checks if
there is a lower priority task than the enqueued task. If so, it kicks
out the lower-priority task, hoping the newly enqueued task or another
higher-priority task runs on the kicked CPU. Kicking another CPU is
expensive as an IPI is involved, so the scheduler judiciously kicks the
CPU when its benefit (i.e., priority gap) is clear enough.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-04-09 14:25:53 +09:00
David Vernet
9a8ed8ab44
Merge pull request #218 from sched-ext/rusty_hotplug
Gracefully handle hotplug in scx_rusty
2024-04-04 16:03:59 -05:00
Andrea Righi
17a30bddc9 scx_rustland_core: bump up version to 0.2
Bump up the version of the crate and update dependencies.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-04-04 22:44:55 +02:00
David Vernet
622b61dd2f
rusty: Support restarting rusty on hotplug events
The scx_rusty scheduler does not support hotplug, and expects a static
host topology throughout its runtime. Though the kernel does have
support for detecting hotplug events, we currently don't detect this in
the kernel, nor surface it to user space when it happens. Now that we
have scx_bpf_exit(), we can gracefully exit the kernel in the event of a
hotplug, and communicate to user space that it should restart the
scheduler.

This patch adds that support to scx_rusty. Note that this assumes that
we're running on a recent enough kernel that has scx_bpf_exit(). If it
doesn't, then we instead just error out of the kernel scheduler and exit
the application.

Signed-off-by: David Vernet <void@manifault.com>
2024-04-04 14:52:48 -05:00
Tejun Heo
ba52cc131b scx_lavd: Add .gitignore 2024-04-04 07:15:37 -10:00
Tejun Heo
a60737a6bf
Merge pull request #207 from sched-ext/api-updates
scx: Apply API updates from sched_ext
2024-04-02 14:26:42 -10:00
Tejun Heo
b925bdf94d Cargo.toml: Update libbpf-rs/cargo dependencies to 0.23 and drop patch.crates-io sections
New versions of libbpf-rs and libbpf-cargo are now available with all the
needed features. Update the dependencies and drop the patch sections.
2024-04-02 11:19:39 -10:00
Tejun Heo
6f81409df4 Bump versions
- scx_utils bumped from 0.6.0 to 0.7.0.

- Repo and rust schedulers get a PATCH level bump.
2024-04-02 10:58:50 -10:00
Tejun Heo
f3e20ae9b3 scx_rustland: Apply API updates and add --exit-dump-len option to scx_rustland 2024-04-02 10:30:56 -10:00
David Vernet
5088328f9e
rusty: Check LOCAL_DSQ length for WAKE_SYNC
In rusty_select_cpu(), if a task is WAKE_SYNC, we'll currently migrate
the task to that CPU if there are any idle cores on the system. As in
[0], this condition is insufficient, as there could be idle cores
elsewhere on the system, but still tasks piled up on a single local DSQ.
Let's add a condition that the local DSQ has to be empty in order to
apply the WAKE_SYNC migration.

Before patch:

[void@maniforge src]$ hackbench
Running in process mode with 10 groups using 40 file descriptors each (== 400 tasks)
Each sender will pass 100 messages of 100 bytes
Time: 0.433

With patch:
[void@maniforge src]$ hackbench
Running in process mode with 10 groups using 40 file descriptors each (== 400 tasks)
Each sender will pass 100 messages of 100 bytes
Time: 0.035

Signed-off-by: David Vernet <void@manifault.com>
2024-04-02 15:17:32 -05:00
Tejun Heo
dfa978d166 scx_lavd: Apply API updates 2024-04-02 10:08:02 -10:00
Tejun Heo
0c07f382b1 scx_rusty: Apply API updates 2024-04-02 10:07:54 -10:00
Tejun Heo
59bbd800c1 compat: Implement scx_utils::compat and fix up scx_layered
Implement scx_utils::compat to match C's scx/compat.h and update
scx_layered. Other rust scheds are still broken.
2024-04-02 07:08:56 -10:00
Changwoo Min
3a3bd2a750 scx_lavd: increase the upper bound of ineligible duration
Change the upper bound of ineligible duration (LAVD_ELIGIBLE_TIME_MAX).
The updated (2x increased) upper bound reflects the distribution of
tasks' eligible_delta_ns better.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-03-30 22:59:06 +09:00
Changwoo Min
8efaf0c4c2 scx_lavd: improve the accuracy of task's run_freq
Change the calculation of the run_frequence using the wait_period from
the last time the task yielded CPU to this time when the task is
running. The old implementation measures the time interval between the
last stopping and the current running and increases run_freq without
reason.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-03-30 22:55:17 +09:00
Changwoo Min
fe3efb8ce2 scx_lavd: rename last_{start/stop/wait/wake}_clk for consistency
Change the last_{start/stop/wait/wake}_clk in task_ctx to
last_{running/stopping/quiescent/runnable}_clk, matching with state
transition names. In addition, add comments and reorder fields in
task_ctx for readability.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-03-30 10:13:20 +09:00
Changwoo Min
3ba10a8d4f scx_lavd: accumulate consecutive runnings
When a task runs more than once (running <->stopping) within one
runnable-quiescent transition, accumulate runtime of multiple runnings
for statistics. This helps to get the task's runtime per schedule when
supposing that a huge time slice is given, which is what we want to
collect for scheduling decisions.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-03-29 17:19:30 +09:00
Changwoo Min
7b99ed9c5c scx_lavd: drop runtime_boost using slice_boost_prio
Remove runtime_boost using slice_boost_prio. Without slice_boost_prio,
the scheduler collects the exact time slice.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-03-29 16:31:03 +09:00
Changwoo Min
5629189527 scx_lavd: change update_stat_for_*() for consistency
Let's change the function names of update_stat_for_*() as follow their
callers for consistency and less confusion.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-03-29 14:49:06 +09:00
Changwoo Min
04c9e7fe9d
Merge pull request #201 from multics69/perf-vdeadline01
scx_lavd: fix merge conflicts between PR 197 and 199
2024-03-28 14:15:00 +09:00
Changwoo Min
0ea1aab070 scx_lavd: fix merge conflicts
Merge branch 'perf-vdeadline01' of github.com:sched-ext/scx into perf-vdeadline01
2024-03-28 13:49:19 +09:00
Tejun Heo
340938025f
Merge pull request #200 from sched-ext/layered_delete
layered: Use TLS map instead of hash map
2024-03-27 17:09:20 -10:00
Changwoo Min
60472db845
Merge pull request #197 from multics69/perf-vdeadline01
scx_lavd: improve virtual deadline calculation
2024-03-28 11:44:54 +09:00
Changwoo Min
67f41c7d83 scx_lavd: bug fix: slice_boost should be update before adjusted runtime
The run_time_boosted_ns calculation requires updated slice_boost_prio,
so updating slice_boost_prio should be done before updating
run_time_boosted_ns.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-03-28 11:21:42 +09:00
David Vernet
e857dd90ab
layered: Use TLS map instead of hash map
In scx_layered, we're using a BPF_MAP_TYPE_HASH map (indexed by pid)
rather than a BPF_MAP_TYPE_TASK_STORAGE, to track local storage for a
task. As far as I can tell, there's no reason we need to be doing this.
We never access the map from user space, and we're even passing a
struct task_struct * to a helper subprog to look up the task context
rather than only doing it by pid.

Using a hashmap is error prone for this because we end up having to
manually track lifecycles for entries in the map rather than relying on
BPF to do it for us. For example, BPF will automatically free a task's
entry from the map when it exits. Let's just use TLS here rather than a
hashmap to avoid issues from this (e.g. we've observed the scheduler
getting evicted because we're accessing a stale map entry after a task
has been destroyed).

Reported-by: Valentin Andrei <vandrei@meta.com>
Signed-off-by: David Vernet <void@manifault.com>
2024-03-27 20:14:27 -05:00
Changwoo Min
31157ebc81 scx-lavd: make the comments in update_sys_cpu_load() clear
The current description is a bit confusing, so update the comments for clarity.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-03-28 06:45:57 +09:00
Tejun Heo
129d99f542 scx_lavd: Remove custom task state tracking
transit_task_stat() is now tracking the same runnable, running, stopping,
quiescent transitions that sched_ext core already tracks and always returns
%true. Let's remove it.
2024-03-26 12:23:19 -10:00
Tejun Heo
d7ec05e017 scx_lavd: Call update_stat_for_enq() from lavd_runnable()
LAVD_TASK_STAT_ENQ is tracking a subset of runnable task state transitions -
the ones which end up calling ops.enqueue(). However, what it is trying to
track is a task becoming runnable so that its load can be added to the cpu's
load sum.

Move the LAVD_TASK_STAT_ENQ state transition and update_stat_for_enq()
invocation to ops.runnable() which is called for all runnable transitions.

Note that when all the methods are invoked, the invocation order would be
ops.select_cpu(), runnable() and then enqueue(). So, this change moves
update_stat_for_enq() invocation before calc_when_to_run() for
put_global_rq(). update_stat_for_enq() updates taskc->load_actual which is
consumed by calc_greedy_ratio() and thus affects calc_when_to_run().

Before this patch, calc_greedy_ratio() would use load_actual which doesn't
reflect the last running period. After this patch, the latest running period
will be reflected when the task gets queued to the global queue.

The difference is unlikely to matter but it'd probably make sense to make it
more consistent (e.g. do it at the end of quiescent transition).

After this change, transit_task_stat() doesn't detect any invalid
transitions.
2024-03-26 12:23:19 -10:00
Tejun Heo
625bb84bc4 scx_lavd: Move load subtraction to quiescent state transition
scx_lavd tracks task state transitions and updates statistics on each valid
transition. However, there's an asymmetry between the runnable/running and
stopping/quiescent transitions. In the former, the runnable and running
transitions are accounted separately in update_stat_for_enq() and
update_stat_for_run(), respectively. However, in the latter, the two
transitions are combined together in update_stat_for_stop().

This asymmetry leads to incorrect accounting. For example, a task's load
should be added to the cpu's load sum when the task gets enqueued and
subtracted when the task is no longer runnable (quiescent). The former is
accounted correctly from update_stat_for_enq() but the latter is done
whenever the task stops. A task can transit between running and stopping
multiple times before becoming quiescent, so the asymmetry can end up
subtracting the load of a task which is still running from the cpu's load
sum.

This patch:

- introduces LAVD_TASK_STAT_QUIESCENT and updates transit_task_stat() so
  that it can handle all valid state transitions including the multiple back
  and forth transitions between two pairs - QUIESCENT <-> ENQ and RUNNING
  <-> STOPPING.

- restores the symmetry by moving load adjustments part from
  update_stat_for_stop() to new update_stat_for_quiescent().

This removes a good chunk of ignored transitions. The next patch will take
care of the rest.
2024-03-26 12:23:19 -10:00
Tejun Heo
dd40377f03 scx_lavd: Drop unnecessary extern crates
Since https://doc.rust-lang.org/edition-guide/rust-2018/path-changes.html,
extern crate declarations aren't necessary. Let's drop them.
2024-03-26 12:23:19 -10:00
David Vernet
602ec5ada3
layered: Make helper functions static
lookup_task_ctx(), lookup_task_ctx_may_fail(), and lookup_layer()
currently don't have the static keyword, so BPF may treat them as a
global function. We don't actually want these to be global, so let's
make them static to avoid confusing the verifier.

Signed-off-by: David Vernet <void@manifault.com>
2024-03-26 15:08:32 -05:00
Changwoo Min
83169481a6 scx_lavd: improve latency criticality to latency priority mapping
The old approach is mapping [0, maximum latency criticliy] to [-boost
range, boost range). This approach is easily affected by one outlier
maximum value and suffers from the integer truncation error. The new
approach divides the range into two -- [minimum latency criticality,
average latency criticality) and [average latency criticality, maximum
latency criticality] -- and maps them into [boost range/2, 0) and [0,
-boost range/2), respectively,

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-03-25 22:13:41 +09:00
Changwoo Min
2b5d3c1300 scx_lavd: change sched_prio_to_latency_weight to more skewed one
Replace a latency weight arrary to more skewed one, which is the
inverse of sched_prio_to_slice_weight. It turns out more skewed one
works better under highly CPU-overloaded cases since it gives a longer
deadline to non-latency-critical tasks.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-03-21 14:01:44 +09:00
Changwoo Min
9c12b607ca scx_lavd: increase LAVD_LC_RUNTIME_MAX for improved lat_prio
As the calculated runtime increases by considering the number of
full-time slice consumption, increase the upper bound
(LAVD_LC_RUNTIME_MAX) of runtime to be considered in latency
calculation. Also, add LAVD_SLICE_BOOST_MAX_PRIO to avoid
slice_boost_prio dropping to zero suddenly.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-03-21 10:59:13 +09:00
Changwoo Min
32570789d8 scx_lavd: improve the accuracy of runtime per schedule
Take slice_boost_prio -- how many times a full time slice was consumed
-- into consideration in calculating run_time_ns (runtime per schedule).
This improve the accuracy especially when a task is overscheduled and
its time slice is reduced for enforcing fairness.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-03-21 10:32:09 +09:00
Changwoo Min
b37370bb35 scx_lavd: entail two invalid task state transitions
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-03-20 00:15:47 +09:00
Changwoo Min
8860f26ff4 scx_lavd: add a sanity check if runtime is negative
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-03-20 00:15:37 +09:00
Changwoo Min
fa2282363b scx_lavd: more explanation about sched_prio_to_latency_weight
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-03-19 21:31:37 +09:00
Changwoo Min
24bddad9b4 scx_lavd: fix a typo
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-03-19 21:19:55 +09:00
Changwoo Min
512c4e794f scx_lavd: fix potential CPU stall in lavd_select_cpu()
Returning prev_cpu after picking an idle CPU will cause the idle CPU
stall because the idle core was already punched out from the idle mask
by the scx core so it is no longer idle from the scx core's point of
view.

This fix conducts the idle core selection at the last step so it never
return prev_cpu after picking the idle core.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-03-19 00:46:46 +09:00
Changwoo Min
e41c674fae scx_lavd: remove redundant latency calculation at calc_latency_weight()
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-03-19 00:46:15 +09:00
Changwoo Min
865269f438 scx_lavd: remove unnecessary condition check at slice_fully_consumed()
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-03-19 00:46:15 +09:00
Changwoo Min
c2b1a10e17 scx_lavd: remove unnecessary condition check at update_stat_for_stop()
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-03-19 00:46:15 +09:00
Changwoo Min
a27b509452 sdx_lavd: use is_wakeup_ef() in checking wait flag
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-03-19 00:46:15 +09:00
Changwoo Min
419ccae8db scx_lavd: improve the clarity of the task state transition
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-03-19 00:46:01 +09:00
Changwoo Min
66e15285ea scx_lavd: move scx_bpf_error() calls to get_cpu_ctx{_id}()
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-03-19 00:46:01 +09:00
Changwoo Min
0fc5591bf6 scx_lavd: add a utility func, {try_}get_task_ctx()
get_task_ctx() and try_get_task_ctx() were added for common error
handling for task lookup failure. Since idle "swapper" task is not under
sched_ext, try_get_task_ctx() is added for the case such that idle task
can be searched.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-03-19 00:46:01 +09:00
Changwoo Min
97b4d9ce5a scx_lavd: remove unnecessary condition check in is_wakeup_wf()
We don't need to test SCX_WAKE_SYNC because SCX_WAKE_SYNC should only be
set when SCX_WAKE_TTWU is set.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-03-19 00:46:01 +09:00
Changwoo Min
47e7238b13 scs_lavd: improve the description of fairness
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-03-19 00:45:37 +09:00
Changwoo Min
670c1b5b92 scx_lavd: print one scheduling decision by default
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-03-19 00:30:41 +09:00
Changwoo Min
315e5b3fe2 scx_lavd: remove unnecessary arg from put_local_rq()
cpu_id is unused and not necessary in pu_local_rq(), so it it removed.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-03-19 00:30:26 +09:00
Changwoo Min
ead7d55c5c scx_lavd: replace num_cpus to scx_utils::Topology
This removes the external carte depenendy and avoides the known bugs
in the num_cpus carte.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-03-19 00:30:26 +09:00
Changwoo Min
17bce169e7 scx_lavd: fix formatting issues in main.rs and main.bpf.c
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-03-19 00:30:26 +09:00
Changwoo Min
fb73520990 scx_lavd: add scx_lavd to the meson build 2024-03-16 10:55:37 +09:00
Changwoo Min
6ab3928a0d scx_lavd: add scx_lavd (Latency-criticality Aware Virtual Deadline) scheduler
scx_lavd is a BPF scheduler that implements an LAVD (Latency-criticality
Aware Virtual Deadline) scheduling algorithm. While LAVD is new and
still evolving, its core ideas are 1) measuring how much a task is
latency critical and 2) leveraging the task's latency-criticality
information in making various scheduling decisions (e.g., task's
deadline, time slice, etc.). As the name implies, LAVD is based on the
foundation of deadline scheduling. This scheduler consists of the BPF
part and the rust part. The BPF part makes all the scheduling decisions;
the rust part loads the BPF code and conducts other chores (e.g.,
printing sampled scheduling decisions).
2024-03-16 10:31:07 +09:00
David Vernet
35b7dc95d0
rusty: Fix up the scheduler description
There were a few issues, e.g. us still mentioning the infeasible weights
problem, and arguments using underscores despite clap rendering them
with dashes. Let's fix them up.

Signed-off-by: David Vernet <void@manifault.com>
2024-03-14 11:21:03 -05:00
David Vernet
4520514fe8
rusty: Account for disabled but offline CPUs
As described in https://bugzilla.kernel.org/show_bug.cgi?id=218109,
https://github.com/sched-ext/scx/issues/147 and
https://github.com/sched-ext/sched_ext/issues/69, AMD chips can
sometimes report fully disabled CPUs as offline, which causes us to
count them when looking at /sys/devices/system/cpu/possible.

Additionally, systems can have holes in their active CPU maps. For
example, a system with CPUs 0, 1, 2, 3 possible, may have only 0 and 2
active. To address this, we need to do a few things:

1. Update topology.rs to be clear that it's returning the number of
   _possible_ CPUs in the system. Also update Topology to only record
   online CPUs when creating its span and iterating over sysfs when
   creating domains. It was previously trying to record when a CPU was
   online, but this was actually broken as the topology directory isn't
   present in sysfs when the CPU is offline.

2. Schedulers should not be relying on nr_possible_cpus for anything
   other than interacting with per-CPU data (e.g. for stats extraction),
   or e.g. verifying maximum sizes of statically sized arrays in BPF. It
   should _not_ be used for e.g. performing load calculations, etc. With
   that said, we'll also need to update schedulers to not rely on the
   nr_possible_cpus figure being exported by the topology crate. We do
   that for rusty in this patch, but don't fix any of the others other
   than updating how they call topology.rs.

3. Account for the fact that LLC IDs may be non-contiguous. For example,
   if there is a single core in an LLC, then if we assign LLC IDs to
   domains, then the domain IDs won't be contiguous. This doesn't fit
   our current model which is used by e.g. infeasible_weights.rs. We'll
   update some of the code in rusty to accomodate this, but we'll need
   to do more.

4. Update schedulers to properly reset themselves in the event of a
   hotplug event. We'll take care of that in a follow-on change.

Signed-off-by: David Vernet <void@manifault.com>
2024-03-14 11:15:28 -05:00
David Vernet
2b8a3ea984
rusty: Iterate over domains, not IDs
If a CPU is offline, it could cause an LLC to go offline, which could
cause us to have non-contiguous domain IDs. Right now, a few places in
code assume contiguous domain IDs, such as in the infeasible weights
crate. Let's update domain.rs and load_balaance.rs to do the right
thing. We'll fix the others later.

Signed-off-by: David Vernet <void@manifault.com>
2024-03-14 11:02:01 -05:00
David Vernet
4e9cf5181e
rusty: Fix domain weight() function
We were looking at the domain cpumask length, instead of its weight.
Correct the oversight.

Signed-off-by: David Vernet <void@manifault.com>
2024-03-14 11:02:01 -05:00
David Vernet
bc0336d727
cpumask: Add bitwise ops for cpumask
We implement functions or(), and(), and xor() for cpumasks, but we
should also implement the bitwise ops for those operations in case
people prefer that syntax.

Signed-off-by: David Vernet <void@manifault.com>
2024-03-14 11:02:01 -05:00
David Vernet
583696f940
topology: Include last CPU in online
We're iterating from min..max cpu in cpus_online(), but that's not
inclusive of the max CPU. Let's also include that so we don't think that
last CPU is offline.

Signed-off-by: David Vernet <void@manifault.com>
2024-03-14 11:01:52 -05:00
Andrea Righi
2cd3929475 scx_rustland: mitigate sub-optimal performance with offline CPUs
Most of the schedulers assume that the amount of possible CPUs in the
system represents the actual number of CPUs available.

This is not always true: some CPUs may be offline or certain CPU models
(AMD CPUs for example) may include unavailable CPUs in this number.

This can lead to sub-optimal performance or even errors in the scheduler
(see for example [1][2]).

Ideally, we need to attack this issue in a more generic way, such as
having a proper API provided by a C library, that can be used by all
schedulers and the topology Rust module (scx_utils crate).

But for now, let's try to mitigate most of the common sub-optimal cases
separately inside each scheduler.

For rustland we can apply some mitigations both in select_cpu() (for the
BPF part) and in the user-space part:

 - the former is fixed in the sched-ext kernel by commit 94dc0c01b957
   ("scx: Use cpu_online_mask when resetting idle masks"). However,
   adding an extra check `cpu < num_possible_cpus` in select_cpu(),
   allows to properly support AMD CPUs, even with kernels that don't
   have the cpu_online_mask fix yet (this doesn't always guarantee the
   validity of cpu, but it should be enough to mitigate the majority of
   the potential sub-optimal cases, without introducing any significant
   overhead)

 - the latter can be fixed relying on topology.span(), instead of
   topology.nr_cpus(), to count the amount of available CPUs in the
   system.

[1] https://github.com/sched-ext/sched_ext/issues/69
[2] https://github.com/sched-ext/scx/issues/147

Link: 94dc0c01b9
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-03-14 10:19:31 +01:00
David Vernet
3cda1bc690
Merge pull request #187 from sched-ext/layered-updates
scx_layered: Make config json assume default vaules for unspecified fields
2024-03-13 17:15:18 -05:00
Tejun Heo
76fb0fdd8f scx_layered: Make config json assume default vaules for unspecified fields
This makes writing configs and allows introducing new fields without
breaking existing configs.
2024-03-13 11:10:38 -10:00
Tejun Heo
6048992ca7
Merge pull request #185 from sched-ext/layered-updates
scx_layered: Implement layer properties `exclusive` and `min_exec_us`
2024-03-13 09:59:37 -10:00
Tejun Heo
60b346c1fc scx_layered: Add more comments 2024-03-13 09:56:28 -10:00
David Vernet
91cb5ce8ab
Merge pull request #178 from sched-ext/multi_numa_rusty
rusty: Implement NUMA-aware load balancing
2024-03-12 15:50:27 -05:00
David Vernet
c8d841d50b rusty: Add comments + use VecDeque
Given the complexity of migrating load between nodes (we're doing four
nested loops), we should add a comment explaining what we're doing. This
commit does that. In addition, we use a VecDeque to store (and then
restore) push nodes and push domains so that we can re-add them to their
respective lists in load-sorted order rather than reverse-load-sorted
order. This allows us to avoid having to do unnecessary right-shifts
every time a push object is re-added to its containing list.

Signed-off-by: David Vernet <void@manifault.com>
2024-03-12 13:49:14 -07:00
Tejun Heo
a9457a408e scx_layered: stat reporting updates 2024-03-12 10:48:21 -10:00
Tejun Heo
a642fc873b scx_layered: Fix stat reporting
GSTAT_TASK_CTX_FREE_FAILED should report total while EXCL_* should report
delta pct. Fix them.
2024-03-12 10:25:51 -10:00
David Vernet
03f68092ee rusty: Fix a few remaining issues
Fixing alignment, moving a couple bail! calls around, and adding a
missing break from move_between_nodes() that lets us bail out of a loop
early.

Signed-off-by: David Vernet <void@manifault.com>
2024-03-12 12:44:38 -07:00
Tejun Heo
58cbc5361d scx_layered: warn if omitted stats aren't zero 2024-03-12 09:29:31 -10:00
Tejun Heo
37006d1bc1 scx_layered: Use saturating sub when reading system stats, other misc changes
Sometimes io_wait time goes in the wrong direction. Use saturating sub.
2024-03-12 06:14:06 -10:00
Tejun Heo
342a4946af scx_layered: Better pct formatting when printing stats 2024-03-11 22:18:03 -10:00
Tejun Heo
be2102775b scx_layered: Implement min_exec_us option
which can be used to penalize tasks which wake up very frequently without
doing much.
2024-03-11 22:13:11 -10:00
Tejun Heo
0c62b24993 scx_layered: Implement exclusive property
A task in an exclusive grouped or open layer occupied a whole core - the
sibling CPU is kept idle.
2024-03-11 18:27:16 -10:00
David Vernet
24d798c2ff rusty: Use a flat list of NumaNodes during LB
As Tejun pointed out in review, the disadvantage of using
push/pull/balanced lists is that if the domains inside the nodes are
balanced, we won't be able to push load between them. I'd originally
done it that way both as an optimization, but also to allow me to
iterate over the lists of pushable and pullable domains mutably. That
was addressed in the prior commit, but the nodes themselves were still
put into 3 buckets.

I think this is generally just a cleaner way of doing things, so let's
just collapse the nodes into a flat list as well. This prevents us from
having to coalesce the lists, std::mem::swap them, etc.

Signed-off-by: David Vernet <void@manifault.com>
2024-03-11 21:04:10 -07:00
David Vernet
829b1d3ced rusty: Don't use multiple SortedVec's in struct NumaNode
Tejun pointed out that a possible issue exists in the current
implementation, wherein if you have two NUMA nodes that are imbalanced,
but their domains are internally balanced, we'll fail to migrate between
them if all nodes are in the balanced_nodes list.

To address this, let's just use a single global list for all types of
domains, and do checking internally for imbalances. The reason it was
done this way in the first place was to allow me to mutably iterate over
both vectors in a nested loop. The way around that is to just use loop
{} and push/pop domains from the list.

We could do the same thing for the NUMA nodes themselves, which are also
in 3 separate lists in the LoadBalancer. We'll do that in a subsequent
commit.

Signed-off-by: David Vernet <void@manifault.com>
2024-03-11 21:04:10 -07:00
David Vernet
3d2507e6f2 rusty: Add separate flag for x NUMA greedy task stealing
In scx_rusty, a CPU that is going to go idle will attempt to steal tasks
from remote domains when its domain has no tasks to run, and a remote
domain has at least greedy_threshold enqueued tasks. This stealing is
temporary, but of course has a cost in that the CPU that's stealing the
task may cause it to suffer from cache misses, or in the case of
multi-node machines, remote NUMA accesses and working sets split across
multiple domains.

Given the higher cost of x NUMA work stealing, let's add a separate flag
that lets users tune the threshold for doing cross NUMA greedy task
stealing.

Signed-off-by: David Vernet <void@manifault.com>
2024-03-11 21:02:23 -07:00
Tejun Heo
76cc337d78 scx_layered: Add exclusive option to Open and Grouped layers
Actual implementation isn't done yet.
2024-03-11 12:07:03 -10:00
Jordan Rome
54fe1c954e
Merge pull request #179 from jordalgo/bpftool
Fetch and build bpftool by default
2024-03-11 17:54:29 -04:00
Andrea Righi
bd2c18afd5 Revert "scx_rustland_core: use new consume_raw() libbpf-rs API"
In order to use the new consume_raw() API we need to depend on a version
of libbpf-rs that is not released yet.

Apparently adding such dependency may introduce a potential dependency
conflict with libbpf-sys.

Therefore, revert this change and go back to the previous consume() API.
One a new version of libbpf-rs will be out we can update all our
dependencies to use the new libbpf-rs and re-apply this patch to
scx_rustland_core.

Fixes: 7c8c5fd ("scx_rustland_core: use new consume_raw() libbpf-rs API")
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-03-11 21:54:21 +01:00
Jordan Rome
ffc7b7dc4a Fetch and build bpftool by default
This pairs with the new default behavior to fetch and build libbpf
and is mostly being used so we can use the latest bpftool and libbpf.
2024-03-11 10:00:01 -07:00
Andrea Righi
b7c06b9ed9
Merge pull request #181 from sched-ext/rustland-interactive-tuning
scx_rustland: interactive tuning
2024-03-10 19:31:00 +01:00
Andrea Righi
155444e1c0 scx_rustland: set default time slice to 5ms
In line with rustland's focus on prioritizing interactive tasks, set the
default base time slice to 5ms.

This allows to mitigate potential audio craking issues or system lags
when the system is overloaded or under memory pressure condition (i.e.,
https://github.com/sched-ext/scx/issues/96#issuecomment-1978154324).

A downside of this change is to introduce potential regressions in the
throughput of CPU-intensive workloads, but in such scenarios rustland
may not be the optimal choice and alternative schedulers may be
preferred.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-03-10 14:46:11 +01:00
Andrea Righi
0a7161cbc7 scx_rustland: limit range of task weight
Some high-priority tasks may have a weight too high, that can
potentially disrupt the slice boost optimization logic, causing
interactive tasks to be less responsive.

In line with rustland's focus on prioritizing interactive tasks, prevent
giving too much CPU bandwidth to such high-priority tasks by limiting
the maximum task weight to 1000.

This allows to maintain a good level of system responsiveness even in
presence of tasks with a really high priority.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-03-10 14:39:29 +01:00
Andrea Righi
7c8c5fdd48 scx_rustland_core: use new consume_raw() libbpf-rs API
Use the new consume_raw() API provided by libbpf-rs with
https://github.com/libbpf/libbpf-rs/pull/680.

This allows to be more precise and efficient at processing tasks
consumed from the BPF ring buffer.

NOTE: the new consume_raw() API is not available yet in any official
release of the libbpf-rs crate, but cargo allows to pick versions
directly from git. This slightly increases the build time of
scx_rustland_core and the schedulers based on this crate (since we need
to recompile libbpf-rs from source), but we can re-add a proper
versioned dependency once the libbpf-rs is out.

TODO: this new API also offers the possibility to consume multiple items
from the BPF ring buffer with a single call to consume_raw(). This could
be investigated and implemented as a potential future enhancement.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-03-10 09:55:17 +01:00
David Vernet
1c3168d2a4
topology: Don't assume unique core IDs
The current topology.rs crate assumes that all cores have unique core
IDs in a system. This need not be the case, such as in certain Intel
Xeon processors which reuse core IDs in different NUMA nodes. Let's
update the crate to assume unique core IDs only per socket.

Signed-off-by: David Vernet <void@manifault.com>
2024-03-08 15:13:46 -06:00
David Vernet
26a94b1b14
rusty: Add debug! logging to load_balance.rs
We removed the debug!() output that was previously present in main.rs. Let's
add more debug!() output that helps debug the current LB hierarchy.

Signed-off-by: David Vernet <void@manifault.com>
2024-03-08 15:13:46 -06:00
David Vernet
0d0b101398
rusty: Add load balancing statistics to rusty
The scx_rusty load balancer is currently no longer exporting statistics such as
domain load averages, load sums, etc. Now that we're also balancing by NUMA,
we'll need a way to hierarchically illustrate load balancing statistics. This
patch adds support for that.

Signed-off-by: David Vernet <void@manifault.com>

updating stats printing

Signed-off-by: David Vernet <void@manifault.com>
2024-03-08 15:13:36 -06:00
David Vernet
0871a9525d
rusty: Add direct_greedy_numa flag
Users may want to toggle whether tasks can be temporarily sent to idle CPUs on
remote NUMA nodes. By default, we want it to be disabled as a task spanning
multiple NUMA nodes will end up having its working set spanning both nodes,
which is probably not desirable. However, in case a workload really wants to
encourage work conservation, let's add a flag that allows them to toggle it.

Signed-off-by: David Vernet <void@manifault.com>
2024-03-08 15:12:00 -06:00
David Vernet
d0ebfb85ef
rusty: Disable direct greedy stealing between NUMA nodes
scx_rusty currently pushes tasks to idle cores if the direct greedy threshold
is exceeded, even if the core is on a remote NUMA node. This behavior is
probably not desired in most scenarios. The worst that will happen if a task is
pushed to an idle core in the same node is some L3 cache miss traffic, but for
multiple NUMA nodes, it could cause the task to have its working set span
multiple nodes.

Let's disable direct greedy work stealing across NUMA nodes. A future commit
will add a flag that's disabled by default, and let's users turn this on if
they really want to encourage work conservation.

Signed-off-by: David Vernet <void@manifault.com>
2024-03-08 15:11:59 -06:00
David Vernet
db152cfbe8
rusty: Implement NUMA-aware load balancing
Right now, scx_rusty has no notion of domains spanning NUMA nodes, and makes no
distinction when making load balancing decisions, or work stealing. This can
cause problems on multi-NUMA machines, as load balancing and work stealing
across NUMA nodes has significantly different cost from across L3 cache
boundaries.

In order to better support multi-NUMA machines, this commit adds another layer
to the rusty load balancer, which balances across NUMA nodes using a different
cost function from balancing across domains. Load balancing now takes place
over the span of two passes:

1. In the first pass, we fix imbalances across NUMA nodes by moving tasks
   between domains across those NUMA node boundaries. We require a load
   imbalance of at least 17% in order to move load at this stage. The ratio of
   load imbalance we attempt to adjust (50%) and the maximum amount of load
   we're allowed to push out of a domain (50%) is still the same as when
   balancing between domains inside a NUMA node, but this is easy to tune with
   the current setup.

2. Once we've balanced across NUMA nodes, we iterate over all nodes and balance
   between the domains within each NUMA node. The cost function here is the
   same as what it has been thus far: we require at least a 5% imbalance in
   order to trigger load balancing.

There are a few additional changes / improvements to load balancing in this
commit:

1. NUMA nodes and domains are now ordered according to their load by using
   SortedVec objects. We were previously using BTreeMap keyed by load, but this
   was suboptimal due to the fact that it doesn't allow duplicate entries.

2. We're no longer exporting load balancing statistics as a vector of data such
   as load sums, averages, and imbalances. This is instead all encapsulated in
   the load balancing hierarchy we setup in lb.load_balance(). These statistics
   are not yet exported, but they will be in a subsequent commit.

One of the issues with this commit is that it does introduce some
almost-identical logic that somehow begs to be deduplicated. For example, when
we balance between NUMA nodes, the logic for iterating over push nodes and
pushing to pull nodes is very similar to the logic of iterating over push
domains and pull domains when balancing within a node. It may be that this can
be improved.

The following are some benchmarks run on an Intel Xeon Gold 6138 (2 x 40 core
processor):

kcompile
--------

On Commit a27648c74210 ("afs: Fix setting of mtime when creating a
file/dir/symlink"):

1. make allyesconfig
2. make -j $(nproc) built-in.a
3. make -j clean
4. goto 2

Runtime
-------

         o-----------o-----------o----------o
         | scx_rusty |     CFS   |   Delta  |
---------o-----------o-----------o----------o
Mean     | 562.688s  | 566.085s  | -.6%     |
---------o-----------o-----------o----------o
Variance | 0.54387   | 0.72431   | -24.9%   |
---------o-----------o-----------o----------o

         o-----------o-----------o----------o
         | rusty NUMA| rusty ORIG|   Delta  |
---------o-----------o-----------o----------o
Mean     | 562.688s  | 563.209s  | -.092%   |
---------o-----------o-----------o----------o
Variance | 0.54387   | 0.42038   | 29.38%   |
---------o-----------o-----------o----------o

scx_rusty with NUMA awareness clearly beats CFS, but only barely beats
scx_rusty without it. This isn't necessarily super surprising given that
this is kcompile, which has very poor front-end CPU locality. Further
experimentation with toggling the cost function for performing
migrations may improve this further.

CPU util
--------

         o-----------o-----------o----------o
         | scx_rusty |     CFS   |   Delta  |
---------o-----------o-----------o----------o
Mean     | 7654.25%  | 7551.67%  | 1.11%    |
---------o-----------o-----------o----------o
Variance | 165.35714 | 158.3333  | 4.436%   |
---------o-----------o-----------o----------o

         o-----------o-----------o----------o
         | rusty NUMA| rusty ORIG|   Delta  |
---------o-----------o-----------o----------o
Mean     | 7654.25%  | 7641.57%  | 0.1659%  |
---------o-----------o-----------o----------o
Variance | 165.35714 | 1230.619  | -86.5%   |
---------o-----------o-----------o----------o

As expected, CPU util is quite a bit higher with scx_rusty than it is
with CFS. Further experiments that could be interesting are always
enabling direct-greedy stealing between domains within a NUMA node, and
then comparing rusty NUMA and rusty ORIG. rusty NUMA prevents stealing
between NUMA nodes, so this would show whether the locality introduced
by NUMA awareness appropriately offsets the loss of work conservation.

Major PFs
---------

         o-----------o-----------o----------o
         | scx_rusty |     CFS   |   Delta  |
---------o-----------o-----------o----------o
Mean     | 5332      | 3950      | 36.566%  |
---------o-----------o-----------o----------o
Variance | 6975.5    | 5986.333  | 16.5237% |
---------o-----------o-----------o----------o

         o-----------o-----------o----------o
         | rusty NUMA| rusty ORIG|   Delta  |
---------o-----------o-----------o----------o
Mean     | 5332      | 5336.5    | -.084%   |
---------o-----------o-----------o----------o
Variance | 6975.5    | 955.5     | 630.03%  |
---------o-----------o-----------o----------o

Also as expected, major page faults are far highe higher with scx_rusty
than with CFS. This is expected even with NUMA awareness, given that
scx_rusty is still less sticky than CFS.

Further experiments that could be interesting are tuning the threshold
for which we perform x NUMA migrations to try and keep this value even
lower. The rate of major page faults between rusty NUMA and rusty ORIG
were very close, though rusty NUMA was a bit lower.

Signed-off-by: David Vernet <void@manifault.com>
2024-03-08 15:11:17 -06:00
David Vernet
0b1c3713b2
rusty: Remove lb_apply_weight param from lb_step()
Let's just query self.tuner.fully_utilized directly and save a few lines of
code.

Signed-off-by: David Vernet <void@manifault.com>
2024-03-08 15:11:17 -06:00
David Vernet
758f762058
rusty: Move LoadBalancer out of rusty.rs
More cleanup of scx_rusty. Let's move the LoadBalancer out of rusty.rs and into
its own file. It will soon be extended quite a bit to support multi-NUMA and
other multivariate LB cost functions, so it's time to clean things up and split
it out.

Signed-off-by: David Vernet <void@manifault.com>
2024-03-08 15:11:17 -06:00
David Vernet
94f75bcec6
rusty: Refactor Tuner and DomainGroup out of rusty.rs
rusty.rs is growing a bit unwieldy. We're going to want to update its load
balancing logic somewhat significantly to account for multi-NUMA and other cost
functions, so let's start cleaning the code up so that things are more
logically segmented and easier to work with.

To start, we move the Tuner and DomainGroup/Domain objects into their own
modules.

Signed-off-by: David Vernet <void@manifault.com>
2024-03-08 15:10:37 -06:00
Andrea Righi
be5e51dfaa scx_rlfifo: print a performance warning banner
scx_rlfifo is provided as a simple example to show how to use
scx_rustland_core and it's not supposed to be used in a real production
environment.

To prevent performance bug reports print an explicit warning when it's
started to clarify the goal of this scheduler.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-03-05 19:36:17 +01:00
Andrea Righi
fe19754132 scx_rlfifo: replace 1ms sleep with sched_yield()
Small improvement to make the scheduler a bit more responsive, without
introducing too much complexity or too much CPU overhead.

This can be achieved by replacing a sleep of 1ms with a sched_yield()
every time that the scheduler has finished to dispatch all the queued
tasks.

This also makes the code a bit smaller and easier to read.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-03-05 18:42:24 +01:00
Andrea Righi
5cf113f058 scx_rustland_core: provide DispatchedTask API methods
Provide distinct methods to set the target CPU and the per-task time
slice to dispatched tasks.

Moreover, also provide a constructor to create a DispatchedTask from a
QueuedTask (this allows to automatically bounce a task from the
scheduler to the BPF dispatcher without having to take care of setting
the individual task's attributes).

This also allows to make most of the attributes of DispatchedTask
private, especially it allows to hide cpumask_cnt, that should be only
used internally between the BPF and the user-space component.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-03-03 15:49:37 +01:00
Andrea Righi
e10f8a2d8e scx_rustland_core: introduce per-task time slice
Provide a way to set a different time slice per-task, by adding a new
attribute slice_ns to the DispatchedTask struct.

This attribute determines the time slice assigned to the task, if it is
set to 0 then the global time slice (either the default one or the
effective one, if set) will be used.

At the same time, remove the payload attribute, that is basically unused
(scx_rustland uses it to send the task's vruntime to the BPF dispatcher
for debugging purposes, but it's not very useful anymore at this point).

In the future we may introduce a proper interface to attach a custom
payload to each task with a proper interface.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-03-03 15:06:56 +01:00
Jordan Rome
499924ead8 Add libbpf as a submodule
This is to potentinally reduce issues with folks
using different versions of libbpf at runtime.

This also:
- makes static linking of libbpf the default
- adds steps in `meson setup` to fetch libbpf and make it
2024-03-01 12:39:35 -08:00
Andrea Righi
0d1c6555a4 scx_rustland_core: generate source files in-tree
There is no need to generate source code in a temporary directory with
RustLandBuilder(), we can simply generate code in-tree and exclude the
generated source files from .gitignore.

Having the generated source files in-tree can help to debug potential
build issues (and it also allows to drop the the tempfile crate
dependency).

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-02-28 17:49:44 +01:00
Andrea Righi
2ac1a5924f scx_rustland_core: introduce RustLandBuilder()
Introduce a wrapper to scx_utils::BpfBuilder that can be used to build
the BPF component provided by scx_rustland_core.

The source of the BPF components (main.bpf.c) is included in the crate
as an array of bytes, the content is then unpacked in a temporary file
to perform the build.

The RustLandBuilder() helper is also used to generate bpf.rs (that
implements the low-level user-space Rust connector to the BPF
commponent).

Schedulers based on scx_rustland_core can simply use RustLandBuilder(),
to build the backend provided by scx_rustland_core.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-02-28 17:49:44 +01:00
Andrea Righi
e23426e299 scx_rustland_core: introduce method bpf.update_tasks()
Introduce a helper function to update the counter of queued and
scheduled tasks (used to notify the BPF component if the user-space
scheduler has still some pending work to do).

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-02-28 17:49:44 +01:00
Andrea Righi
00e25530bc scx_rlfifo: simple user-space FIFO scheduler written in Rust
Implement a FIFO scheduler as an example usage of scx_rustland_core.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-02-28 17:49:44 +01:00
Andrea Righi
cf43129d89 scx_rustland: update documentation
scx_rustland has significantly evolved since its original design.

With the introduction of scx_rustland_core and the inclusion of the
scx_rlfifo example, scx_rustland's focus can be shifted from solely
being an "easy-to-read Rust scheduler template" to a fully functional
scheduler.

For this reason, update the README and documentation to reflect its
revised design, objectives, and intended use cases.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-02-28 17:49:44 +01:00
Andrea Righi
871a6c10f9 scx_rustland_core: include scx_rustland backend
Move the BPF component of scx_rustland to scx_rustland_core and make it
available to other user-space schedulers.

NOTE: main.bpf.c and bpf.rs are not pre-compiled in the
scx_rustland_core crate, they need to be included in the user-space
scheduler's source code in order to be compiled/linked properly.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-02-28 17:49:44 +01:00
Andrea Righi
416d6a940f rust: introduce scx_rustland_core crate
Introduce a separate crate (scx_rustland_core) that can be used to
implement sched-ext schedulers in Rust that run in user-space.

This commit only provides the basic layout for the new crate and the
abstraction to the custom allocator.

In general, any scheduler that has a user-space component needs to use
the custom allocator to prevent potential deadlock conditions, caused by
page faults (a kthread needs to run to resolve the page fault, but the
scheduler is blocked waiting for the user-space page fault to be
resolved => deadlock).

However, we don't want to necessarily enforce this constraint to all the
existing Rust schedulers, some of them may do all user-space allocations
in safe paths, hence the separate scx_rustland_core crate.

Merging this code in scx_utils would force all the Rust schedulers to
use the custom allocator.

In a future commit the scx_rustland backend will be moved to
scx_rustland_core, making it a totally generic BPF scheduler framework
that can be used to implement user-space schedulers in Rust.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-02-28 17:49:44 +01:00
David Vernet
8b04a2687f
rusty: Use new infeasible crate
Now that we have a new 'infeasible' crate that abstracts the logic for
implementing the infeasible weights solution. Let's update rusty to use
it.

Signed-off-by: David Vernet <void@manifault.com>
2024-02-26 10:51:54 -06:00
David Vernet
87eab38506
rustland: Update rustland to use topology.rs
The new topology crate allows us to replace the custom rustland topology
logic with the logic in the topology crate itself.

Signed-off-by: David Vernet <void@manifault.com>
2024-02-23 13:09:06 -06:00
David Vernet
43624a87ce
rusty: Use new topology crate
Now that we have this new Topology crate, let's update Rusty to use it
instead of using the old one.

Signed-off-by: David Vernet <void@manifault.com>
2024-02-23 10:39:55 -06:00
Tejun Heo
4dc77f8ddf
Merge pull request #149 from davemarchevsky/davemarchevsky_nice_equals
scx_layered: Add MATCH_NICE_EQUALS match kind
2024-02-22 06:38:17 -10:00
Dave Marchevsky
9f510f18cd scx_layered: Add MATCH_NICE_EQUALS match kind
I have a usecase where specific nice values are used to bucket tasks
into groups that are handled separately by different `scx_layered`
policies, with no implications of relative priority between niceness X,
X + 1, X - 1, etc. In other words, nicevals are used as simple tags in
this scenario.

If we wanted to treat a specific niceness this way e.g. `11`, we could
do so with AND'd MATCH_NICE_{ABOVE,BELOW} like so:

```json
  "matches" : [
    [
      {
        "NiceAbove": 10
      },
      {
        "NiceBelow": 12
      },
    ],
  ],
```

But this is unnecessarily verbose and doesn't communicate the intent of
the match very well. Adding a `NiceEquals` matcher simplifies the
config and makes intent obvious:

```json
  "matches" : [
    [
      {
        "NiceEquals": 11
      },
    ],
  ],
```

This PR adds support for such a matcher.

Also, rename `layer_match.nice_above_or_below` to just
`layer_match.nice`, as the former doesn't describe the newly-added
matcher's use of the field. It's still obvious that `layer_match.nice`
is relevant to MATCH_NICE_{ABOVE, BELOW, EQUALS}.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
2024-02-22 04:08:07 -08:00
David Vernet
615b594e1c
layered: Don't refresh cpumasks before attaching
As mentioned in the previous commit, for some reason we're sometimes
(non-deterministically) not seeing the updated cpumask / layer values in
BPF if we initialize the cpumasks here before attaching. Let's undo this
for now so to avoid observing buggy behavior, until we figure it out.

Signed-off-by: David Vernet <void@manifault.com>
2024-02-21 19:19:45 -06:00
David Vernet
68d317079a
Revert "layered: Set layered cpumask in scheduler init call"
This reverts commit 56ff3437a2.

For some reason we seem to be non-deterministically failing to see the
updated layer values in BPF if we initialize before attaching. Let's
just undo this specific part so that we can unblock this being broken,
and we can figure it out async.

Signed-off-by: David Vernet <void@manifault.com>
2024-02-21 19:17:19 -06:00
David Vernet
31df8fbd09
layered: Consume from layer with cpumask in layered_dispatch
Currently, in layered_dispatch, we do the following:

1. Iterate over all preempt=true layers, and first try to consume from
   them.

2. Iterate over all confined layers, and consume from them if we find a
   layer with a cpumask that contains the consuming CPU.

3. Iterate over all grouped and open layers and consume from them in
   ordered sequence.

In (2), we're only iterating over confined layers, but we should also be
iterating over grouped layers. Otherwise, despite a consuming CPU being
allocated to a specific grouped layer, the CPU will consume from
whichever grouped or open layer has a task that's ready to run.

Signed-off-by: David Vernet <void@manifault.com>
2024-02-21 15:38:23 -06:00
David Vernet
56ff3437a2
layered: Set layered cpumask in scheduler init call
In layered_init, we're currently setting all bits in every layers'
cpumask, and then asynchronously updating the cpumasks at later time to
reflect their actual values at runtime. Now that we're updating the
layered code to initialize the cpumasks before we attach the scheduler,
we can instead have the init path actually refresh and initialize the
cpumasks directly.

Signed-off-by: David Vernet <void@manifault.com>
2024-02-21 15:38:23 -06:00
David Vernet
1f834e7f94
layered: Initialize layers before attaching scheduler
We currently have a bug in layered wherein we could fail to propagate
layer updates from user space to kernel space if a layer is never
adjusted after it's first initialized. For example, in the following
configuration:

[
	{
		"name": "workload.slice",
		"comment": "main workload slice",
		"matches": [
			[
				{
					"CgroupPrefix": "workload.slice/"
				}
			]
		],
		"kind": {
			"Grouped": {
				"cpus_range": [30, 30],
				"util_range": [
					0.0,
					1.0
				],
				"preempt": false
			}
		}
	},
	{
		"name": "normal",
		"comment": "the rest",
		"matches": [
			[]
		],
		"kind": {
			"Grouped": {
				"cpus_range": [2, 2],
				"util_range": [
					0.0,
					1.0
				],
				"preempt": false
			}
		}
	}
]

Both layers are static, and need only be resized a single time, so the
configuration would never be propagated to the kernel due to us never
calling update_bpf_layer_cpumask(). Let's instead have the
initialization propagate changes to the skeleton before we attach the
scheduler.

This has the advantage both of fixing the bug mentioned above where a
static configuration is never propagated to the kernel, and that we
don't have a short period when the scheduler is first attached where we
don't make optimal scheduling decisions due to the layer resizing not
being propagated.

Signed-off-by: David Vernet <void@manifault.com>
2024-02-21 15:38:21 -06:00
Tejun Heo
22d635c385
Merge pull request #141 from jordalgo/rusty-logging
Add libbpf logging to rust schedulers
2024-02-20 13:52:39 -10:00
Andrea Righi
80de48ec83 scx_rustland: introduce --builtin-idle
Add a command line option to enable/disable the sched-ext built-in idle
selection logic in the user-space scheduler.

With this option the user-space scheduler will try to dispatch tasks on
the CPU selected during the .select_cpu() phase (using the built-in idle
selection logic).

Without this option the user-space scheduler will try to dispatch tasks
to the first CPU available.

The former can be useful to improve throughput, since tasks are more
likely to stick on the same CPU, while the latter can provide better
system responsiveness, especially when the system is significantly busy.

Given that, by default, tasks can be dispatched directly bypassing the
user-space scheduler if an idle CPU is found during .select_cpu(), the
user-space scheduler is primarily engaged only when the system is busy
(no idle CPUs are available). Under these circumstances, it is typically
more efficient to dispatch tasks on the first available CPU. Hence, the
default behavior is to ignore built-in idle selection logic in the
user-space scheduler.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-02-21 00:25:14 +01:00
Andrea Righi
e487d71032 scx_rustland: simply CPU selection by relying on built-in idle selection
Checking if a CPU is idle or busy in the user-space scheduler is a bit
redundant, considering that we also rely on the built-in idle selection
logic in the BPF part.

Therefore get rid of the additional idle selection logic in the
user-space scheduler and rely on the built-in idle selection.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-02-21 00:25:14 +01:00
Andrea Righi
2cd1d4b684 scx_rustland: introduce --full-user
Introduce an option to send all scheduling events and actions to
user-space, disabling any form of in-kernel optimization.

Enabling this option will likely make the system less responsive (but
more predictable in terms of performance) and it can be useful for
debugging purposes.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-02-21 00:25:14 +01:00
Jordan Rome
7c32acece0 Add libbpf logging to the rust schedulers
This is to get better logs when failing to load, attach, etc.
2024-02-20 15:17:10 -08:00
David Vernet
ef8aa9ea31
add documentation
Signed-off-by: David Vernet <void@manifault.com>
2024-02-20 14:57:09 -06:00
David Vernet
8aba090d4f
rust: Add topology module to utils crate
scx_rusty has logic in the scheduler to inspect the host to
automatically build scheduling domains across every L3 cache. This would
be generically useful for many different types of schedulers, so let's
add it to the scx_utils crate so it can be used by others.

Signed-off-by: David Vernet <void@manifault.com>
2024-02-20 14:57:09 -06:00
Andrea Righi
7ff06a6ff0 scx_rustland: prevent misaligned pointer dereference
The buffer used to store struct queued_task_ctx items fetched from the
BPF ring buffer needs to be aligned to the architecture register size,
otherwise we may hit misaligned pointer dereference issues, such as:

  thread 'main' panicked at src/bpf.rs:162:43:
  misaligned pointer dereference: address must be a multiple of 0x8 but is 0x56516a51e004
  note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Prevent this by making sure the buffer is always aligned to 64-bits.

Fixes: 93dc615 ("scx_rustland: use a ring buffer for queued tasks")
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-02-20 19:08:38 +01:00
Andrea Righi
93dc615653 scx_rustland: use a ring buffer for queued tasks
Switch from a BPF_MAP_TYPE_QUEUE to a BPF_MAP_TYPE_RINGBUF to store the
tasks that need to be processed by the user-space scheduler.

A ring buffer allows to save a lot of memory copies and syscalls, since
the memory is directly shared between the BPF and the user-space
components.

Performance profile before this change:

  2.44%  [kernel]  [k] __memset
  2.19%  [kernel]  [k] __sys_bpf
  1.59%  [kernel]  [k] __kmem_cache_alloc_node
  1.00%  [kernel]  [k] _copy_from_user

After this change:

  1.42%  [kernel]  [k] __memset
  0.14%  [kernel]  [k] __sys_bpf
  0.10%  [kernel]  [k] __kmem_cache_alloc_node
  0.07%  [kernel]  [k] _copy_from_user

Both the overhead of sys_bpf() and copy_from_user() are reduced by a
factor of ~15x now (only the dispatch path is using sys_bpf() now).

NOTE: despite being very effective, the current implementation is a bit
of a hack. This is because the present ring buffer API exclusively
permits consumption in a greedy manner, where multiple items can be
consumed simultaneously. However, libbpf-rs does not provide precise
information regarding the exact number of items consumed. By utilizing a
more refined libbpf-rs API [1] we may be able to improve this code a
bit.

Moreover, libbpf-rs doesn't provide an API for the user_ring_buffer, so
at the moment there's not a trivial way to apply the same change to the
dispatched tasks.

However, just with this change applied, the overhead of sys_bpf() and
copy_from_user() is already minimal, so we won't get much benefits by
changing the dispatch path to use a BPF ring buffer.

[1] https://github.com/libbpf/libbpf-rs/pull/680

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-02-20 12:30:22 +01:00
Andrea Righi
04685e633f scx_rustland: avoid memory copies while accessing cpu_map
Instead of using a BPF_MAP_TYPE_ARRAY to store which tasks are running
on which CPU we can simply use a global array, mapped in the user-space
address space.

In this way we can avoid a lot of memory copies and call to sys_bpf(),
significantly reducing the scheduler's overhead.

Keep in mind that we don't need to be 100% correct while accessing this
information, so we can accept some fuzziness in order to significantly
reduce the scheduler's overhead.

Performance profile before this change:

   5.52%  [kernel]  [k] __sys_bpf
   4.84%  [kernel]  [k] __kmem_cache_alloc_node
   4.71%  [kernel]  [k] map_lookup_elem
   4.10%  [kernel]  [k] _copy_from_user
   3.51%  [kernel]  [k] bpf_map_copy_value
   3.12%  [kernel]  [k] check_heap_object

After this change:

   2.20%  [kernel]  [k] __sys_bpf
   1.91%  [kernel]  [k] map_lookup_and_delete_elem
   1.60%  [kernel]  [k] __kmem_cache_alloc_node
   1.10%  [kernel]  [k] _copy_from_user
   0.12%  [kernel]  [k] check_heap_object
                    n/a bpf_map_copy_value
                    n/a map_lookup_elem

With this change we can reduce the overhead of sys_bpf() by ~2x and
the overhead of copy_from_user() by ~4x.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-02-20 12:30:16 +01:00
Andrea Righi
fc889c6995 scx_rustland: replace custom allocator with buddy-alloc
Currently, the primary bottleneck in scx_rustland lies within its custom
memory allocator, which is used to prevent page faults in the user-space
scheduler.

This is pretty evident looking at perf top:

  39.95%  scx_rustland             [.] <scx_rustland::bpf::alloc::RustLandAllocator as core::alloc::global::GlobalAlloc>::alloc
   3.41%  [kernel]                 [k] _copy_from_user
   3.20%  [kernel]                 [k] __kmem_cache_alloc_node
   2.59%  [kernel]                 [k] __sys_bpf
   2.30%  [kernel]                 [k] __kmem_cache_free
   1.48%  libc.so.6                [.] syscall
   1.45%  [kernel]                 [k] __virt_addr_valid
   1.42%  scx_rustland             [.] <scx_rustland::bpf::alloc::RustLandAllocator as core::alloc::global::GlobalAlloc>::dealloc
   1.31%  [kernel]                 [k] _copy_to_user
   1.23%  [kernel]                 [k] entry_SYSRETQ_unsafe_stack

However, there's no need to reinvent the wheel here, rather than relying
on an overly simplistic and inefficient allocator, we can rely on
buddy-alloc [1], which is also capable of operating on a preallocated
memory buffer.

After switching to buddy-alloc, the performance profile under the same
workload conditions looks like the following:

   6.01%  [kernel]                 [k] _copy_from_user
   5.21%  [kernel]                 [k] __kmem_cache_alloc_node
   4.45%  [kernel]                 [k] __sys_bpf
   3.80%  [kernel]                 [k] __kmem_cache_free
   2.79%  libc.so.6                [.] syscall
   2.34%  [kernel]                 [k] __virt_addr_valid
   2.26%  [kernel]                 [k] _copy_to_user
   2.14%  [kernel]                 [k] __check_heap_object
   2.10%  [kernel]                 [k] __check_object_size.part.0
   2.02%  [kernel]                 [k] entry_SYSRETQ_unsafe_stack

With this change in place, the primary overhead is now moved to the
bpf() syscall and the copies between kernel and user-space (this could
potentially be optimized in the future using BPF ring buffers, instead
of BPF FIFO queues).

A better focus at the allocator overhead before vs after this change:

 [before]
 39.95%  scx_rustland  [.] core::alloc::global::GlobalAlloc>::alloc
  1.42%  scx_rustland  [.] core::alloc::global::GlobalAlloc>::dealloc

 [after]
  1.50%  scx_rustland  [.] core::alloc::global::GlobalAlloc>::alloc
  0.76%  scx_rustland  [.] core::alloc::global::GlobalAlloc>::dealloc

[1] https://crates.io/crates/buddy-alloc

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-02-11 14:33:39 +01:00
Andrea Righi
ccf5946425 scx_rustland: speed up search by PID in tasks BTreeSet
In order to prevent duplicate PIDs in the TaskTree (BTreeSet), we
perform an O(N) search each time we add an item, to verify whether the
PID already exists or not.

Under heavy stress test conditions the O(N) complexity can have a
potential impact on the overall performance.

To mitigate this, introduce a HashMap that can be used to retrieve tasks
by PID typically with a O(1) complexity. This could potentially degrade
to O(N) in presence of hash collisions, but even in this case, accessing
the hash map is still more efficient than scanning all the entries in
the BTreeSet to search for the target PID.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-02-11 14:11:38 +01:00
Andrea Righi
7ce0d038e4
Merge pull request #133 from sched-ext/rustland-cpumask-gen-cnt
scx_rustland: per-task cpumask generation counter
2024-02-10 19:07:02 +01:00
Andrea Righi
61d1ed338a scx_rustland: per-task cpumask generation counter
Introduce a per-task generation counter to check the validity of the
cpumask at dispatch time.

The logic is the following:

 - the cpumask generation number is incremented every time a task
   calls .set_cpumask()

 - when a task is enqueued the current generation number is stored in
   the queued_task_ctx and relayed to the user-space scheduler

 - the user-space scheduler can decide to dispatch the task on the CPU
   determined by the BPF layer in .select_cpu(), redirect the task to
   any other specific CPU, or redirect to the first CPU available (using
   NO_CPU)

 - task is then dispatched back to the BPF code along with its cpumask
   generation counter

 - at dispatch time the BPF code checks if the generation number is the
   same and it discards the dispatch attempt if the cpumask is not valid
   anymore (the task will be automatically re-enqueued by the sched-ext
   core code, potentially selecting another CPU / cpumask)

 - if the cpumask is valid, but the CPU selected by the user-space
   scheduler is invalid (according to the cpumask), the task will be
   transparently bounced by the BPF code to the shared DSQ (in this way
   the user-space code can be completely abstracted and dispatches that
   target invalid CPUs can be automatically fixed by the BPF layer)

This solution can prevent stalls due to dispatches targeting invalid
CPUs and it can also avoid redundant dispatch events, making the code
more efficient and the cpumask interlocking more reliable.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-02-10 18:02:42 +01:00
David Vernet
1c00de9402
Merge pull request #129 from sched-ext/infeasible_weights
Implement solution to infeasible weights problem
2024-02-09 16:23:56 -06:00
David Vernet
e627176d90
scx: Implement solution to infeasible weights problem
As described in [0], there is an open problem in load balancing called
the "infeasible weights" problem. Essentially, the problem boils down to
the fact that a task with disproportionately high load can be granted
more CPU time than they can actually consume per their duty cycle.

This patch implements a solution to that problem, wherein we apply the
algorithm described in this paper to adjust all infeasible weights in
the system down to a feasible wight that gives them their full duty
cycle, while allowing the remaining feasible tasks on the system to
share the remaining compute capacity on the machine.

[0]: https://drive.google.com/file/d/1fAoWUlmW-HTp6akuATVpMxpUpvWcGSAv/view?usp=drive_link

Signed-off-by: David Vernet <void@manifault.com>
2024-02-09 16:23:12 -06:00
Andrea Righi
8e47602f00 scx_rustland: keep default CPU selection when idle
Dispatch to the shared DSQ (NO_CPU) only when the assigned CPU is not
idle anymore, otherwise maintain the same CPU that has been assigned by
the BPF layer.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-02-08 22:48:07 +01:00
Andrea Righi
7085d57709 scx_rustland: kick user-space scheduler when a CPU is released
When the system is not being fully utilized there may be delays in
promptly awakening the user-space scheduler.

This can happen for example, when some CPU-intensive tasks are
constantly dispatched bypassing the user-space scheduler (e.g., using
SCX_DSQ_LOCAL) and other CPUs are completely idle.

Under this condition the update_idle() can fail to activate the
user-space scheduler, because there are no pending events, and only the
periodic timer will wake up the scheduler, potentially introducing lags
of up to 1 sec.

This can be reproduced, for example, running a video game that doesn't
use all the CPUs available in the system (i.e., Team Fortress 2). With
this game it is pretty easy to notice sporadic lags that are resumed
after ~1sec, due to the periodic timer kicking scheduler.

To prevent this from happening wake up the user-space scheduler
immediately as soon as a CPU is released, speculating on the fact that
most of the time there will be always another task ready to run.

This can introduce a little more overhead in the scheduler (due to
potential unnecessary wake up events), but it also prevents stuttery
behaviors and it makes the system much more smooth and responsive,
especially with video games.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-02-08 22:48:07 +01:00
Andrea Righi
cb82d91e0f scx_rustland: use scx_bpf_dispatch_cancel()
Use scx_bpf_dispatch_cancel() to invalidate dispatches on wrong per-CPU
DSQ, due to cpumask race conditions, and redirect them to the shared
DSQ.

This prevents dispatching tasks to CPU that cannot be used according to
the task's cpumask.

With this applied the scheduler passed all the `stress-ng --race-sched`
stress tests.

Moreover, introduce a counter that is periodically reported to stdout as
an additional statistic, that can be helpful for debugging.

Link: https://github.com/sched-ext/sched_ext/pull/135
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-02-08 22:48:07 +01:00
Andrea Righi
13e23e8cc9 scx_rustland: dump scheduler statistics before exiting
Print all the scheduler statistics before exiting. Reporting the very
last state of the scheduler can help to debug events that could trigger
error conditions (such as page faults, scheduler congestions, etc.).

While at it, fix also some minor coding style issues (tabs vs spaces).

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-02-08 15:37:44 +01:00
David Vernet
c574598dc7
scx_rusty: Fix typos
Signed-off-by: David Vernet <void@manifault.com>
2024-02-07 23:38:26 -06:00
Tejun Heo
2062d1ad1f scx: Add compat support for SCX_KICK_IDLE and use it for idle CPU wakeups
SCX_KICK_IDLE is a new feature which isn't defined in older kernels. Add
compat wrapper and use it for idle CPU wakeups.

Signed-off-by: Tejun Heo <tj@kernel.org>
2024-02-06 15:28:40 -10:00
Andrea Righi
acb174aa51 scx_rustland: prevent duplicate PIDs in the task BTreeSet
Items in the task BTreeSet are stored by pid and vruntime. Make sure
that we never store multiple items with the same PID, so that
re-enqueued tasks are not dispatched multiple times.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-02-03 14:46:39 +01:00
Andrea Righi
681b3fd807 scx_rustland: more aggressive time slice scaling
Allow to scale the effective time slice down to 250 us. This can help to
maintain a good quality of the audio even when the system is overloaded
by multiple CPU-intensive tasks.

Moreover, always round up the time slice scaling factor to be a little
more aggressive and prioritize at scaling the time slice, so that we can
prioritize low latency tasks even more.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-02-01 16:40:59 +01:00
Andrea Righi
26d6d530f0 scx_rustland: enhance interactive task classification
Evaluate the number of voluntary context switches per second (nvcsw/sec)
for each task using an exponentially weighted moving average (EWMA) with
weight 0.5, that allows to classify interactive tasks with more
accuracy.

Using a simple average over a period of time of 10 sec can introduce
small lags every 10 sec, as the statistics for the number of voluntary
context switches are refreshed. This can result in interactive tasks
taking a brief time to catch up in order to be accurately classified as
so, causing for example short audio cracks, small drop of 5-10 fps in
games, etc.

Using a EMWA allows to smooth the average of nvcsw/sec, preventing short
lags in the interactive tasks, while also preventing to incorrectly
classify as interactive tasks that may experience an isolated short
burst of voluntary context switches.

This patch has been tested with the usual test case of playing a
videogame while running a parallel kernel build in the background.

Without this patch the short lag every 10 sec is clearly noticeable,
with this patch applied the game and audio run smoothly.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-02-01 16:40:59 +01:00
Andrea Righi
baeea306fc scx_rustland: rely on the built-in idle selection logic
Simplify the idle selection logic by relying only on the built-in idle
selection performed in the BPF layer.

When there are idle CPUs available in the system, tasks are dispatched
directly by the BPF dispatcher without invoking the user-space
scheduler. This allows to avoid the user-space overhead and get the best
system performance when CPU resources are not overcommitted.

Once the number of tasks exceeds the available CPUs, the user-space
scheduler takes over. However, by this time, the system is already
overcommitted, so there's little advantage in attempting to pinpoint the
optimal idle CPU through the user-space scheduler. Instead, tasks can be
executed on the first available CPU, consistently dispatching them to
the shared DSQ.

This allows to achieve the optimal performance both with system
under-utilization and over-utilization.

With this change in place the user-space scheduler won't dispatch tasks
directly to specific CPUs, but we still want to keep this as a generic
feature in the BPF layer, so that it can be potentially used in the
future by this scheduler or even by other user-space schedulers (once
the BPF layer will be moved to a more generic place).

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-02-01 16:40:59 +01:00
Andrea Righi
b9e60f71ed scx_rustland: usersched: code refactoring
No functional change, just move code around to make it more readable.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-02-01 16:40:59 +01:00
Andrea Righi
d13ed5c025 scx_rustland: BPF: refine CPU dispatch logic
When the user-space scheduler dispatches a task on a specific CPU, that
CPU might not be valid, since the user-space doesn't have visibility of
the task's cpumask.

When this happens the BPF dispatcher (that has direct visibility of the
cpumask) should automatically redirect the task to a valid CPU, but
instead of bouncing the task on the shared DSQ, we should try to use the
CPU assigned by the built-in idle selection logic.

If this CPU is also not valid, then we can simply ignore the task, that
has been de-queued and re-enqueued, since a valid CPU will be naturally
re-selected at a later time.

Moreover, avoid to kick any specific CPU when the task is dispatched to
shared DSQ, since the task can be consumed on any CPU and the additional
kick would simply add more overhead.

Lastly, rename dsq_id_to_cpu() to dsq_to_cpu() and cpu_to_dsq_id() to
cpu_to_dsq() for more clarity.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-02-01 16:38:17 +01:00
Andrea Righi
45d8b54eb9 scx_rustland: re-introduce per-CPU DSQ + a global shared DSQ
With commit c6ada25 ("scx_rustland: use custom pcpu DSQ instead of
SCX_DSQ_LOCAL{_ON}") we tried to introduce custom per-CPU DSQs, instead
of using SCX_DSQ_LOCAL and SCX_DSQ_LOCAL_ON to dispatch tasks.

This was required, because dispatching tasks using SCX_DSQ_LOCAL_ON
doesn't provide a guarantee that the cpumask, checked at dispatch time
to determine the validity of a target CPU, remains valid.

This method solved the cpumask validity issue, but unfortunately it
introduced a noticeable performance regression and a potential
starvation issue (that were probably caused by the same problem): if a
task is assigned to a CPU in select_cpu() and the scheduler decides to
dispatch it on a different CPU, the task will be added to the new CPU's
DSQ, but if no dispatch event happens there, the task may remain stuck
in the per-CPU DSQ for a long time, triggering the sched-ext watchdog
timeout that would kick out the scheduler, for example:

  12:53:28 [WARN] FAIL: IPC:CSteamEngin[7217] failed to run for 6.482s (err=1026)
  12:53:28 [INFO] Unregister RustLand scheduler

Therefore, we reverted this change with 6d89ece ("scx_rustland: dispatch
tasks only on the global DSQ"), dispatching all the tasks to the global
DSQ, completely delegating the kernel to distribute tasks among the
available CPUs.

This is not the ideal solution, because we still want to give the
possibility to the user-space scheduler to assign tasks to specific
CPUs.

Therefore, re-introduce distinct per-CPU DSQs, but also provide a global
shared DSQ. Tasks dispatched in the per-CPU DSQs are consumed from the
dispatch() callback of their corresponding CPU, tasks dispatched in the
global shared DSQ are consumed from any CPU.

In this way the BPF layer is able to provide an interface that gives
the flexibility to the user-space to dispatch a task on a specific CPU
or on the first CPU available, depending on the particular scheduler's
need.

If an invalid CPU (according to the cpumask) is selected the BPF
dispatcher will transparently redirect the task to a valid CPU, selected
using the built-in idle selection logic.

In the future we may want to improve this part, giving to the
user-space the visibility of the cpumask, in order to pick a valid CPU
in advance and in a proper synchronized way.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-02-01 00:33:35 +01:00
Andrea Righi
b5e846c538 scx_rustland: BPF: small refactoring
No functional change, just some refactoring to make the code more clear.

We have is_usersched_needed() and set_usersched_needed() that are doing
different things (the former is checkig if there are pending tasks for
the scheduler, the latter is setting the usersched_needed flag to
activate the dispatch of the user-space scheduler).

Rename is_usersched_needed() to usersched_has_pending_tasks() to make
the code more clear and understandable.

Also move dispatch_user_scheduler() closer to the other dispatch-related
helper functions.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-02-01 00:33:35 +01:00
Tejun Heo
6db362b27a scx_rustland: Use scx_utils::user_exit_info
Instead of the bespoke implementation. This also makes scx_rustland to print
out debug dump if exists.

Signed-off-by: Tejun Heo <tj@kernel.org>
2024-01-31 11:44:15 -10:00
Tejun Heo
965926f393 scx_rusty: Use scx_utils::user_exit_info
Instead of the bespoke implementation. This also makes scx_rusty to print
out debug dump if exists.

Signed-off-by: Tejun Heo <tj@kernel.org>
2024-01-31 11:08:17 -10:00
Tejun Heo
105dc36b8f scx_layered: Use scx_utils::user_exit_info
Instead of the bespoke implementation. This also makes scx_layered to print
out debug dump if exists.

Signed-off-by: Tejun Heo <tj@kernel.org>
2024-01-31 10:54:20 -10:00
Tejun Heo
4ee8104a6d
Merge pull request #114 from dschatzberg/local_avoid_enqueue
scx_layered: dispatch from select_cpu if possible
2024-01-31 08:33:26 -10:00
Dan Schatzberg
11e487c165 scx_layered: dispatch from select_cpu if possible
If we are doing local dispatch, we can avoid enqueue() altogether by
dispatching from select_cpu()

Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
2024-01-31 09:54:26 -08:00
Jordan Rome
1b3a9a1e72 [scx_layered] downgrade prometheus-client
This library at version 22 is not available in fedora:
https://src.fedoraproject.org/rpms/rust-prometheus-client

Rather than bothering the maintainer, let's just downgrade here.
2024-01-31 04:36:01 -08:00
Dan Schatzberg
ab5635ff6d scx_layered: Grab idle_smtmask a bit later
This is a really minor optimization, but we don't need idle_smtmask to
schedule pinned tasks, so defer it so the nr_cpus_allowed == 1 path is
marginally faster.

Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
2024-01-29 08:16:37 -08:00
Dan Schatzberg
8c9e65d880 scx_layered: Remove unnecessary idle_cpumask
idle_cpumask isn't used at all in pick_idle_cpu_from. The only need for
these cpumasks is to check if prev_cpu is a wholly idle CPU (and we only
do this when smt_enabled). idle_smtmask is sufficient for that check.

Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
2024-01-29 08:16:37 -08:00
Dan Schatzberg
142b6230b2 scx_layered: Fix AFFN_VIOL stat bump
Prior to this patch, we only bump LSTAT_AFFN_BIOL when the target cpu
was idle, but in both cases it should be counted as AFFN_VIOL.

Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
2024-01-26 13:13:16 -08:00
Tejun Heo
988b7d13c1 Bump versions
scx_exit_info change doesn't require code to be updated but breaks binary
compatbility. Bump versions and cut a new release.
2024-01-25 09:01:23 -10:00
Tejun Heo
eb997a6e55
Merge pull request #101 from dschatzberg/openmetrics
scx_layered: Add support for OpenMetrics format
2024-01-25 08:59:16 -10:00
Dan Schatzberg
7f9548eb34 scx_layered: Add support for OpenMetrics format
Currently scx_layered outputs statistics periodically as info! logs. The
format of this is largely unstructured and mostly suitable for running
scx_layered interactively (e.g. observing its behavior on the command
line or via logs after the fact).

In order to run scx_layered at larger scale, it's desireable to have
statistics output in some format that is amenable to being ingested into
monitoring databases (e.g. Prometheseus). This allows collection of
stats across many machines.

This commit adds a command line flag (-o) that outputs statistics to
stdout in OpenMetrics format instead of the normal log mechanism.
OpenMetrics has a public format
specification (https://github.com/OpenObservability/OpenMetrics) and is
in use by many projects.

The library for producing OpenMetrics metrics is lightweight but does
induce some changes. Primarily, metrics need to be pre-registered (see
OpenMetricsStats::new()).

Without -o, the output looks as before, for example:

```
19:39:54 [INFO] CPUs: online/possible=52/52 nr_cores=26
19:39:54 [INFO] Layered Scheduler Attached
19:39:56 [INFO] tot=   9912 local=76.71 open_idle= 0.00 affn_viol= 2.63 tctx_err=0 proc=21ms
19:39:56 [INFO] busy=  1.3 util=   65.2 load=    263.4 fallback_cpu=  1
19:39:56 [INFO]   batch    : util/frac=   49.7/ 76.3 load/frac=    252.0: 95.7 tasks=   458
19:39:56 [INFO]              tot=   2842 local=45.04 open_idle= 0.00 preempt= 0.00 affn_viol= 0.00
19:39:56 [INFO]              cpus=  2 [  0,  2] 04000001 00000000
19:39:56 [INFO]   immediate: util/frac=    0.0/  0.0 load/frac=      0.0:  0.0 tasks=     0
19:39:56 [INFO]              tot=      0 local= 0.00 open_idle= 0.00 preempt= 0.00 affn_viol= 0.00
19:39:56 [INFO]              cpus= 50 [  0, 50] fbfffffe 000fffff
19:39:56 [INFO]   normal   : util/frac=   15.4/ 23.7 load/frac=     11.4:  4.3 tasks=   556
19:39:56 [INFO]              tot=   7070 local=89.43 open_idle= 0.00 preempt= 0.00 affn_viol= 3.69
19:39:56 [INFO]              cpus= 50 [  0, 50] fbfffffe 000fffff
19:39:58 [INFO] tot=   7091 local=84.91 open_idle= 0.00 affn_viol= 2.64 tctx_err=0 proc=21ms
19:39:58 [INFO] busy=  0.6 util=   31.2 load=    107.1 fallback_cpu=  1
19:39:58 [INFO]   batch    : util/frac=   18.3/ 58.5 load/frac=     93.9: 87.7 tasks=   589
19:39:58 [INFO]              tot=   2011 local=60.67 open_idle= 0.00 preempt= 0.00 affn_viol= 0.00
19:39:58 [INFO]              cpus=  2 [  2,  2] 04000001 00000000
19:39:58 [INFO]   immediate: util/frac=    0.0/  0.0 load/frac=      0.0:  0.0 tasks=     0
19:39:58 [INFO]              tot=      0 local= 0.00 open_idle= 0.00 preempt= 0.00 affn_viol= 0.00
19:39:58 [INFO]              cpus= 50 [ 50, 50] fbfffffe 000fffff
19:39:58 [INFO]   normal   : util/frac=   13.0/ 41.5 load/frac=     13.2: 12.3 tasks=   650
19:39:58 [INFO]              tot=   5080 local=94.51 open_idle= 0.00 preempt= 0.00 affn_viol= 3.68
19:39:58 [INFO]              cpus= 50 [ 50, 50] fbfffffe 000fffff
^C19:39:59 [INFO] EXIT: BPF scheduler unregistered
```

With -o passed, the output is in OpenMetrics format:

```
19:40:08 [INFO] CPUs: online/possible=52/52 nr_cores=26
19:40:08 [INFO] Layered Scheduler Attached
 # HELP total Total scheduling events in the period.
 # TYPE total gauge
total 8489
 # HELP local % that got scheduled directly into an idle CPU.
 # TYPE local gauge
local 86.45305689716104
 # HELP open_idle % of open layer tasks scheduled into occupied idle CPUs.
 # TYPE open_idle gauge
open_idle 0.0
 # HELP affn_viol % which violated configured policies due to CPU affinity restrictions.
 # TYPE affn_viol gauge
affn_viol 2.332430203793144
 # HELP tctx_err Failures to free task contexts.
 # TYPE tctx_err gauge
tctx_err 0
 # HELP proc_ms CPU time this binary has consumed during the period.
 # TYPE proc_ms gauge
proc_ms 20
 # HELP busy CPU busy % (100% means all CPUs were fully occupied).
 # TYPE busy gauge
busy 0.5294061026085283
 # HELP util CPU utilization % (100% means one CPU was fully occupied).
 # TYPE util gauge
util 27.37195512782239
 # HELP load Sum of weight * duty_cycle for all tasks.
 # TYPE load gauge
load 81.55024768702126
 # HELP layer_util CPU utilization of the layer (100% means one CPU was fully occupied).
 # TYPE layer_util gauge
layer_util{layer_name="immediate"} 0.0
layer_util{layer_name="normal"} 19.340849995024997
layer_util{layer_name="batch"} 8.031105132797393
 # HELP layer_util_frac Fraction of total CPU utilization consumed by the layer.
 # TYPE layer_util_frac gauge
layer_util_frac{layer_name="batch"} 29.34063385422595
layer_util_frac{layer_name="immediate"} 0.0
layer_util_frac{layer_name="normal"} 70.65936614577405
 # HELP layer_load Sum of weight * duty_cycle for tasks in the layer.
 # TYPE layer_load gauge
layer_load{layer_name="immediate"} 0.0
layer_load{layer_name="normal"} 11.14363313258934
layer_load{layer_name="batch"} 70.40661455443191
 # HELP layer_load_frac Fraction of total load consumed by the layer.
 # TYPE layer_load_frac gauge
layer_load_frac{layer_name="normal"} 13.664744680306903
layer_load_frac{layer_name="immediate"} 0.0
layer_load_frac{layer_name="batch"} 86.33525531969309
 # HELP layer_tasks Number of tasks in the layer.
 # TYPE layer_tasks gauge
layer_tasks{layer_name="immediate"} 0
layer_tasks{layer_name="normal"} 490
layer_tasks{layer_name="batch"} 343
 # HELP layer_total Number of scheduling events in the layer.
 # TYPE layer_total gauge
layer_total{layer_name="normal"} 6711
layer_total{layer_name="batch"} 1778
layer_total{layer_name="immediate"} 0
 # HELP layer_local % of scheduling events directly into an idle CPU.
 # TYPE layer_local gauge
layer_local{layer_name="batch"} 69.79752530933632
layer_local{layer_name="immediate"} 0.0
layer_local{layer_name="normal"} 90.86574281031143
 # HELP layer_open_idle % of scheduling events into idle CPUs occupied by other layers.
 # TYPE layer_open_idle gauge
layer_open_idle{layer_name="immediate"} 0.0
layer_open_idle{layer_name="batch"} 0.0
layer_open_idle{layer_name="normal"} 0.0
 # HELP layer_preempt % of scheduling events that preempted other tasks. #
 # TYPE layer_preempt gauge
layer_preempt{layer_name="normal"} 0.0
layer_preempt{layer_name="batch"} 0.0
layer_preempt{layer_name="immediate"} 0.0
 # HELP layer_affn_viol % of scheduling events that violated configured policies due to CPU affinity restrictions.
 # TYPE layer_affn_viol gauge
layer_affn_viol{layer_name="normal"} 2.950379973178364
layer_affn_viol{layer_name="batch"} 0.0
layer_affn_viol{layer_name="immediate"} 0.0
 # HELP layer_cur_nr_cpus Current  # of CPUs assigned to the layer.
 # TYPE layer_cur_nr_cpus gauge
layer_cur_nr_cpus{layer_name="normal"} 50
layer_cur_nr_cpus{layer_name="batch"} 2
layer_cur_nr_cpus{layer_name="immediate"} 50
 # HELP layer_min_nr_cpus Minimum  # of CPUs assigned to the layer.
 # TYPE layer_min_nr_cpus gauge
layer_min_nr_cpus{layer_name="normal"} 0
layer_min_nr_cpus{layer_name="batch"} 0
layer_min_nr_cpus{layer_name="immediate"} 0
 # HELP layer_max_nr_cpus Maximum  # of CPUs assigned to the layer.
 # TYPE layer_max_nr_cpus gauge
layer_max_nr_cpus{layer_name="immediate"} 50
layer_max_nr_cpus{layer_name="normal"} 50
layer_max_nr_cpus{layer_name="batch"} 2
 # EOF
^C19:40:11 [INFO] EXIT: BPF scheduler unregistered
```

Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
2024-01-25 09:59:49 -08:00
Andrea Righi
6d89eceb93 scx_rustland: dispatch tasks only on the global DSQ
Commit c6ada25 ("scx_rustland: use custom pcpu DSQ instead of
SCX_DSQ_LOCAL{_ON}") fixed the race issues with the cpumask, but it also
introduced performance regressions.

Until we figure out the reasons of the performance regressions, simplify
the dispatcher and go back at using only the global DSQ, relying on the
built-in idle cpu selection.

In this way we can still enforce task affinity properly
(`stress-ng --race-sched N` does not crash the scheduler) and we can
also provide a better level of system responsiveness (according to the
results of the stress tests done recently).

The idea of this change is to make the scheduler usable in certain
real-world scenarios (and as bug-free as possible), while we figure out
the performance regressions of the per-CPU DSQ approach, that will
likely be re-introduced later on in the future.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-23 13:24:12 +01:00
Andrea Righi
06b5ff3d2f scx_rustland: clarify the logic to determine interactive tasks
No functional change, simply rewrite the code a bit and update the
comment to clarify the logic to detect interactive tasks and apply the
priority boost.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-23 08:28:44 +01:00
Andrea Righi
ab1c4f66a8 scx_rustland: allow to disable the slice boost completely
Allow to specify `-b 0` to completely disable the slice boost logic and
fallback to standard vruntime-based scheduler with variable time slice.

In this way interactive tasks will not get over-prioritized over the
other tasks in the system.

Having this option can help to easily track down potential performance
regressions arising for over-prioritizing interactive tasks.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-23 00:34:06 +01:00
Andrea Righi
b4269452fc scx_userland: handle preemption events from higher sched_class
Make sure to re-schedule the user-space scheduler if it's preempted by a
task from a higher priority sched_class.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-23 00:34:06 +01:00
Andrea Righi
2426d1024f scx_rustland: increase max amount of enqueued tasks
As the scheduler is progressing towards a more stable and usable state,
it may be subject to heavy stress tests.

For this reason, bump up the limit of MAX_ENQUEUED_TASKS to 8192 in the
BPF component, to be able to sustain task-intensive stress tests,
reducing the risk of potential scheduling congestion conditions.

The downside is a negligible increase in the memory footprint of the BPF
component, that is worth the cost in order to have an improved scheduler
stability.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-21 15:47:35 +01:00
Andrea Righi
28bf96c78e scx_rustland: mitigate unevictable memory page faults
Page faults cannot happen when the user-space scheduler is running,
otherwise we may hit deadlock conditions: a kthread may need to run to
resolve the page fault, but the user-space scheduler is waiting on the
page fault to be resolved => deadlock.

We solved this problem (mostly) in commit 9708a80 ("scx_userland: use a
custom memory allocator to prevent page faults"), introducing a custom
allocator for the user-space scheduler that operates on a pre-allocated
mlocked memory buffer, but there is an exception that can still trigger
page faults: kcompactd.

When memory compaction is enabled, specifically with
vm.compact_unevictable_allowed=1 (which is often the default in many
distributions), kcompactd regularly attempts to compact all memory
zones, such that free memory is available in contiguous blocks where
feasible, including unevictable memory as well.

In the event that kcompactd remaps pages within the user-space
scheduler's address space, it can lead to page faults, resulting in a
potential deadlock.

To prevent this from happening automatically set
vm.compact_unevictable_allowed=0 when the scheduler is loaded and
restore the previous value when the scheduler in unloaded. In this way
we can prevent kcompactd from touching the unevictable memory associated
to the user-space scheduler.

Keep in mind that this is not a full bullet proof solution: something
else in the system may still set vm.compact_unevictable_allowed=1 while
the scheduler is running, re-enabling the risk of deadlock.

Ideally we would need a way to mark the user-space scheduler memory as
"really unevictable", or a proper kernel ABI to instruct kcompactd to
exclude certain tasks (or better, cgroups) from its proactive memory
compaction actions, but since then, this seems to be the best way to
mitigate this issue.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-21 15:47:35 +01:00
David Vernet
c6ada251ef scx_rustland: use custom pcpu DSQ instead of SCX_DSQ_LOCAL{_ON}
We still don't have a reliable and non-racy way to manage cpumasks from
the user-space scheduler, so it is quite hard for the scheduler to
enforce the proper CPU affinity behavior.

Despite checking the cpumask in the BPF part, tasks may still be
assigned to a CPU that they cannot use, triggering scheduler errors.

For example, it is really easy to crash the scheduler with a simple CPU
affinity stress test (`stress-ng --race-sched 8 --timeout 5`):

  14:51:28 [WARN] FAIL: SCX_DSQ_LOCAL[_ON] verdict target cpu 1 not allowed for stress-ng-race-[567048] (err=1024)

To prevent this issue from happening, create custom DSQ for each CPU
available in the system and use these per-CPU DSQs to dispatch all the
tasks processed by the user-space scheduler, including the user-space
scheduler itself.

Then consume the these DSQs from the .dispatch() callback of the
respective CPU, to transfer all the tasks to the consuming CPU's local
DSQ, preventing the cpumask race condition encountered using
SCX_DSQ_LOCAL_ON.

With this patch applied the `stress-ng --race-sched N` stress test can
be executed successfully (even with large values of N) without causing
the scheduler to crash.

Signed-off-by: David Vernet <void@manifault.com>
[ arighi: kick target cpu to improve responsiveness, update comments ]
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-21 15:47:35 +01:00