Commit Graph

82 Commits

Author SHA1 Message Date
Tejun Heo
45f7fd13b7 versions: Synchronize crate dependency versions 2024-08-08 14:45:46 -10:00
Tejun Heo
63c4a0191f
Merge branch 'main' into topic/inlined-skeleton-members 2024-08-08 14:23:37 -10:00
Tejun Heo
cd6a4d72c7 Bump versions for 1.0.2 release 2024-08-08 14:10:16 -10:00
Tejun Heo
7c3ffe96e1 Unify crate dependency versions
Different sub-projects are using different versions for the same crates.
Synchronize them to the latest.
2024-08-08 13:26:47 -10:00
Andrea Righi
51cfb69199 scx_rustland_core: re-introduce partial mode
Re-add the partial mode option that was dropped during the refactoring.

The partial option allows to apply the scheduler only to the tasks which
have their scheduling policy set to SCHED_EXT via sched_setscheduler().

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-07 08:41:06 +02:00
Andrea Righi
e1f2b3822e scx_rustland_core: drop CPU ownership API
The API for determining which PID is running on a specific CPU is racy
and is unnecessary since this information can be obtained from user
space.

Additionally, it's not reliable for identifying idle CPUs.  Therefore,
it's better to remove this API and, in the future, provide a cpumask
alternative that can export the idle state of the CPUs to user space.

As a consequence also change scx_rustland to dispatch one task a time,
instead of dispatching tasks in batches of idle cores (that are usually
not accurate due to the racy nature of the CPU ownership interaface).

Dispatching one task at a time even makes the scheduler more performant,
due to the vruntime scheduling being applied to more tasks sitting in
the scheduler's queue.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-07 08:41:06 +02:00
Andrea Righi
9a0e7755df scx_rustland_core: export counter of online CPUs
Introduce a helper to get the amount of online CPUs tracked by the BPF
part.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-07 08:10:53 +02:00
Andrea Righi
d9c9f78e3e scx_rustland: re-align vruntime and time slice evaluation to scx_bpfland
Drop the slice boost logic and apply a vruntime and task time slice
evaluation approach similar to scx_bpfland (but implement this in the
user-space component instead of the BPF part).

Additionally, introduce a slice_us_min parameter to define the minimum
time slice that can be assigned to a task, also similar to scx_bpfland.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-07 08:10:53 +02:00
Andrea Righi
e1e6e31208 scx_rustland_core: update copyright info
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-07 08:10:53 +02:00
Andrea Righi
b87541a26e scx_rustland_core: refactor idle CPU selection logic
Use the same idle selection logic used in scx_bpfland also in
scx_rustland_core.

Also drop fifo_mode and always use the BPF idle selection logic by
default as long as the system is not saturated, unless full_user is
specified.

This approach allows user-space schedulers aiming for maximum
performance to leverage the BPF idle selection logic (bypassing
user-space), while those seeking full control can enable full_user to
bypass the BPF CPU idle selection logic and choose the target CPU for
each task from user-space.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-07 08:10:53 +02:00
Andrea Righi
d8985306f4 scx_rustland: user-space interactive task classifier
We don't need to send the number of voluntary context switches (nvcsw)
from BPF to user-space, as this information is already accessible in
user-space via procfs. Sending this data would only create unnecessary
overhead for schedulers that don't require it, and those that do can
easily retrieve it through procfs.

Therefore, drop this metric from scx_rustland_core and change
scx_rustland implementing an interactive task classifier fully in the
user-space part of the scheduler.

Also drop some options that are not provide any significant benefit
(also in preparation of a bigger refactoring to define a better API for
the user-space framework).

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-06 17:56:58 +02:00
Andrea Righi
a8d14fc0c4
Merge pull request #471 from sched-ext/rustland-core-musl
scx_rustland_core: add support for musl
2024-08-06 17:56:19 +02:00
Kawanaao
c3109ebeed
scx_rustland_core alloc: Replaced RefCell with Mutex
Necessary for some multi-threaded cases
2024-08-06 13:10:21 +00:00
Andrea Righi
4c7fb5cdbd scx_rustland_core: add support for musl
It seems that musl glibc is not POSIX.1-2001, POSIX.1-2008 compliant and
using sched_setscheduler() just returns -ENOSYS:
https://git.musl-libc.org/cgit/musl/commit/src/sched/sched_setscheduler.c?id=1e21e78bf7a5c24c217446d8760be7b7188711c2

Switch to pthread_setschedparam() to properly support building
scx_rustland_core with musl.

This fixes #469.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-06 07:32:38 +02:00
Andrea Righi
d4005dd186 scx_rustland_core: fix missing import (timespec)
Explicitly import timespec to fix the following potential build error:

error[E0422]: cannot find struct, variant or union type `timespec` in this scope
 --> src/bpf.rs:365:35
    |
365 | sched_ss_repl_period: timespec {
    |                       ^^^^^^^^ not found in this scope
    |
help: consider importing this struct
    |
6 + use libc::timespec;
    |

This fixes issue #469.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-05 23:15:48 +02:00
Daniel Müller
565aec3662 rust: Update libbpf-rs & libbpf-cargo to 0.24
Update libbpf-rs & libbpf-cargo to 0.24. Among other things, generated
skeletons now contain directly accessible map and program objects, no
longer necessitating the use of accessor methods. As a result, the risk
for mutability conflicts is reduced greatly.

Signed-off-by: Daniel Müller <deso@posteo.net>
2024-07-16 11:48:52 -07:00
Tejun Heo
51334b5c4d Bump versions for 1.0.1 release 2024-07-15 13:21:52 -10:00
Tejun Heo
761ec142ce Bump most versions to 1.0.0
sched_ext is about to be merged upstream. There are some compatibility
breaking changes and we're making the current sched_ext/for-6.11
1edab907b57d ("sched_ext/scx_qmap: Pick idle CPU for direct dispatch on
!wakeup enqueues") the baseline.

Tag everything except scx_mitosis as 1.0.0. As scx_mitosis is still in early
development and is currently temporarily disabled, only the patchlevel is
bumped.
2024-07-12 11:34:14 -10:00
I Hsin Cheng
1595da78dc scx_rustland_core: Remove unused variable
Remove unused variable "tctx" in rustland_select_cpu.

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
2024-07-05 01:04:49 +08:00
Andrea Righi
5db0908530 scx_rustland_core: make sure to use a valid CPU during direct dispatch
We may end up selecting an invalid CPU (according to the task's cpumask)
when dispatching the task via dispatch_direct_cpu().

When this happens simply return an error and do not dispatch the task
and let the caller handle the error: in the context of select_cpu() we
can simply ignore the dispatch and return the target CPU; in the context
of FIFO mode dispatch we can fallback to SCX_DSQ_LOCAL if the target CPU
is not valid.

This fixes issue #353.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-06-25 14:11:46 +02:00
Andrea Righi
e4b13b2aa6 scx_rustland_core: reduce dispatch overhead
Kick CPUs in the dispatch path only when needed (typically when tasks
are bounced to other CPUs).

Moreover, avoid to consume all the tasks dispatched at once.

This seems to reduce the BPF overhead (according to bpftop), going from
~10% CPU usage down to ~6% CPU usage of rustland_dispatch() on an over
commissioned system, without introducing any measureable performance
regression.

Tested-by: SoulHarsh007 <harsh.peshwani@outlook.com>
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-06-25 14:02:39 +02:00
Andrea Righi
631d5576dc scx_rustland_core: refactor CPU selection logic
Allow to dispatch tasks directly (bypassing the user-space scheduler)
only when the scheduler is operating in FIFO mode.

On an over-commissioned system, directly dispatching tasks can only
increase OS noise. These tasks can get a brief priority boost and an
extended time slice just because they found an idle CPU, which can lead
to erratic behavior.

This is particularly problematic when measuring performance stability,
such as evaluating the frames-per-second (fps) of a video game on an
overloaded system.

In such cases, it's better to bounce all tasks to the user-space
scheduler, that will ensure a better level of fairness and smoother
performance.

Moreover, get rid of the second chance dispatch logic introduced in
commit 4791d862 ("scx_rustland_core: second chance CPU migration"). This
seems to provide benefits only on certain architectures (Intel), but it
can introduces lags in others (AMD).

Tested-by: SoulHarsh007 <harsh.peshwani@outlook.com>
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-06-25 14:02:39 +02:00
Andrea Righi
19217b5722 scx_rustland_core: clarify comment about resuming FIFO mode
Tested-by: SoulHarsh007 <harsh.peshwani@outlook.com>
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-06-25 14:02:39 +02:00
Andrea Righi
081d4bdb86 scx_rustland_core: add debugging to dispatch_direct_cpu()
Report dispatch_direct_cpu() events in the trace, like any other
dispatch-related event.

Tested-by: SoulHarsh007 <harsh.peshwani@outlook.com>
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-06-25 14:02:39 +02:00
Andrea Righi
b04e82b5eb scx_rustland_core: include buddy-alloc and refactor allocator code
The dependency of the buddy-alloc crate [1] seems to cause some troubles
with packaging, mostly because the selftests for the crate are failing
when it's compiled in release mode.

For example:

 $ cargo test --release -- --nocapture
 thread 'tests::fast_alloc::test_basic_malloc' panicked at src/tests/fast_alloc.rs:25:13:
 assertion `left == right` failed
   left: 0
  right: 42

Some of these failures with BuddyAlloc can be fixed by using a memory
arena buffer aligned to page size.

However, some test failures with FastAlloc persist that cannot be
resolved merely by aligning the pre-allocated memory arena to the page
size, as mentioned in [2].

The concern is that this may potentially lead to actual memory bugs.

Therefore, it seems safer to refactor the custom allocator code to
simply use BuddyAlloc, dropping FastAlloc completely.

To achieve this, the entire BuddyAlloc code has been directly included
in scx_rustland_core, referencing the original project and its MIT
licensing information (with the entire code still distributed under the
GPLv2 license).

Then the code has been slightly modified to remove FastAlloc and the
external dependency on the buddy-alloc crate has been dropped.

From a performance perspective this change doesn't seem to introduce any
measurable regression.

[1] https://github.com/jjyr/buddy-alloc
[2] https://github.com/jjyr/buddy-alloc/issues/16

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-06-19 14:44:04 +02:00
Tejun Heo
d18bb4831a
Merge pull request #363 from sched-ext/htejun/compat-strip
Strip compat support
2024-06-16 20:04:50 -10:00
Andrea Righi
812aeb3b81 scx_rustland_core: fix potential race in dispatch_task()
Fix a potential race condition that might lead to a task being
dispatched without kicking the target CPU, which could result in a
potential stall.

With this applied, scx_rustland has been running without any stall for
about 18 hours on a system where the issue was previously quite easy to
reproduce.

Moreover, clarify a couple of comments in the dispatch path.

This fixes issue #353.

Tested-by: SoulHarsh007 <harsh.peshwani@outlook.com>
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-06-16 08:31:48 +02:00
Tejun Heo
5b5e5be906 compat: Drop __COMPAT_SCX_KICK_IDLE
In preparation of upstreaming, let's set the min version requirement at the
released v6.9 kernels. Drop __COMPAT_SCX_KICK_IDLE. The open helper macros
now check the existence of SCX_KICK_IDLE and abort if not.
2024-06-15 20:24:15 -10:00
Tejun Heo
7c9aedaefe compat: Drop __COMPAT_scx_bpf_switch_all()
In preparation of upstreaming, let's set the min version requirement at the
released v6.9 kernels. Drop __COMPAT_scx_bpf_switch_call(). The open helper
macros now check the existence of SCX_OPS_SWITCH_PARTIAL and abort if not.
2024-06-15 20:03:37 -10:00
Andrea Righi
6cc2595922 scx_rustland_core: allow user-space scheduler to preempt other tasks
The same change was previously applied with commit 8820af8d
("scx_rustland: enable user-space scheduler to preempt other tasks").

However, it was reverted in commit 732ba490 ("scx_rustland: avoid using
SCX_ENQ_PREEMPT") with the introduction of the global dynamic time
slice (inversely proportional to the number of running tasks), because
it already provided sufficient context switch opportunities, negating
any advantage of the preemption.

With the introduction of the virtual time slice in commit 6f4cd853
("scx_rustland: introduce virtual time slice"), re-introducing the
ability for the user-space scheduler to preempt other tasks now appears
beneficial again.

As confirmed by experimental results, this change helps prevent
potential audio cracking issues and enhances overall system
responsiveness, resulting in a 4-5% increase in fps performing the
usual benchmark of gaming while recompiling the kernel.

Tested-by: SoulHarsh007 <harsh.peshwani@outlook.com>
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-06-14 20:09:19 +02:00
Andrea Righi
2d2c6083d7 scx_rustland_core: use SCX_DSQ_LOCAL only with kthreads
Tasks dispatched directly using SCX_DSQ_LOCAL may receive excessively
high priority compared to those dispatched by the user-space scheduler.

To avoid this priority disparity, dispatch tasks to the per-CPU DSQs
using direct dispatch and reserve SCX_DSQ_LOCAL for per-CPU kthreads
only.

Tested-by: Tested-by: SoulHarsh007 <harsh.peshwani@outlook.com>
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-06-14 20:09:19 +02:00
Andrea Righi
40e67897a9
Merge pull request #331 from sched-ext/rustland-core-dispatch-debug
scx_rustland_core: add extra debugging info to dispatch_task()
2024-06-04 18:58:01 +02:00
Andrea Righi
b363a13310 scx_rustland_core: add enqueue flags to debug info
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-06-04 17:14:38 +02:00
Andrea Righi
89384754ce scx_rustland_core: add task time slice to the debug info
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-06-04 17:14:38 +02:00
Tejun Heo
e556dd375d scx: Unify loading and running boilerplate across rust schedulers
Make restart handling with user_exit_info simpler and consistently use the
load and report macros consistently across the rust schedulers. This makes
all schedulers automatically handle auto restarts from CPU hotplug events.
Note that this is necessary even for scx_lavd which has CPU hotplug
operations as CPU hotplug operations which took place between skel open and
scheduler init can still trigger restart.
2024-06-03 12:25:41 -10:00
Tejun Heo
a2d5310cb6 Bump versions for a release 2024-06-03 08:35:21 -10:00
Andrea Righi
4503c7080a scx_rustland_core: fix build error with musl
As reported in #319, we may get a build failure in presence of musl,
that requires additional parameters in sched_param.

Fix by adding a proper conditional to support both gnu libc and musl
libc.

This fixes #319.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-05-26 22:30:04 +02:00
Yiwei Lin
8c2236770d Reduce MAX_ENQUEUED_TASKS to fit percpu allocator
The setting of ops->dispatch_max_batch leads to a too large allocated
size for percpu allocator, and it will be unhappy if we want a size
larger than PCPU_MIN_UNIT_SIZE. Reduce MAX_ENQUEUED_TASKS for fix.
2024-05-23 22:33:02 +08:00
Andrea Righi
4791d862f5 scx_rustland_core: second chance CPU migration
Implement a second-chance migration in select_cpu(): after a task has
been dispatched directly do not try to migrate it immediately on a
different CPU, but force it to stay on prev_cpu for another round.

This seems quite effective on certain architectures (such as on a system
with 11th Gen Intel(R) Core(TM) i7-1195G7 @ 2.90GHz), and it can provide
noticeable benefits with gaming or WebGL applications (such as
https://webglsamples.org/aquarium/aquarium.html) under regular workload
conditions (around +5% fps).

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-05-22 21:38:34 +02:00
Andrea Righi
6c129e9b4b scx_rustland_core: consume from the shared DSQ before local DSQ
The shared DSQ is typically used to prioritize tasks and dispatch them
on the first CPU available, so consume from the shared DSQ before the
local CPU DSQ.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-05-22 12:12:55 +02:00
Andrea Righi
9e4bea4a1c scx_rustland_core: switch to FIFO when system is underutilized
Provide a knob in scx_rustland_core to automatically turn the scheduler
into a simple FIFO when the system is underutilized.

This choice is based on the assumption that, in the case of system
underutilization (less tasks running than the amount of available CPUs),
the best scheduling policy is FIFO.

With this option enabled the scheduler starts in FIFO mode. If most of
the CPUs are busy (nr_running >= num_cpus - 1), the scheduler
immediately exits from FIFO mode and starts to apply the logic
implemented by the user-space component. Then the scheduler can switch
back to FIFO if there are no tasks waiting to be scheduled (evaluated
using a moving average).

This option can be enabled/disabled by the user-space scheduler using
the fifo_sched parameter in BpfScheduler: if set, the BPF component will
periodically check for system utilization and switch back and forth to
FIFO mode based on that.

This allows to improve performance of workloads that are using a small
amount of the available CPUs in the system, while still maintaining the
same good level of performance for interactive tasks when the system is
over commissioned.

In certain video games, such as Baldur's Gate 3 or Counter-Strike 2,
running in "normal" system conditions, we can experience a boost in fps
of approximately 4-8% with this change applied.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-05-22 09:02:02 +02:00
Andrea Righi
912bde6a52 scx_rustland_core: simplify CPU selection logic
Simplify the CPU idle selection logic relying on the built-in logic.

If something can be improved in this logic it should be done in the
backend, changing the default idle selection logic, rustland doesn't
need to do anything special here for now.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-05-22 09:02:02 +02:00
Andrea Righi
0d75c80587 Revert "Merge pull request #305 from sched-ext/rustland-fifo-mode"
This merge included additional commits that were supposed to be included
in a separate pull request and have nothing to do with the fifo-mode
changes.

Therefore, revert the whole pull request and create a separate one with
the correct list of commits required to implement this feature.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-05-22 09:00:25 +02:00
Andrea Righi
f27e67dfb8 scx_rustland_core: second chance CPU migration
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-05-21 18:08:58 +02:00
Andrea Righi
778ee1406f scx_rustland_core: consume from the shared DSQ before local DSQ
The shared DSQ is typically used to prioritize tasks and dispatch them
on the first CPU available, so consume from the shared DSQ before the
local CPU DSQ.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-05-21 18:08:19 +02:00
Andrea Righi
d25675ff44 scx_rustland_core: switch to FIFO when system is underutilized
Provide a knob in scx_rustland_core to automatically turn the scheduler
into a simple FIFO when the system is underutilized.

This choice is based on the assumption that, in the case of system
underutilization (less tasks running than the amount of available CPUs),
the best scheduling policy is FIFO.

With this option enabled the scheduler starts in FIFO mode. If most of
the CPUs are busy (nr_running >= num_cpus - 1), the scheduler
immediately exits from FIFO mode and starts to apply the logic
implemented by the user-space component. Then the scheduler can switch
back to FIFO if there are no tasks waiting to be scheduled (evaluated
using a moving average).

This option can be enabled/disabled by the user-space scheduler using
the fifo_sched parameter in BpfScheduler: if set, the BPF component will
periodically check for system utilization and switch back and forth to
FIFO mode based on that.

This allows to improve performance of workloads that are using a small
amount of the available CPUs in the system, while still maintaining the
same good level of performance for interactive tasks when the system is
over commissioned.

In certain video games, such as Baldur's Gate 3 or Counter-Strike 2,
running in "normal" system conditions, we can experience a boost in fps
of approximately 4-8% with this change applied.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-05-21 17:39:11 +02:00
Andrea Righi
3ad634c293 scx_rustland_core: simplify CPU selection logic
Simplify the CPU idle selection logic relying on the built-in logic.

If something can be improved in this logic it should be done in the
backend, changing the default idle selection logic, rustland doesn't
need to do anything special here for now.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-05-21 17:08:06 +02:00
David Vernet
ee940bd8b5
rustland: Mark get_cpu_owner() as __maybe_unused
scx_rustland has a function called get_cpu_owner() in BPF which
currently has no callers. There's nothing wrong with the function, but
it causes a warning due to an unused function. Let's just annotate it
with __maybe_unused to tell the compiler that it's not a problem.

Signed-off-by: David Vernet <void@manifault.com>
2024-05-18 07:51:20 -05:00
Tejun Heo
ab25992416 Add missing skel.attach() calls
C SCX_OPS_ATTACH() and rust scx_ops_attach() macros were not calling
.attach() and were only attaching the struct_ops. This meant that all
non-struct_ops BPF programs contained in the skels were never attached which
breaks e.g. scx_layered.

Let's fix it by adding .attach() invocation the the attach macros.
2024-05-17 14:33:04 -10:00
Andrea Righi
e9ac6105c7 scx_rustland_core: introduce low-power mode
Introduce a low-power mode to force the scheduler to operate in a very
non-work conserving way, causing a significant saving in terms of power
consumption, while still providing a good level of responsiveness in the
system.

This option can be enabled in scx_rustland via the --low_power / -l
option.

The idea is to not immediately re-kick a CPU when it enters an idle
state, but do that only if there are no other tasks running in the
system.

In this way, latency-critical tasks can be still dispatched immediately
on the other active CPUs, while CPU-bound tasks will be forced to spend
more time waiting to be scheduled, basically enforcing a special CPU
throttling mechanism that affects only the tasks that are not latency
critical.

The consequence is a reduction in the overall system throughput, but
also a significant reduction of power consumption, that can be useful
for mobile / battery-powered devices.

Test case (using `scx_rustland -l`):

 - play a video game (Terraria) while recompiling the kernel
 - measure game performance (fps) and core power consumption (W)
 - compare the result of normal mode vs low-power mode

Result:
                  Game performance | Power consumption |
     ------------+-----------------+-------------------+
     normal mode |          60 fps |               6W  |
  low-power mode |          60 fps |               3W  |

As we can see from the result the reduction of power consumption is
quite significant (50%), while the responsiveness of the game (fps)
remains the same, that means battery life can be potentially doubled
without significantly affecting system responsiveness.

The overall throughput of the system is, of course, affected in a
negative way (kernel build is approximately 50% slower during this
test), but the goal here is to save power while still maintaining a good
level of responsiveness in the system.

For this reason the low-power mode should be considered only in
emergency conditions, for example when the system is close to completely
run out of power or simply to extend the battery life of a mobile device
without compromising its responsiveness.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-05-15 20:32:05 +02:00