sched_ext is about to be merged upstream. There are some compatibility
breaking changes and we're making the current sched_ext/for-6.11
1edab907b57d ("sched_ext/scx_qmap: Pick idle CPU for direct dispatch on
!wakeup enqueues") the baseline.
Tag everything except scx_mitosis as 1.0.0. As scx_mitosis is still in early
development and is currently temporarily disabled, only the patchlevel is
bumped.
This change adds the ability to customize the log recorder format for
each metric type. There is a default format that is used if no custom
`MetricFormatter` is provided. This is the same format that was used
before this change.
The `MetricFormatter` should be implemented by the user to customize the
format of the log recorder. The `LogRecorderBuilder` now takes a
`MetricFormatter` as an optional parameter.
Following changes will allow additional customization of the log
recorder format, such as how many metrics are logged per line.
Signed-off-by: Jose Fernandez <josef@netflix.com>
If CONFIG_DEBUG_INFO_BTF is not enabled in the kernel, the C schedulers
report the following error via libbpf, clearly indicating the missing
kernel config:
libbpf: kernel BTF is missing at '/sys/kernel/btf/vmlinux', was CONFIG_DEBUG_INFO_BTF enabled?
In contrast, the Rust schedulers report a less clear error:
thread 'main' panicked at /home/arighi/src/scx/rust/scx_utils/src/compat.rs:23:9:
btf__load_vmlinux_btf() returned NULL
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Make sure to report a similar error, so that users have a better clue
about the missing kernel config. After this change the error looks like
the following:
thread 'main' panicked at /home/arighi/src/scx/rust/scx_utils/src/compat.rs:23:9:
btf__load_vmlinux_btf() returned NULL, was CONFIG_DEBUG_INFO_BTF enabled?
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
We want schedulers to be able to print, log, etc the build ID of the
repository. To do this, we can use the vergen Cargo crate to generate
environment variables that contain values that we can export from scx_utils.
This patch update scx_utils accordingly to use vergen to generate build
ID output that can be printed from schedulers. A subsequent patch will
update scx_rusty to print this build ID value.
Signed-off-by: David Vernet <void@manifault.com>
We may end up selecting an invalid CPU (according to the task's cpumask)
when dispatching the task via dispatch_direct_cpu().
When this happens simply return an error and do not dispatch the task
and let the caller handle the error: in the context of select_cpu() we
can simply ignore the dispatch and return the target CPU; in the context
of FIFO mode dispatch we can fallback to SCX_DSQ_LOCAL if the target CPU
is not valid.
This fixes issue #353.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Kick CPUs in the dispatch path only when needed (typically when tasks
are bounced to other CPUs).
Moreover, avoid to consume all the tasks dispatched at once.
This seems to reduce the BPF overhead (according to bpftop), going from
~10% CPU usage down to ~6% CPU usage of rustland_dispatch() on an over
commissioned system, without introducing any measureable performance
regression.
Tested-by: SoulHarsh007 <harsh.peshwani@outlook.com>
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Allow to dispatch tasks directly (bypassing the user-space scheduler)
only when the scheduler is operating in FIFO mode.
On an over-commissioned system, directly dispatching tasks can only
increase OS noise. These tasks can get a brief priority boost and an
extended time slice just because they found an idle CPU, which can lead
to erratic behavior.
This is particularly problematic when measuring performance stability,
such as evaluating the frames-per-second (fps) of a video game on an
overloaded system.
In such cases, it's better to bounce all tasks to the user-space
scheduler, that will ensure a better level of fairness and smoother
performance.
Moreover, get rid of the second chance dispatch logic introduced in
commit 4791d862 ("scx_rustland_core: second chance CPU migration"). This
seems to provide benefits only on certain architectures (Intel), but it
can introduces lags in others (AMD).
Tested-by: SoulHarsh007 <harsh.peshwani@outlook.com>
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Report dispatch_direct_cpu() events in the trace, like any other
dispatch-related event.
Tested-by: SoulHarsh007 <harsh.peshwani@outlook.com>
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
This change adds a new module to the scx_utils crate that provides a
log recorder for metrics-rs. The log recorder will log all metrics to
the console at a configurable interval in an easy to read format. Each
metric type will be displayed in a separate section. Indentation will
be used to show the hierarchy of the metrics. This results in a more
verbose output, but it is easier to read and understand.
scx_rusty was updated to use the log recorder and all explicit metric
logging was removed.
Counters will show the total count and the rate of change per second.
Counters with an additional label, like `type` in
`dispatched_tasks_total` in rusty, will show the count, rate, and
percentage of the total count.
Counters:
dispatched_tasks_total: 65559 [1344.8/s]
prev_idle: 44963 (68.6%) [966.5/s]
wsync_prev_idle: 15696 (23.9%) [317.3/s]
direct_dispatch: 2833 (4.3%) [35.3/s]
dsq: 1804 (2.8%) [21.3/s]
wsync: 262 (0.4%) [4.3/s]
direct_greedy: 1 (0.0%) [0.0/s]
pinned: 0 (0.0%) [0.0/s]
greedy_idle: 0 (0.0%) [0.0/s]
greedy_xnuma: 0 (0.0%) [0.0/s]
direct_greedy_far: 0 (0.0%) [0.0/s]
greedy_local: 0 (0.0%) [0.0/s]
dl_clamped_total: 1290 [20.3/s]
dl_preset_total: 514 [1.0/s]
kick_greedy_total: 6 [0.3/s]
lb_data_errors_total: 0 [0.0/s]
load_balance_total: 0 [0.0/s]
repatriate_total: 0 [0.0/s]
task_errors_total: 0 [0.0/s]
Gauges will show the last set value:
Gauges:
slice_length_us: 20000.00
Histograms will show the average, min, and max. The histogram will be
reset after each log interval to avoid memory leaks, since the data
structure that holds the samples is unbounded.
Histograms:
cpu_busy_pct: avg=1.66 min=1.16 max=2.16
load_avg node=0: avg=0.31 min=0.23 max=0.39
load_avg node=0 dom=0: avg=0.31 min=0.23 max=0.39
processing_duration_us: avg=297.50 min=296.00 max=299.00
Signed-off-by: Jose Fernandez <josef@netflix.com>
In some cases, a host may have an odd topology where there are gaps in
CPU IDs (including between possible CPUs). A common pattern in
schedulers is to perform allocations for every possible CPU ID, such as
creating a per-cpu DSQ. In order to avoid confusing schedulers, let's
track the maximum CPU ID on a system so that we can return the number of
CPU IDs on the system which is inclusive of gaps.
We also update scx_rustland in this change to accommodate the fact that
we no longer export nr_cpus_possible() from TopologyMap.
Signed-off-by: David Vernet <void@manifault.com>
The dependency of the buddy-alloc crate [1] seems to cause some troubles
with packaging, mostly because the selftests for the crate are failing
when it's compiled in release mode.
For example:
$ cargo test --release -- --nocapture
thread 'tests::fast_alloc::test_basic_malloc' panicked at src/tests/fast_alloc.rs:25:13:
assertion `left == right` failed
left: 0
right: 42
Some of these failures with BuddyAlloc can be fixed by using a memory
arena buffer aligned to page size.
However, some test failures with FastAlloc persist that cannot be
resolved merely by aligning the pre-allocated memory arena to the page
size, as mentioned in [2].
The concern is that this may potentially lead to actual memory bugs.
Therefore, it seems safer to refactor the custom allocator code to
simply use BuddyAlloc, dropping FastAlloc completely.
To achieve this, the entire BuddyAlloc code has been directly included
in scx_rustland_core, referencing the original project and its MIT
licensing information (with the entire code still distributed under the
GPLv2 license).
Then the code has been slightly modified to remove FastAlloc and the
external dependency on the buddy-alloc crate has been dropped.
From a performance perspective this change doesn't seem to introduce any
measurable regression.
[1] https://github.com/jjyr/buddy-alloc
[2] https://github.com/jjyr/buddy-alloc/issues/16
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Take advantages of BTreeMap's Entry API working with or_insert() to do
the conditional insertion. Insert only when the entry doesn't exist.
Doing so can reduce the amount of code and provide better readability
and perform in-place manipulation.
Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
In preparation of upstreaming, let's set the min version requirement at the
released v6.9 kernels. Drop support for missing sched_ext_ops.dump*(). The
open helper macros now check the existence of the fields and abort if
missing.
In preparation of upstreaming, let's set the min version requirement at the
released v6.9 kernels. Drop support for missing sched_ext_ops.tick(). The
open helper macros now check the existence of the field and abort if
missing.
In preparation of upstreaming, let's set the min version requirement at the
released v6.9 kernels. Drop support for missing sched_ext_ops.exit_dump_len.
The open helper macros now check the existence of the field and abort if
missing.
In preparation of upstreaming, let's set the min version requirement at the
released v6.9 kernels. Drop support for missing sched_ext_ops.hotplug_seq.
The open helper macros now check the existence of the field and abort if
missing.
In preparation of upstreaming, let's set the min version requirement at the
released v6.9 kernels. Drop __COMPAT_scx_bpf_cpuperf_*(). The open helper
macros now check the existence of scx_bpf_cpuperf_cap() and abort if not.
In preparation of upstreaming, let's set the min version requirement at the
released v6.9 kernels. Drop __COMPAT_HAS_CPUMASKS(). The open helper macros
now check the existence of scx_bpf_nr_cpu_ids() and abort if not.
In preparation of upstreaming, let's set the min version requirement at the
released v6.9 kernels. Drop __COMPAT_scx_bpf_dump(). The open helper macros
now check the existence of scx_bpf_dump_bstr() and abort if not.
While at it, reorder the min requirement checks so that newly added ones are
up top to make testing easier.
In preparation of upstreaming, let's set the min version requirement at the
released v6.9 kernels. Drop __COMPAT_scx_bpf_exit(). The open helper macros
now check the existence of scx_bpf_exit_bstr() and abort if not.
Fix a potential race condition that might lead to a task being
dispatched without kicking the target CPU, which could result in a
potential stall.
With this applied, scx_rustland has been running without any stall for
about 18 hours on a system where the issue was previously quite easy to
reproduce.
Moreover, clarify a couple of comments in the dispatch path.
This fixes issue #353.
Tested-by: SoulHarsh007 <harsh.peshwani@outlook.com>
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
In preparation of upstreaming, let's set the min version requirement at the
released v6.9 kernels. Drop __COMPAT_SCX_KICK_IDLE. The open helper macros
now check the existence of SCX_KICK_IDLE and abort if not.
In preparation of upstreaming, let's set the min version requirement at the
released v6.9 kernels. Drop __COMPAT_scx_bpf_switch_call(). The open helper
macros now check the existence of SCX_OPS_SWITCH_PARTIAL and abort if not.
scx_ops_open!() and scx_ops_attach!() could return the calling function
after an error, which can be surprising. Forutnately, as all the current
callers are either unwrapping or returning on error, the surprising behavior
is currently not very noticeable.
Fix it by breaking out of the macro block on errors.
The same change was previously applied with commit 8820af8d
("scx_rustland: enable user-space scheduler to preempt other tasks").
However, it was reverted in commit 732ba490 ("scx_rustland: avoid using
SCX_ENQ_PREEMPT") with the introduction of the global dynamic time
slice (inversely proportional to the number of running tasks), because
it already provided sufficient context switch opportunities, negating
any advantage of the preemption.
With the introduction of the virtual time slice in commit 6f4cd853
("scx_rustland: introduce virtual time slice"), re-introducing the
ability for the user-space scheduler to preempt other tasks now appears
beneficial again.
As confirmed by experimental results, this change helps prevent
potential audio cracking issues and enhances overall system
responsiveness, resulting in a 4-5% increase in fps performing the
usual benchmark of gaming while recompiling the kernel.
Tested-by: SoulHarsh007 <harsh.peshwani@outlook.com>
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Tasks dispatched directly using SCX_DSQ_LOCAL may receive excessively
high priority compared to those dispatched by the user-space scheduler.
To avoid this priority disparity, dispatch tasks to the per-CPU DSQs
using direct dispatch and reserve SCX_DSQ_LOCAL for per-CPU kthreads
only.
Tested-by: Tested-by: SoulHarsh007 <harsh.peshwani@outlook.com>
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
This change adds the CPU frequency transition latency from the
`cpuinfo_transition_latency` from sysfs. The value of this field is
described [cpufreq
docs](https://www.kernel.org/doc/Documentation/cpu-freq/user-guide.txt).
On supported systems it returns the CPU frequency transition latency in
nanoseconds. The goal of this change is so that in the future schedulers
can use this data to make better frequency scaling decisions.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
- scx_utils: Replace kfunc_exists() with ksym_exists() which doesn't care
about the type of the symbol.
- scx_layered: Fix load failure on kernels >= v6.10-rc due to
scheduler_tick() -> sched_tick rename. Attach the tick fentry function to
either scheduler_tick() or sched_tick().
Make restart handling with user_exit_info simpler and consistently use the
load and report macros consistently across the rust schedulers. This makes
all schedulers automatically handle auto restarts from CPU hotplug events.
Note that this is necessary even for scx_lavd which has CPU hotplug
operations as CPU hotplug operations which took place between skel open and
scheduler init can still trigger restart.
As reported in #319, we may get a build failure in presence of musl,
that requires additional parameters in sched_param.
Fix by adding a proper conditional to support both gnu libc and musl
libc.
This fixes#319.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
The setting of ops->dispatch_max_batch leads to a too large allocated
size for percpu allocator, and it will be unhappy if we want a size
larger than PCPU_MIN_UNIT_SIZE. Reduce MAX_ENQUEUED_TASKS for fix.
Implement a second-chance migration in select_cpu(): after a task has
been dispatched directly do not try to migrate it immediately on a
different CPU, but force it to stay on prev_cpu for another round.
This seems quite effective on certain architectures (such as on a system
with 11th Gen Intel(R) Core(TM) i7-1195G7 @ 2.90GHz), and it can provide
noticeable benefits with gaming or WebGL applications (such as
https://webglsamples.org/aquarium/aquarium.html) under regular workload
conditions (around +5% fps).
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
The shared DSQ is typically used to prioritize tasks and dispatch them
on the first CPU available, so consume from the shared DSQ before the
local CPU DSQ.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Provide a knob in scx_rustland_core to automatically turn the scheduler
into a simple FIFO when the system is underutilized.
This choice is based on the assumption that, in the case of system
underutilization (less tasks running than the amount of available CPUs),
the best scheduling policy is FIFO.
With this option enabled the scheduler starts in FIFO mode. If most of
the CPUs are busy (nr_running >= num_cpus - 1), the scheduler
immediately exits from FIFO mode and starts to apply the logic
implemented by the user-space component. Then the scheduler can switch
back to FIFO if there are no tasks waiting to be scheduled (evaluated
using a moving average).
This option can be enabled/disabled by the user-space scheduler using
the fifo_sched parameter in BpfScheduler: if set, the BPF component will
periodically check for system utilization and switch back and forth to
FIFO mode based on that.
This allows to improve performance of workloads that are using a small
amount of the available CPUs in the system, while still maintaining the
same good level of performance for interactive tasks when the system is
over commissioned.
In certain video games, such as Baldur's Gate 3 or Counter-Strike 2,
running in "normal" system conditions, we can experience a boost in fps
of approximately 4-8% with this change applied.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Simplify the CPU idle selection logic relying on the built-in logic.
If something can be improved in this logic it should be done in the
backend, changing the default idle selection logic, rustland doesn't
need to do anything special here for now.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
This merge included additional commits that were supposed to be included
in a separate pull request and have nothing to do with the fifo-mode
changes.
Therefore, revert the whole pull request and create a separate one with
the correct list of commits required to implement this feature.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
The shared DSQ is typically used to prioritize tasks and dispatch them
on the first CPU available, so consume from the shared DSQ before the
local CPU DSQ.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Provide a knob in scx_rustland_core to automatically turn the scheduler
into a simple FIFO when the system is underutilized.
This choice is based on the assumption that, in the case of system
underutilization (less tasks running than the amount of available CPUs),
the best scheduling policy is FIFO.
With this option enabled the scheduler starts in FIFO mode. If most of
the CPUs are busy (nr_running >= num_cpus - 1), the scheduler
immediately exits from FIFO mode and starts to apply the logic
implemented by the user-space component. Then the scheduler can switch
back to FIFO if there are no tasks waiting to be scheduled (evaluated
using a moving average).
This option can be enabled/disabled by the user-space scheduler using
the fifo_sched parameter in BpfScheduler: if set, the BPF component will
periodically check for system utilization and switch back and forth to
FIFO mode based on that.
This allows to improve performance of workloads that are using a small
amount of the available CPUs in the system, while still maintaining the
same good level of performance for interactive tasks when the system is
over commissioned.
In certain video games, such as Baldur's Gate 3 or Counter-Strike 2,
running in "normal" system conditions, we can experience a boost in fps
of approximately 4-8% with this change applied.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Simplify the CPU idle selection logic relying on the built-in logic.
If something can be improved in this logic it should be done in the
backend, changing the default idle selection logic, rustland doesn't
need to do anything special here for now.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
scx_rustland has a function called get_cpu_owner() in BPF which
currently has no callers. There's nothing wrong with the function, but
it causes a warning due to an unused function. Let's just annotate it
with __maybe_unused to tell the compiler that it's not a problem.
Signed-off-by: David Vernet <void@manifault.com>
C SCX_OPS_ATTACH() and rust scx_ops_attach() macros were not calling
.attach() and were only attaching the struct_ops. This meant that all
non-struct_ops BPF programs contained in the skels were never attached which
breaks e.g. scx_layered.
Let's fix it by adding .attach() invocation the the attach macros.
In order to make it easy for schedulers to use the hotplug_seq feature that's
available in recent kernels, we'll need to provide a macro wrapper so that we
can support the feature with backwards compatibility. This adds scx_ops_open!()
to abstract that. Any scheduler that uses scx_ops_open()! will be exited if a
hotplug event happens between opening the skeleton, and loading it.
Signed-off-by: David Vernet <void@manifault.com>
Now that the kernel exports the SCX_ECODE_ACT_RESTART exit code, we can
remove the custom hotplug logic from scx_rusty, and instead rely on the
built-in logic from the kernel. There's still a corner case that we're not
honoring: when a hotplug event happens on the init path. A future change will
address this as well.
Signed-off-by: David Vernet <void@manifault.com>
Introduce a low-power mode to force the scheduler to operate in a very
non-work conserving way, causing a significant saving in terms of power
consumption, while still providing a good level of responsiveness in the
system.
This option can be enabled in scx_rustland via the --low_power / -l
option.
The idea is to not immediately re-kick a CPU when it enters an idle
state, but do that only if there are no other tasks running in the
system.
In this way, latency-critical tasks can be still dispatched immediately
on the other active CPUs, while CPU-bound tasks will be forced to spend
more time waiting to be scheduled, basically enforcing a special CPU
throttling mechanism that affects only the tasks that are not latency
critical.
The consequence is a reduction in the overall system throughput, but
also a significant reduction of power consumption, that can be useful
for mobile / battery-powered devices.
Test case (using `scx_rustland -l`):
- play a video game (Terraria) while recompiling the kernel
- measure game performance (fps) and core power consumption (W)
- compare the result of normal mode vs low-power mode
Result:
Game performance | Power consumption |
------------+-----------------+-------------------+
normal mode | 60 fps | 6W |
low-power mode | 60 fps | 3W |
As we can see from the result the reduction of power consumption is
quite significant (50%), while the responsiveness of the game (fps)
remains the same, that means battery life can be potentially doubled
without significantly affecting system responsiveness.
The overall throughput of the system is, of course, affected in a
negative way (kernel build is approximately 50% slower during this
test), but the goal here is to save power while still maintaining a good
level of responsiveness in the system.
For this reason the low-power mode should be considered only in
emergency conditions, for example when the system is close to completely
run out of power or simply to extend the battery life of a mobile device
without compromising its responsiveness.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Schedulers and the kernel can include an exit code when exiting a scheduler.
There are some built-in codes that can be specified: SCX_ECODE_RSN_HOTPLUG,
and SCX_ECODE_ACT_RESTART. Some schedulers may want to check the exit
code against these values, so let's export them from user_exit_info.rs.
We use lazy_static so that we can read the values for the enum for the
currently-running kernel.
Signed-off-by: David Vernet <void@manifault.com>
The comment that describes rustland_update_idle() is still incorrectly
reporting an old implemention detail. Update its description for better
clarity.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Change the BPF CPU selection logic as following:
- if the previously used CPU is idle, keep using it
- if the task is not coming from a wait state, try to stick as much as
possible to the same CPU (for better cache usage)
- if the task is waking up from a wait state rely on the sched_ext
built-int idle selection logic
This logic can be completely disabled when the full user-space mode is
enabled. In this case tasks will always be assigned to the previously
used CPU and the user-space scheduler should take care of distributing
them among the available CPUs.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Some users are running with NUMA disabled, which makes sense given that it's
useless in a lot of contexts. Let's make the Topology crate assume a default
node with ID 0 in such cases.
Signed-off-by: David Vernet <void@manifault.com>
Add a method to TopologyMap to get the amount of online CPUs.
Considering that most of the schedulers are not handling CPU hotplugging
it can be useful to expose also this metric in addition to the amount of
available CPUs in the system.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Drop the global effective time-slice and use the more fine-grained
per-task time-slice to implement the dynamic time-slice capability.
This allows to reduce the scheduler's overhead (dropping the global time
slice volatile variable shared between user-space and BPF) and it
provides a more fine-grained control on the per-task time slice.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
If another scheduler is already running, the Rust schedulers based on
scx_utils are reporting an error like the following, that can be a bit
difficult to understand:
Error: Failed to attach struct ops
Caused by:
bpf call "libbpf_rs::map::Map::attach_struct_ops::{{closure}}" returned NULL
Change the scx_ops_attach macro to check if another sched_ext scheduler
is running and in that case report a more explicit error.
With this applied:
$ sudo scx_rustland
Error: another sched_ext scheduler is already running
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Replace the BPF_MAP_TYPE_QUEUE with a BPF_MAP_TYPE_USER_RINGBUF to store
the tasks dispatched from the user-space scheduler to the BPF component.
This eliminates the need of the bpf() syscalls, significantly reducing
the overhead of the user-space->kernel communication and delivering a
notable performance boost in the overall system throughput.
Based on experimental results, this change allows to reduces the scheduling
overhead by approximately 30-35% when the system is overcommitted.
This improvement has the potential to make user-space schedulers based
on scx_rustland_core viable options for real production systems.
Link: https://github.com/libbpf/libbpf-rs/pull/776
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
In order to prevent deadlock conditions user-space schedulers need to
perform memory allocations from a pool of pre-allocated memory that will
never be unmapped/reclaimed by the kernel (unevictable memory).
However, there is a special kernel sysctl setting
(vm.compact_unevictable_allowed) that allows kcompactd to reclaim
unevictable memory. This behavior should be prevented by setting
vm.compact_unevictable_allowed = 0, that is what scx_rustland_core does
transparently when a scheduler is started, restoring the previous value
when the scheduler is stopped.
Unfortunately, this is not always doable, especially when running a
scheduler inside a containerized environment or under certain
security/privilege restrictions (e.g., AppArmor confinement).
Therefore, just report a WARNING if we are unable to change this
parameter, instead of considering it a hard-failure, to allow running
scx_rustland_core schedulers also inside such limited environments.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Given that rustland_core now supports task preemption and it has been
tested successfully, it's worhtwhile to cut a new version of the crate.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
It would be useful to be able to check whether a kfunc exists from rust
user space. For example, now that we have support for adjusting cpufreq
in scx_layered, we'll want to be able to test whether or not a the
scx_bpf_cpuperf_set() (and friends) kfuncs are present for backwards
compat purposes. Let's add a kfunc_exists() function to compat.rs for
this purpose.
Signed-off-by: David Vernet <void@manifault.com>
Introuce a TopologyMap object, represented as an array of arrays, where
each inner array corresponds to a core containing its associated CPU
IDs.
This object can be used as a cache to facilitate efficient iteration
over the entire host's topology.
Example usage:
let topo = Topology::new()?;
let topo_map = TopologyMap::new(topo)?;
for (core_id, core) in topo_map.iter().enumerate() {
for cpu in core {
println!("core={} cpu={}", core_id, cpu);
}
}
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
We're currently cloning cpumasks returned by calls to {Core, Cache,
Node, Topology}::span(). If a caller needs to clone it, they can. Let's
not penalize the callers that just want to query the underlying cpumask.
Signed-off-by: David Vernet <void@manifault.com>
Do not encode dispatch flags in the cpu field, but simply use a separate
"flags" field.
This makes the code much simpler and it increases the size of
dispatched_task_ctx from 24 to 32, that is probably better in terms of
cacheline allocation / performance.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Introduce the new dispatch flag RL_PREEMPT_CPU that can be used to
dispatch tasks that can preempt others.
Tasks with this flag set will be dispatched by the BPF part using
SCX_ENQ_PREEMPT, so they can potentially preempt any other task running
on the target CPU.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Reserve some bits of the `cpu` attribute of a task to store special
dispatch flags.
Initially, let's introduce just RL_CPU_ANY to replace the special value
NO_CPU, indicating that the task can be dispatched on any CPU,
specifically the first CPU that becomes available.
This allows to keep the CPU value assigned by the builtin idle selection
logic, that can potentially be used later for further optimizations.
Moreover, having the possibility to specify dispatch flags gives more
flexibility and it allows to map new scheduling features to such flags.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Newer kernels also support exiting gracefully with an exit code. Let's
update the UserExitInfo struct to also read and export this value.
Signed-off-by: David Vernet <void@manifault.com>
scx_rustland_core needs to ship both a binary part and a source code
part, which will be used to build schedulers based on it.
To effectively publish the scx_rustland_core crate on crates.io we need
to properly separate the source code assets from the crate's main source
code.
To achieve this, move the assets into a separate directory and declare
them inside a [lib] section in Cargo.toml.
This allows to publish the crate on crates.io, providing also a clear
separation between source code and assets.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Now that libbpf-rs 0.23 has been officially released with the new
consume_raw() API (https://github.com/libbpf/libbpf-rs/pull/680) we can
re-introduce the change in rustland-core that allows to use this API to
improve the quality of the code and make it slightly more efficient when
consuming tasks from BPF to user-space.
Fixes: bd2c18a ("Revert "scx_rustland_core: use new consume_raw() libbpf-rs API"")
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
As described in https://github.com/sched-ext/scx/issues/195, apparently
some chips don't export information about their cache topology. There's
not much we can do if we don't have that information, so let's just
assume a unified cache per node if that happens.
Andrea suggested this patch -- I'm applying exactly what he proposed,
with a slightly modified comment.
Suggested-by: Andrea Righi <andrea.righi@canonical.com>
Signed-off-by: David Vernet <void@manifault.com>
As described in https://bugzilla.kernel.org/show_bug.cgi?id=218109,
https://github.com/sched-ext/scx/issues/147 and
https://github.com/sched-ext/sched_ext/issues/69, AMD chips can
sometimes report fully disabled CPUs as offline, which causes us to
count them when looking at /sys/devices/system/cpu/possible.
Additionally, systems can have holes in their active CPU maps. For
example, a system with CPUs 0, 1, 2, 3 possible, may have only 0 and 2
active. To address this, we need to do a few things:
1. Update topology.rs to be clear that it's returning the number of
_possible_ CPUs in the system. Also update Topology to only record
online CPUs when creating its span and iterating over sysfs when
creating domains. It was previously trying to record when a CPU was
online, but this was actually broken as the topology directory isn't
present in sysfs when the CPU is offline.
2. Schedulers should not be relying on nr_possible_cpus for anything
other than interacting with per-CPU data (e.g. for stats extraction),
or e.g. verifying maximum sizes of statically sized arrays in BPF. It
should _not_ be used for e.g. performing load calculations, etc. With
that said, we'll also need to update schedulers to not rely on the
nr_possible_cpus figure being exported by the topology crate. We do
that for rusty in this patch, but don't fix any of the others other
than updating how they call topology.rs.
3. Account for the fact that LLC IDs may be non-contiguous. For example,
if there is a single core in an LLC, then if we assign LLC IDs to
domains, then the domain IDs won't be contiguous. This doesn't fit
our current model which is used by e.g. infeasible_weights.rs. We'll
update some of the code in rusty to accomodate this, but we'll need
to do more.
4. Update schedulers to properly reset themselves in the event of a
hotplug event. We'll take care of that in a follow-on change.
Signed-off-by: David Vernet <void@manifault.com>
We implement functions or(), and(), and xor() for cpumasks, but we
should also implement the bitwise ops for those operations in case
people prefer that syntax.
Signed-off-by: David Vernet <void@manifault.com>
Offline CPUs don't have a /sys/devices/system/cpu/cpuN/topology
directory, so let's just skip them if they're not online. Schedulers are
expected to detect hotplug, and handle gracefully restarting.
Signed-off-by: David Vernet <void@manifault.com>
We're iterating from min..max cpu in cpus_online(), but that's not
inclusive of the max CPU. Let's also include that so we don't think that
last CPU is offline.
Signed-off-by: David Vernet <void@manifault.com>
Most of the schedulers assume that the amount of possible CPUs in the
system represents the actual number of CPUs available.
This is not always true: some CPUs may be offline or certain CPU models
(AMD CPUs for example) may include unavailable CPUs in this number.
This can lead to sub-optimal performance or even errors in the scheduler
(see for example [1][2]).
Ideally, we need to attack this issue in a more generic way, such as
having a proper API provided by a C library, that can be used by all
schedulers and the topology Rust module (scx_utils crate).
But for now, let's try to mitigate most of the common sub-optimal cases
separately inside each scheduler.
For rustland we can apply some mitigations both in select_cpu() (for the
BPF part) and in the user-space part:
- the former is fixed in the sched-ext kernel by commit 94dc0c01b957
("scx: Use cpu_online_mask when resetting idle masks"). However,
adding an extra check `cpu < num_possible_cpus` in select_cpu(),
allows to properly support AMD CPUs, even with kernels that don't
have the cpu_online_mask fix yet (this doesn't always guarantee the
validity of cpu, but it should be enough to mitigate the majority of
the potential sub-optimal cases, without introducing any significant
overhead)
- the latter can be fixed relying on topology.span(), instead of
topology.nr_cpus(), to count the amount of available CPUs in the
system.
[1] https://github.com/sched-ext/sched_ext/issues/69
[2] https://github.com/sched-ext/scx/issues/147
Link: 94dc0c01b9
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
We are failing to parse /sys/devices/system/cpu/online in systems with
just one CPU, for example:
$ vng -r --cpus 1 -- scx_rusty
Error: Failed to parse online cpus 0
Correctly handle strings containing only a single CPU during parsing.
Fixes: c5a3b83b ("topology: Add new topology crate")
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
The current topology.rs crate assumes that all cores have unique core
IDs in a system. This need not be the case, such as in certain Intel
Xeon processors which reuse core IDs in different NUMA nodes. Let's
update the crate to assume unique core IDs only per socket.
Signed-off-by: David Vernet <void@manifault.com>
Provide distinct methods to set the target CPU and the per-task time
slice to dispatched tasks.
Moreover, also provide a constructor to create a DispatchedTask from a
QueuedTask (this allows to automatically bounce a task from the
scheduler to the BPF dispatcher without having to take care of setting
the individual task's attributes).
This also allows to make most of the attributes of DispatchedTask
private, especially it allows to hide cpumask_cnt, that should be only
used internally between the BPF and the user-space component.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Provide a way to set a different time slice per-task, by adding a new
attribute slice_ns to the DispatchedTask struct.
This attribute determines the time slice assigned to the task, if it is
set to 0 then the global time slice (either the default one or the
effective one, if set) will be used.
At the same time, remove the payload attribute, that is basically unused
(scx_rustland uses it to send the task's vruntime to the BPF dispatcher
for debugging purposes, but it's not very useful anymore at this point).
In the future we may introduce a proper interface to attach a custom
payload to each task with a proper interface.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
This is to potentinally reduce issues with folks
using different versions of libbpf at runtime.
This also:
- makes static linking of libbpf the default
- adds steps in `meson setup` to fetch libbpf and make it
There is no need to generate source code in a temporary directory with
RustLandBuilder(), we can simply generate code in-tree and exclude the
generated source files from .gitignore.
Having the generated source files in-tree can help to debug potential
build issues (and it also allows to drop the the tempfile crate
dependency).
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Introduce a Builder() class in scx_utils that can be used by other scx
crates (such as scx_rustland_core) to prevent code duplication.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Introduce a wrapper to scx_utils::BpfBuilder that can be used to build
the BPF component provided by scx_rustland_core.
The source of the BPF components (main.bpf.c) is included in the crate
as an array of bytes, the content is then unpacked in a temporary file
to perform the build.
The RustLandBuilder() helper is also used to generate bpf.rs (that
implements the low-level user-space Rust connector to the BPF
commponent).
Schedulers based on scx_rustland_core can simply use RustLandBuilder(),
to build the backend provided by scx_rustland_core.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Introduce a helper function to update the counter of queued and
scheduled tasks (used to notify the BPF component if the user-space
scheduler has still some pending work to do).
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Move the BPF component of scx_rustland to scx_rustland_core and make it
available to other user-space schedulers.
NOTE: main.bpf.c and bpf.rs are not pre-compiled in the
scx_rustland_core crate, they need to be included in the user-space
scheduler's source code in order to be compiled/linked properly.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Introduce a separate crate (scx_rustland_core) that can be used to
implement sched-ext schedulers in Rust that run in user-space.
This commit only provides the basic layout for the new crate and the
abstraction to the custom allocator.
In general, any scheduler that has a user-space component needs to use
the custom allocator to prevent potential deadlock conditions, caused by
page faults (a kthread needs to run to resolve the page fault, but the
scheduler is blocked waiting for the user-space page fault to be
resolved => deadlock).
However, we don't want to necessarily enforce this constraint to all the
existing Rust schedulers, some of them may do all user-space allocations
in safe paths, hence the separate scx_rustland_core crate.
Merging this code in scx_utils would force all the Rust schedulers to
use the custom allocator.
In a future commit the scx_rustland backend will be moved to
scx_rustland_core, making it a totally generic BPF scheduler framework
that can be used to implement user-space schedulers in Rust.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
We want to avoid every scheduler implementation from having to implement
the solution to the infeasible weights problem, but we also want to
enable sufficient flexibility where not every program has to have the
same partition of scheduling domains, etc. To enable this, a new
infeasible crate is added which encapsulates all of the logic for being
given duty cycle and weight, and performing the necessary math to adjust
for infeasibility.
Signed-off-by: David Vernet <void@manifault.com>
For convenience, let's provide callers with a way to easily look up
cores and CPUs from the root topology object.
Signed-off-by: David Vernet <void@manifault.com>
The topology.rs crate is insufficiently generic, and reflects
implementation details of scx_rusty more than it provides generic use
cases for modeling a host's topology. This adds a new topology2.rs crate
that will replace topology.rs. We have this as an intermediate commit so
that we don't bundle updating scx_rusty with adding this crate.
Signed-off-by: David Vernet <void@manifault.com>
Now that we have cpumask.rs, we can remove some logic from topology.rs
and have it create and use Cpumasks.
Signed-off-by: David Vernet <void@manifault.com>
Let's add a Cpumask trait that schedulers can use to avoid all having to
deal directly with BitVec and the like.
Signed-off-by: David Vernet <void@manifault.com>
We currently panic! if we're building a Topology that detects more than
two siblings on a physical core. This can and will likely happen on
multi-socket machines. Given that we're planning to add support for
detecting NUMA nodes soon, let's just demote the panic! to a warn!.
Signed-off-by: David Vernet <void@manifault.com>
scx_rusty has logic in the scheduler to inspect the host to
automatically build scheduling domains across every L3 cache. This would
be generically useful for many different types of schedulers, so let's
add it to the scx_utils crate so it can be used by others.
Signed-off-by: David Vernet <void@manifault.com>
Use c_char to convert C strings, that is more portable across different
architectures.
This prevents a build failure on arm64 and ppc64el.
Fixes: d57a23f ("rust/scx_utils: Add user_exit_info support")
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
After updates to reflect the updated init and direct dispatch API, the
schedulers aren't compatible with older kernels. Bump versions and publish
releases.
This is to fix fedora build failures for these archs:
s390x and ppc64le
Error:
```
---- bpf_builder::tests::test_bpf_builder_new stdout ----
thread 'bpf_builder::tests::test_bpf_builder_new' panicked at src/bpf_builder.rs:592:9:
Failed to create BpfBuilder (Err(CPU arch "s390x" not found in ARCH_MAP))
```
https://koji.fedoraproject.org/koji/taskinfo?taskID=111114326
- combine c and kernel-examples as it's confusing to have both
- rename 'rust-user' and 'c-user' to just 'rust' and 'c', which is simpler
- update and fix sync-to-kernel.sh
This is a follow on to #32, which got reverted. I wrongly assumed that
scx_rusty resides in the sched_ext tree and consumes published version
of scx_utils.
With this change we update the other in-tree dependencies. I built
scx_layered & scx_rusty. I bumped scx-utils to 0.4, because the
libbpf-cargo seems to be part of the public API surface and libbpf-cargo
0.21 and 0.22 are not compatible with each other.
Signed-off-by: Daniel Müller <deso@posteo.net>
Apply the same logic of commit 00cd15a ("build: properly detect clang
version in Ubuntu") in scx_utils as well.
This allows to build scx_utils properly in Ubuntu.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>