In order to prevent deadlock conditions user-space schedulers need to
perform memory allocations from a pool of pre-allocated memory that will
never be unmapped/reclaimed by the kernel (unevictable memory).
However, there is a special kernel sysctl setting
(vm.compact_unevictable_allowed) that allows kcompactd to reclaim
unevictable memory. This behavior should be prevented by setting
vm.compact_unevictable_allowed = 0, that is what scx_rustland_core does
transparently when a scheduler is started, restoring the previous value
when the scheduler is stopped.
Unfortunately, this is not always doable, especially when running a
scheduler inside a containerized environment or under certain
security/privilege restrictions (e.g., AppArmor confinement).
Therefore, just report a WARNING if we are unable to change this
parameter, instead of considering it a hard-failure, to allow running
scx_rustland_core schedulers also inside such limited environments.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Given that rustland_core now supports task preemption and it has been
tested successfully, it's worhtwhile to cut a new version of the crate.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
It would be useful to be able to check whether a kfunc exists from rust
user space. For example, now that we have support for adjusting cpufreq
in scx_layered, we'll want to be able to test whether or not a the
scx_bpf_cpuperf_set() (and friends) kfuncs are present for backwards
compat purposes. Let's add a kfunc_exists() function to compat.rs for
this purpose.
Signed-off-by: David Vernet <void@manifault.com>
Introuce a TopologyMap object, represented as an array of arrays, where
each inner array corresponds to a core containing its associated CPU
IDs.
This object can be used as a cache to facilitate efficient iteration
over the entire host's topology.
Example usage:
let topo = Topology::new()?;
let topo_map = TopologyMap::new(topo)?;
for (core_id, core) in topo_map.iter().enumerate() {
for cpu in core {
println!("core={} cpu={}", core_id, cpu);
}
}
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
We're currently cloning cpumasks returned by calls to {Core, Cache,
Node, Topology}::span(). If a caller needs to clone it, they can. Let's
not penalize the callers that just want to query the underlying cpumask.
Signed-off-by: David Vernet <void@manifault.com>
Do not encode dispatch flags in the cpu field, but simply use a separate
"flags" field.
This makes the code much simpler and it increases the size of
dispatched_task_ctx from 24 to 32, that is probably better in terms of
cacheline allocation / performance.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Introduce the new dispatch flag RL_PREEMPT_CPU that can be used to
dispatch tasks that can preempt others.
Tasks with this flag set will be dispatched by the BPF part using
SCX_ENQ_PREEMPT, so they can potentially preempt any other task running
on the target CPU.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Reserve some bits of the `cpu` attribute of a task to store special
dispatch flags.
Initially, let's introduce just RL_CPU_ANY to replace the special value
NO_CPU, indicating that the task can be dispatched on any CPU,
specifically the first CPU that becomes available.
This allows to keep the CPU value assigned by the builtin idle selection
logic, that can potentially be used later for further optimizations.
Moreover, having the possibility to specify dispatch flags gives more
flexibility and it allows to map new scheduling features to such flags.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Newer kernels also support exiting gracefully with an exit code. Let's
update the UserExitInfo struct to also read and export this value.
Signed-off-by: David Vernet <void@manifault.com>
scx_rustland_core needs to ship both a binary part and a source code
part, which will be used to build schedulers based on it.
To effectively publish the scx_rustland_core crate on crates.io we need
to properly separate the source code assets from the crate's main source
code.
To achieve this, move the assets into a separate directory and declare
them inside a [lib] section in Cargo.toml.
This allows to publish the crate on crates.io, providing also a clear
separation between source code and assets.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Now that libbpf-rs 0.23 has been officially released with the new
consume_raw() API (https://github.com/libbpf/libbpf-rs/pull/680) we can
re-introduce the change in rustland-core that allows to use this API to
improve the quality of the code and make it slightly more efficient when
consuming tasks from BPF to user-space.
Fixes: bd2c18a ("Revert "scx_rustland_core: use new consume_raw() libbpf-rs API"")
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
As described in https://github.com/sched-ext/scx/issues/195, apparently
some chips don't export information about their cache topology. There's
not much we can do if we don't have that information, so let's just
assume a unified cache per node if that happens.
Andrea suggested this patch -- I'm applying exactly what he proposed,
with a slightly modified comment.
Suggested-by: Andrea Righi <andrea.righi@canonical.com>
Signed-off-by: David Vernet <void@manifault.com>
As described in https://bugzilla.kernel.org/show_bug.cgi?id=218109,
https://github.com/sched-ext/scx/issues/147 and
https://github.com/sched-ext/sched_ext/issues/69, AMD chips can
sometimes report fully disabled CPUs as offline, which causes us to
count them when looking at /sys/devices/system/cpu/possible.
Additionally, systems can have holes in their active CPU maps. For
example, a system with CPUs 0, 1, 2, 3 possible, may have only 0 and 2
active. To address this, we need to do a few things:
1. Update topology.rs to be clear that it's returning the number of
_possible_ CPUs in the system. Also update Topology to only record
online CPUs when creating its span and iterating over sysfs when
creating domains. It was previously trying to record when a CPU was
online, but this was actually broken as the topology directory isn't
present in sysfs when the CPU is offline.
2. Schedulers should not be relying on nr_possible_cpus for anything
other than interacting with per-CPU data (e.g. for stats extraction),
or e.g. verifying maximum sizes of statically sized arrays in BPF. It
should _not_ be used for e.g. performing load calculations, etc. With
that said, we'll also need to update schedulers to not rely on the
nr_possible_cpus figure being exported by the topology crate. We do
that for rusty in this patch, but don't fix any of the others other
than updating how they call topology.rs.
3. Account for the fact that LLC IDs may be non-contiguous. For example,
if there is a single core in an LLC, then if we assign LLC IDs to
domains, then the domain IDs won't be contiguous. This doesn't fit
our current model which is used by e.g. infeasible_weights.rs. We'll
update some of the code in rusty to accomodate this, but we'll need
to do more.
4. Update schedulers to properly reset themselves in the event of a
hotplug event. We'll take care of that in a follow-on change.
Signed-off-by: David Vernet <void@manifault.com>
We implement functions or(), and(), and xor() for cpumasks, but we
should also implement the bitwise ops for those operations in case
people prefer that syntax.
Signed-off-by: David Vernet <void@manifault.com>
Offline CPUs don't have a /sys/devices/system/cpu/cpuN/topology
directory, so let's just skip them if they're not online. Schedulers are
expected to detect hotplug, and handle gracefully restarting.
Signed-off-by: David Vernet <void@manifault.com>
We're iterating from min..max cpu in cpus_online(), but that's not
inclusive of the max CPU. Let's also include that so we don't think that
last CPU is offline.
Signed-off-by: David Vernet <void@manifault.com>
Most of the schedulers assume that the amount of possible CPUs in the
system represents the actual number of CPUs available.
This is not always true: some CPUs may be offline or certain CPU models
(AMD CPUs for example) may include unavailable CPUs in this number.
This can lead to sub-optimal performance or even errors in the scheduler
(see for example [1][2]).
Ideally, we need to attack this issue in a more generic way, such as
having a proper API provided by a C library, that can be used by all
schedulers and the topology Rust module (scx_utils crate).
But for now, let's try to mitigate most of the common sub-optimal cases
separately inside each scheduler.
For rustland we can apply some mitigations both in select_cpu() (for the
BPF part) and in the user-space part:
- the former is fixed in the sched-ext kernel by commit 94dc0c01b957
("scx: Use cpu_online_mask when resetting idle masks"). However,
adding an extra check `cpu < num_possible_cpus` in select_cpu(),
allows to properly support AMD CPUs, even with kernels that don't
have the cpu_online_mask fix yet (this doesn't always guarantee the
validity of cpu, but it should be enough to mitigate the majority of
the potential sub-optimal cases, without introducing any significant
overhead)
- the latter can be fixed relying on topology.span(), instead of
topology.nr_cpus(), to count the amount of available CPUs in the
system.
[1] https://github.com/sched-ext/sched_ext/issues/69
[2] https://github.com/sched-ext/scx/issues/147
Link: 94dc0c01b9
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
We are failing to parse /sys/devices/system/cpu/online in systems with
just one CPU, for example:
$ vng -r --cpus 1 -- scx_rusty
Error: Failed to parse online cpus 0
Correctly handle strings containing only a single CPU during parsing.
Fixes: c5a3b83b ("topology: Add new topology crate")
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
The current topology.rs crate assumes that all cores have unique core
IDs in a system. This need not be the case, such as in certain Intel
Xeon processors which reuse core IDs in different NUMA nodes. Let's
update the crate to assume unique core IDs only per socket.
Signed-off-by: David Vernet <void@manifault.com>
Provide distinct methods to set the target CPU and the per-task time
slice to dispatched tasks.
Moreover, also provide a constructor to create a DispatchedTask from a
QueuedTask (this allows to automatically bounce a task from the
scheduler to the BPF dispatcher without having to take care of setting
the individual task's attributes).
This also allows to make most of the attributes of DispatchedTask
private, especially it allows to hide cpumask_cnt, that should be only
used internally between the BPF and the user-space component.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Provide a way to set a different time slice per-task, by adding a new
attribute slice_ns to the DispatchedTask struct.
This attribute determines the time slice assigned to the task, if it is
set to 0 then the global time slice (either the default one or the
effective one, if set) will be used.
At the same time, remove the payload attribute, that is basically unused
(scx_rustland uses it to send the task's vruntime to the BPF dispatcher
for debugging purposes, but it's not very useful anymore at this point).
In the future we may introduce a proper interface to attach a custom
payload to each task with a proper interface.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
This is to potentinally reduce issues with folks
using different versions of libbpf at runtime.
This also:
- makes static linking of libbpf the default
- adds steps in `meson setup` to fetch libbpf and make it
There is no need to generate source code in a temporary directory with
RustLandBuilder(), we can simply generate code in-tree and exclude the
generated source files from .gitignore.
Having the generated source files in-tree can help to debug potential
build issues (and it also allows to drop the the tempfile crate
dependency).
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Introduce a Builder() class in scx_utils that can be used by other scx
crates (such as scx_rustland_core) to prevent code duplication.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Introduce a wrapper to scx_utils::BpfBuilder that can be used to build
the BPF component provided by scx_rustland_core.
The source of the BPF components (main.bpf.c) is included in the crate
as an array of bytes, the content is then unpacked in a temporary file
to perform the build.
The RustLandBuilder() helper is also used to generate bpf.rs (that
implements the low-level user-space Rust connector to the BPF
commponent).
Schedulers based on scx_rustland_core can simply use RustLandBuilder(),
to build the backend provided by scx_rustland_core.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Introduce a helper function to update the counter of queued and
scheduled tasks (used to notify the BPF component if the user-space
scheduler has still some pending work to do).
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Move the BPF component of scx_rustland to scx_rustland_core and make it
available to other user-space schedulers.
NOTE: main.bpf.c and bpf.rs are not pre-compiled in the
scx_rustland_core crate, they need to be included in the user-space
scheduler's source code in order to be compiled/linked properly.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Introduce a separate crate (scx_rustland_core) that can be used to
implement sched-ext schedulers in Rust that run in user-space.
This commit only provides the basic layout for the new crate and the
abstraction to the custom allocator.
In general, any scheduler that has a user-space component needs to use
the custom allocator to prevent potential deadlock conditions, caused by
page faults (a kthread needs to run to resolve the page fault, but the
scheduler is blocked waiting for the user-space page fault to be
resolved => deadlock).
However, we don't want to necessarily enforce this constraint to all the
existing Rust schedulers, some of them may do all user-space allocations
in safe paths, hence the separate scx_rustland_core crate.
Merging this code in scx_utils would force all the Rust schedulers to
use the custom allocator.
In a future commit the scx_rustland backend will be moved to
scx_rustland_core, making it a totally generic BPF scheduler framework
that can be used to implement user-space schedulers in Rust.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
We want to avoid every scheduler implementation from having to implement
the solution to the infeasible weights problem, but we also want to
enable sufficient flexibility where not every program has to have the
same partition of scheduling domains, etc. To enable this, a new
infeasible crate is added which encapsulates all of the logic for being
given duty cycle and weight, and performing the necessary math to adjust
for infeasibility.
Signed-off-by: David Vernet <void@manifault.com>
For convenience, let's provide callers with a way to easily look up
cores and CPUs from the root topology object.
Signed-off-by: David Vernet <void@manifault.com>
The topology.rs crate is insufficiently generic, and reflects
implementation details of scx_rusty more than it provides generic use
cases for modeling a host's topology. This adds a new topology2.rs crate
that will replace topology.rs. We have this as an intermediate commit so
that we don't bundle updating scx_rusty with adding this crate.
Signed-off-by: David Vernet <void@manifault.com>
Now that we have cpumask.rs, we can remove some logic from topology.rs
and have it create and use Cpumasks.
Signed-off-by: David Vernet <void@manifault.com>