Tejun pointed out that a possible issue exists in the current
implementation, wherein if you have two NUMA nodes that are imbalanced,
but their domains are internally balanced, we'll fail to migrate between
them if all nodes are in the balanced_nodes list.
To address this, let's just use a single global list for all types of
domains, and do checking internally for imbalances. The reason it was
done this way in the first place was to allow me to mutably iterate over
both vectors in a nested loop. The way around that is to just use loop
{} and push/pop domains from the list.
We could do the same thing for the NUMA nodes themselves, which are also
in 3 separate lists in the LoadBalancer. We'll do that in a subsequent
commit.
Signed-off-by: David Vernet <void@manifault.com>
In scx_rusty, a CPU that is going to go idle will attempt to steal tasks
from remote domains when its domain has no tasks to run, and a remote
domain has at least greedy_threshold enqueued tasks. This stealing is
temporary, but of course has a cost in that the CPU that's stealing the
task may cause it to suffer from cache misses, or in the case of
multi-node machines, remote NUMA accesses and working sets split across
multiple domains.
Given the higher cost of x NUMA work stealing, let's add a separate flag
that lets users tune the threshold for doing cross NUMA greedy task
stealing.
Signed-off-by: David Vernet <void@manifault.com>
The current topology.rs crate assumes that all cores have unique core
IDs in a system. This need not be the case, such as in certain Intel
Xeon processors which reuse core IDs in different NUMA nodes. Let's
update the crate to assume unique core IDs only per socket.
Signed-off-by: David Vernet <void@manifault.com>
We removed the debug!() output that was previously present in main.rs. Let's
add more debug!() output that helps debug the current LB hierarchy.
Signed-off-by: David Vernet <void@manifault.com>
The scx_rusty load balancer is currently no longer exporting statistics such as
domain load averages, load sums, etc. Now that we're also balancing by NUMA,
we'll need a way to hierarchically illustrate load balancing statistics. This
patch adds support for that.
Signed-off-by: David Vernet <void@manifault.com>
updating stats printing
Signed-off-by: David Vernet <void@manifault.com>
Users may want to toggle whether tasks can be temporarily sent to idle CPUs on
remote NUMA nodes. By default, we want it to be disabled as a task spanning
multiple NUMA nodes will end up having its working set spanning both nodes,
which is probably not desirable. However, in case a workload really wants to
encourage work conservation, let's add a flag that allows them to toggle it.
Signed-off-by: David Vernet <void@manifault.com>
scx_rusty currently pushes tasks to idle cores if the direct greedy threshold
is exceeded, even if the core is on a remote NUMA node. This behavior is
probably not desired in most scenarios. The worst that will happen if a task is
pushed to an idle core in the same node is some L3 cache miss traffic, but for
multiple NUMA nodes, it could cause the task to have its working set span
multiple nodes.
Let's disable direct greedy work stealing across NUMA nodes. A future commit
will add a flag that's disabled by default, and let's users turn this on if
they really want to encourage work conservation.
Signed-off-by: David Vernet <void@manifault.com>
Right now, scx_rusty has no notion of domains spanning NUMA nodes, and makes no
distinction when making load balancing decisions, or work stealing. This can
cause problems on multi-NUMA machines, as load balancing and work stealing
across NUMA nodes has significantly different cost from across L3 cache
boundaries.
In order to better support multi-NUMA machines, this commit adds another layer
to the rusty load balancer, which balances across NUMA nodes using a different
cost function from balancing across domains. Load balancing now takes place
over the span of two passes:
1. In the first pass, we fix imbalances across NUMA nodes by moving tasks
between domains across those NUMA node boundaries. We require a load
imbalance of at least 17% in order to move load at this stage. The ratio of
load imbalance we attempt to adjust (50%) and the maximum amount of load
we're allowed to push out of a domain (50%) is still the same as when
balancing between domains inside a NUMA node, but this is easy to tune with
the current setup.
2. Once we've balanced across NUMA nodes, we iterate over all nodes and balance
between the domains within each NUMA node. The cost function here is the
same as what it has been thus far: we require at least a 5% imbalance in
order to trigger load balancing.
There are a few additional changes / improvements to load balancing in this
commit:
1. NUMA nodes and domains are now ordered according to their load by using
SortedVec objects. We were previously using BTreeMap keyed by load, but this
was suboptimal due to the fact that it doesn't allow duplicate entries.
2. We're no longer exporting load balancing statistics as a vector of data such
as load sums, averages, and imbalances. This is instead all encapsulated in
the load balancing hierarchy we setup in lb.load_balance(). These statistics
are not yet exported, but they will be in a subsequent commit.
One of the issues with this commit is that it does introduce some
almost-identical logic that somehow begs to be deduplicated. For example, when
we balance between NUMA nodes, the logic for iterating over push nodes and
pushing to pull nodes is very similar to the logic of iterating over push
domains and pull domains when balancing within a node. It may be that this can
be improved.
The following are some benchmarks run on an Intel Xeon Gold 6138 (2 x 40 core
processor):
kcompile
--------
On Commit a27648c74210 ("afs: Fix setting of mtime when creating a
file/dir/symlink"):
1. make allyesconfig
2. make -j $(nproc) built-in.a
3. make -j clean
4. goto 2
Runtime
-------
o-----------o-----------o----------o
| scx_rusty | CFS | Delta |
---------o-----------o-----------o----------o
Mean | 562.688s | 566.085s | -.6% |
---------o-----------o-----------o----------o
Variance | 0.54387 | 0.72431 | -24.9% |
---------o-----------o-----------o----------o
o-----------o-----------o----------o
| rusty NUMA| rusty ORIG| Delta |
---------o-----------o-----------o----------o
Mean | 562.688s | 563.209s | -.092% |
---------o-----------o-----------o----------o
Variance | 0.54387 | 0.42038 | 29.38% |
---------o-----------o-----------o----------o
scx_rusty with NUMA awareness clearly beats CFS, but only barely beats
scx_rusty without it. This isn't necessarily super surprising given that
this is kcompile, which has very poor front-end CPU locality. Further
experimentation with toggling the cost function for performing
migrations may improve this further.
CPU util
--------
o-----------o-----------o----------o
| scx_rusty | CFS | Delta |
---------o-----------o-----------o----------o
Mean | 7654.25% | 7551.67% | 1.11% |
---------o-----------o-----------o----------o
Variance | 165.35714 | 158.3333 | 4.436% |
---------o-----------o-----------o----------o
o-----------o-----------o----------o
| rusty NUMA| rusty ORIG| Delta |
---------o-----------o-----------o----------o
Mean | 7654.25% | 7641.57% | 0.1659% |
---------o-----------o-----------o----------o
Variance | 165.35714 | 1230.619 | -86.5% |
---------o-----------o-----------o----------o
As expected, CPU util is quite a bit higher with scx_rusty than it is
with CFS. Further experiments that could be interesting are always
enabling direct-greedy stealing between domains within a NUMA node, and
then comparing rusty NUMA and rusty ORIG. rusty NUMA prevents stealing
between NUMA nodes, so this would show whether the locality introduced
by NUMA awareness appropriately offsets the loss of work conservation.
Major PFs
---------
o-----------o-----------o----------o
| scx_rusty | CFS | Delta |
---------o-----------o-----------o----------o
Mean | 5332 | 3950 | 36.566% |
---------o-----------o-----------o----------o
Variance | 6975.5 | 5986.333 | 16.5237% |
---------o-----------o-----------o----------o
o-----------o-----------o----------o
| rusty NUMA| rusty ORIG| Delta |
---------o-----------o-----------o----------o
Mean | 5332 | 5336.5 | -.084% |
---------o-----------o-----------o----------o
Variance | 6975.5 | 955.5 | 630.03% |
---------o-----------o-----------o----------o
Also as expected, major page faults are far highe higher with scx_rusty
than with CFS. This is expected even with NUMA awareness, given that
scx_rusty is still less sticky than CFS.
Further experiments that could be interesting are tuning the threshold
for which we perform x NUMA migrations to try and keep this value even
lower. The rate of major page faults between rusty NUMA and rusty ORIG
were very close, though rusty NUMA was a bit lower.
Signed-off-by: David Vernet <void@manifault.com>
More cleanup of scx_rusty. Let's move the LoadBalancer out of rusty.rs and into
its own file. It will soon be extended quite a bit to support multi-NUMA and
other multivariate LB cost functions, so it's time to clean things up and split
it out.
Signed-off-by: David Vernet <void@manifault.com>
rusty.rs is growing a bit unwieldy. We're going to want to update its load
balancing logic somewhat significantly to account for multi-NUMA and other cost
functions, so let's start cleaning the code up so that things are more
logically segmented and easier to work with.
To start, we move the Tuner and DomainGroup/Domain objects into their own
modules.
Signed-off-by: David Vernet <void@manifault.com>
Instead clone the libbpf repo at a specific hash during setup.
This is to fix an issue whereby submodules are not included
in the tarball and therefore won't be updated/fetched during
setup after unzipping the tarball.
This is to fix an error sometimes seen when compiling with gcc,
whereby the position independent sched ext rust libraries
don't play nicely with the libbpf library if it's not also built
as position independent.
This also explicitly sets the rustc relocation-mode to "pic",
which is the default (just so this doesn't accidentally change
out from under us).
scx_rlfifo is provided as a simple example to show how to use
scx_rustland_core and it's not supposed to be used in a real production
environment.
To prevent performance bug reports print an explicit warning when it's
started to clarify the goal of this scheduler.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Small improvement to make the scheduler a bit more responsive, without
introducing too much complexity or too much CPU overhead.
This can be achieved by replacing a sleep of 1ms with a sched_yield()
every time that the scheduler has finished to dispatch all the queued
tasks.
This also makes the code a bit smaller and easier to read.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Otherwise, we end up passing CC=ccache to libbpf's Makefile which triggers
an error as ccache invoked on its own can't act as a stand-in for the
compiler.
Provide distinct methods to set the target CPU and the per-task time
slice to dispatched tasks.
Moreover, also provide a constructor to create a DispatchedTask from a
QueuedTask (this allows to automatically bounce a task from the
scheduler to the BPF dispatcher without having to take care of setting
the individual task's attributes).
This also allows to make most of the attributes of DispatchedTask
private, especially it allows to hide cpumask_cnt, that should be only
used internally between the BPF and the user-space component.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Provide a way to set a different time slice per-task, by adding a new
attribute slice_ns to the DispatchedTask struct.
This attribute determines the time slice assigned to the task, if it is
set to 0 then the global time slice (either the default one or the
effective one, if set) will be used.
At the same time, remove the payload attribute, that is basically unused
(scx_rustland uses it to send the task's vruntime to the BPF dispatcher
for debugging purposes, but it's not very useful anymore at this point).
In the future we may introduce a proper interface to attach a custom
payload to each task with a proper interface.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
This is to potentinally reduce issues with folks
using different versions of libbpf at runtime.
This also:
- makes static linking of libbpf the default
- adds steps in `meson setup` to fetch libbpf and make it
This reverts commit a7b39f24e2, reversing
changes made to cf7404fb03.
The PR doesn't do what the description says. It instead limits the number of
rustc instances to 1 for each cargo build making rust builds extremely slow.
Let's revert and try again.
This reverts commit 5b9b953e3c, reversing
changes made to a7b39f24e2.
The current git submodule approach is a bit cumbersome and doesn't provide a
unified build environment for both libbpf and scx scheds. Also, the build
instruction doesn't seem to work. Let's revert it for now.
Each cargo build is already parallelized, spreading multiple rustc
across all the available CPUs by default.
Allowing to run multiple instances of cargo at the same time doesn't
provide any benefit and it can only increase the risk of triggering OOM
conditions or overloading the build system.
Therefore, limit the amount of parallel cargo build instances to 1.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
There is no need to generate source code in a temporary directory with
RustLandBuilder(), we can simply generate code in-tree and exclude the
generated source files from .gitignore.
Having the generated source files in-tree can help to debug potential
build issues (and it also allows to drop the the tempfile crate
dependency).
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Introduce a Builder() class in scx_utils that can be used by other scx
crates (such as scx_rustland_core) to prevent code duplication.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Introduce a wrapper to scx_utils::BpfBuilder that can be used to build
the BPF component provided by scx_rustland_core.
The source of the BPF components (main.bpf.c) is included in the crate
as an array of bytes, the content is then unpacked in a temporary file
to perform the build.
The RustLandBuilder() helper is also used to generate bpf.rs (that
implements the low-level user-space Rust connector to the BPF
commponent).
Schedulers based on scx_rustland_core can simply use RustLandBuilder(),
to build the backend provided by scx_rustland_core.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Introduce a helper function to update the counter of queued and
scheduled tasks (used to notify the BPF component if the user-space
scheduler has still some pending work to do).
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
scx_rustland has significantly evolved since its original design.
With the introduction of scx_rustland_core and the inclusion of the
scx_rlfifo example, scx_rustland's focus can be shifted from solely
being an "easy-to-read Rust scheduler template" to a fully functional
scheduler.
For this reason, update the README and documentation to reflect its
revised design, objectives, and intended use cases.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Move the BPF component of scx_rustland to scx_rustland_core and make it
available to other user-space schedulers.
NOTE: main.bpf.c and bpf.rs are not pre-compiled in the
scx_rustland_core crate, they need to be included in the user-space
scheduler's source code in order to be compiled/linked properly.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Introduce a separate crate (scx_rustland_core) that can be used to
implement sched-ext schedulers in Rust that run in user-space.
This commit only provides the basic layout for the new crate and the
abstraction to the custom allocator.
In general, any scheduler that has a user-space component needs to use
the custom allocator to prevent potential deadlock conditions, caused by
page faults (a kthread needs to run to resolve the page fault, but the
scheduler is blocked waiting for the user-space page fault to be
resolved => deadlock).
However, we don't want to necessarily enforce this constraint to all the
existing Rust schedulers, some of them may do all user-space allocations
in safe paths, hence the separate scx_rustland_core crate.
Merging this code in scx_utils would force all the Rust schedulers to
use the custom allocator.
In a future commit the scx_rustland backend will be moved to
scx_rustland_core, making it a totally generic BPF scheduler framework
that can be used to implement user-space schedulers in Rust.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>