Commit Graph

464 Commits

Author SHA1 Message Date
David Vernet
829b1d3ced rusty: Don't use multiple SortedVec's in struct NumaNode
Tejun pointed out that a possible issue exists in the current
implementation, wherein if you have two NUMA nodes that are imbalanced,
but their domains are internally balanced, we'll fail to migrate between
them if all nodes are in the balanced_nodes list.

To address this, let's just use a single global list for all types of
domains, and do checking internally for imbalances. The reason it was
done this way in the first place was to allow me to mutably iterate over
both vectors in a nested loop. The way around that is to just use loop
{} and push/pop domains from the list.

We could do the same thing for the NUMA nodes themselves, which are also
in 3 separate lists in the LoadBalancer. We'll do that in a subsequent
commit.

Signed-off-by: David Vernet <void@manifault.com>
2024-03-11 21:04:10 -07:00
David Vernet
3d2507e6f2 rusty: Add separate flag for x NUMA greedy task stealing
In scx_rusty, a CPU that is going to go idle will attempt to steal tasks
from remote domains when its domain has no tasks to run, and a remote
domain has at least greedy_threshold enqueued tasks. This stealing is
temporary, but of course has a cost in that the CPU that's stealing the
task may cause it to suffer from cache misses, or in the case of
multi-node machines, remote NUMA accesses and working sets split across
multiple domains.

Given the higher cost of x NUMA work stealing, let's add a separate flag
that lets users tune the threshold for doing cross NUMA greedy task
stealing.

Signed-off-by: David Vernet <void@manifault.com>
2024-03-11 21:02:23 -07:00
David Vernet
1c3168d2a4
topology: Don't assume unique core IDs
The current topology.rs crate assumes that all cores have unique core
IDs in a system. This need not be the case, such as in certain Intel
Xeon processors which reuse core IDs in different NUMA nodes. Let's
update the crate to assume unique core IDs only per socket.

Signed-off-by: David Vernet <void@manifault.com>
2024-03-08 15:13:46 -06:00
David Vernet
26a94b1b14
rusty: Add debug! logging to load_balance.rs
We removed the debug!() output that was previously present in main.rs. Let's
add more debug!() output that helps debug the current LB hierarchy.

Signed-off-by: David Vernet <void@manifault.com>
2024-03-08 15:13:46 -06:00
David Vernet
0d0b101398
rusty: Add load balancing statistics to rusty
The scx_rusty load balancer is currently no longer exporting statistics such as
domain load averages, load sums, etc. Now that we're also balancing by NUMA,
we'll need a way to hierarchically illustrate load balancing statistics. This
patch adds support for that.

Signed-off-by: David Vernet <void@manifault.com>

updating stats printing

Signed-off-by: David Vernet <void@manifault.com>
2024-03-08 15:13:36 -06:00
David Vernet
0871a9525d
rusty: Add direct_greedy_numa flag
Users may want to toggle whether tasks can be temporarily sent to idle CPUs on
remote NUMA nodes. By default, we want it to be disabled as a task spanning
multiple NUMA nodes will end up having its working set spanning both nodes,
which is probably not desirable. However, in case a workload really wants to
encourage work conservation, let's add a flag that allows them to toggle it.

Signed-off-by: David Vernet <void@manifault.com>
2024-03-08 15:12:00 -06:00
David Vernet
d0ebfb85ef
rusty: Disable direct greedy stealing between NUMA nodes
scx_rusty currently pushes tasks to idle cores if the direct greedy threshold
is exceeded, even if the core is on a remote NUMA node. This behavior is
probably not desired in most scenarios. The worst that will happen if a task is
pushed to an idle core in the same node is some L3 cache miss traffic, but for
multiple NUMA nodes, it could cause the task to have its working set span
multiple nodes.

Let's disable direct greedy work stealing across NUMA nodes. A future commit
will add a flag that's disabled by default, and let's users turn this on if
they really want to encourage work conservation.

Signed-off-by: David Vernet <void@manifault.com>
2024-03-08 15:11:59 -06:00
David Vernet
12e0586fe9
cpumask: Update cpumask fmt function
The cpumask print formatter doesn't look great in its current form, which uses
the BitVec formatter under the hood:

[INFO] NUMA[00 32:<[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]>]
[INFO]         DOM[00] 32:<[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]>
[INFO]         DOM[01] 32:<[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]>

Let's just iterate over the mask and manually format the string using the
binary formatter over the slice of u64's, which renders like this:

[INFO] NUMA[00] 0b11111111111111111111111111111111]
[INFO]         DOM[00] 0b00000000111111110000000011111111
[INFO]         DOM[01] 0b11111111000000001111111100000000

Signed-off-by: David Vernet <void@manifault.com>
2024-03-08 15:11:17 -06:00
David Vernet
db152cfbe8
rusty: Implement NUMA-aware load balancing
Right now, scx_rusty has no notion of domains spanning NUMA nodes, and makes no
distinction when making load balancing decisions, or work stealing. This can
cause problems on multi-NUMA machines, as load balancing and work stealing
across NUMA nodes has significantly different cost from across L3 cache
boundaries.

In order to better support multi-NUMA machines, this commit adds another layer
to the rusty load balancer, which balances across NUMA nodes using a different
cost function from balancing across domains. Load balancing now takes place
over the span of two passes:

1. In the first pass, we fix imbalances across NUMA nodes by moving tasks
   between domains across those NUMA node boundaries. We require a load
   imbalance of at least 17% in order to move load at this stage. The ratio of
   load imbalance we attempt to adjust (50%) and the maximum amount of load
   we're allowed to push out of a domain (50%) is still the same as when
   balancing between domains inside a NUMA node, but this is easy to tune with
   the current setup.

2. Once we've balanced across NUMA nodes, we iterate over all nodes and balance
   between the domains within each NUMA node. The cost function here is the
   same as what it has been thus far: we require at least a 5% imbalance in
   order to trigger load balancing.

There are a few additional changes / improvements to load balancing in this
commit:

1. NUMA nodes and domains are now ordered according to their load by using
   SortedVec objects. We were previously using BTreeMap keyed by load, but this
   was suboptimal due to the fact that it doesn't allow duplicate entries.

2. We're no longer exporting load balancing statistics as a vector of data such
   as load sums, averages, and imbalances. This is instead all encapsulated in
   the load balancing hierarchy we setup in lb.load_balance(). These statistics
   are not yet exported, but they will be in a subsequent commit.

One of the issues with this commit is that it does introduce some
almost-identical logic that somehow begs to be deduplicated. For example, when
we balance between NUMA nodes, the logic for iterating over push nodes and
pushing to pull nodes is very similar to the logic of iterating over push
domains and pull domains when balancing within a node. It may be that this can
be improved.

The following are some benchmarks run on an Intel Xeon Gold 6138 (2 x 40 core
processor):

kcompile
--------

On Commit a27648c74210 ("afs: Fix setting of mtime when creating a
file/dir/symlink"):

1. make allyesconfig
2. make -j $(nproc) built-in.a
3. make -j clean
4. goto 2

Runtime
-------

         o-----------o-----------o----------o
         | scx_rusty |     CFS   |   Delta  |
---------o-----------o-----------o----------o
Mean     | 562.688s  | 566.085s  | -.6%     |
---------o-----------o-----------o----------o
Variance | 0.54387   | 0.72431   | -24.9%   |
---------o-----------o-----------o----------o

         o-----------o-----------o----------o
         | rusty NUMA| rusty ORIG|   Delta  |
---------o-----------o-----------o----------o
Mean     | 562.688s  | 563.209s  | -.092%   |
---------o-----------o-----------o----------o
Variance | 0.54387   | 0.42038   | 29.38%   |
---------o-----------o-----------o----------o

scx_rusty with NUMA awareness clearly beats CFS, but only barely beats
scx_rusty without it. This isn't necessarily super surprising given that
this is kcompile, which has very poor front-end CPU locality. Further
experimentation with toggling the cost function for performing
migrations may improve this further.

CPU util
--------

         o-----------o-----------o----------o
         | scx_rusty |     CFS   |   Delta  |
---------o-----------o-----------o----------o
Mean     | 7654.25%  | 7551.67%  | 1.11%    |
---------o-----------o-----------o----------o
Variance | 165.35714 | 158.3333  | 4.436%   |
---------o-----------o-----------o----------o

         o-----------o-----------o----------o
         | rusty NUMA| rusty ORIG|   Delta  |
---------o-----------o-----------o----------o
Mean     | 7654.25%  | 7641.57%  | 0.1659%  |
---------o-----------o-----------o----------o
Variance | 165.35714 | 1230.619  | -86.5%   |
---------o-----------o-----------o----------o

As expected, CPU util is quite a bit higher with scx_rusty than it is
with CFS. Further experiments that could be interesting are always
enabling direct-greedy stealing between domains within a NUMA node, and
then comparing rusty NUMA and rusty ORIG. rusty NUMA prevents stealing
between NUMA nodes, so this would show whether the locality introduced
by NUMA awareness appropriately offsets the loss of work conservation.

Major PFs
---------

         o-----------o-----------o----------o
         | scx_rusty |     CFS   |   Delta  |
---------o-----------o-----------o----------o
Mean     | 5332      | 3950      | 36.566%  |
---------o-----------o-----------o----------o
Variance | 6975.5    | 5986.333  | 16.5237% |
---------o-----------o-----------o----------o

         o-----------o-----------o----------o
         | rusty NUMA| rusty ORIG|   Delta  |
---------o-----------o-----------o----------o
Mean     | 5332      | 5336.5    | -.084%   |
---------o-----------o-----------o----------o
Variance | 6975.5    | 955.5     | 630.03%  |
---------o-----------o-----------o----------o

Also as expected, major page faults are far highe higher with scx_rusty
than with CFS. This is expected even with NUMA awareness, given that
scx_rusty is still less sticky than CFS.

Further experiments that could be interesting are tuning the threshold
for which we perform x NUMA migrations to try and keep this value even
lower. The rate of major page faults between rusty NUMA and rusty ORIG
were very close, though rusty NUMA was a bit lower.

Signed-off-by: David Vernet <void@manifault.com>
2024-03-08 15:11:17 -06:00
David Vernet
0b1c3713b2
rusty: Remove lb_apply_weight param from lb_step()
Let's just query self.tuner.fully_utilized directly and save a few lines of
code.

Signed-off-by: David Vernet <void@manifault.com>
2024-03-08 15:11:17 -06:00
David Vernet
758f762058
rusty: Move LoadBalancer out of rusty.rs
More cleanup of scx_rusty. Let's move the LoadBalancer out of rusty.rs and into
its own file. It will soon be extended quite a bit to support multi-NUMA and
other multivariate LB cost functions, so it's time to clean things up and split
it out.

Signed-off-by: David Vernet <void@manifault.com>
2024-03-08 15:11:17 -06:00
David Vernet
94f75bcec6
rusty: Refactor Tuner and DomainGroup out of rusty.rs
rusty.rs is growing a bit unwieldy. We're going to want to update its load
balancing logic somewhat significantly to account for multi-NUMA and other cost
functions, so let's start cleaning the code up so that things are more
logically segmented and easier to work with.

To start, we move the Tuner and DomainGroup/Domain objects into their own
modules.

Signed-off-by: David Vernet <void@manifault.com>
2024-03-08 15:10:37 -06:00
Jordan Rome
6c7617a037
Merge pull request #177 from jordalgo/libbpf-shallow
Remove libbpf as a submodule
2024-03-08 06:12:40 -05:00
Jordan Rome
1769dece7d Remove libbpf as a submodule
Instead clone the libbpf repo at a specific hash during setup.
This is to fix an issue whereby submodules are not included
in the tarball and therefore won't be updated/fetched during
setup after unzipping the tarball.
2024-03-07 18:31:09 -08:00
David Vernet
1a6ff1a871
Merge pull request #175 from sched-ext/docs
[trivial] docs: Update rhone link
2024-03-07 10:00:57 -06:00
David Vernet
fdf5f5be55
docs: Update rhone link
I changed by GitHub username to Byte-Lab, so let's update the docs.

Signed-off-by: David Vernet <void@manifault.com>
2024-03-07 09:16:27 -06:00
Jordan Rome
a77793bd10
Merge pull request #174 from jordalgo/build-libbpf-static-only
Libbpf - add BUILD_STATIC_ONLY flag
2024-03-05 19:31:26 -05:00
Jordan Rome
96fe285588 Libbpf - add BUILD_STATIC_ONLY flag 2024-03-05 15:11:51 -08:00
Jordan Rome
3eb700156a
Merge pull request #172 from jordalgo/libbpf-flags
Always build libbpf as a PIE
2024-03-05 16:29:32 -05:00
Jordan Rome
38dab12459 Always build libbpf as a PIE
This is to fix an error sometimes seen when compiling with gcc,
whereby the position independent sched ext rust libraries
don't play nicely with the libbpf library if it's not also built
as position independent.

This also explicitly sets the rustc relocation-mode to "pic",
which is the default (just so this doesn't accidentally change
out from under us).
2024-03-05 12:56:09 -08:00
Andrea Righi
a42dd32ff4
Merge pull request #173 from sched-ext/scx-rlfifo-warning
scx_rlfifo: warn user about performance
2024-03-05 19:59:12 +01:00
Andrea Righi
be5e51dfaa scx_rlfifo: print a performance warning banner
scx_rlfifo is provided as a simple example to show how to use
scx_rustland_core and it's not supposed to be used in a real production
environment.

To prevent performance bug reports print an explicit warning when it's
started to clarify the goal of this scheduler.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-03-05 19:36:17 +01:00
Andrea Righi
fe19754132 scx_rlfifo: replace 1ms sleep with sched_yield()
Small improvement to make the scheduler a bit more responsive, without
introducing too much complexity or too much CPU overhead.

This can be achieved by replacing a sleep of 1ms with a sched_yield()
every time that the scheduler has finished to dispatch all the queued
tasks.

This also makes the code a bit smaller and easier to read.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-03-05 18:42:24 +01:00
Tejun Heo
db17905930
Merge pull request #170 from sched-ext/htejun
meson-scripts/build_libbpf: Accommodate meson setting CC to "ccache $COMPILER"
2024-03-04 10:09:19 -10:00
Tejun Heo
069c390ef2 meson-scripts/build_libbpf: Accommodate meson setting CC to "ccache $COMPILER"
Otherwise, we end up passing CC=ccache to libbpf's Makefile which triggers
an error as ccache invoked on its own can't act as a stand-in for the
compiler.
2024-03-04 10:04:25 -10:00
Andrea Righi
ea1a6029c5
Merge pull request #169 from sched-ext/rustland-api-improvements
scx_rustland_core: API improvements
2024-03-04 07:05:43 +01:00
Andrea Righi
5cf113f058 scx_rustland_core: provide DispatchedTask API methods
Provide distinct methods to set the target CPU and the per-task time
slice to dispatched tasks.

Moreover, also provide a constructor to create a DispatchedTask from a
QueuedTask (this allows to automatically bounce a task from the
scheduler to the BPF dispatcher without having to take care of setting
the individual task's attributes).

This also allows to make most of the attributes of DispatchedTask
private, especially it allows to hide cpumask_cnt, that should be only
used internally between the BPF and the user-space component.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-03-03 15:49:37 +01:00
Andrea Righi
e10f8a2d8e scx_rustland_core: introduce per-task time slice
Provide a way to set a different time slice per-task, by adding a new
attribute slice_ns to the DispatchedTask struct.

This attribute determines the time slice assigned to the task, if it is
set to 0 then the global time slice (either the default one or the
effective one, if set) will be used.

At the same time, remove the payload attribute, that is basically unused
(scx_rustland uses it to send the task's vruntime to the BPF dispatcher
for debugging purposes, but it's not very useful anymore at this point).

In the future we may introduce a proper interface to attach a custom
payload to each task with a proper interface.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-03-03 15:06:56 +01:00
Jordan Rome
143743ce3e
Merge pull request #168 from jordalgo/libbpf-submodule-2
Add libbpf as a submodule (take 2)
2024-03-01 20:21:22 -05:00
Jordan Rome
499924ead8 Add libbpf as a submodule
This is to potentinally reduce issues with folks
using different versions of libbpf at runtime.

This also:
- makes static linking of libbpf the default
- adds steps in `meson setup` to fetch libbpf and make it
2024-03-01 12:39:35 -08:00
Tejun Heo
79dac2ee70
Merge pull request #167 from sched-ext/htejun
Revert "Merge pull request #165 from sched-ext/reduce-rust-build-load"
2024-02-29 07:50:19 -10:00
Tejun Heo
c3c71781f1 Revert "Merge pull request #165 from sched-ext/reduce-rust-build-load"
This reverts commit a7b39f24e2, reversing
changes made to cf7404fb03.

The PR doesn't do what the description says. It instead limits the number of
rustc instances to 1 for each cargo build making rust builds extremely slow.
Let's revert and try again.
2024-02-29 07:46:37 -10:00
Tejun Heo
0d3eeef7f0
Merge pull request #166 from sched-ext/htejun
Revert "Merge pull request #163 from jordalgo/libbpf-submodule"
2024-02-29 07:43:22 -10:00
Tejun Heo
438373a8cc Revert "Merge pull request #163 from jordalgo/libbpf-submodule"
This reverts commit 5b9b953e3c, reversing
changes made to a7b39f24e2.

The current git submodule approach is a bit cumbersome and doesn't provide a
unified build environment for both libbpf and scx scheds. Also, the build
instruction doesn't seem to work. Let's revert it for now.
2024-02-29 07:39:01 -10:00
David Vernet
5b9b953e3c
Merge pull request #163 from jordalgo/libbpf-submodule
Add libbpf as a submodule
2024-02-29 09:31:40 -06:00
Jordan Rome
626e66686a Add libbpf as a submodule
This is to potentinally reduce issues with folks using
different versions of libbpf at runtime.
2024-02-29 07:31:13 -08:00
David Vernet
a7b39f24e2
Merge pull request #165 from sched-ext/reduce-rust-build-load
build: limit the maximum amount of parallel cargo build
2024-02-29 09:19:42 -06:00
Andrea Righi
274eb8b4d8 build: limit the maximum amount of parallel cargo build
Each cargo build is already parallelized, spreading multiple rustc
across all the available CPUs by default.

Allowing to run multiple instances of cargo at the same time doesn't
provide any benefit and it can only increase the risk of triggering OOM
conditions or overloading the build system.

Therefore, limit the amount of parallel cargo build instances to 1.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-02-29 15:25:32 +01:00
David Vernet
cf7404fb03
Merge pull request #164 from sirlucjan/services-update2
scx: update /etc/default/scx
2024-02-28 12:30:01 -06:00
Piotr Gorski
f87fe20de2
scx: update /etc/default/scx
Signed-off-by: Piotr Gorski <lucjan.lucjanov@gmail.com>
2024-02-28 18:59:55 +01:00
David Vernet
7278d88632
Merge pull request #161 from sched-ext/scx-user
Introduce scx_rustland_core: a generic layer to implement user-space schedulers in Rust
2024-02-28 10:57:19 -06:00
Andrea Righi
0d1c6555a4 scx_rustland_core: generate source files in-tree
There is no need to generate source code in a temporary directory with
RustLandBuilder(), we can simply generate code in-tree and exclude the
generated source files from .gitignore.

Having the generated source files in-tree can help to debug potential
build issues (and it also allows to drop the the tempfile crate
dependency).

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-02-28 17:49:44 +01:00
Andrea Righi
06d8170f9f scx_utils: introduce Builder()
Introduce a Builder() class in scx_utils that can be used by other scx
crates (such as scx_rustland_core) to prevent code duplication.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-02-28 17:49:44 +01:00
Andrea Righi
2ac1a5924f scx_rustland_core: introduce RustLandBuilder()
Introduce a wrapper to scx_utils::BpfBuilder that can be used to build
the BPF component provided by scx_rustland_core.

The source of the BPF components (main.bpf.c) is included in the crate
as an array of bytes, the content is then unpacked in a temporary file
to perform the build.

The RustLandBuilder() helper is also used to generate bpf.rs (that
implements the low-level user-space Rust connector to the BPF
commponent).

Schedulers based on scx_rustland_core can simply use RustLandBuilder(),
to build the backend provided by scx_rustland_core.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-02-28 17:49:44 +01:00
Andrea Righi
e23426e299 scx_rustland_core: introduce method bpf.update_tasks()
Introduce a helper function to update the counter of queued and
scheduled tasks (used to notify the BPF component if the user-space
scheduler has still some pending work to do).

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-02-28 17:49:44 +01:00
Andrea Righi
00e25530bc scx_rlfifo: simple user-space FIFO scheduler written in Rust
Implement a FIFO scheduler as an example usage of scx_rustland_core.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-02-28 17:49:44 +01:00
Andrea Righi
cf43129d89 scx_rustland: update documentation
scx_rustland has significantly evolved since its original design.

With the introduction of scx_rustland_core and the inclusion of the
scx_rlfifo example, scx_rustland's focus can be shifted from solely
being an "easy-to-read Rust scheduler template" to a fully functional
scheduler.

For this reason, update the README and documentation to reflect its
revised design, objectives, and intended use cases.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-02-28 17:49:44 +01:00
Andrea Righi
871a6c10f9 scx_rustland_core: include scx_rustland backend
Move the BPF component of scx_rustland to scx_rustland_core and make it
available to other user-space schedulers.

NOTE: main.bpf.c and bpf.rs are not pre-compiled in the
scx_rustland_core crate, they need to be included in the user-space
scheduler's source code in order to be compiled/linked properly.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-02-28 17:49:44 +01:00
Andrea Righi
416d6a940f rust: introduce scx_rustland_core crate
Introduce a separate crate (scx_rustland_core) that can be used to
implement sched-ext schedulers in Rust that run in user-space.

This commit only provides the basic layout for the new crate and the
abstraction to the custom allocator.

In general, any scheduler that has a user-space component needs to use
the custom allocator to prevent potential deadlock conditions, caused by
page faults (a kthread needs to run to resolve the page fault, but the
scheduler is blocked waiting for the user-space page fault to be
resolved => deadlock).

However, we don't want to necessarily enforce this constraint to all the
existing Rust schedulers, some of them may do all user-space allocations
in safe paths, hence the separate scx_rustland_core crate.

Merging this code in scx_utils would force all the Rust schedulers to
use the custom allocator.

In a future commit the scx_rustland backend will be moved to
scx_rustland_core, making it a totally generic BPF scheduler framework
that can be used to implement user-space schedulers in Rust.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-02-28 17:49:44 +01:00
David Vernet
4dfb898a08
Merge pull request #159 from sched-ext/load_balancer
Add new infeasible.rs crate
2024-02-26 11:15:50 -06:00