scx/scheds/rust/scx_rusty/Cargo.toml
David Vernet db152cfbe8
rusty: Implement NUMA-aware load balancing
Right now, scx_rusty has no notion of domains spanning NUMA nodes, and makes no
distinction when making load balancing decisions, or work stealing. This can
cause problems on multi-NUMA machines, as load balancing and work stealing
across NUMA nodes has significantly different cost from across L3 cache
boundaries.

In order to better support multi-NUMA machines, this commit adds another layer
to the rusty load balancer, which balances across NUMA nodes using a different
cost function from balancing across domains. Load balancing now takes place
over the span of two passes:

1. In the first pass, we fix imbalances across NUMA nodes by moving tasks
   between domains across those NUMA node boundaries. We require a load
   imbalance of at least 17% in order to move load at this stage. The ratio of
   load imbalance we attempt to adjust (50%) and the maximum amount of load
   we're allowed to push out of a domain (50%) is still the same as when
   balancing between domains inside a NUMA node, but this is easy to tune with
   the current setup.

2. Once we've balanced across NUMA nodes, we iterate over all nodes and balance
   between the domains within each NUMA node. The cost function here is the
   same as what it has been thus far: we require at least a 5% imbalance in
   order to trigger load balancing.

There are a few additional changes / improvements to load balancing in this
commit:

1. NUMA nodes and domains are now ordered according to their load by using
   SortedVec objects. We were previously using BTreeMap keyed by load, but this
   was suboptimal due to the fact that it doesn't allow duplicate entries.

2. We're no longer exporting load balancing statistics as a vector of data such
   as load sums, averages, and imbalances. This is instead all encapsulated in
   the load balancing hierarchy we setup in lb.load_balance(). These statistics
   are not yet exported, but they will be in a subsequent commit.

One of the issues with this commit is that it does introduce some
almost-identical logic that somehow begs to be deduplicated. For example, when
we balance between NUMA nodes, the logic for iterating over push nodes and
pushing to pull nodes is very similar to the logic of iterating over push
domains and pull domains when balancing within a node. It may be that this can
be improved.

The following are some benchmarks run on an Intel Xeon Gold 6138 (2 x 40 core
processor):

kcompile
--------

On Commit a27648c74210 ("afs: Fix setting of mtime when creating a
file/dir/symlink"):

1. make allyesconfig
2. make -j $(nproc) built-in.a
3. make -j clean
4. goto 2

Runtime
-------

         o-----------o-----------o----------o
         | scx_rusty |     CFS   |   Delta  |
---------o-----------o-----------o----------o
Mean     | 562.688s  | 566.085s  | -.6%     |
---------o-----------o-----------o----------o
Variance | 0.54387   | 0.72431   | -24.9%   |
---------o-----------o-----------o----------o

         o-----------o-----------o----------o
         | rusty NUMA| rusty ORIG|   Delta  |
---------o-----------o-----------o----------o
Mean     | 562.688s  | 563.209s  | -.092%   |
---------o-----------o-----------o----------o
Variance | 0.54387   | 0.42038   | 29.38%   |
---------o-----------o-----------o----------o

scx_rusty with NUMA awareness clearly beats CFS, but only barely beats
scx_rusty without it. This isn't necessarily super surprising given that
this is kcompile, which has very poor front-end CPU locality. Further
experimentation with toggling the cost function for performing
migrations may improve this further.

CPU util
--------

         o-----------o-----------o----------o
         | scx_rusty |     CFS   |   Delta  |
---------o-----------o-----------o----------o
Mean     | 7654.25%  | 7551.67%  | 1.11%    |
---------o-----------o-----------o----------o
Variance | 165.35714 | 158.3333  | 4.436%   |
---------o-----------o-----------o----------o

         o-----------o-----------o----------o
         | rusty NUMA| rusty ORIG|   Delta  |
---------o-----------o-----------o----------o
Mean     | 7654.25%  | 7641.57%  | 0.1659%  |
---------o-----------o-----------o----------o
Variance | 165.35714 | 1230.619  | -86.5%   |
---------o-----------o-----------o----------o

As expected, CPU util is quite a bit higher with scx_rusty than it is
with CFS. Further experiments that could be interesting are always
enabling direct-greedy stealing between domains within a NUMA node, and
then comparing rusty NUMA and rusty ORIG. rusty NUMA prevents stealing
between NUMA nodes, so this would show whether the locality introduced
by NUMA awareness appropriately offsets the loss of work conservation.

Major PFs
---------

         o-----------o-----------o----------o
         | scx_rusty |     CFS   |   Delta  |
---------o-----------o-----------o----------o
Mean     | 5332      | 3950      | 36.566%  |
---------o-----------o-----------o----------o
Variance | 6975.5    | 5986.333  | 16.5237% |
---------o-----------o-----------o----------o

         o-----------o-----------o----------o
         | rusty NUMA| rusty ORIG|   Delta  |
---------o-----------o-----------o----------o
Mean     | 5332      | 5336.5    | -.084%   |
---------o-----------o-----------o----------o
Variance | 6975.5    | 955.5     | 630.03%  |
---------o-----------o-----------o----------o

Also as expected, major page faults are far highe higher with scx_rusty
than with CFS. This is expected even with NUMA awareness, given that
scx_rusty is still less sticky than CFS.

Further experiments that could be interesting are tuning the threshold
for which we perform x NUMA migrations to try and keep this value even
lower. The rate of major page faults between rusty NUMA and rusty ORIG
were very close, though rusty NUMA was a bit lower.

Signed-off-by: David Vernet <void@manifault.com>
2024-03-08 15:11:17 -06:00

28 lines
928 B
TOML

[package]
name = "scx_rusty"
version = "0.5.4"
authors = ["Dan Schatzberg <dschatzberg@meta.com>", "Meta"]
edition = "2021"
description = "A multi-domain, BPF / user space hybrid scheduler used within sched_ext, which is a Linux kernel feature which enables implementing kernel thread schedulers in BPF and dynamically loading them. https://github.com/sched-ext/scx/tree/main"
license = "GPL-2.0-only"
[dependencies]
anyhow = "1.0.65"
clap = { version = "4.1", features = ["derive", "env", "unicode", "wrap_help"] }
ctrlc = { version = "3.1", features = ["termination"] }
fb_procfs = "0.7.0"
libbpf-rs = "0.22.0"
libc = "0.2.137"
log = "0.4.17"
ordered-float = "3.4.0"
scx_utils = { path = "../../../rust/scx_utils", version = "0.6" }
simplelog = "0.12.0"
sorted-vec = "0.8.3"
static_assertions = "1.1.0"
[build-dependencies]
scx_utils = { path = "../../../rust/scx_utils", version = "0.6" }
[features]
enable_backtrace = []