mirror of
https://github.com/sched-ext/scx.git
synced 2024-11-25 04:00:24 +00:00
db152cfbe8
Right now, scx_rusty has no notion of domains spanning NUMA nodes, and makes no distinction when making load balancing decisions, or work stealing. This can cause problems on multi-NUMA machines, as load balancing and work stealing across NUMA nodes has significantly different cost from across L3 cache boundaries. In order to better support multi-NUMA machines, this commit adds another layer to the rusty load balancer, which balances across NUMA nodes using a different cost function from balancing across domains. Load balancing now takes place over the span of two passes: 1. In the first pass, we fix imbalances across NUMA nodes by moving tasks between domains across those NUMA node boundaries. We require a load imbalance of at least 17% in order to move load at this stage. The ratio of load imbalance we attempt to adjust (50%) and the maximum amount of load we're allowed to push out of a domain (50%) is still the same as when balancing between domains inside a NUMA node, but this is easy to tune with the current setup. 2. Once we've balanced across NUMA nodes, we iterate over all nodes and balance between the domains within each NUMA node. The cost function here is the same as what it has been thus far: we require at least a 5% imbalance in order to trigger load balancing. There are a few additional changes / improvements to load balancing in this commit: 1. NUMA nodes and domains are now ordered according to their load by using SortedVec objects. We were previously using BTreeMap keyed by load, but this was suboptimal due to the fact that it doesn't allow duplicate entries. 2. We're no longer exporting load balancing statistics as a vector of data such as load sums, averages, and imbalances. This is instead all encapsulated in the load balancing hierarchy we setup in lb.load_balance(). These statistics are not yet exported, but they will be in a subsequent commit. One of the issues with this commit is that it does introduce some almost-identical logic that somehow begs to be deduplicated. For example, when we balance between NUMA nodes, the logic for iterating over push nodes and pushing to pull nodes is very similar to the logic of iterating over push domains and pull domains when balancing within a node. It may be that this can be improved. The following are some benchmarks run on an Intel Xeon Gold 6138 (2 x 40 core processor): kcompile -------- On Commit a27648c74210 ("afs: Fix setting of mtime when creating a file/dir/symlink"): 1. make allyesconfig 2. make -j $(nproc) built-in.a 3. make -j clean 4. goto 2 Runtime ------- o-----------o-----------o----------o | scx_rusty | CFS | Delta | ---------o-----------o-----------o----------o Mean | 562.688s | 566.085s | -.6% | ---------o-----------o-----------o----------o Variance | 0.54387 | 0.72431 | -24.9% | ---------o-----------o-----------o----------o o-----------o-----------o----------o | rusty NUMA| rusty ORIG| Delta | ---------o-----------o-----------o----------o Mean | 562.688s | 563.209s | -.092% | ---------o-----------o-----------o----------o Variance | 0.54387 | 0.42038 | 29.38% | ---------o-----------o-----------o----------o scx_rusty with NUMA awareness clearly beats CFS, but only barely beats scx_rusty without it. This isn't necessarily super surprising given that this is kcompile, which has very poor front-end CPU locality. Further experimentation with toggling the cost function for performing migrations may improve this further. CPU util -------- o-----------o-----------o----------o | scx_rusty | CFS | Delta | ---------o-----------o-----------o----------o Mean | 7654.25% | 7551.67% | 1.11% | ---------o-----------o-----------o----------o Variance | 165.35714 | 158.3333 | 4.436% | ---------o-----------o-----------o----------o o-----------o-----------o----------o | rusty NUMA| rusty ORIG| Delta | ---------o-----------o-----------o----------o Mean | 7654.25% | 7641.57% | 0.1659% | ---------o-----------o-----------o----------o Variance | 165.35714 | 1230.619 | -86.5% | ---------o-----------o-----------o----------o As expected, CPU util is quite a bit higher with scx_rusty than it is with CFS. Further experiments that could be interesting are always enabling direct-greedy stealing between domains within a NUMA node, and then comparing rusty NUMA and rusty ORIG. rusty NUMA prevents stealing between NUMA nodes, so this would show whether the locality introduced by NUMA awareness appropriately offsets the loss of work conservation. Major PFs --------- o-----------o-----------o----------o | scx_rusty | CFS | Delta | ---------o-----------o-----------o----------o Mean | 5332 | 3950 | 36.566% | ---------o-----------o-----------o----------o Variance | 6975.5 | 5986.333 | 16.5237% | ---------o-----------o-----------o----------o o-----------o-----------o----------o | rusty NUMA| rusty ORIG| Delta | ---------o-----------o-----------o----------o Mean | 5332 | 5336.5 | -.084% | ---------o-----------o-----------o----------o Variance | 6975.5 | 955.5 | 630.03% | ---------o-----------o-----------o----------o Also as expected, major page faults are far highe higher with scx_rusty than with CFS. This is expected even with NUMA awareness, given that scx_rusty is still less sticky than CFS. Further experiments that could be interesting are tuning the threshold for which we perform x NUMA migrations to try and keep this value even lower. The rate of major page faults between rusty NUMA and rusty ORIG were very close, though rusty NUMA was a bit lower. Signed-off-by: David Vernet <void@manifault.com>
28 lines
928 B
TOML
28 lines
928 B
TOML
[package]
|
|
name = "scx_rusty"
|
|
version = "0.5.4"
|
|
authors = ["Dan Schatzberg <dschatzberg@meta.com>", "Meta"]
|
|
edition = "2021"
|
|
description = "A multi-domain, BPF / user space hybrid scheduler used within sched_ext, which is a Linux kernel feature which enables implementing kernel thread schedulers in BPF and dynamically loading them. https://github.com/sched-ext/scx/tree/main"
|
|
license = "GPL-2.0-only"
|
|
|
|
[dependencies]
|
|
anyhow = "1.0.65"
|
|
clap = { version = "4.1", features = ["derive", "env", "unicode", "wrap_help"] }
|
|
ctrlc = { version = "3.1", features = ["termination"] }
|
|
fb_procfs = "0.7.0"
|
|
libbpf-rs = "0.22.0"
|
|
libc = "0.2.137"
|
|
log = "0.4.17"
|
|
ordered-float = "3.4.0"
|
|
scx_utils = { path = "../../../rust/scx_utils", version = "0.6" }
|
|
simplelog = "0.12.0"
|
|
sorted-vec = "0.8.3"
|
|
static_assertions = "1.1.0"
|
|
|
|
[build-dependencies]
|
|
scx_utils = { path = "../../../rust/scx_utils", version = "0.6" }
|
|
|
|
[features]
|
|
enable_backtrace = []
|