sched_ext is about to be merged upstream. There are some compatibility
breaking changes and we're making the current sched_ext/for-6.11
1edab907b57d ("sched_ext/scx_qmap: Pick idle CPU for direct dispatch on
!wakeup enqueues") the baseline.
Tag everything except scx_mitosis as 1.0.0. As scx_mitosis is still in early
development and is currently temporarily disabled, only the patchlevel is
bumped.
We need a layer of indirection between the stats collection and their
output destinations. Currently, stats are only printed to stdout. Our
goal is to integrate with various telemetry systems such as Prometheus,
StatsD, and custom metric backends like those used by Meta and Netflix.
Importantly, adding a new backend should not require changes to the
existing stats code.
This patch introduces the `metrics` [1] crate, which provides a
framework for defining metrics and publishing them to different
backends.
The initial implementation includes the `dispatched_tasks_count`
metric, tagged with `type`. This metric increments every time a task is
dispatched, emitting the raw count instead of a percentage. A monotonic
counter is the most suitable metric type for this use case, as
percentages can be calculated at query time if needed. Existing logged
metrics continue to print percentages and remain unchanged.
A new flag, `--enable-prometheus`, has been added. When enabled, it
starts a Prometheus endpoint on port 9000 (default is false). This
endpoint allows metrics to be charted in Prometheus or Grafana
dashboards.
Future changes will migrate additional stats to this framework and add
support for other backends.
[1] https://metrics.rs/
Signed-off-by: Jose Fernandez <josef@netflix.com>
Right now, scx_rusty has no notion of domains spanning NUMA nodes, and makes no
distinction when making load balancing decisions, or work stealing. This can
cause problems on multi-NUMA machines, as load balancing and work stealing
across NUMA nodes has significantly different cost from across L3 cache
boundaries.
In order to better support multi-NUMA machines, this commit adds another layer
to the rusty load balancer, which balances across NUMA nodes using a different
cost function from balancing across domains. Load balancing now takes place
over the span of two passes:
1. In the first pass, we fix imbalances across NUMA nodes by moving tasks
between domains across those NUMA node boundaries. We require a load
imbalance of at least 17% in order to move load at this stage. The ratio of
load imbalance we attempt to adjust (50%) and the maximum amount of load
we're allowed to push out of a domain (50%) is still the same as when
balancing between domains inside a NUMA node, but this is easy to tune with
the current setup.
2. Once we've balanced across NUMA nodes, we iterate over all nodes and balance
between the domains within each NUMA node. The cost function here is the
same as what it has been thus far: we require at least a 5% imbalance in
order to trigger load balancing.
There are a few additional changes / improvements to load balancing in this
commit:
1. NUMA nodes and domains are now ordered according to their load by using
SortedVec objects. We were previously using BTreeMap keyed by load, but this
was suboptimal due to the fact that it doesn't allow duplicate entries.
2. We're no longer exporting load balancing statistics as a vector of data such
as load sums, averages, and imbalances. This is instead all encapsulated in
the load balancing hierarchy we setup in lb.load_balance(). These statistics
are not yet exported, but they will be in a subsequent commit.
One of the issues with this commit is that it does introduce some
almost-identical logic that somehow begs to be deduplicated. For example, when
we balance between NUMA nodes, the logic for iterating over push nodes and
pushing to pull nodes is very similar to the logic of iterating over push
domains and pull domains when balancing within a node. It may be that this can
be improved.
The following are some benchmarks run on an Intel Xeon Gold 6138 (2 x 40 core
processor):
kcompile
--------
On Commit a27648c74210 ("afs: Fix setting of mtime when creating a
file/dir/symlink"):
1. make allyesconfig
2. make -j $(nproc) built-in.a
3. make -j clean
4. goto 2
Runtime
-------
o-----------o-----------o----------o
| scx_rusty | CFS | Delta |
---------o-----------o-----------o----------o
Mean | 562.688s | 566.085s | -.6% |
---------o-----------o-----------o----------o
Variance | 0.54387 | 0.72431 | -24.9% |
---------o-----------o-----------o----------o
o-----------o-----------o----------o
| rusty NUMA| rusty ORIG| Delta |
---------o-----------o-----------o----------o
Mean | 562.688s | 563.209s | -.092% |
---------o-----------o-----------o----------o
Variance | 0.54387 | 0.42038 | 29.38% |
---------o-----------o-----------o----------o
scx_rusty with NUMA awareness clearly beats CFS, but only barely beats
scx_rusty without it. This isn't necessarily super surprising given that
this is kcompile, which has very poor front-end CPU locality. Further
experimentation with toggling the cost function for performing
migrations may improve this further.
CPU util
--------
o-----------o-----------o----------o
| scx_rusty | CFS | Delta |
---------o-----------o-----------o----------o
Mean | 7654.25% | 7551.67% | 1.11% |
---------o-----------o-----------o----------o
Variance | 165.35714 | 158.3333 | 4.436% |
---------o-----------o-----------o----------o
o-----------o-----------o----------o
| rusty NUMA| rusty ORIG| Delta |
---------o-----------o-----------o----------o
Mean | 7654.25% | 7641.57% | 0.1659% |
---------o-----------o-----------o----------o
Variance | 165.35714 | 1230.619 | -86.5% |
---------o-----------o-----------o----------o
As expected, CPU util is quite a bit higher with scx_rusty than it is
with CFS. Further experiments that could be interesting are always
enabling direct-greedy stealing between domains within a NUMA node, and
then comparing rusty NUMA and rusty ORIG. rusty NUMA prevents stealing
between NUMA nodes, so this would show whether the locality introduced
by NUMA awareness appropriately offsets the loss of work conservation.
Major PFs
---------
o-----------o-----------o----------o
| scx_rusty | CFS | Delta |
---------o-----------o-----------o----------o
Mean | 5332 | 3950 | 36.566% |
---------o-----------o-----------o----------o
Variance | 6975.5 | 5986.333 | 16.5237% |
---------o-----------o-----------o----------o
o-----------o-----------o----------o
| rusty NUMA| rusty ORIG| Delta |
---------o-----------o-----------o----------o
Mean | 5332 | 5336.5 | -.084% |
---------o-----------o-----------o----------o
Variance | 6975.5 | 955.5 | 630.03% |
---------o-----------o-----------o----------o
Also as expected, major page faults are far highe higher with scx_rusty
than with CFS. This is expected even with NUMA awareness, given that
scx_rusty is still less sticky than CFS.
Further experiments that could be interesting are tuning the threshold
for which we perform x NUMA migrations to try and keep this value even
lower. The rate of major page faults between rusty NUMA and rusty ORIG
were very close, though rusty NUMA was a bit lower.
Signed-off-by: David Vernet <void@manifault.com>
scx_rusty has logic in the scheduler to inspect the host to
automatically build scheduling domains across every L3 cache. This would
be generically useful for many different types of schedulers, so let's
add it to the scx_utils crate so it can be used by others.
Signed-off-by: David Vernet <void@manifault.com>
As described in [0], there is an open problem in load balancing called
the "infeasible weights" problem. Essentially, the problem boils down to
the fact that a task with disproportionately high load can be granted
more CPU time than they can actually consume per their duty cycle.
This patch implements a solution to that problem, wherein we apply the
algorithm described in this paper to adjust all infeasible weights in
the system down to a feasible wight that gives them their full duty
cycle, while allowing the remaining feasible tasks on the system to
share the remaining compute capacity on the machine.
[0]: https://drive.google.com/file/d/1fAoWUlmW-HTp6akuATVpMxpUpvWcGSAv/view?usp=drive_link
Signed-off-by: David Vernet <void@manifault.com>
After updates to reflect the updated init and direct dispatch API, the
schedulers aren't compatible with older kernels. Bump versions and publish
releases.
- combine c and kernel-examples as it's confusing to have both
- rename 'rust-user' and 'c-user' to just 'rust' and 'c', which is simpler
- update and fix sync-to-kernel.sh