mirror of
https://github.com/sched-ext/scx.git
synced 2024-11-25 12:10:24 +00:00
commit
e38937b501
354
OVERVIEW.md
354
OVERVIEW.md
@ -1,14 +1,11 @@
|
||||
# Overview
|
||||
|
||||
This patch set proposes a new scheduler class called ‘ext_sched_class’, or
|
||||
sched_ext, which allows scheduling policies to be implemented as BPF programs.
|
||||
[sched_ext](https://github.com/sched-ext/scx) is a Linux kernel feature which
|
||||
enables implementing and dynamically loading safe kernel thread schedulers in
|
||||
BPF.
|
||||
|
||||
More details will be provided on the overall architecture of sched_ext
|
||||
throughout the various patches in this set, as well as in the “How” section
|
||||
below. We realize that this patch set is a significant proposal, so we will be
|
||||
going into depth in the following “Motivation” section to explain why we think
|
||||
it’s justified. That section is laid out as follows, touching on three main
|
||||
axes where we believe that sched_ext provides significant value:
|
||||
The benefits of such a framework are multifaceted, with there being three main
|
||||
axes where sched_ext is specifically designed to provide significant value:
|
||||
|
||||
1. Ease of experimentation and exploration: Enabling rapid iteration of new
|
||||
scheduling policies.
|
||||
@ -19,20 +16,21 @@ axes where we believe that sched_ext provides significant value:
|
||||
3. Rapid scheduler deployments: Non-disruptive swap outs of scheduling
|
||||
policies in production environments.
|
||||
|
||||
After the motivation section, we’ll provide a more detailed (but still
|
||||
high-level) overview of how sched_ext works.
|
||||
We'll begin by doing a deeper dive into the motivation of sched_ext in the
|
||||
following [Motivation](#motivation) section. Following that, we'll provide some
|
||||
deatils on the overall architecture of sched_ext in the [How](#how) section
|
||||
below.
|
||||
|
||||
|
||||
# Motivation
|
||||
# Motivation<a name="motivation"></a>
|
||||
|
||||
## 1. Ease of experimentation and exploration
|
||||
|
||||
### Why is exploration important?
|
||||
|
||||
Scheduling is a challenging problem space. Small changes in scheduling
|
||||
behavior can have a significant impact on various components of a system, with
|
||||
the corresponding effects varying widely across different platforms,
|
||||
architectures, and workloads.
|
||||
Scheduling is a challenging problem space. Small changes in scheduling behavior
|
||||
can have a significant impact on various components of a system, with the
|
||||
corresponding effects varying widely across different platforms, architectures,
|
||||
and workloads.
|
||||
|
||||
While complexities have always existed in scheduling, they have increased
|
||||
dramatically over the past 10-15 years. In the mid-late 2000s, cores were
|
||||
@ -41,11 +39,11 @@ scheduling being roughly the same across the entire die.
|
||||
|
||||
Systems in the modern age are by comparison much more complex. Modern CPU
|
||||
designs, where the total power budget of all CPU cores often far exceeds the
|
||||
power budget of the socket, with dynamic frequency scaling, and with or
|
||||
without chiplets, have significantly expanded the scheduling problem space.
|
||||
Cache hierarchies have become less uniform, with Core Complex (CCX) designs
|
||||
such as recent AMD processors having multiple shared L3 caches within a single
|
||||
socket. Such topologies resemble NUMA sans persistent NUMA node stickiness.
|
||||
power budget of the socket, with dynamic frequency scaling, and with or without
|
||||
chiplets, have significantly expanded the scheduling problem space. Cache
|
||||
hierarchies have become less uniform, with Core Complex (CCX) designs such as
|
||||
recent AMD processors having multiple shared L3 caches within a single socket.
|
||||
Such topologies resemble NUMA sans persistent NUMA node stickiness.
|
||||
|
||||
Use-cases have become increasingly complex and diverse as well. Applications
|
||||
such as mobile and VR have strict latency requirements to avoid missing
|
||||
@ -54,40 +52,38 @@ constantly pushing the demands on the scheduler in terms of workload isolation
|
||||
and resource distribution.
|
||||
|
||||
Experimentation and exploration are important for any non-trivial problem
|
||||
space. However, given the recent hardware and software developments, we
|
||||
believe that experimentation and exploration are not just important, but
|
||||
_critical_ in the scheduling problem space.
|
||||
space. However, given the recent hardware and software developments, we believe
|
||||
that experimentation and exploration are not just important, but _critical_ in
|
||||
the scheduling problem space.
|
||||
|
||||
Indeed, other approaches in industry are already being explored. AMD has
|
||||
proposed an experimental patch set [0] which enables userspace to provide
|
||||
hints to the scheduler via “Userspace Hinting”. The approach adds a prctl()
|
||||
API which allows callers to set a numerical “hint” value on a struct
|
||||
task_struct. This hint is then optionally read by the scheduler to adjust the
|
||||
cost calculus for various scheduling decisions.
|
||||
proposed an experimental [patch
|
||||
set](https://lore.kernel.org/lkml/20220910105326.1797-1-kprateek.nayak@amd.com/)
|
||||
which enables userspace to provide hints to the scheduler via "Userspace
|
||||
Hinting". The approach adds a prctl() API which allows callers to set a
|
||||
numerical "hint" value on a struct task_struct. This hint is then optionally
|
||||
read by the scheduler to adjust the cost calculus for various scheduling
|
||||
decisions.
|
||||
|
||||
[0]: https://lore.kernel.org/lkml/20220910105326.1797-1-kprateek.nayak@amd.com/
|
||||
Huawei have also [expressed
|
||||
interest](https://lore.kernel.org/bpf/dedc7b72-9da4-91d0-d81d-75360c177188@huawei.com/)
|
||||
in enabling some form of programmable scheduling. While we're unaware of any
|
||||
patch sets which have been sent to the upstream list for this proposal, it
|
||||
similarly illustrates the need for more flexibility in the scheduler.
|
||||
|
||||
Huawei have also expressed interest [1] in enabling some form of programmable
|
||||
scheduling. While we’re unaware of any patch sets which have been sent to the
|
||||
upstream list for this proposal, it similarly illustrates the need for more
|
||||
flexibility in the scheduler.
|
||||
|
||||
[1]: https://lore.kernel.org/bpf/dedc7b72-9da4-91d0-d81d-75360c177188@huawei.com/
|
||||
|
||||
Additionally, Google has developed ghOSt [2] with the goal of enabling custom,
|
||||
userspace driven scheduling policies. Prior presentations at LPC [3] have
|
||||
Additionally, Google has developed
|
||||
[ghOSt](https://dl.acm.org/doi/pdf/10.1145/3477132.3483542) with the goal of
|
||||
enabling custom, userspace driven scheduling policies. Prior
|
||||
[presentations](https://lpc.events/event/16/contributions/1365/) at LPC have
|
||||
discussed ghOSt and how BPF can be used to accelerate scheduling.
|
||||
|
||||
[2]: https://dl.acm.org/doi/pdf/10.1145/3477132.3483542
|
||||
[3]: https://lpc.events/event/16/contributions/1365/
|
||||
|
||||
### Why can’t we just explore directly with CFS?
|
||||
### Why can't we just explore directly with CFS?
|
||||
|
||||
Experimenting with CFS directly or implementing a new sched_class from scratch
|
||||
is of course possible, but is often difficult and time consuming. Newcomers to
|
||||
the scheduler often require years to understand the codebase and become
|
||||
productive contributors. Even for seasoned kernel engineers, experimenting
|
||||
with and upstreaming features can take a very long time. The iteration process
|
||||
productive contributors. Even for seasoned kernel engineers, experimenting with
|
||||
and upstreaming features can take a very long time. The iteration process
|
||||
itself is also time consuming, as testing scheduler changes on real hardware
|
||||
requires reinstalling the kernel and rebooting the host.
|
||||
|
||||
@ -99,18 +95,17 @@ This caused issues, for example ensuring proper fairness between the
|
||||
independent runqueues of SMT siblings.
|
||||
|
||||
The high barrier to entry for working on the scheduler is an impediment to
|
||||
academia as well. Master’s/PhD candidates who are interested in improving the
|
||||
academia as well. Master's/PhD candidates who are interested in improving the
|
||||
scheduler will spend years ramping-up, only to complete their degrees just as
|
||||
they’re finally ready to make significant changes. A lower entrance barrier
|
||||
they're finally ready to make significant changes. A lower entrance barrier
|
||||
would allow researchers to more quickly ramp up, test out hypotheses, and
|
||||
iterate on novel ideas. Research methodology is also severely hampered by the
|
||||
high barrier of entry to make modifications; for example, the Shenango [4] and
|
||||
high barrier of entry to make modifications; for example, the
|
||||
[Shenango](https://www.usenix.org/system/files/nsdi19-ousterhout.pdf) and
|
||||
Shinjuku scheduling policies used sched affinity to replicate the desired
|
||||
policy semantics, due to the difficulty of incorporating these policies into
|
||||
the kernel directly.
|
||||
|
||||
[4]: https://www.usenix.org/system/files/nsdi19-ousterhout.pdf
|
||||
|
||||
The iterative process itself also imposes a significant cost to working on the
|
||||
scheduler. Testing changes requires developers to recompile and reinstall the
|
||||
kernel, reboot their machines, rewarm their workloads, and then finally rerun
|
||||
@ -125,7 +120,7 @@ instances in the Meta production environment takes hours, for example.
|
||||
### How does sched_ext help with exploration?
|
||||
|
||||
sched_ext attempts to address all of the problems described above. In this
|
||||
section, we’ll describe the benefits to experimentation and exploration that
|
||||
section, we'll describe the benefits to experimentation and exploration that
|
||||
are afforded by sched_ext, provide real-world examples of those benefits, and
|
||||
discuss some of the trade-offs and considerations in our design choices.
|
||||
|
||||
@ -151,58 +146,54 @@ indefinitely starve tasks. BPF also enables sched_ext to significantly improve
|
||||
iteration speed for running experiments. Loading and unloading a BPF scheduler
|
||||
is simply a matter of running and terminating a sched_ext binary.
|
||||
|
||||
BPF also provides programs with a rich set of APIs, such as maps, kfuncs,
|
||||
and BPF helpers. In addition to providing useful building blocks to programs
|
||||
that run entirely in kernel space (such as many of our example schedulers),
|
||||
these APIs also allow programs to leverage user space in making scheduling
|
||||
decisions. Specifically, the Atropos sample scheduler has a relatively
|
||||
simple weighted vtime or FIFO scheduling layer in BPF, paired with a load
|
||||
balancing component in userspace written in Rust. As described in more
|
||||
detail below, we also built a more general user-space scheduling framework
|
||||
called “rhone” by leveraging various BPF features.
|
||||
BPF also provides programs with a rich set of APIs, such as maps, kfuncs, and
|
||||
BPF helpers. In addition to providing useful building blocks to programs that
|
||||
run entirely in kernel space (such as many of our example schedulers), these
|
||||
APIs also allow programs to leverage user space in making scheduling decisions.
|
||||
Specifically, the Atropos sample scheduler has a relatively simple weighted
|
||||
vtime or FIFO scheduling layer in BPF, paired with a load balancing component
|
||||
in userspace written in Rust. As described in more detail below, we also built
|
||||
a more general user-space scheduling framework called "rhone" by leveraging
|
||||
various BPF features.
|
||||
|
||||
On the other hand, BPF does have shortcomings, as can be plainly seen from
|
||||
the complexity in some of the example schedulers. scx_pair.bpf.c illustrates
|
||||
this point well. To start, it requires a good amount of code to emulate
|
||||
On the other hand, BPF does have shortcomings, as can be plainly seen from the
|
||||
complexity in some of the example schedulers. scx_pair.bpf.c illustrates this
|
||||
point well. To start, it requires a good amount of code to emulate
|
||||
cgroup-local-storage. In the kernel proper, this would simply be a matter of
|
||||
adding another pointer to the struct cgroup, but in BPF, it requires a
|
||||
complex juggling of data amongst multiple different maps, a good amount of
|
||||
boilerplate code, and some unwieldy bpf_loop()‘s and atomics. The code is
|
||||
also littered with explicit and often unnecessary sanity checks to appease
|
||||
the verifier.
|
||||
adding another pointer to the struct cgroup, but in BPF, it requires a complex
|
||||
juggling of data amongst multiple different maps, a good amount of boilerplate
|
||||
code, and some unwieldy `bpf_loop()`'s and atomics. The code is also littered
|
||||
with explicit and often unnecessary sanity checks to appease the verifier.
|
||||
|
||||
That being said, BPF is being rapidly improved. For example, Yonghong Song
|
||||
recently upstreamed a patch set [5] to add a cgroup local storage map type,
|
||||
allowing scx_pair.bpf.c to be simplified. There are plans to address other
|
||||
issues as well, such as providing statically-verified locking, and avoiding
|
||||
the need for unnecessary sanity checks. Addressing these shortcomings is a
|
||||
high priority for BPF, and as progress continues to be made, we expect most
|
||||
deficiencies to be addressed in the not-too-distant future.
|
||||
|
||||
[5]: https://lore.kernel.org/bpf/20221026042835.672317-1-yhs@fb.com/
|
||||
recently upstreamed a
|
||||
[patch set](https://lore.kernel.org/bpf/20221026042835.672317-1-yhs@fb.com/) to
|
||||
add a cgroup local storage map type, allowing scx_pair.bpf.c to be simplified.
|
||||
There are plans to address other issues as well, such as providing
|
||||
statically-verified locking, and avoiding the need for unnecessary sanity
|
||||
checks. Addressing these shortcomings is a high priority for BPF, and as
|
||||
progress continues to be made, we expect most deficiencies to be addressed in
|
||||
the not-too-distant future.
|
||||
|
||||
Yet another exploration advantage of sched_ext is helping widening the scope
|
||||
of experiments. For example, sched_ext makes it easy to defer CPU assignment
|
||||
until a task starts executing, allowing schedulers to share scheduling queues
|
||||
at any granularity (hyper-twin, CCX and so on). Additionally, higher level
|
||||
frameworks can be built on top to further widen the scope. For example, the
|
||||
aforementioned “rhone” [6] library allows implementing scheduling policies in
|
||||
user-space by encapsulating the complexity around communicating scheduling
|
||||
decisions with the kernel. This allows taking advantage of a richer
|
||||
programming environment in user-space, enabling experimenting with, for
|
||||
instance, more complex mathematical models.
|
||||
|
||||
[6]: https://github.com/Decave/rhone
|
||||
aforementioned [rhone](https://github.com/Decave/rhone) library allows
|
||||
implementing scheduling policies in user-space by encapsulating the complexity
|
||||
around communicating scheduling decisions with the kernel. This allows taking
|
||||
advantage of a richer programming environment in user-space, enabling
|
||||
experimenting with, for instance, more complex mathematical models.
|
||||
|
||||
sched_ext also allows developers to leverage machine learning. At Meta, we
|
||||
experimented with using machine learning to predict whether a running task
|
||||
would soon yield its CPU. These predictions can be used to aid the scheduler
|
||||
in deciding whether to keep a runnable task on its current CPU rather than
|
||||
would soon yield its CPU. These predictions can be used to aid the scheduler in
|
||||
deciding whether to keep a runnable task on its current CPU rather than
|
||||
migrating it to an idle CPU, with the hope of avoiding unnecessary cache
|
||||
misses. Using a tiny neural net model with only one hidden layer of size 16,
|
||||
and a decaying count of 64 syscalls as a feature, we were able to achieve a
|
||||
15% throughput improvement on an Nginx benchmark, with an 87% inference
|
||||
accuracy.
|
||||
and a decaying count of 64 syscalls as a feature, we were able to achieve a 15%
|
||||
throughput improvement on an Nginx benchmark, with an 87% inference accuracy.
|
||||
|
||||
## 2. Customization
|
||||
|
||||
@ -227,14 +218,14 @@ directly from the application (for example, a service that knows the different
|
||||
deadlines of incoming RPCs).
|
||||
|
||||
Google has also experimented with some promising, novel scheduling policies.
|
||||
One example is “central” scheduling, wherein a single CPU makes all
|
||||
scheduling decisions for the entire system. This allows most cores on the
|
||||
system to be fully dedicated to running workloads, and can have significant
|
||||
performance improvements for certain use cases. For example, central
|
||||
scheduling with VCPUs can avoid expensive vmexits and cache flushes, by
|
||||
instead delegating the responsibility of preemption checks from the tick to
|
||||
a single CPU. See scx_central.bpf.c for a simple example of a central
|
||||
scheduling policy built in sched_ext.
|
||||
One example is "central" scheduling, wherein a single CPU makes all scheduling
|
||||
decisions for the entire system. This allows most cores on the system to be
|
||||
fully dedicated to running workloads, and can have significant performance
|
||||
improvements for certain use cases. For example, central scheduling with VCPUs
|
||||
can avoid expensive vmexits and cache flushes, by instead delegating the
|
||||
responsibility of preemption checks from the tick to a single CPU. See
|
||||
scx_central.bpf.c for a simple example of a central scheduling policy built in
|
||||
sched_ext.
|
||||
|
||||
Some workloads also have non-generalizable constraints which enable
|
||||
optimizations in a scheduling policy which would otherwise not be feasible.
|
||||
@ -242,20 +233,18 @@ For example,VM workloads at Google typically have a low overcommit ratio
|
||||
compared to the number of physical CPUs. This allows the scheduler to support
|
||||
bounded tail latencies, as well as longer blocks of uninterrupted time.
|
||||
|
||||
Yet another interesting use case is the scx_flatcg scheduler, which is in
|
||||
0024-sched_ext-Add-cgroup-support.patch and provides a flattened
|
||||
hierarchical vtree for cgroups. This scheduler does not account for
|
||||
thundering herd problems among cgroups, and therefore may not be suitable
|
||||
for inclusion in CFS. However, in a simple benchmark using wrk[8] on apache
|
||||
serving a CGI script calculating sha1sum of a small file, it outperformed
|
||||
CFS by ~3% with CPU controller disabled and by ~10% with two apache
|
||||
instances competing with 2:1 weight ratio nested four level deep.
|
||||
|
||||
[7] https://github.com/wg/wrk
|
||||
Yet another interesting use case is the scx_flatcg scheduler, which provides a
|
||||
flattened hierarchical vtree for cgroups. This scheduler does not account for
|
||||
thundering herd problems among cgroups, and therefore may not be suitable for
|
||||
inclusion in CFS. However, in a simple benchmark using
|
||||
[wrk](https://github.com/wg/wrk) on apache serving a CGI script calculating
|
||||
sha1sum of a small file, it outperformed CFS by ~3% with CPU controller
|
||||
disabled and by ~10% with two apache instances competing with 2:1 weight ratio
|
||||
nested four level deep.
|
||||
|
||||
Certain industries require specific scheduling behaviors that do not apply
|
||||
broadly. For example, ARINC 653 defines scheduling behavior that is widely
|
||||
used by avionic software, and some out-of-tree implementations
|
||||
broadly. For example, ARINC 653 defines scheduling behavior that is widely used
|
||||
by avionic software, and some out-of-tree implementations
|
||||
(https://ieeexplore.ieee.org/document/7005306) have been built. While the
|
||||
upstream community may decide to merge one such implementation in the future,
|
||||
it would also be entirely reasonable to not do so given the narrowness of
|
||||
@ -267,17 +256,16 @@ There are also classes of policy exploration, such as machine learning, or
|
||||
responding in real-time to application hints, that are significantly harder
|
||||
(and not necessarily appropriate) to integrate within the kernel itself.
|
||||
|
||||
### Won’t this increase fragmentation?
|
||||
### Won't this increase fragmentation?
|
||||
|
||||
We acknowledge that to some degree, sched_ext does run the risk of
|
||||
increasing the fragmentation of scheduler implementations. As a result of
|
||||
exploration, however, we believe that enabling the larger ecosystem to
|
||||
innovate will ultimately accelerate the overall development and performance
|
||||
of Linux.
|
||||
We acknowledge that to some degree, sched_ext does run the risk of increasing
|
||||
the fragmentation of scheduler implementations. As a result of exploration,
|
||||
however, we believe that enabling the larger ecosystem to innovate will
|
||||
ultimately accelerate the overall development and performance of Linux.
|
||||
|
||||
BPF programs are required to be GPLv2, which is enforced by the verifier on
|
||||
program loads. With regards to API stability, just as with other semi-internal
|
||||
interfaces such as BPF kfuncs, we won’t be providing any API stability
|
||||
interfaces such as BPF kfuncs, we won't be providing any API stability
|
||||
guarantees to BPF schedulers. While we intend to make an effort to provide
|
||||
compatibility when possible, we will not provide any explicit, strong
|
||||
guarantees as the kernel typically does with e.g. UAPI headers. For users who
|
||||
@ -292,30 +280,26 @@ schedulers and scx_rusty scheduler, will be upstreamed as part of the kernel
|
||||
tree. Distros will be able to package and release these schedulers with the
|
||||
kernel, allowing users to utilize these schedulers out-of-the-box without
|
||||
requiring any additional work or dependencies such as clang or building the
|
||||
scheduler programs themselves. Other schedulers and scheduling frameworks
|
||||
such as rhone may be open-sourced through separate per-project repos.
|
||||
scheduler programs themselves. Other schedulers and scheduling frameworks such
|
||||
as rhone may be open-sourced through separate per-project repos.
|
||||
|
||||
## 3. Rapid scheduler deployments
|
||||
|
||||
Rolling out kernel upgrades is a slow and iterative process. At a large scale
|
||||
it can take months to roll a new kernel out to a fleet of servers. While this
|
||||
latency is expected and inevitable for normal kernel upgrades, it can become
|
||||
highly problematic when kernel changes are required to fix bugs. Livepatch [8]
|
||||
is available to quickly roll out critical security fixes to large fleets, but
|
||||
the scope of changes that can be applied with livepatching is fairly limited,
|
||||
and would likely not be usable for patching scheduling policies. With
|
||||
sched_ext, new scheduling policies can be rapidly rolled out to production
|
||||
environments.
|
||||
highly problematic when kernel changes are required to fix bugs.
|
||||
[Livepatch](https://www.kernel.org/doc/html/latest/livepatch/livepatch.html) is
|
||||
available to quickly roll out critical security fixes to large fleets, but the
|
||||
scope of changes that can be applied with livepatching is fairly limited, and
|
||||
would likely not be usable for patching scheduling policies. With sched_ext,
|
||||
new scheduling policies can be rapidly rolled out to production environments.
|
||||
|
||||
[8]: https://www.kernel.org/doc/html/latest/livepatch/livepatch.html
|
||||
|
||||
As an example, one of the variants of the L1 Terminal Fault (L1TF) [9]
|
||||
vulnerability allows a VCPU running a VM to read arbitrary host kernel
|
||||
memory for pages in L1 data cache. The solution was to implement core
|
||||
scheduling, which ensures that tasks running as hypertwins have the same
|
||||
“cookie”.
|
||||
|
||||
[9]: https://www.intel.com/content/www/us/en/architecture-and-technology/l1tf.html
|
||||
As an example, one of the variants of the [L1 Terminal Fault
|
||||
(L1TF)](https://www.intel.com/content/www/us/en/architecture-and-technology/l1tf.html)
|
||||
vulnerability allows a VCPU running a VM to read arbitrary host kernel memory
|
||||
for pages in L1 data cache. The solution was to implement core scheduling,
|
||||
which ensures that tasks running as hypertwins have the same "cookie".
|
||||
|
||||
While core scheduling works well, it took a long time to finalize and land
|
||||
upstream. This long rollout period was painful, and required organizations to
|
||||
@ -328,30 +312,29 @@ Once core scheduling was upstream, organizations had to upgrade the kernels on
|
||||
their entire fleets. As downtime is not an option for many, these upgrades had
|
||||
to be gradually rolled out, which can take a very long time for large fleets.
|
||||
|
||||
An example of an sched_ext scheduler that illustrates core scheduling
|
||||
semantics is scx_pair.bpf.c, which co-schedules pairs of tasks from the same
|
||||
cgroup, and is resilient to L1TF vulnerabilities. While this example
|
||||
scheduler is certainly not suitable for production in its current form, a
|
||||
similar scheduler that is more performant and featureful could be written
|
||||
and deployed if necessary.
|
||||
An example of an sched_ext scheduler that illustrates core scheduling semantics
|
||||
is scx_pair.bpf.c, which co-schedules pairs of tasks from the same cgroup, and
|
||||
is resilient to L1TF vulnerabilities. While this example scheduler is certainly
|
||||
not suitable for production in its current form, a similar scheduler that is
|
||||
more performant and featureful could be written and deployed if necessary.
|
||||
|
||||
Rapid scheduling deployments can similarly be useful to quickly roll-out new
|
||||
scheduling features without requiring kernel upgrades. At Google, for example,
|
||||
it was observed that some low-priority workloads were causing degraded
|
||||
performance for higher-priority workloads due to consuming a disproportionate
|
||||
share of memory bandwidth. While a temporary mitigation was to use sched
|
||||
affinity to limit the footprint of this low-priority workload to a small
|
||||
subset of CPUs, a preferable solution would be to implement a more featureful
|
||||
affinity to limit the footprint of this low-priority workload to a small subset
|
||||
of CPUs, a preferable solution would be to implement a more featureful
|
||||
task-priority mechanism which automatically throttles lower-priority tasks
|
||||
which are causing memory contention for the rest of the system. Implementing
|
||||
this in CFS and rolling it out to the fleet could take a very long time.
|
||||
|
||||
sched_ext would directly address these gaps. If another hardware bug or
|
||||
resource contention issue comes in that requires scheduler support to
|
||||
mitigate, sched_ext can be used to experiment with and test different
|
||||
policies. Once a scheduler is available, it can quickly be rolled out to as
|
||||
many hosts as necessary, and function as a stop-gap solution until a
|
||||
longer-term mitigation is upstreamed.
|
||||
resource contention issue comes in that requires scheduler support to mitigate,
|
||||
sched_ext can be used to experiment with and test different policies. Once a
|
||||
scheduler is available, it can quickly be rolled out to as many hosts as
|
||||
necessary, and function as a stop-gap solution until a longer-term mitigation
|
||||
is upstreamed.
|
||||
|
||||
|
||||
# How
|
||||
@ -359,7 +342,7 @@ longer-term mitigation is upstreamed.
|
||||
sched_ext is a new sched_class which allows scheduling policies to be
|
||||
implemented in BPF programs.
|
||||
|
||||
sched_ext leverages BPF’s struct_ops feature to define a structure which
|
||||
sched_ext leverages BPF's struct_ops feature to define a structure which
|
||||
exports function callbacks and flags to BPF programs that wish to implement
|
||||
scheduling policies. The struct_ops structure exported by sched_ext is struct
|
||||
sched_ext_ops, and is conceptually similar to struct sched_class. The role of
|
||||
@ -368,82 +351,81 @@ ergonomic struct sched_ext_ops callbacks.
|
||||
|
||||
Unlike some other BPF program types which have ABI requirements due to
|
||||
exporting UAPIs, struct_ops has no ABI requirements whatsoever. This provides
|
||||
us with the flexibility to change the APIs provided to schedulers as
|
||||
necessary. BPF struct_ops is also already being used successfully in other
|
||||
subsystems, such as in support of TCP congestion control.
|
||||
us with the flexibility to change the APIs provided to schedulers as necessary.
|
||||
BPF struct_ops is also already being used successfully in other subsystems,
|
||||
such as in support of TCP congestion control.
|
||||
|
||||
The only struct_ops field that is required to be specified by a scheduler is
|
||||
the ‘name’ field. Otherwise, sched_ext will provide sane default behavior,
|
||||
such as automatically choosing an idle CPU on the task wakeup path if
|
||||
.select_cpu() is missing.
|
||||
the 'name' field. Otherwise, sched_ext will provide sane default behavior, such
|
||||
as automatically choosing an idle CPU on the task wakeup path if .select_cpu()
|
||||
is missing.
|
||||
|
||||
### Dispatch queues
|
||||
|
||||
To bridge the workflow imbalance between the scheduler core and sched_ext_ops
|
||||
callbacks, sched_ext uses simple FIFOs called dispatch queues (dsq's). By
|
||||
default, there is one global dsq (SCX_DSQ_GLOBAL), and one local per-CPU dsq
|
||||
(SCX_DSQ_LOCAL). SCX_DSQ_GLOBAL is provided for convenience and need not be
|
||||
used by a scheduler that doesn't require it. As described in more detail
|
||||
below, SCX_DSQ_LOCAL is the per-CPU FIFO that sched_ext pulls from when
|
||||
putting the next task on the CPU. The BPF scheduler can manage an arbitrary
|
||||
number of dsq's using scx_bpf_create_dsq() and scx_bpf_destroy_dsq().
|
||||
callbacks, sched_ext uses simple FIFOs called dispatch queues (DSQ's). By
|
||||
default, there is one global dsq (`SCX_DSQ_GLOBAL`), and one local per-CPU dsq
|
||||
(`SCX_DSQ_LOCAL`). `SCX_DSQ_GLOBAL` is provided for convenience and need not be
|
||||
used by a scheduler that doesn't require it. As described in more detail below,
|
||||
`SCX_DSQ_LOCAL` is the per-CPU FIFO that sched_ext pulls from when putting the
|
||||
next task on the CPU. The BPF scheduler can manage an arbitrary number of dsq's
|
||||
using `scx_bpf_create_dsq()` and `scx_bpf_destroy_dsq()`.
|
||||
|
||||
### Scheduling cycle
|
||||
|
||||
The following briefly shows a typical workflow for how a waking task is
|
||||
scheduled and executed.
|
||||
|
||||
1. When a task is waking up, .select_cpu() is the first operation invoked.
|
||||
1. When a task is waking up, `.select_cpu()` is the first operation invoked.
|
||||
This serves two purposes. It both allows a scheduler to optimize task
|
||||
placement by specifying a CPU where it expects the task to eventually be
|
||||
scheduled, and the latter is that the selected CPU will be woken if it’s
|
||||
scheduled, and the latter is that the selected CPU will be woken if it's
|
||||
idle.
|
||||
|
||||
2. Once the target CPU is selected, .enqueue() is invoked. It can make one of
|
||||
the following decisions:
|
||||
|
||||
- Immediately dispatch the task to either the global dsq (SCX_DSQ_GLOBAL)
|
||||
or the current CPU’s local dsq (SCX_DSQ_LOCAL).
|
||||
- Immediately dispatch the task to either the global DSQ (`SCX_DSQ_GLOBAL`)
|
||||
or the current CPU's local dsq (`SCX_DSQ_LOCAL`).
|
||||
|
||||
- Immediately dispatch the task to a user-created dispatch queue.
|
||||
|
||||
- Queue the task on the BPF side, e.g. in an rbtree map for a vruntime
|
||||
scheduler, with the intention of dispatching it at a later time from
|
||||
.dispatch().
|
||||
`.dispatch()`.
|
||||
|
||||
3. When a CPU is ready to schedule, it first looks at its local dsq. If empty,
|
||||
it invokes .consume() which should make one or more scx_bpf_consume() calls
|
||||
to consume tasks from dsq's. If a scx_bpf_consume() call succeeds, the CPU
|
||||
has the next task to run and .consume() can return. If .consume() is not
|
||||
defined, sched_ext will by-default consume from only the built-in
|
||||
SCX_DSQ_GLOBAL dsq.
|
||||
3. When a CPU is ready to schedule, it first looks at its local DSQ. If empty,
|
||||
it invokes `.consume()` which should make one or more `scx_bpf_consume()`
|
||||
calls to consume tasks from DSQ's. If a `scx_bpf_consume()` call succeeds,
|
||||
the CPU has the next task to run and `.consume()` can return. If
|
||||
`.consume()` is not defined, sched_ext will by-default consume from only the
|
||||
built-in `SCX_DSQ_GLOBAL` DSQ.
|
||||
|
||||
4. If there's still no task to run, .dispatch() is invoked which should make
|
||||
one or more scx_bpf_dispatch() calls to dispatch tasks from the BPF
|
||||
scheduler to one of the dsq's. If more than one task has been dispatched,
|
||||
4. If there's still no task to run, `.dispatch()` is invoked which should make
|
||||
one or more `scx_bpf_dispatch()` calls to dispatch tasks from the BPF
|
||||
scheduler to one of the DSQ's. If more than one task has been dispatched,
|
||||
go back to the previous consumption step.
|
||||
|
||||
### Verifying callback behavior
|
||||
|
||||
sched_ext always verifies that any value returned from a callback is valid,
|
||||
and will issue an error and unload the scheduler if it is not. For example, if
|
||||
.select_cpu() returns an invalid CPU, or if an attempt is made to invoke the
|
||||
scx_bpf_dispatch() with invalid enqueue flags. Furthermore, if a task remains
|
||||
sched_ext always verifies that any value returned from a callback is valid, and
|
||||
will issue an error and unload the scheduler if it is not. For example, if
|
||||
`.select_cpu()` returns an invalid CPU, or if an attempt is made to invoke the
|
||||
`scx_bpf_dispatch()` with invalid enqueue flags. Furthermore, if a task remains
|
||||
runnable for too long without being scheduled, sched_ext will detect it and
|
||||
error-out the scheduler.
|
||||
|
||||
|
||||
# Closing Thoughts
|
||||
|
||||
Both Meta and Google have experimented quite a lot with schedulers in the
|
||||
last several years. Google has benchmarked various workloads using user
|
||||
space scheduling, and have achieved performance wins by trading off
|
||||
generality for application specific needs. At Meta, we are actively
|
||||
experimenting with multiple production workloads and seeing significant
|
||||
performance gains, and are in the process of deploying sched_ext schedulers
|
||||
on production workloads at scale. We expect to leverage it extensively to
|
||||
run various experiments and develop customized schedulers for a number of
|
||||
critical workloads.
|
||||
Both Meta and Google have experimented quite a lot with schedulers in the last
|
||||
several years. Google has benchmarked various workloads using user space
|
||||
scheduling, and have achieved performance wins by trading off generality for
|
||||
application specific needs. At Meta, we are actively experimenting with
|
||||
multiple production workloads and seeing significant performance gains, and are
|
||||
in the process of deploying sched_ext schedulers on production workloads at
|
||||
scale. We expect to leverage it extensively to run various experiments and
|
||||
develop customized schedulers for a number of critical workloads.
|
||||
|
||||
|
||||
# Written By
|
||||
|
16
README.md
16
README.md
@ -5,16 +5,17 @@ which enables implementing kernel thread schedulers in BPF and dynamically
|
||||
loading them. This repository contains various scheduler implementations and
|
||||
support utilities.
|
||||
|
||||
sched_ext enables safe and rapid iterations of scheduler implementations
|
||||
radically widening the scope of scheduling strategies that can be
|
||||
experimented with and deployed even in massive and complex production
|
||||
environments.
|
||||
sched_ext enables safe and rapid iterations of scheduler implementations, thus
|
||||
radically widening the scope of scheduling strategies that can be experimented
|
||||
with and deployed; even in massive and complex production environments.
|
||||
|
||||
- The [scx_layered case
|
||||
study](https://github.com/sched-ext/scx/blob/case-studies/case-studies/scx_layered.md)
|
||||
concretely demonstrates the power and benefits of sched_ext.
|
||||
- For more detailed high-level discussion, please refer to the [overview
|
||||
document](OVERVIEW.md).
|
||||
- For a high-level but thorough overview of the sched_ext (especially its
|
||||
motivation), please refer to the [overview document](OVERVIEW.md).
|
||||
- For a description of the schedulers shipped with this tree, please refer to
|
||||
the [schedulers document](scheds/README.md).
|
||||
|
||||
While the kernel feature is not upstream yet, we believe sched_ext has a
|
||||
reasonable chance of landing upstream in the foreseeable future. Both Meta
|
||||
@ -327,4 +328,5 @@ can reach us through the following channels:
|
||||
- Reddit: https://reddit.com/r/sched_ext
|
||||
|
||||
We also hold weekly office hours every monday. Please see the #office-hours
|
||||
channel on slack for details.
|
||||
channel on slack for details. To join the slack community, you can use [this
|
||||
link](https://bit.ly/scx_slack).
|
||||
|
36
scheds/README.md
Normal file
36
scheds/README.md
Normal file
@ -0,0 +1,36 @@
|
||||
SCHED_EXT SCHEDULERS
|
||||
====================
|
||||
|
||||
# Introduction
|
||||
|
||||
This directory contains the repo's schedulers.
|
||||
|
||||
Some of these schedulers are simply examples of different types of schedulers
|
||||
that can be built using sched_ext. They can be loaded and used to schedule on
|
||||
your system, but their primary purpose is to illustrate how various features of
|
||||
sched_ext can be used.
|
||||
|
||||
Other schedulers are actually performant, production-ready schedulers. That is,
|
||||
for the correct workload and with the correct tuning, they may be deployed in a
|
||||
production environment with acceptable or possibly even improved performance.
|
||||
Some of the examples could be improved to become production schedulers.
|
||||
|
||||
Please see the following README files for details on each of the various types
|
||||
of schedulers:
|
||||
|
||||
- [kernel-examples](kernel-examples/README.md) describes all of the example
|
||||
schedulers that are also shipped with the Linux kernel tree.
|
||||
- [rust-user](rust-user/README.md) describes all of the schedulers with rust
|
||||
user space components. All of these schedulers are production ready.
|
||||
|
||||
## Note on syncing
|
||||
|
||||
Note that there is a [sync-to-kernel.sh](sync-to-kernel.sh) script in this
|
||||
directory. This is used to sync any changes to the kernel-examples/ schedulers
|
||||
with the Linux kernel tree. If you've made any changes to a scheduler in
|
||||
kernel-examples/, please use the script to synchronize with the sched_ext Linux
|
||||
kernel tree:
|
||||
|
||||
```
|
||||
$ ./sync-to-kernel.sh /path/to/kernel/tree
|
||||
```
|
168
scheds/kernel-examples/README.md
Normal file
168
scheds/kernel-examples/README.md
Normal file
@ -0,0 +1,168 @@
|
||||
EXAMPLE SCHEDULERS
|
||||
==================
|
||||
|
||||
# Introduction
|
||||
|
||||
This directory contains example schedulers that are shipped with the sched_ext
|
||||
Linux kernel tree.
|
||||
|
||||
While these schedulers can be loaded and used to schedule on your system, their
|
||||
primary purpose is to illustrate how various features of sched_ext can be used.
|
||||
|
||||
This document will give some background on each example scheduler, including
|
||||
describing the types of workloads or scenarios they're designed to accommodate.
|
||||
For more details on any of these schedulers, please see the header comment in
|
||||
their .bpf.c file.
|
||||
|
||||
# Schedulers
|
||||
|
||||
This section lists, in alphabetical order, all of the current example
|
||||
schedulers.
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
## scx_simple
|
||||
|
||||
### Overview
|
||||
|
||||
A simple scheduler that provides an example of a minimal sched_ext
|
||||
scheduler. scx_simple can be run in either global weighted vtime mode, or
|
||||
FIFO mode.
|
||||
|
||||
### Typical Use Case
|
||||
|
||||
Though very simple, this scheduler should perform reasonably well on
|
||||
single-socket CPUs with a uniform L3 cache topology. Note that while running in
|
||||
global FIFO mode may work well for some workloads, saturating threads can
|
||||
easily drown out inactive ones.
|
||||
|
||||
### Production Ready?
|
||||
|
||||
This scheduler could be used in a production environment, assuming the hardware
|
||||
constraints enumerated above, and assuming the workload can accommodate a
|
||||
simple scheduling policy.
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
## scx_qmap
|
||||
|
||||
### Overview
|
||||
|
||||
Another simple, yet slightly more complex scheduler that provides an example of
|
||||
a basic weighted FIFO queuing policy. It also provides examples of some common
|
||||
useful BPF features, such as sleepable per-task storage allocation in the
|
||||
`ops.prep_enable()` callback, and using the `BPF_MAP_TYPE_QUEUE` map type to
|
||||
enqueue tasks. It also illustrates how core-sched support could be implemented.
|
||||
|
||||
### Typical Use Case
|
||||
|
||||
Purely used to illustrate sched_ext features.
|
||||
|
||||
### Production Ready?
|
||||
|
||||
No
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
## scx_central
|
||||
|
||||
### Overview
|
||||
|
||||
A "central" scheduler where scheduling decisions are made from a single CPU.
|
||||
This scheduler illustrates how scheduling decisions can be dispatched from a
|
||||
single CPU, allowing other cores to run with infinite slices, without timer
|
||||
ticks, and without having to incur the overhead of making scheduling decisions.
|
||||
|
||||
### Typical Use Case
|
||||
|
||||
This scheduler could theoretically be useful for any workload that benefits
|
||||
from minimizing scheduling overhead and timer ticks. An example of where this
|
||||
could be particularly useful is running VMs, where running with infinite slices
|
||||
and no timer ticks allows the VM to avoid unnecessary expensive vmexits.
|
||||
|
||||
### Production Ready?
|
||||
|
||||
Not yet. While tasks are run with an infinite slice (SCX_SLICE_INF), they're
|
||||
preempted every 20ms in a timer callback. The scheduler also puts the core
|
||||
schedling logic inside of the central / scheduling CPU's ops.dispatch() path,
|
||||
and does not yet have any kind of priority mechanism.
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
## scx_pair
|
||||
|
||||
### Overview
|
||||
|
||||
A sibling scheduler which ensures that tasks will only ever be co-located on a
|
||||
physical core if they're in the same cgroup. It illustrates how a scheduling
|
||||
policy could be implemented to mitigate CPU bugs, such as L1TF, and also shows
|
||||
how some useful kfuncs such as `scx_bpf_kick_cpu()` can be utilized.
|
||||
|
||||
### Typical Use Case
|
||||
|
||||
While this scheduler is only meant to be used to illustrate certain sched_ext
|
||||
features, with a bit more work (e.g. by adding some form of priority handling
|
||||
inside and across cgroups), it could have been used as a way to quickly
|
||||
mitigate L1TF before core scheduling was implemented and rolled out.
|
||||
|
||||
### Production Ready?
|
||||
|
||||
No
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
## scx_flatcg
|
||||
|
||||
### Overview
|
||||
|
||||
A flattened cgroup hierarchy scheduler. This scheduler implements hierarchical
|
||||
weight-based cgroup CPU control by flattening the cgroup hierarchy into a
|
||||
single layer, by compounding the active weight share at each level. The effect
|
||||
of this is a much more performant CPU controller, which does not need to
|
||||
descend down cgroup trees in order to properly compute a cgroup's share.
|
||||
|
||||
### Typical Use Case
|
||||
|
||||
This scheduler could be useful for any typical workload requiring a CPU
|
||||
controller, but which cannot tolerate the higher overheads of the fair CPU
|
||||
controller.
|
||||
|
||||
### Production Ready?
|
||||
|
||||
Yes, though the scheduler (currently) does not adequately accommodate
|
||||
thundering herds of cgroups. If, for example, many cgroups which are nested
|
||||
behind a low-priority cgroup were to wake up around the same time, they may be
|
||||
able to consume more CPU cycles than they are entitled to.
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
## scx_userland
|
||||
|
||||
### Overview
|
||||
|
||||
A simple weighted vtime scheduler where all scheduling decisions take place in
|
||||
user space. This is in contrast to Rusty, where load balancing lives in user
|
||||
space, but scheduling decisions are still made in the kernel.
|
||||
|
||||
### Typical Use Case
|
||||
|
||||
There are many advantages to writing schedulers in user space. For example, you
|
||||
can use a debugger, you can write the scheduler in Rust, and you can use data
|
||||
structures bundled with your favorite library.
|
||||
|
||||
On the other hand, user space scheduling can be hard to get right. You can
|
||||
potentially deadlock due to not scheduling a task that's required for the
|
||||
scheduler itself to make forward progress (though the sched_ext watchdog will
|
||||
protect the system by unloading your scheduler after a timeout if that
|
||||
happens). You also have to bootstrap some communication protocol between the
|
||||
kernel and user space.
|
||||
|
||||
A more robust solution to this would be building a user space scheduling
|
||||
framework that abstracts much of this complexity away from you.
|
||||
|
||||
### Production Ready?
|
||||
|
||||
No. This scheduler uses an ordered list for vtime scheduling, and is stricly
|
||||
less performant than just using something like `scx_simple`. It is purely
|
||||
meant to illustrate that it's possible to build a user space scheduler on
|
||||
top of sched_ext.
|
84
scheds/rust-user/README.md
Normal file
84
scheds/rust-user/README.md
Normal file
@ -0,0 +1,84 @@
|
||||
RUST SCHEDULERS
|
||||
===============
|
||||
|
||||
# Introduction
|
||||
|
||||
This directory contains schedulers with user space rust components.
|
||||
|
||||
This document will give some background on each scheduler, including describing
|
||||
the types of workloads or scenarios they're designed to accommodate. For more
|
||||
details on any of these schedulers, please see the header comment in their
|
||||
main.rs or \*.bpf.c files.
|
||||
|
||||
|
||||
# Schedulers
|
||||
|
||||
This section lists, in alphabetical order, all of the current rust user-space
|
||||
schedulers.
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
## scx_layered
|
||||
|
||||
### Overview
|
||||
|
||||
A highly configurable multi-layer BPF / user space hybrid scheduler.
|
||||
|
||||
scx_layered allows the user to classify tasks into multiple layers, and apply
|
||||
different scheduling policies to those layers. For example, a layer could be
|
||||
created of all tasks that are part of the `user.slice` cgroup slice, and a
|
||||
policy could be specified that ensures that the layer is given at least 80% CPU
|
||||
utilization for some subset of CPUs on the system.
|
||||
|
||||
### Typical Use Case
|
||||
|
||||
scx_layered is designed to be highly customizable, and can be targeted for
|
||||
specific applications. For example, if you had a high-priority service that
|
||||
required priority access to all but 1 physical core to ensure acceptable p99
|
||||
latencies, you could specify that the service would get priority access to all
|
||||
but 1 core on the system. If that service ends up not utilizing all of those
|
||||
cores, they could be used by other layers until they're needed.
|
||||
|
||||
### Production Ready?
|
||||
|
||||
Yes. If tuned correctly, scx_layered should be performant across various CPU
|
||||
architectures and workloads.
|
||||
|
||||
That said, you may run into an issue with infeasible weights, where a task with
|
||||
a very high weight may cause the scheduler to incorrectly leave cores idle
|
||||
because it thinks they're necessary to accommodate the compute for a single
|
||||
task. This can also happen in CFS, and should soon be addressed for
|
||||
scx_layered.
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
## scx_rusty
|
||||
|
||||
### Overview
|
||||
|
||||
A multi-domain, BPF / user space hybrid scheduler. The BPF portion of the
|
||||
scheduler does a simple round robin in each domain, and the user space portion
|
||||
(written in Rust) calculates the load factor of each domain, and informs BPF of
|
||||
how tasks should be load balanced accordingly.
|
||||
|
||||
### Typical Use Case
|
||||
|
||||
Rusty is designed to be flexible, and accommodate different architectures and
|
||||
workloads. Various load balancing thresholds (e.g. greediness, frequenty, etc),
|
||||
as well as how Rusty should partition the system into scheduling domains, can
|
||||
be tuned to achieve the optimal configuration for any given system or workload.
|
||||
|
||||
### Production Ready?
|
||||
|
||||
Yes. If tuned correctly, rusty should be performant across various CPU
|
||||
architectures and workloads. Rusty by default creates a separate scheduling
|
||||
domain per-LLC, so its default configuration may be performant as well. Note
|
||||
however that scx_rusty does not yet disambiguate between LLCs in different NUMA
|
||||
nodes, so it may perform better on multi-CCX machines where all the LLCs share
|
||||
the same socket, as opposed to multi-socket machines.
|
||||
|
||||
Note as well that you may run into an issue with infeasible weights, where a
|
||||
task with a very high weight may cause the scheduler to incorrectly leave cores
|
||||
idle because it thinks they're necessary to accommodate the compute for a
|
||||
single task. This can also happen in CFS, and should soon be addressed for
|
||||
scx_rusty.
|
Loading…
Reference in New Issue
Block a user