Merge pull request #8 from sched-ext/readme

Readme
2024-11-25 12:10:24 +00:00 · 2023-12-06 17:21:20 -06:00 · 2023-12-06 17:21:20 -06:00 · e38937b501
commit e38937b501
parent bae7f1d395 d9ece9fe87
5 changed files with 465 additions and 193 deletions
--- a/OVERVIEW.md
+++ b/OVERVIEW.md
@ -1,14 +1,11 @@
 # Overview

-This patch set proposes a new scheduler class called ‘ext_sched_class’, or
-sched_ext, which allows scheduling policies to be implemented as BPF programs.
+[sched_ext](https://github.com/sched-ext/scx) is a Linux kernel feature which
+enables implementing and dynamically loading safe kernel thread schedulers in
+BPF.

-More details will be provided on the overall architecture of sched_ext
-throughout the various patches in this set, as well as in the “How” section
-below. We realize that this patch set is a significant proposal, so we will be
-going into depth in the following “Motivation” section to explain why we think
-it’s justified. That section is laid out as follows, touching on three main
-axes where we believe that sched_ext provides significant value:
+The benefits of such a framework are multifaceted, with there being three main
+axes where sched_ext is specifically designed to provide significant value:

 1. Ease of experimentation and exploration: Enabling rapid iteration of new
   scheduling policies.
@ -19,20 +16,21 @@ axes where we believe that sched_ext provides significant value:
 3. Rapid scheduler deployments: Non-disruptive swap outs of scheduling
   policies in production environments.

-After the motivation section, we’ll provide a more detailed (but still
-high-level) overview of how sched_ext works.
+We'll begin by doing a deeper dive into the motivation of sched_ext in the
+following [Motivation](#motivation) section. Following that, we'll provide some
+deatils on the overall architecture of sched_ext in the [How](#how) section
+below.

-
-# Motivation
+# Motivation<a name="motivation"></a>

 ## 1. Ease of experimentation and exploration

 ### Why is exploration important?

-Scheduling is a challenging problem space. Small changes in scheduling
-behavior can have a significant impact on various components of a system, with
-the corresponding effects varying widely across different platforms,
-architectures, and workloads.
+Scheduling is a challenging problem space. Small changes in scheduling behavior
+can have a significant impact on various components of a system, with the
+corresponding effects varying widely across different platforms, architectures,
+and workloads.

 While complexities have always existed in scheduling, they have increased
 dramatically over the past 10-15 years. In the mid-late 2000s, cores were
@ -41,11 +39,11 @@ scheduling being roughly the same across the entire die.

 Systems in the modern age are by comparison much more complex. Modern CPU
 designs, where the total power budget of all CPU cores often far exceeds the
-power budget of the socket, with dynamic frequency scaling, and with or
-without chiplets, have significantly expanded the scheduling problem space.
-Cache hierarchies have become less uniform, with Core Complex (CCX) designs
-such as recent AMD processors having multiple shared L3 caches within a single
-socket. Such topologies resemble NUMA sans persistent NUMA node stickiness.
+power budget of the socket, with dynamic frequency scaling, and with or without
+chiplets, have significantly expanded the scheduling problem space.  Cache
+hierarchies have become less uniform, with Core Complex (CCX) designs such as
+recent AMD processors having multiple shared L3 caches within a single socket.
+Such topologies resemble NUMA sans persistent NUMA node stickiness.

 Use-cases have become increasingly complex and diverse as well. Applications
 such as mobile and VR have strict latency requirements to avoid missing
@ -54,40 +52,38 @@ constantly pushing the demands on the scheduler in terms of workload isolation
 and resource distribution.

 Experimentation and exploration are important for any non-trivial problem
-space. However, given the recent hardware and software developments, we
-believe that experimentation and exploration are not just important, but
-_critical_ in the scheduling problem space.
+space. However, given the recent hardware and software developments, we believe
+that experimentation and exploration are not just important, but _critical_ in
+the scheduling problem space.

 Indeed, other approaches in industry are already being explored. AMD has
-proposed an experimental patch set [0] which enables userspace to provide
-hints to the scheduler via “Userspace Hinting”. The approach adds a prctl()
-API which allows callers to set a numerical “hint” value on a struct
-task_struct. This hint is then optionally read by the scheduler to adjust the
-cost calculus for various scheduling decisions.
+proposed an experimental [patch
+set](https://lore.kernel.org/lkml/20220910105326.1797-1-kprateek.nayak@amd.com/)
+which enables userspace to provide hints to the scheduler via "Userspace
+Hinting". The approach adds a prctl() API which allows callers to set a
+numerical "hint" value on a struct task_struct. This hint is then optionally
+read by the scheduler to adjust the cost calculus for various scheduling
+decisions.

-[0]: https://lore.kernel.org/lkml/20220910105326.1797-1-kprateek.nayak@amd.com/
+Huawei have also [expressed
+interest](https://lore.kernel.org/bpf/dedc7b72-9da4-91d0-d81d-75360c177188@huawei.com/)
+in enabling some form of programmable scheduling. While we're unaware of any
+patch sets which have been sent to the upstream list for this proposal, it
+similarly illustrates the need for more flexibility in the scheduler.

-Huawei have also expressed interest [1] in enabling some form of programmable
-scheduling. While we’re unaware of any patch sets which have been sent to the
-upstream list for this proposal, it similarly illustrates the need for more
-flexibility in the scheduler.
-
-[1]: https://lore.kernel.org/bpf/dedc7b72-9da4-91d0-d81d-75360c177188@huawei.com/
-
-Additionally, Google has developed ghOSt [2] with the goal of enabling custom,
-userspace driven scheduling policies. Prior presentations at LPC [3] have
+Additionally, Google has developed
+[ghOSt](https://dl.acm.org/doi/pdf/10.1145/3477132.3483542) with the goal of
+enabling custom, userspace driven scheduling policies. Prior
+[presentations](https://lpc.events/event/16/contributions/1365/) at LPC have
 discussed ghOSt and how BPF can be used to accelerate scheduling.

-[2]: https://dl.acm.org/doi/pdf/10.1145/3477132.3483542
-[3]: https://lpc.events/event/16/contributions/1365/
-
-### Why can’t we just explore directly with CFS?
+### Why can't we just explore directly with CFS?

 Experimenting with CFS directly or implementing a new sched_class from scratch
 is of course possible, but is often difficult and time consuming. Newcomers to
 the scheduler often require years to understand the codebase and become
-productive contributors. Even for seasoned kernel engineers, experimenting
-with and upstreaming features can take a very long time. The iteration process
+productive contributors. Even for seasoned kernel engineers, experimenting with
+and upstreaming features can take a very long time. The iteration process
 itself is also time consuming, as testing scheduler changes on real hardware
 requires reinstalling the kernel and rebooting the host.
 
@ -99,18 +95,17 @@ This caused issues, for example ensuring proper fairness between the
 independent runqueues of SMT siblings.

 The high barrier to entry for working on the scheduler is an impediment to
-academia as well. Master’s/PhD candidates who are interested in improving the
+academia as well. Master's/PhD candidates who are interested in improving the
 scheduler will spend years ramping-up, only to complete their degrees just as
-they’re finally ready to make significant changes. A lower entrance barrier
+they're finally ready to make significant changes. A lower entrance barrier
 would allow researchers to more quickly ramp up, test out hypotheses, and
 iterate on novel ideas. Research methodology is also severely hampered by the
-high barrier of entry to make modifications; for example, the Shenango [4] and
+high barrier of entry to make modifications; for example, the
+[Shenango](https://www.usenix.org/system/files/nsdi19-ousterhout.pdf) and
 Shinjuku scheduling policies used sched affinity to replicate the desired
 policy semantics, due to the difficulty of incorporating these policies into
 the kernel directly.

-[4]: https://www.usenix.org/system/files/nsdi19-ousterhout.pdf
-
 The iterative process itself also imposes a significant cost to working on the
 scheduler. Testing changes requires developers to recompile and reinstall the
 kernel, reboot their machines, rewarm their workloads, and then finally rerun
@ -125,7 +120,7 @@ instances in the Meta production environment takes hours, for example.
 ### How does sched_ext help with exploration?

 sched_ext attempts to address all of the problems described above. In this
-section, we’ll describe the benefits to experimentation and exploration that
+section, we'll describe the benefits to experimentation and exploration that
 are afforded by sched_ext, provide real-world examples of those benefits, and
 discuss some of the trade-offs and considerations in our design choices.

@ -151,58 +146,54 @@ indefinitely starve tasks. BPF also enables sched_ext to significantly improve
 iteration speed for running experiments. Loading and unloading a BPF scheduler
 is simply a matter of running and terminating a sched_ext binary.

-BPF also provides programs with a rich set of APIs, such as maps, kfuncs,
-and BPF helpers. In addition to providing useful building blocks to programs
-that run entirely in kernel space (such as many of our example schedulers),
-these APIs also allow programs to leverage user space in making scheduling
-decisions. Specifically, the Atropos sample scheduler has a relatively
-simple weighted vtime or FIFO scheduling layer in BPF, paired with a load
-balancing component in userspace written in Rust. As described in more
-detail below, we also built a more general user-space scheduling framework
-called “rhone” by leveraging various BPF features.
+BPF also provides programs with a rich set of APIs, such as maps, kfuncs, and
+BPF helpers. In addition to providing useful building blocks to programs that
+run entirely in kernel space (such as many of our example schedulers), these
+APIs also allow programs to leverage user space in making scheduling decisions.
+Specifically, the Atropos sample scheduler has a relatively simple weighted
+vtime or FIFO scheduling layer in BPF, paired with a load balancing component
+in userspace written in Rust. As described in more detail below, we also built
+a more general user-space scheduling framework called "rhone" by leveraging
+various BPF features.

-On the other hand, BPF does have shortcomings, as can be plainly seen from
-the complexity in some of the example schedulers. scx_pair.bpf.c illustrates
-this point well. To start, it requires a good amount of code to emulate
+On the other hand, BPF does have shortcomings, as can be plainly seen from the
+complexity in some of the example schedulers. scx_pair.bpf.c illustrates this
+point well. To start, it requires a good amount of code to emulate
 cgroup-local-storage. In the kernel proper, this would simply be a matter of
-adding another pointer to the struct cgroup, but in BPF, it requires a
-complex juggling of data amongst multiple different maps, a good amount of
-boilerplate code, and some unwieldy bpf_loop()‘s and atomics. The code is
-also littered with explicit and often unnecessary sanity checks to appease
-the verifier.
+adding another pointer to the struct cgroup, but in BPF, it requires a complex
+juggling of data amongst multiple different maps, a good amount of boilerplate
+code, and some unwieldy `bpf_loop()`'s and atomics. The code is also littered
+with explicit and often unnecessary sanity checks to appease the verifier.

 That being said, BPF is being rapidly improved. For example, Yonghong Song
-recently upstreamed a patch set [5] to add a cgroup local storage map type,
-allowing scx_pair.bpf.c to be simplified. There are plans to address other
-issues as well, such as providing statically-verified locking, and avoiding
-the need for unnecessary sanity checks. Addressing these shortcomings is a
-high priority for BPF, and as progress continues to be made, we expect most
-deficiencies to be addressed in the not-too-distant future.
-
-[5]: https://lore.kernel.org/bpf/20221026042835.672317-1-yhs@fb.com/
+recently upstreamed a
+[patch set](https://lore.kernel.org/bpf/20221026042835.672317-1-yhs@fb.com/) to
+add a cgroup local storage map type, allowing scx_pair.bpf.c to be simplified.
+There are plans to address other issues as well, such as providing
+statically-verified locking, and avoiding the need for unnecessary sanity
+checks. Addressing these shortcomings is a high priority for BPF, and as
+progress continues to be made, we expect most deficiencies to be addressed in
+the not-too-distant future.

 Yet another exploration advantage of sched_ext is helping widening the scope
 of experiments. For example, sched_ext makes it easy to defer CPU assignment
 until a task starts executing, allowing schedulers to share scheduling queues
 at any granularity (hyper-twin, CCX and so on). Additionally, higher level
 frameworks can be built on top to further widen the scope. For example, the
-aforementioned “rhone” [6] library allows implementing scheduling policies in
-user-space by encapsulating the complexity around communicating scheduling
-decisions with the kernel. This allows taking advantage of a richer
-programming environment in user-space, enabling experimenting with, for
-instance, more complex mathematical models.
-
-[6]: https://github.com/Decave/rhone
+aforementioned [rhone](https://github.com/Decave/rhone) library allows
+implementing scheduling policies in user-space by encapsulating the complexity
+around communicating scheduling decisions with the kernel. This allows taking
+advantage of a richer programming environment in user-space, enabling
+experimenting with, for instance, more complex mathematical models.

 sched_ext also allows developers to leverage machine learning. At Meta, we
 experimented with using machine learning to predict whether a running task
-would soon yield its CPU. These predictions can be used to aid the scheduler
-in deciding whether to keep a runnable task on its current CPU rather than
+would soon yield its CPU. These predictions can be used to aid the scheduler in
+deciding whether to keep a runnable task on its current CPU rather than
 migrating it to an idle CPU, with the hope of avoiding unnecessary cache
 misses. Using a tiny neural net model with only one hidden layer of size 16,
-and a decaying count of 64 syscalls as a feature, we were able to achieve a
-15% throughput improvement on an Nginx benchmark, with an 87% inference
-accuracy.
+and a decaying count of 64 syscalls as a feature, we were able to achieve a 15%
+throughput improvement on an Nginx benchmark, with an 87% inference accuracy.

 ## 2. Customization

@ -227,14 +218,14 @@ directly from the application (for example, a service that knows the different
 deadlines of incoming RPCs).

 Google has also experimented with some promising, novel scheduling policies.
-One example is “central” scheduling, wherein a single CPU makes all
-scheduling decisions for the entire system. This allows most cores on the
-system to be fully dedicated to running workloads, and can have significant
-performance improvements for certain use cases. For example, central
-scheduling with VCPUs can avoid expensive vmexits and cache flushes, by
-instead delegating the responsibility of preemption checks from the tick to
-a single CPU. See scx_central.bpf.c for a simple example of a central
-scheduling policy built in sched_ext.
+One example is "central" scheduling, wherein a single CPU makes all scheduling
+decisions for the entire system. This allows most cores on the system to be
+fully dedicated to running workloads, and can have significant performance
+improvements for certain use cases. For example, central scheduling with VCPUs
+can avoid expensive vmexits and cache flushes, by instead delegating the
+responsibility of preemption checks from the tick to a single CPU. See
+scx_central.bpf.c for a simple example of a central scheduling policy built in
+sched_ext.

 Some workloads also have non-generalizable constraints which enable
 optimizations in a scheduling policy which would otherwise not be feasible.
@ -242,20 +233,18 @@ For example,VM workloads at Google typically have a low overcommit ratio
 compared to the number of physical CPUs. This allows the scheduler to support
 bounded tail latencies, as well as longer blocks of uninterrupted time.

-Yet another interesting use case is the scx_flatcg scheduler, which is in
-0024-sched_ext-Add-cgroup-support.patch and provides a flattened
-hierarchical vtree for cgroups. This scheduler does not account for
-thundering herd problems among cgroups, and therefore may not be suitable
-for inclusion in CFS. However, in a simple benchmark using wrk[8] on apache
-serving a CGI script calculating sha1sum of a small file, it outperformed
-CFS by ~3% with CPU controller disabled and by ~10% with two apache
-instances competing with 2:1 weight ratio nested four level deep.
-
-[7] https://github.com/wg/wrk
+Yet another interesting use case is the scx_flatcg scheduler, which provides a
+flattened hierarchical vtree for cgroups. This scheduler does not account for
+thundering herd problems among cgroups, and therefore may not be suitable for
+inclusion in CFS. However, in a simple benchmark using
+[wrk](https://github.com/wg/wrk) on apache serving a CGI script calculating
+sha1sum of a small file, it outperformed CFS by ~3% with CPU controller
+disabled and by ~10% with two apache instances competing with 2:1 weight ratio
+nested four level deep.

 Certain industries require specific scheduling behaviors that do not apply
-broadly. For example, ARINC 653 defines scheduling behavior that is widely
-used by avionic software, and some out-of-tree implementations
+broadly. For example, ARINC 653 defines scheduling behavior that is widely used
+by avionic software, and some out-of-tree implementations
 (https://ieeexplore.ieee.org/document/7005306) have been built. While the
 upstream community may decide to merge one such implementation in the future,
 it would also be entirely reasonable to not do so given the narrowness of
@ -267,17 +256,16 @@ There are also classes of policy exploration, such as machine learning, or
 responding in real-time to application hints, that are significantly harder
 (and not necessarily appropriate) to integrate within the kernel itself.

-### Won’t this increase fragmentation?
+### Won't this increase fragmentation?

-We acknowledge that to some degree, sched_ext does run the risk of
-increasing the fragmentation of scheduler implementations. As a result of
-exploration, however, we believe that enabling the larger ecosystem to
-innovate will ultimately accelerate the overall development and performance
-of Linux.
+We acknowledge that to some degree, sched_ext does run the risk of increasing
+the fragmentation of scheduler implementations. As a result of exploration,
+however, we believe that enabling the larger ecosystem to innovate will
+ultimately accelerate the overall development and performance of Linux.

 BPF programs are required to be GPLv2, which is enforced by the verifier on
 program loads. With regards to API stability, just as with other semi-internal
-interfaces such as BPF kfuncs, we won’t be providing any API stability
+interfaces such as BPF kfuncs, we won't be providing any API stability
 guarantees to BPF schedulers. While we intend to make an effort to provide
 compatibility when possible, we will not provide any explicit, strong
 guarantees as the kernel typically does with e.g. UAPI headers. For users who
@ -292,30 +280,26 @@ schedulers and scx_rusty scheduler, will be upstreamed as part of the kernel
 tree. Distros will be able to package and release these schedulers with the
 kernel, allowing users to utilize these schedulers out-of-the-box without
 requiring any additional work or dependencies such as clang or building the
-scheduler programs themselves. Other schedulers and scheduling frameworks
-such as rhone may be open-sourced through separate per-project repos.
+scheduler programs themselves. Other schedulers and scheduling frameworks such
+as rhone may be open-sourced through separate per-project repos.

 ## 3. Rapid scheduler deployments

 Rolling out kernel upgrades is a slow and iterative process. At a large scale
 it can take months to roll a new kernel out to a fleet of servers. While this
 latency is expected and inevitable for normal kernel upgrades, it can become
-highly problematic when kernel changes are required to fix bugs. Livepatch [8]
-is available to quickly roll out critical security fixes to large fleets, but
-the scope of changes that can be applied with livepatching is fairly limited,
-and would likely not be usable for patching scheduling policies. With
-sched_ext, new scheduling policies can be rapidly rolled out to production
-environments.
+highly problematic when kernel changes are required to fix bugs.
+[Livepatch](https://www.kernel.org/doc/html/latest/livepatch/livepatch.html) is
+available to quickly roll out critical security fixes to large fleets, but the
+scope of changes that can be applied with livepatching is fairly limited, and
+would likely not be usable for patching scheduling policies. With sched_ext,
+new scheduling policies can be rapidly rolled out to production environments.

-[8]: https://www.kernel.org/doc/html/latest/livepatch/livepatch.html
-
-As an example, one of the variants of the L1 Terminal Fault (L1TF) [9]
-vulnerability allows a VCPU running a VM to read arbitrary host kernel
-memory for pages in L1 data cache. The solution was to implement core
-scheduling, which ensures that tasks running as hypertwins have the same
-“cookie”.
-
-[9]: https://www.intel.com/content/www/us/en/architecture-and-technology/l1tf.html
+As an example, one of the variants of the [L1 Terminal Fault
+(L1TF)](https://www.intel.com/content/www/us/en/architecture-and-technology/l1tf.html)
+vulnerability allows a VCPU running a VM to read arbitrary host kernel memory
+for pages in L1 data cache. The solution was to implement core scheduling,
+which ensures that tasks running as hypertwins have the same "cookie".

 While core scheduling works well, it took a long time to finalize and land
 upstream. This long rollout period was painful, and required organizations to
@ -328,30 +312,29 @@ Once core scheduling was upstream, organizations had to upgrade the kernels on
 their entire fleets. As downtime is not an option for many, these upgrades had
 to be gradually rolled out, which can take a very long time for large fleets.

-An example of an sched_ext scheduler that illustrates core scheduling
-semantics is scx_pair.bpf.c, which co-schedules pairs of tasks from the same
-cgroup, and is resilient to L1TF vulnerabilities. While this example
-scheduler is certainly not suitable for production in its current form, a
-similar scheduler that is more performant and featureful could be written
-and deployed if necessary.
+An example of an sched_ext scheduler that illustrates core scheduling semantics
+is scx_pair.bpf.c, which co-schedules pairs of tasks from the same cgroup, and
+is resilient to L1TF vulnerabilities. While this example scheduler is certainly
+not suitable for production in its current form, a similar scheduler that is
+more performant and featureful could be written and deployed if necessary.

 Rapid scheduling deployments can similarly be useful to quickly roll-out new
 scheduling features without requiring kernel upgrades. At Google, for example,
 it was observed that some low-priority workloads were causing degraded
 performance for higher-priority workloads due to consuming a disproportionate
 share of memory bandwidth. While a temporary mitigation was to use sched
-affinity to limit the footprint of this low-priority workload to a small
-subset of CPUs, a preferable solution would be to implement a more featureful
+affinity to limit the footprint of this low-priority workload to a small subset
+of CPUs, a preferable solution would be to implement a more featureful
 task-priority mechanism which automatically throttles lower-priority tasks
 which are causing memory contention for the rest of the system. Implementing
 this in CFS and rolling it out to the fleet could take a very long time.

 sched_ext would directly address these gaps. If another hardware bug or
-resource contention issue comes in that requires scheduler support to
-mitigate, sched_ext can be used to experiment with and test different
-policies. Once a scheduler is available, it can quickly be rolled out to as
-many hosts as necessary, and function as a stop-gap solution until a
-longer-term mitigation is upstreamed.
+resource contention issue comes in that requires scheduler support to mitigate,
+sched_ext can be used to experiment with and test different policies. Once a
+scheduler is available, it can quickly be rolled out to as many hosts as
+necessary, and function as a stop-gap solution until a longer-term mitigation
+is upstreamed.


 # How
@ -359,7 +342,7 @@ longer-term mitigation is upstreamed.
 sched_ext is a new sched_class which allows scheduling policies to be
 implemented in BPF programs.

-sched_ext leverages BPF’s struct_ops feature to define a structure which
+sched_ext leverages BPF's struct_ops feature to define a structure which
 exports function callbacks and flags to BPF programs that wish to implement
 scheduling policies. The struct_ops structure exported by sched_ext is struct
 sched_ext_ops, and is conceptually similar to struct sched_class. The role of
@ -368,82 +351,81 @@ ergonomic struct sched_ext_ops callbacks.

 Unlike some other BPF program types which have ABI requirements due to
 exporting UAPIs, struct_ops has no ABI requirements whatsoever. This provides
-us with the flexibility to change the APIs provided to schedulers as
-necessary. BPF struct_ops is also already being used successfully in other
-subsystems, such as in support of TCP congestion control.
+us with the flexibility to change the APIs provided to schedulers as necessary.
+BPF struct_ops is also already being used successfully in other subsystems,
+such as in support of TCP congestion control.

 The only struct_ops field that is required to be specified by a scheduler is
-the ‘name’ field. Otherwise, sched_ext will provide sane default behavior,
-such as automatically choosing an idle CPU on the task wakeup path if
-.select_cpu() is missing.
+the 'name' field. Otherwise, sched_ext will provide sane default behavior, such
+as automatically choosing an idle CPU on the task wakeup path if .select_cpu()
+is missing.

 ### Dispatch queues

 To bridge the workflow imbalance between the scheduler core and sched_ext_ops
-callbacks, sched_ext uses simple FIFOs called dispatch queues (dsq's). By
-default, there is one global dsq (SCX_DSQ_GLOBAL), and one local per-CPU dsq
-(SCX_DSQ_LOCAL). SCX_DSQ_GLOBAL is provided for convenience and need not be
-used by a scheduler that doesn't require it. As described in more detail
-below, SCX_DSQ_LOCAL is the per-CPU FIFO that sched_ext pulls from when
-putting the next task on the CPU. The BPF scheduler can manage an arbitrary
-number of dsq's using scx_bpf_create_dsq() and scx_bpf_destroy_dsq().
+callbacks, sched_ext uses simple FIFOs called dispatch queues (DSQ's). By
+default, there is one global dsq (`SCX_DSQ_GLOBAL`), and one local per-CPU dsq
+(`SCX_DSQ_LOCAL`). `SCX_DSQ_GLOBAL` is provided for convenience and need not be
+used by a scheduler that doesn't require it. As described in more detail below,
+`SCX_DSQ_LOCAL` is the per-CPU FIFO that sched_ext pulls from when putting the
+next task on the CPU. The BPF scheduler can manage an arbitrary number of dsq's
+using `scx_bpf_create_dsq()` and `scx_bpf_destroy_dsq()`.

 ### Scheduling cycle

 The following briefly shows a typical workflow for how a waking task is
 scheduled and executed.

-1. When a task is waking up, .select_cpu() is the first operation invoked.
+1. When a task is waking up, `.select_cpu()` is the first operation invoked.
   This serves two purposes. It both allows a scheduler to optimize task
   placement by specifying a CPU where it expects the task to eventually be
-   scheduled, and the latter is that the selected CPU will be woken if it’s
+   scheduled, and the latter is that the selected CPU will be woken if it's
   idle.

 2. Once the target CPU is selected, .enqueue() is invoked. It can make one of
   the following decisions:

-   - Immediately dispatch the task to either the global dsq (SCX_DSQ_GLOBAL)
-     or the current CPU’s local dsq (SCX_DSQ_LOCAL).
+   - Immediately dispatch the task to either the global DSQ (`SCX_DSQ_GLOBAL`)
+     or the current CPU's local dsq (`SCX_DSQ_LOCAL`).

   - Immediately dispatch the task to a user-created dispatch queue.

   - Queue the task on the BPF side, e.g. in an rbtree map for a vruntime
     scheduler, with the intention of dispatching it at a later time from
-     .dispatch().
+     `.dispatch()`.

-3. When a CPU is ready to schedule, it first looks at its local dsq. If empty,
-   it invokes .consume() which should make one or more scx_bpf_consume() calls
-   to consume tasks from dsq's. If a scx_bpf_consume() call succeeds, the CPU
-   has the next task to run and .consume() can return. If .consume() is not
-   defined, sched_ext will by-default consume from only the built-in
-   SCX_DSQ_GLOBAL dsq.
+3. When a CPU is ready to schedule, it first looks at its local DSQ. If empty,
+   it invokes `.consume()` which should make one or more `scx_bpf_consume()`
+   calls to consume tasks from DSQ's. If a `scx_bpf_consume()` call succeeds,
+   the CPU has the next task to run and `.consume()` can return. If
+   `.consume()` is not defined, sched_ext will by-default consume from only the
+   built-in `SCX_DSQ_GLOBAL` DSQ.

-4. If there's still no task to run, .dispatch() is invoked which should make
-   one or more scx_bpf_dispatch() calls to dispatch tasks from the BPF
-   scheduler to one of the dsq's. If more than one task has been dispatched,
+4. If there's still no task to run, `.dispatch()` is invoked which should make
+   one or more `scx_bpf_dispatch()` calls to dispatch tasks from the BPF
+   scheduler to one of the DSQ's. If more than one task has been dispatched,
   go back to the previous consumption step.

 ### Verifying callback behavior

-sched_ext always verifies that any value returned from a callback is valid,
-and will issue an error and unload the scheduler if it is not. For example, if
-.select_cpu() returns an invalid CPU, or if an attempt is made to invoke the
-scx_bpf_dispatch() with invalid enqueue flags. Furthermore, if a task remains
+sched_ext always verifies that any value returned from a callback is valid, and
+will issue an error and unload the scheduler if it is not. For example, if
+`.select_cpu()` returns an invalid CPU, or if an attempt is made to invoke the
+`scx_bpf_dispatch()` with invalid enqueue flags. Furthermore, if a task remains
 runnable for too long without being scheduled, sched_ext will detect it and
 error-out the scheduler.


 # Closing Thoughts

-Both Meta and Google have experimented quite a lot with schedulers in the
-last several years. Google has benchmarked various workloads using user
-space scheduling, and have achieved performance wins by trading off
-generality for application specific needs. At Meta, we are actively
-experimenting with multiple production workloads and seeing significant
-performance gains, and are in the process of deploying sched_ext schedulers
-on production workloads at scale. We expect to leverage it extensively to
-run various experiments and develop customized schedulers for a number of
-critical workloads.
+Both Meta and Google have experimented quite a lot with schedulers in the last
+several years. Google has benchmarked various workloads using user space
+scheduling, and have achieved performance wins by trading off generality for
+application specific needs. At Meta, we are actively experimenting with
+multiple production workloads and seeing significant performance gains, and are
+in the process of deploying sched_ext schedulers on production workloads at
+scale. We expect to leverage it extensively to run various experiments and
+develop customized schedulers for a number of critical workloads.


 # Written By
--- a/README.md
+++ b/README.md
@ -5,16 +5,17 @@ which enables implementing kernel thread schedulers in BPF and dynamically
 loading them. This repository contains various scheduler implementations and
 support utilities.

-sched_ext enables safe and rapid iterations of scheduler implementations
-radically widening the scope of scheduling strategies that can be
-experimented with and deployed even in massive and complex production
-environments.
+sched_ext enables safe and rapid iterations of scheduler implementations, thus
+radically widening the scope of scheduling strategies that can be experimented
+with and deployed; even in massive and complex production environments.

 - The [scx_layered case
  study](https://github.com/sched-ext/scx/blob/case-studies/case-studies/scx_layered.md)
  concretely demonstrates the power and benefits of sched_ext.
- For more detailed high-level discussion, please refer to the [overview
-  document](OVERVIEW.md).
+- For a high-level but thorough overview of the sched_ext (especially its
+  motivation), please refer to the [overview document](OVERVIEW.md).
+- For a description of the schedulers shipped with this tree, please refer to
+  the [schedulers document](scheds/README.md).

 While the kernel feature is not upstream yet, we believe sched_ext has a
 reasonable chance of landing upstream in the foreseeable future. Both Meta
@ -327,4 +328,5 @@ can reach us through the following channels:
 - Reddit: https://reddit.com/r/sched_ext

 We also hold weekly office hours every monday. Please see the #office-hours
-channel on slack for details.
+channel on slack for details. To join the slack community, you can use [this
+link](https://bit.ly/scx_slack).
--- a/scheds/README.md
+++ b/scheds/README.md
@ -0,0 +1,36 @@
+SCHED_EXT SCHEDULERS
+====================
+
+# Introduction
+
+This directory contains the repo's schedulers.
+
+Some of these schedulers are simply examples of different types of schedulers
+that can be built using sched_ext. They can be loaded and used to schedule on
+your system, but their primary purpose is to illustrate how various features of
+sched_ext can be used.
+
+Other schedulers are actually performant, production-ready schedulers. That is,
+for the correct workload and with the correct tuning, they may be deployed in a
+production environment with acceptable or possibly even improved performance.
+Some of the examples could be improved to become production schedulers.
+
+Please see the following README files for details on each of the various types
+of schedulers:
+
+- [kernel-examples](kernel-examples/README.md) describes all of the example
+  schedulers that are also shipped with the Linux kernel tree.
+- [rust-user](rust-user/README.md) describes all of the schedulers with rust
+  user space components. All of these schedulers are production ready.
+
+## Note on syncing
+
+Note that there is a [sync-to-kernel.sh](sync-to-kernel.sh) script in this
+directory. This is used to sync any changes to the kernel-examples/ schedulers
+with the Linux kernel tree. If you've made any changes to a scheduler in
+kernel-examples/, please use the script to synchronize with the sched_ext Linux
+kernel tree:
+
+```
+$ ./sync-to-kernel.sh /path/to/kernel/tree
+```
--- a/scheds/kernel-examples/README.md
+++ b/scheds/kernel-examples/README.md
@ -0,0 +1,168 @@
+EXAMPLE SCHEDULERS
+==================
+
+# Introduction
+
+This directory contains example schedulers that are shipped with the sched_ext
+Linux kernel tree.
+
+While these schedulers can be loaded and used to schedule on your system, their
+primary purpose is to illustrate how various features of sched_ext can be used.
+
+This document will give some background on each example scheduler, including
+describing the types of workloads or scenarios they're designed to accommodate.
+For more details on any of these schedulers, please see the header comment in
+their .bpf.c file.
+
+# Schedulers
+
+This section lists, in alphabetical order, all of the current example
+schedulers.
+
+--------------------------------------------------------------------------------
+
+## scx_simple
+
+### Overview
+
+A simple scheduler that provides an example of a minimal sched_ext
+scheduler. scx_simple can be run in either global weighted vtime mode, or
+FIFO mode.
+
+### Typical Use Case
+
+Though very simple, this scheduler should perform reasonably well on
+single-socket CPUs with a uniform L3 cache topology. Note that while running in
+global FIFO mode may work well for some workloads, saturating threads can
+easily drown out inactive ones.
+
+### Production Ready?
+
+This scheduler could be used in a production environment, assuming the hardware
+constraints enumerated above, and assuming the workload can accommodate a
+simple scheduling policy.
+
+--------------------------------------------------------------------------------
+
+## scx_qmap
+
+### Overview
+
+Another simple, yet slightly more complex scheduler that provides an example of
+a basic weighted FIFO queuing policy. It also provides examples of some common
+useful BPF features, such as sleepable per-task storage allocation in the
+`ops.prep_enable()` callback, and using the `BPF_MAP_TYPE_QUEUE` map type to
+enqueue tasks. It also illustrates how core-sched support could be implemented.
+
+### Typical Use Case
+
+Purely used to illustrate sched_ext features.
+
+### Production Ready?
+
+No
+
+--------------------------------------------------------------------------------
+
+## scx_central
+
+### Overview
+
+A "central" scheduler where scheduling decisions are made from a single CPU.
+This scheduler illustrates how scheduling decisions can be dispatched from a
+single CPU, allowing other cores to run with infinite slices, without timer
+ticks, and without having to incur the overhead of making scheduling decisions.
+
+### Typical Use Case
+
+This scheduler could theoretically be useful for any workload that benefits
+from minimizing scheduling overhead and timer ticks. An example of where this
+could be particularly useful is running VMs, where running with infinite slices
+and no timer ticks allows the VM to avoid unnecessary expensive vmexits.
+
+### Production Ready?
+
+Not yet. While tasks are run with an infinite slice (SCX_SLICE_INF), they're
+preempted every 20ms in a timer callback. The scheduler also puts the core
+schedling logic inside of the central / scheduling CPU's ops.dispatch() path,
+and does not yet have any kind of priority mechanism.
+
+--------------------------------------------------------------------------------
+
+## scx_pair
+
+### Overview
+
+A sibling scheduler which ensures that tasks will only ever be co-located on a
+physical core if they're in the same cgroup. It illustrates how a scheduling
+policy could be implemented to mitigate CPU bugs, such as L1TF, and also shows
+how some useful kfuncs such as `scx_bpf_kick_cpu()` can be utilized.
+
+### Typical Use Case
+
+While this scheduler is only meant to be used to illustrate certain sched_ext
+features, with a bit more work (e.g. by adding some form of priority handling
+inside and across cgroups), it could have been used as a way to quickly
+mitigate L1TF before core scheduling was implemented and rolled out.
+
+### Production Ready?
+
+No
+
+--------------------------------------------------------------------------------
+
+## scx_flatcg
+
+### Overview
+
+A flattened cgroup hierarchy scheduler. This scheduler implements hierarchical
+weight-based cgroup CPU control by flattening the cgroup hierarchy into a
+single layer, by compounding the active weight share at each level. The effect
+of this is a much more performant CPU controller, which does not need to
+descend down cgroup trees in order to properly compute a cgroup's share.
+
+### Typical Use Case
+
+This scheduler could be useful for any typical workload requiring a CPU
+controller, but which cannot tolerate the higher overheads of the fair CPU
+controller.
+
+### Production Ready?
+
+Yes, though the scheduler (currently) does not adequately accommodate
+thundering herds of cgroups. If, for example, many cgroups which are nested
+behind a low-priority cgroup were to wake up around the same time, they may be
+able to consume more CPU cycles than they are entitled to.
+
+--------------------------------------------------------------------------------
+
+## scx_userland
+
+### Overview
+
+A simple weighted vtime scheduler where all scheduling decisions take place in
+user space. This is in contrast to Rusty, where load balancing lives in user
+space, but scheduling decisions are still made in the kernel.
+
+### Typical Use Case
+
+There are many advantages to writing schedulers in user space. For example, you
+can use a debugger, you can write the scheduler in Rust, and you can use data
+structures bundled with your favorite library.
+
+On the other hand, user space scheduling can be hard to get right. You can
+potentially deadlock due to not scheduling a task that's required for the
+scheduler itself to make forward progress (though the sched_ext watchdog will
+protect the system by unloading your scheduler after a timeout if that
+happens). You also have to bootstrap some communication protocol between the
+kernel and user space.
+
+A more robust solution to this would be building a user space scheduling
+framework that abstracts much of this complexity away from you.
+
+### Production Ready?
+
+No. This scheduler uses an ordered list for vtime scheduling, and is stricly
+less performant than just using something like `scx_simple`. It is purely
+meant to illustrate that it's possible to build a user space scheduler on
+top of sched_ext.
--- a/scheds/rust-user/README.md
+++ b/scheds/rust-user/README.md
@ -0,0 +1,84 @@
+RUST SCHEDULERS
+===============
+
+# Introduction
+
+This directory contains schedulers with user space rust components.
+
+This document will give some background on each scheduler, including describing
+the types of workloads or scenarios they're designed to accommodate.  For more
+details on any of these schedulers, please see the header comment in their
+main.rs or \*.bpf.c files.
+
+
+# Schedulers
+
+This section lists, in alphabetical order, all of the current rust user-space
+schedulers.
+
+--------------------------------------------------------------------------------
+
+## scx_layered
+
+### Overview
+
+A highly configurable multi-layer BPF / user space hybrid scheduler.
+
+scx_layered allows the user to classify tasks into multiple layers, and apply
+different scheduling policies to those layers. For example, a layer could be
+created of all tasks that are part of the `user.slice` cgroup slice, and a
+policy could be specified that ensures that the layer is given at least 80% CPU
+utilization for some subset of CPUs on the system.
+
+### Typical Use Case
+
+scx_layered is designed to be highly customizable, and can be targeted for
+specific applications. For example, if you had a high-priority service that
+required priority access to all but 1 physical core to ensure acceptable p99
+latencies, you could specify that the service would get priority access to all
+but 1 core on the system. If that service ends up not utilizing all of those
+cores, they could be used by other layers until they're needed.
+
+### Production Ready?
+
+Yes. If tuned correctly, scx_layered should be performant across various CPU
+architectures and workloads.
+
+That said, you may run into an issue with infeasible weights, where a task with
+a very high weight may cause the scheduler to incorrectly leave cores idle
+because it thinks they're necessary to accommodate the compute for a single
+task. This can also happen in CFS, and should soon be addressed for
+scx_layered.
+
+--------------------------------------------------------------------------------
+
+## scx_rusty
+
+### Overview
+
+A multi-domain, BPF / user space hybrid scheduler. The BPF portion of the
+scheduler does a simple round robin in each domain, and the user space portion
+(written in Rust) calculates the load factor of each domain, and informs BPF of
+how tasks should be load balanced accordingly.
+
+### Typical Use Case
+
+Rusty is designed to be flexible, and accommodate different architectures and
+workloads. Various load balancing thresholds (e.g. greediness, frequenty, etc),
+as well as how Rusty should partition the system into scheduling domains, can
+be tuned to achieve the optimal configuration for any given system or workload.
+
+### Production Ready?
+
+Yes. If tuned correctly, rusty should be performant across various CPU
+architectures and workloads. Rusty by default creates a separate scheduling
+domain per-LLC, so its default configuration may be performant as well. Note
+however that scx_rusty does not yet disambiguate between LLCs in different NUMA
+nodes, so it may perform better on multi-CCX machines where all the LLCs share
+the same socket, as opposed to multi-socket machines.
+
+Note as well that you may run into an issue with infeasible weights, where a
+task with a very high weight may cause the scheduler to incorrectly leave cores
+idle because it thinks they're necessary to accommodate the compute for a
+single task. This can also happen in CFS, and should soon be addressed for
+scx_rusty.