scx-upstream/scheds/kernel-examples
Kumar Kartikeya Dwivedi c4c994c9ce
scx_central: Break dispatch_to_cpu loop when running out of buffer slots
For the case where many tasks being popped from the central queue cannot
be dispatched to the local DSQ of the target CPU, we will keep bouncing
them to the fallback DSQ and continue the dispatch_to_cpu loop until we
find one which can be dispatch to the local DSQ of the target CPU.

In a contrived case, it might be so that all tasks pin themselves to
CPUs != target CPU, and due to their affinity cannot be dispatched to
that CPU's local DSQ. If all of them are filling up the central queue,
then we will keep looping in the dispatch_to_cpu loop and eventually run
out of slots for the dispatch buffer. The nr_mismatched counter will
quickly rise and sched-ext will notice the error and unload the BPF
scheduler.

To remedy this, ensure that we break the dispatch_to_cpu loop when we
can no longer perform a dispatch operation. The outer loop in
central_dispatch for the central CPU should ensure the loop breaks when
we run out of these slots and schedule a self-IPI to the central core,
and allow sched-ext to consume the dispatch buffer before restarting the
dispatch loop again.

A basic way to reproduce this scenario is to do:
taskset -c 0 perf bench sched messaging

The error in the kernel will be:
sched_ext: BPF scheduler "central" errored, disabling
sched_ext: runtime error (dispatch buffer overflow)
bpf_prog_6a473147db3cec67_dispatch_to_cpu+0xc2/0x19a
bpf_prog_c9e51ba75372a829_central_dispatch+0x103/0x1a5

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
2023-12-12 07:50:46 +00:00
..
meson.build build: "meson install" works now 2023-12-01 13:37:28 -10:00
README.md README: Add scheds/ README's 2023-12-06 16:55:02 -06:00
scx_central.bpf.c scx_central: Break dispatch_to_cpu loop when running out of buffer slots 2023-12-12 07:50:46 +00:00
scx_central.c scx_central: use proper format string for u64 2023-12-09 14:49:20 +01:00
scx_flatcg.bpf.c scx_flatcg: use proper data size for hweight_gen 2023-12-09 14:49:30 +01:00
scx_flatcg.c scx_flatcg: use proper data size for hweight_gen 2023-12-09 14:49:30 +01:00
scx_flatcg.h scx: Initial repo setup and import of example schedulers from kernel tree 2023-11-27 14:47:04 -10:00
scx_pair.bpf.c scheds: Rearrange include files to match kernel/tools/sched_ext/include 2023-12-03 12:47:23 -10:00
scx_pair.c scx_pair: use proper format string for u64 types 2023-12-09 14:49:38 +01:00
scx_pair.h scx: Initial repo setup and import of example schedulers from kernel tree 2023-11-27 14:47:04 -10:00
scx_qmap.bpf.c scx_qmap: use proper data size for scheduler stats 2023-12-09 14:49:25 +01:00
scx_qmap.c scx_qmap: use proper format string for u64 types 2023-12-09 14:49:44 +01:00
scx_simple.bpf.c scx_simple: Don't vtime dispatch to SCX_DSQ_GLOBAL 2023-12-04 18:06:47 -06:00
scx_simple.c scheds: Rearrange include files to match kernel/tools/sched_ext/include 2023-12-03 12:47:23 -10:00
scx_userland.bpf.c scx_userland: get rid of strings.h include 2023-12-09 14:49:14 +01:00
scx_userland.c scheds: Rearrange include files to match kernel/tools/sched_ext/include 2023-12-03 12:47:23 -10:00
scx_userland.h scx: Initial repo setup and import of example schedulers from kernel tree 2023-11-27 14:47:04 -10:00

EXAMPLE SCHEDULERS

Introduction

This directory contains example schedulers that are shipped with the sched_ext Linux kernel tree.

While these schedulers can be loaded and used to schedule on your system, their primary purpose is to illustrate how various features of sched_ext can be used.

This document will give some background on each example scheduler, including describing the types of workloads or scenarios they're designed to accommodate. For more details on any of these schedulers, please see the header comment in their .bpf.c file.

Schedulers

This section lists, in alphabetical order, all of the current example schedulers.


scx_simple

Overview

A simple scheduler that provides an example of a minimal sched_ext scheduler. scx_simple can be run in either global weighted vtime mode, or FIFO mode.

Typical Use Case

Though very simple, this scheduler should perform reasonably well on single-socket CPUs with a uniform L3 cache topology. Note that while running in global FIFO mode may work well for some workloads, saturating threads can easily drown out inactive ones.

Production Ready?

This scheduler could be used in a production environment, assuming the hardware constraints enumerated above, and assuming the workload can accommodate a simple scheduling policy.


scx_qmap

Overview

Another simple, yet slightly more complex scheduler that provides an example of a basic weighted FIFO queuing policy. It also provides examples of some common useful BPF features, such as sleepable per-task storage allocation in the ops.prep_enable() callback, and using the BPF_MAP_TYPE_QUEUE map type to enqueue tasks. It also illustrates how core-sched support could be implemented.

Typical Use Case

Purely used to illustrate sched_ext features.

Production Ready?

No


scx_central

Overview

A "central" scheduler where scheduling decisions are made from a single CPU. This scheduler illustrates how scheduling decisions can be dispatched from a single CPU, allowing other cores to run with infinite slices, without timer ticks, and without having to incur the overhead of making scheduling decisions.

Typical Use Case

This scheduler could theoretically be useful for any workload that benefits from minimizing scheduling overhead and timer ticks. An example of where this could be particularly useful is running VMs, where running with infinite slices and no timer ticks allows the VM to avoid unnecessary expensive vmexits.

Production Ready?

Not yet. While tasks are run with an infinite slice (SCX_SLICE_INF), they're preempted every 20ms in a timer callback. The scheduler also puts the core schedling logic inside of the central / scheduling CPU's ops.dispatch() path, and does not yet have any kind of priority mechanism.


scx_pair

Overview

A sibling scheduler which ensures that tasks will only ever be co-located on a physical core if they're in the same cgroup. It illustrates how a scheduling policy could be implemented to mitigate CPU bugs, such as L1TF, and also shows how some useful kfuncs such as scx_bpf_kick_cpu() can be utilized.

Typical Use Case

While this scheduler is only meant to be used to illustrate certain sched_ext features, with a bit more work (e.g. by adding some form of priority handling inside and across cgroups), it could have been used as a way to quickly mitigate L1TF before core scheduling was implemented and rolled out.

Production Ready?

No


scx_flatcg

Overview

A flattened cgroup hierarchy scheduler. This scheduler implements hierarchical weight-based cgroup CPU control by flattening the cgroup hierarchy into a single layer, by compounding the active weight share at each level. The effect of this is a much more performant CPU controller, which does not need to descend down cgroup trees in order to properly compute a cgroup's share.

Typical Use Case

This scheduler could be useful for any typical workload requiring a CPU controller, but which cannot tolerate the higher overheads of the fair CPU controller.

Production Ready?

Yes, though the scheduler (currently) does not adequately accommodate thundering herds of cgroups. If, for example, many cgroups which are nested behind a low-priority cgroup were to wake up around the same time, they may be able to consume more CPU cycles than they are entitled to.


scx_userland

Overview

A simple weighted vtime scheduler where all scheduling decisions take place in user space. This is in contrast to Rusty, where load balancing lives in user space, but scheduling decisions are still made in the kernel.

Typical Use Case

There are many advantages to writing schedulers in user space. For example, you can use a debugger, you can write the scheduler in Rust, and you can use data structures bundled with your favorite library.

On the other hand, user space scheduling can be hard to get right. You can potentially deadlock due to not scheduling a task that's required for the scheduler itself to make forward progress (though the sched_ext watchdog will protect the system by unloading your scheduler after a timeout if that happens). You also have to bootstrap some communication protocol between the kernel and user space.

A more robust solution to this would be building a user space scheduling framework that abstracts much of this complexity away from you.

Production Ready?

No. This scheduler uses an ordered list for vtime scheduling, and is stricly less performant than just using something like scx_simple. It is purely meant to illustrate that it's possible to build a user space scheduler on top of sched_ext.