f8a2445869
The primary scheduling domain represents a group of CPUs in the system where the scheduler will initially attempt to assign tasks. Tasks will only be dispatched to CPUs within this primary domain until they are fully utilized, after which tasks may overflow to other available CPUs. The primary scheduling domain can defined using the option `--primary-domain CPUMASK` (by default all the CPUs in the system are used as primary domain). This change introduces two new special values for the CPUMASK argument: - `performance`: automatically detect the fastest CPUs in the system and use them as primary scheduling domain, - `powersave`: automatically detect the slowest CPUs in the system and use them as primary scheduling domain. The current logic only supports creating two groups: fast and slow CPUs. The fast CPU group is created by excluding CPUs with the lowest frequency from the overall set, which means that within the fast CPU group, CPUs may have different maximum frequencies. When using the `performance` mode the fast CPUs will be used as primary domain, whereas in `powersave` mode, the slow CPUs will be used instead. This option is particularly useful in hybrid architectures (with P-cores and E-cores), as it allows the use of bpfland to prioritize task scheduling on either P-cores or E-cores, depending on the desired performance profile. Example: - Dell Precision 5480 - CPU: 13th Gen Intel(R) Core(TM) i7-13800H - P-cores: 0-11 / max freq: 5.2GHz - E-cores: 12-19 / max freq: 4.0GHz $ scx_bpfland --primary-domain performance 0[||||||||| 24.5%] 10[|||||||| 22.8%] 1[|||||| 14.9%] 11[||||||||||||| 36.9%] 2[|||||| 16.2%] 12[ 0.0%] 3[||||||||| 25.3%] 13[ 0.0%] 4[||||||||||| 33.3%] 14[ 0.0%] 5[|||| 9.9%] 15[ 0.0%] 6[||||||||||| 31.5%] 16[ 0.0%] 7[||||||| 17.4%] 17[ 0.0%] 8[|||||||| 23.4%] 18[ 0.0%] 9[||||||||| 26.1%] 19[ 0.0%] Avg power consumption: 3.29W $ scx_bpfland --primary-domain powersave 0[| 2.5%] 10[ 0.0%] 1[ 0.0%] 11[ 0.0%] 2[ 0.0%] 12[|||| 8.0%] 3[ 0.0%] 13[||||||||||||||||||||| 64.2%] 4[ 0.0%] 14[|||||||||| 29.6%] 5[ 0.0%] 15[||||||||||||||||| 52.5%] 6[ 0.0%] 16[||||||||| 24.7%] 7[ 0.0%] 17[|||||||||| 30.4%] 8[ 0.0%] 18[||||||| 22.4%] 9[ 0.0%] 19[||||| 12.4%] Avg power consumption: 2.17W (Info collected from htop and turbostat) Signed-off-by: Andrea Righi <andrea.righi@linux.dev> |
||
---|---|---|
.. | ||
src | ||
build.rs | ||
Cargo.lock | ||
Cargo.toml | ||
LICENSE | ||
meson.build | ||
README.md | ||
rustfmt.toml |
scx_bpfland
This is a single user-defined scheduler used within sched_ext, which is a Linux kernel feature which enables implementing kernel thread schedulers in BPF and dynamically loading them. Read more about sched_ext.
Overview
scx_bpfland: a vruntime-based sched_ext scheduler that prioritizes interactive workloads.
This scheduler is derived from scx_rustland, but it is fully implemented in BPF with minimal user-space Rust part to process command line options, collect metrics and logs out scheduling statistics. The BPF part makes all the scheduling decisions.
Tasks are categorized as either interactive or regular based on their average rate of voluntary context switches per second. Tasks that exceed a specific voluntary context switch threshold are classified as interactive. Interactive tasks are prioritized in a higher-priority queue, while regular tasks are placed in a lower-priority queue. Within each queue, tasks are sorted based on their weighted runtime: tasks that have higher weight (priority) or use the CPU for less time (smaller runtime) are scheduled sooner, due to their a higher position in the queue.
Moreover, each task gets a time slice budget. When a task is dispatched, it receives a time slice equivalent to the remaining unused portion of its previously allocated time slice (with a minimum threshold applied). This gives latency-sensitive workloads more chances to exceed their time slice when needed to perform short bursts of CPU activity without being interrupted (i.e., real-time audio encoding / decoding workloads).
Typical Use Case
Interactive workloads, such as gaming, live streaming, multimedia, real-time audio encoding/decoding, especially when these workloads are running alongside CPU-intensive background tasks.
In this scenario scx_bpfland ensures that interactive workloads maintain a high level of responsiveness.
Production Ready?
The scheduler is based on scx_rustland, implementing nearly the same scheduling algorithm with minor changes and optimizations to be fully implemented in BPF.
Given that the scx_rustland scheduling algorithm has been extensively tested, this scheduler can be considered ready for production use.