Instead of keeping one copy of sched_stats, each stats server session
carries their own so that stats can be generated independently by each
client at any interval. CPU allocation min/max tracking is broken for now.
The primary scheduling domain represents a group of CPUs in the system
where the scheduler will initially attempt to assign tasks. Tasks will
only be dispatched to CPUs within this primary domain until they are
fully utilized, after which tasks may overflow to other available CPUs.
The primary scheduling domain can defined using the option
`--primary-domain CPUMASK` (by default all the CPUs in the system are
used as primary domain).
This change introduces two new special values for the CPUMASK argument:
- `performance`: automatically detect the fastest CPUs in the system
and use them as primary scheduling domain,
- `powersave`: automatically detect the slowest CPUs in the system and
use them as primary scheduling domain.
The current logic only supports creating two groups: fast and slow CPUs.
The fast CPU group is created by excluding CPUs with the lowest
frequency from the overall set, which means that within the fast CPU
group, CPUs may have different maximum frequencies.
When using the `performance` mode the fast CPUs will be used as primary
domain, whereas in `powersave` mode, the slow CPUs will be used instead.
This option is particularly useful in hybrid architectures (with P-cores
and E-cores), as it allows the use of bpfland to prioritize task
scheduling on either P-cores or E-cores, depending on the desired
performance profile.
Example:
- Dell Precision 5480
- CPU: 13th Gen Intel(R) Core(TM) i7-13800H
- P-cores: 0-11 / max freq: 5.2GHz
- E-cores: 12-19 / max freq: 4.0GHz
$ scx_bpfland --primary-domain performance
0[||||||||| 24.5%] 10[|||||||| 22.8%]
1[|||||| 14.9%] 11[||||||||||||| 36.9%]
2[|||||| 16.2%] 12[ 0.0%]
3[||||||||| 25.3%] 13[ 0.0%]
4[||||||||||| 33.3%] 14[ 0.0%]
5[|||| 9.9%] 15[ 0.0%]
6[||||||||||| 31.5%] 16[ 0.0%]
7[||||||| 17.4%] 17[ 0.0%]
8[|||||||| 23.4%] 18[ 0.0%]
9[||||||||| 26.1%] 19[ 0.0%]
Avg power consumption: 3.29W
$ scx_bpfland --primary-domain powersave
0[| 2.5%] 10[ 0.0%]
1[ 0.0%] 11[ 0.0%]
2[ 0.0%] 12[|||| 8.0%]
3[ 0.0%] 13[||||||||||||||||||||| 64.2%]
4[ 0.0%] 14[|||||||||| 29.6%]
5[ 0.0%] 15[||||||||||||||||| 52.5%]
6[ 0.0%] 16[||||||||| 24.7%]
7[ 0.0%] 17[|||||||||| 30.4%]
8[ 0.0%] 18[||||||| 22.4%]
9[ 0.0%] 19[||||| 12.4%]
Avg power consumption: 2.17W
(Info collected from htop and turbostat)
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
While the system is not saturated the scheduler will use the following
strategy to select the next CPU for a task:
- pick the same CPU if it's a full-idle SMT core
- pick any full-idle SMT core in the primary scheduling group that
shares the same L2 cache
- pick any full-idle SMT core in the primary scheduling grouop that
shares the same L3 cache
- pick the same CPU (ignoring SMT)
- pick any idle CPU in the primary scheduling group that shares the
same L2 cache
- pick any idle CPU in the primary scheduling group that shares the
same L3 cache
- pick any idle CPU in the system
While the system is completely saturated (no idle CPUs available), tasks
will be dispatched on the first CPU that becomes available.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
This option chooses little (effiency) cores over big (performance) cores
to save power consumption for core compaction.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
The changes include 1) chopping down a big function into smaller ones
for readability and maintainability and 2) using the interior mutability
pattern (Cell and RefCell) to avoid unnecessary clone() calls. There
are no functional changes.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Fix the uninitialized variable "layer" in the function match_layer which
caused the compiling process to fail. "layer" is supposed to be the same
as "&layers[layer_id]".
Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
- Allow no-value user attributes which are automatically assigned "true"
when specified.
- Make "top" attribute string "true" instead of bool true for consistency.
Testing for existence is always enough for value-less attributes.
- Don't drop leading "_" from user attribute names when storing in dicts.
Dropping makes things more confusing.
- Add "_om_skip" to scx_layered fields which don't jive well with OM.
scxstats_to_openmetrics.py is updated accordignly and no longer generates
warnings on those fields.
- Examples and README updated accordingly.
This is a generic tool to pipe from scx_stats to OpenMetrics. This is a
barebone implmentation and the current output may not match what scx_layered
was outputting before. Will be updated later.
The guide that is currently available for Fedora sched-ext is outdated. To remedy this,
I have opted to update the guide to use CachyOS's kernel that is also available on Fedora.
The scx schedulers that are available on Fedora's repositories are also outdated and doesn't work
with the current patchset. I have also updated the scheduler installation to use our package
in the CachyOS Addons COPR.
Signed-off-by: Eric Naim <dnaim@proton.me>