Rely on scx_utils::Topology to classify Big, Little and Turbo CPUs.
Moreover, support the special keyword "all" with --primary-domain to
include all the CPUs in the system (default).
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Integrate the logic used by scx_bpfland to detect turbo-boosted cores in
Topology.
Also change the logic to detect Big/Little cores in function of
base_frequency, instead of scaling_max_freq, otherwise turbo-boosted
cores in homogeneous systems may be incorrectly classified as Big.
Moreover, introduce the following new methods to Cpu to check for the
core type:
- is_turbo(): return true if the CPU is Turbo, false otherwise
- is_big(): return true if the CPU is either Turbo or Big
- is_little(): return true if the CPU is Little
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
When creating the turbo boost scheduling domain, we might use a full CPU
mask (selecting all possible CPUs) to indicate "do not prioritize turbo
boost CPUs" or when all CPUs have the same maximum frequency.
This approach works when the primary domain also contains all the CPUs,
as the complete overlap allows the CPU selection logic to ignore the
turbo boost domain and start picking CPUs directly from the primary
domain.
However, if the primary domain doesn't include all CPUs, the two domains
won't fully overlap, which can lead to the turbo boost domain
incorrectly including all CPUs, thereby negating the restrictions set by
the primary scheduling domain.
To resolve this, an empty CPU mask should be used for the turbo boost
domain when turbo boost CPUs aren't prioritized. If the turbo boost
domain is empty, it should be entirely bypassed, and the selection
should proceed directly to the primary domain.
Reported-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
With an ill combination of old kernel and old LLVM, the BPF verifier
incorrectly detects an infinite loop. After changing the loop with a
constant end, the old verifier can pass the code.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Refactor the code design to make it more suitable as a template for
implementing advanced scheduling policies.
In particular, create separate loops for task consumption and task
dispatching. This will make the scheduler easier to adapt as a
foundation for implementing more complex scheduling policies.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Now it checks an active cpumask within a previous core's compute domain
before checking the full active CPUs.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
The BPF verifier in the old kernel gives up to analysis the nested loop
in the consume_task(). We reduce the loop less complex by reducing
LAVD_CPDOM_MAX_DIST from 6 to 4 in order to make the verifier happy.
Note that the theoretical maximum distance is 6 (numa > llc > core type)
but there is no such hardware today, hence reducing it to 6 should be
okay in next few years, when hopefully the verifier becomes smarter.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Updating nr_queue_task every runqueue operation is expensive and
unnecessary. So we do update every system state update interval and use
moving average, which is accurate enough.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Avoid to periodically read the current performance profile from
/sys/devices/system/cpu/cpufreq/policy0/energy_performance_preference if
it's not available (i.e., with older CPUs or kernels without cpufreq).
This fixes issue #560.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
meson build script was building each rust sub-project under rust/ and
scheds/rust/ separately. This means that each rust project is built
independently which leads to a couple problems - 1. There are a lot of
shared dependencies but they have to be built over and over again for each
proejct. 2. Concurrency management becomes sad - we either have to unleash
multiple cargo builds at the same time possibly thrashing the system or
build one by one.
We've been trying to solve this from meson side in vain. Thankfully, in
issue #546, @vimproved suggested using cargo workspace which makes the
sub-projects share the same target directory and built together by the same
cargo instance while still allowing each project to behave independently for
development and publishing purposes.
Make the following changes:
- Create two cargo workspaces - one under rust/, the other under
scheds/rust/. Each contains all rust projects underneath it.
- Don't let meson descend into rust/. These are libraries used by the rust
schedulers. No need to build them from meson. Cargo will build them as
needed.
- Change the rust_scheds build target to invoke `cargo build` in
scheds/rust/ and let cargo do its thing.
- Remove per-scheduler meson.build files and instead generate custom_targets
in scheds/rust/meson.build which invokes `cargo build -p $SCHED`.
- This changes rust binary directory. Update README and
meson-scripts/install_rust_user_scheds accordingly.
- Remove per-scheduler Cargo.lock as scheds/rust/Cargo.lock is shared by all
schedulers now.
- Unify .gitignore handling.
The followings are build times on Ryzen 3975W:
Before:
________________________________________________________
Executed in 165.93 secs fish external
usr time 40.55 mins 2.71 millis 40.55 mins
sys time 3.34 mins 36.40 millis 3.34 mins
After:
________________________________________________________
Executed in 36.04 secs fish external
usr time 336.42 secs 0.00 millis 336.42 secs
sys time 36.65 secs 43.95 millis 36.61 secs
Wallclock time is reduced 5x and CPU time 7x.
Refactor the code to hide the shutdown handling inside BpfScheduler and
simply use the exited() method to check when the scheduler is stopped.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Three of the reported stats are cumulative. While they obviously can be
processed into delta values, that holds for the other direction too and the
cumulative values are difficult to make intutive sense of. Report interval
delta values instead.
Note that a stats client can reliably build back cumulative values even
under heavy system contention - the delta values reported between two
consecutive reads are guaranteed to be correct regardless of the duration of
the interval.
Use scx_stats instead of prometheus for stats reporting. This has a few
advantages:
- Stats metadata can be defined more succinctly.
- Natural support for nesting statistics which will be useful in making
scheduler components composable.
- Support for multiple programmable readers where each reader can use their
own reading interval.
- Built-in stats help message generation.
- Openmetrics integration is still available through
scx_stats/scripts/scxstats_to_openmetrics.py.
Let's make it a bit easier to use:
- Shorten exported names by changing the prefix from ScxStats to Stats. This
should be distinctive enough and more inline with how most libraries name
their exports.
- Importing the right set of traits can be tricky. Introduce prelude module
so that importing is a bit less painful.
There is no reason to have two separate options for "verbose" and
"debug" mode. Just merge the two and always use "debug". If enabled,
increase verbosity to stdout and enable reporting BPF scheduling events
in debugfs (e.g., /sys/kernel/debug/tracing/trace_pipe).
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Since scx_rustland_core enables setting a time slice on a per-task basis
during task dispatch, there's no need to maintain a global time slice in
the BPF component. Instead, a global time slice can simply be managed in
user-space, achieving the same outcome.
Therefore, drop the global slice_us property from BpfScheduler to
simplify the API.
NOTE: if a time slice is not specified for a task, SCX_SLICE_DFL will be
used by default.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Add more comments to make the source code more understandable, so that
it can be easily used as a template for implementing more complex
scheduling policies.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Scheduling sample reporting is switched to use scx_stats. This makes the
scheduler run without making too much noise while still allowing monitoring
on demand. It can also make introspection more dynamic - e.g. it shouldn't
be difficult to add other monitoring commands which take scheduling samples
based on different criteria or add other types of staisitcs.
--nr_sched-samples is replaced with --monitor-nr-samples.