Add a per layer config for different implementations of layer growth
algorithms. Convert the existing default logic into a default layer
growth algorithm and add a linear implementation.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
Refactor some BPF code to make verification easier on older kernels.
This is to make it easier to maintain backports.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
Aggressively try to keep tasks running on the same CPU / cache / domain,
to achieve higher performance when the system is not over commissioned.
This is done by giving a second chance in ops.enqueue(), in addition to
ops.select_cpu(), to find an idle CPU close to the previously used CPU.
Moreover, even if the task is dispatched to the global DSQs, always try
to check if there is an idle CPU in the primary domain that can
immediately consume the task.
= Results =
This change seems to provide a minor, but consistent, boost of
performance with the CPU-intensive benchmarks from the CachyOS
benchmarks selection [1].
Similar results can also be noticed with some WebGL benchmarks [2], when
system usage is close to its maximum capacity.
Test:
- cachyos-benchmarker
System:
- AMD Ryzen 7 5800X 8-Core Processor
Metrics:
- total time: elapsed time of all benchmarks
- total score: geometric mean of all benchmarks
NOTE: total time is the most relevant, since it gives a measure of the
aggregate performance, while the total score emphasizes more on
performance consistency across all benchmarks.
== Results: summary ==
+-------------------------+---------------------+---------------------+
| Scheduler | Total Time | Total Score |
| | (less = better) | (less = better) |
+-------------------------+---------------------+---------------------+
| EEVDF | 624.44 sec | 123.68 |
| bpfland | 625.34 sec | 122.21 |
| bpfland-task-affinity | 623.67 sec | 122.27 |
+-------------------------+---------------------+---------------------+
== Conclusion ==
With this patch applied, bpfland shows both a better performance and
consistency. Although the gains are small (less than 1%), they are still
significant for this type of benchmark and consistently appear across
multiple runs.
[1] https://github.com/CachyOS/cachyos-benchmarker
[2] https://webglsamples.org/aquarium/aquarium.html
Tested-by: Piotr Gorski < piotr.gorski@cachyos.org >
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
When iterating neighbors, the existing code unnecessarily iterates all
the neighbors to the maximum even if there is no neighors. So the fix
escapes early when there is no neighbors.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Ctrl-c wasn't properly handled in the monitoring mode
(`--monitor-sched-samples`), so the scheduler could not be terminated by
pressing ctrl-c. The missing ctrl-c handling is added to the monitor
thread.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Rely on scx_utils::Topology to classify Big, Little and Turbo CPUs.
Moreover, support the special keyword "all" with --primary-domain to
include all the CPUs in the system (default).
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Integrate the logic used by scx_bpfland to detect turbo-boosted cores in
Topology.
Also change the logic to detect Big/Little cores in function of
base_frequency, instead of scaling_max_freq, otherwise turbo-boosted
cores in homogeneous systems may be incorrectly classified as Big.
Moreover, introduce the following new methods to Cpu to check for the
core type:
- is_turbo(): return true if the CPU is Turbo, false otherwise
- is_big(): return true if the CPU is either Turbo or Big
- is_little(): return true if the CPU is Little
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
When creating the turbo boost scheduling domain, we might use a full CPU
mask (selecting all possible CPUs) to indicate "do not prioritize turbo
boost CPUs" or when all CPUs have the same maximum frequency.
This approach works when the primary domain also contains all the CPUs,
as the complete overlap allows the CPU selection logic to ignore the
turbo boost domain and start picking CPUs directly from the primary
domain.
However, if the primary domain doesn't include all CPUs, the two domains
won't fully overlap, which can lead to the turbo boost domain
incorrectly including all CPUs, thereby negating the restrictions set by
the primary scheduling domain.
To resolve this, an empty CPU mask should be used for the turbo boost
domain when turbo boost CPUs aren't prioritized. If the turbo boost
domain is empty, it should be entirely bypassed, and the selection
should proceed directly to the primary domain.
Reported-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
With an ill combination of old kernel and old LLVM, the BPF verifier
incorrectly detects an infinite loop. After changing the loop with a
constant end, the old verifier can pass the code.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Refactor the code design to make it more suitable as a template for
implementing advanced scheduling policies.
In particular, create separate loops for task consumption and task
dispatching. This will make the scheduler easier to adapt as a
foundation for implementing more complex scheduling policies.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Now it checks an active cpumask within a previous core's compute domain
before checking the full active CPUs.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
The BPF verifier in the old kernel gives up to analysis the nested loop
in the consume_task(). We reduce the loop less complex by reducing
LAVD_CPDOM_MAX_DIST from 6 to 4 in order to make the verifier happy.
Note that the theoretical maximum distance is 6 (numa > llc > core type)
but there is no such hardware today, hence reducing it to 6 should be
okay in next few years, when hopefully the verifier becomes smarter.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Updating nr_queue_task every runqueue operation is expensive and
unnecessary. So we do update every system state update interval and use
moving average, which is accurate enough.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Avoid to periodically read the current performance profile from
/sys/devices/system/cpu/cpufreq/policy0/energy_performance_preference if
it's not available (i.e., with older CPUs or kernels without cpufreq).
This fixes issue #560.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
meson build script was building each rust sub-project under rust/ and
scheds/rust/ separately. This means that each rust project is built
independently which leads to a couple problems - 1. There are a lot of
shared dependencies but they have to be built over and over again for each
proejct. 2. Concurrency management becomes sad - we either have to unleash
multiple cargo builds at the same time possibly thrashing the system or
build one by one.
We've been trying to solve this from meson side in vain. Thankfully, in
issue #546, @vimproved suggested using cargo workspace which makes the
sub-projects share the same target directory and built together by the same
cargo instance while still allowing each project to behave independently for
development and publishing purposes.
Make the following changes:
- Create two cargo workspaces - one under rust/, the other under
scheds/rust/. Each contains all rust projects underneath it.
- Don't let meson descend into rust/. These are libraries used by the rust
schedulers. No need to build them from meson. Cargo will build them as
needed.
- Change the rust_scheds build target to invoke `cargo build` in
scheds/rust/ and let cargo do its thing.
- Remove per-scheduler meson.build files and instead generate custom_targets
in scheds/rust/meson.build which invokes `cargo build -p $SCHED`.
- This changes rust binary directory. Update README and
meson-scripts/install_rust_user_scheds accordingly.
- Remove per-scheduler Cargo.lock as scheds/rust/Cargo.lock is shared by all
schedulers now.
- Unify .gitignore handling.
The followings are build times on Ryzen 3975W:
Before:
________________________________________________________
Executed in 165.93 secs fish external
usr time 40.55 mins 2.71 millis 40.55 mins
sys time 3.34 mins 36.40 millis 3.34 mins
After:
________________________________________________________
Executed in 36.04 secs fish external
usr time 336.42 secs 0.00 millis 336.42 secs
sys time 36.65 secs 43.95 millis 36.61 secs
Wallclock time is reduced 5x and CPU time 7x.
Refactor the code to hide the shutdown handling inside BpfScheduler and
simply use the exited() method to check when the scheduler is stopped.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Three of the reported stats are cumulative. While they obviously can be
processed into delta values, that holds for the other direction too and the
cumulative values are difficult to make intutive sense of. Report interval
delta values instead.
Note that a stats client can reliably build back cumulative values even
under heavy system contention - the delta values reported between two
consecutive reads are guaranteed to be correct regardless of the duration of
the interval.
Use scx_stats instead of prometheus for stats reporting. This has a few
advantages:
- Stats metadata can be defined more succinctly.
- Natural support for nesting statistics which will be useful in making
scheduler components composable.
- Support for multiple programmable readers where each reader can use their
own reading interval.
- Built-in stats help message generation.
- Openmetrics integration is still available through
scx_stats/scripts/scxstats_to_openmetrics.py.
Let's make it a bit easier to use:
- Shorten exported names by changing the prefix from ScxStats to Stats. This
should be distinctive enough and more inline with how most libraries name
their exports.
- Importing the right set of traits can be tricky. Introduce prelude module
so that importing is a bit less painful.
There is no reason to have two separate options for "verbose" and
"debug" mode. Just merge the two and always use "debug". If enabled,
increase verbosity to stdout and enable reporting BPF scheduling events
in debugfs (e.g., /sys/kernel/debug/tracing/trace_pipe).
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Since scx_rustland_core enables setting a time slice on a per-task basis
during task dispatch, there's no need to maintain a global time slice in
the BPF component. Instead, a global time slice can simply be managed in
user-space, achieving the same outcome.
Therefore, drop the global slice_us property from BpfScheduler to
simplify the API.
NOTE: if a time slice is not specified for a task, SCX_SLICE_DFL will be
used by default.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Add more comments to make the source code more understandable, so that
it can be easily used as a template for implementing more complex
scheduling policies.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Scheduling sample reporting is switched to use scx_stats. This makes the
scheduler run without making too much noise while still allowing monitoring
on demand. It can also make introspection more dynamic - e.g. it shouldn't
be difficult to add other monitoring commands which take scheduling samples
based on different criteria or add other types of staisitcs.
--nr_sched-samples is replaced with --monitor-nr-samples.
The update_tasks() API is somewhat confusing, so replace it with a
clearer API, notify_complete().
This new API will return control to the BPF component and inform it
about the number of tasks still pending in the user-space scheduler.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
The low-power API is a bit of a hack implemented purely in the BPF
layer, this should be better re-implemented with some concepts of
topology awareness.
Therefore, get rid of this API for now.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
The current API used to notify the user-space scheduler when a task
exits is really confusing (setting a negative value in
queued_task_ctx.cpu), and it's also possible to detect task exiting
events from user-space (or check in procfs, even if it's slower).
In any case, a better API should be provided for this, so drop the
current one for now.
NOTE: this will cause additional memory usage for scx_rustland, but it
can be fixed/addressed later in a separate commit (i.e., providing a
periodic garbage collector for the unused task entries).
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Allow user-space scheduler to pick an idle CPU via
self.bpf.select_cpu(pid, prev_task, flags), mimicking the BPF's
select_cpu() iterface.
Also remove the full_user option and always rely on the idle selection
logic from user-space.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Keep evaluating the average number of voluntary context switches for
each task when lowlatency mode is enabled, even when interactive tasks
classification is disabled (via `-c 0`).
The average nvcsw is also used in lowlatency mode to evaluate the
proportional bonus to the tasks' deadline and it shouldn't be ignored
when interactive tasks classification is disabled. Moreover, make sure
that such bonus never exceeds the starvation threshold.
Keep in mind that it is still possible to disable the periodic average
nvcsw evaluation with `-c 0`, without specifying `--lowlatency`.
Fixes: 6a22853 ("scx_bpfland: introduce --lowlatency option")
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
A lot of scx_lavd's options do not clearly explain what they do. Add
some short explanations, clean up the existing ones, and direct the user
to read the in-code documentation for more info.
And move related ops into it. This is a bit more natural and will also allow
doing other operaitons (e.g. describing stats) without launching the server.
Make `--primar-domain auto` aware of turbo boosted CPUs and prioritize
them over the primary scheduling domain when the energy model
`balance_power` is used (typically when running on battery power with
the "balanced" profile).
With this change the scheduling hierarchy becomes the following:
1) CPUs in the turbo scheduling domain
2) CPUs in the primary scheduling domain
3) full-idle SMT CPUs
4) CPUs in the same L2 cache
5) CPUs in the same L3 cache
6) CPUs in the task's allowed domain
And the idle selection logic is modified as following:
- In the turbo scheduling domain:
- pick same full-idle SMT CPU
- pick any other full-idle SMT CPU sharing the same L2 cache
- pick any other full-idle SMT CPU sharing the same L3 cache
- pick any other full-idle SMT CPU
- pick same idle CPU
- pick any other idle CPU sharing the same L2 cache
- pick any other idle CPU sharing the same L3 cache
- pick any other idle SMT CPU
- In the primary scheduling domain:
- pick same full-idle SMT CPU
- pick any other full-idle SMT CPU sharing the same L2 cache
- pick any other full-idle SMT CPU sharing the same L3 cache
- pick any other full-idle SMT CPU
- pick same idle CPU
- pick any other idle CPU sharing the same L2 cache
- pick any other idle CPU sharing the same L3 cache
- pick any other idle SMT CPU
- In the entire task domain:
- pick any other idle CPU
Keep in mind that the turbo domain will be evaluated only when the
scheduler is started with `--primary-domain auto` and only when the
`balance_power` energy profile is used.
The turbo domain is always made using the subset of CPUs in the system
with the highest max frequency. If such subset can't be determined (for
example if all the CPUs in the primary domain have all the same
frequency), the turbo domain will be ignored.
Prioritizing turbo boosted CPUs can help to improve performance by
forcing the governor to scale up their frequency, without increasing too
much power consumption, due to the fact that tasks will be preferably
confined into a reduced amount of cores.
This change seems to improve performance, without increasing much
power consuption, on Intel laptops while using the `balanced_power`
energy profile.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Introduce the new option `--primary-domain auto`. With this option the
scheduler will dynamically adjusts the primary scheduling domain at
run-time, in function of the current energy profile reported in
/sys/devices/system/cpu/cpufreq/policy0/energy_performance_preference.
When the `power` energy profile is selected, the primary scheduling
domain will prioritize E-cores. Alternatively, when the `performance`
profile is selected, it will prioritize P-cores. For all the other
energy profiles, all the CPUs in the system will be used.
Note that this option is only relevant on hybrid architectures with
P-cores and E-cores.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Introduce the new `--lowlatency` option, which enables switching between
the default pure vruntime-based scheduling (more optimized for server
workloads) and a deadline-based scheduling (better suited for
low-latency workloads).
When the low-latency mode is activated, a task's deadline is calculated
as its vruntime, adjusted by a bonus proportional to the task's average
number of voluntary context switches (the more voluntary context
switches, the shorter the deadline).
This feature enhances the prioritization of interactive tasks even more,
proportionally to their average voluntary context switches, also within
the two main global queues (priority / shared) and it helps to maintain
interactive workloads always responsive, even in presence of heavy
non-interactive background work.
Low-latency mode allows to prevent audio cracking even in presence of a
large amount of short-lived tasks with pseudo-interactive behavior (i.e,
hackbench) and it enables achieving approximately a +33% average
frames-per-second (FPS) in the typical "gaming while building the
kernel" benchmark.
However, it can also amplify the de-prioritization of CPU-intensive
tasks, making this option more suitable for specific low-latency
scenarios. Therefore the low-latency mode is disabled by default and it
can only be enabled via the `--lowlatency` option.
Tested-by: Piotr Gorski (piotrgorski@cachyos.org)
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Explicitly replenish the task's time slice from ops.dispatch() if the
task still wants to run and no other task is selected. In this way the
sched_ext core won't automatically re-schedule the task on the same CPU,
implicitly assigning a time slice of SCX_SLICE_DFL.
Moreover, instead of determining the task time slice in ops.enqueue(),
refresh the time slice immediately before the task is started on its
assigned CPU in ops.running().
This allows to use a more precise time slice, adjusted based on the
actual amount of tasks that are currently waiting to be scheduled.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
The meaning of SCX_OPS_ENQ_LAST will change with future kernel updates and
enqueueing on local DSQ will no longer be sufficient to avoid stalls. No
reason to do it anyway. Just drop it.
With the global scx_utils::NR_CPU_IDS we don't need Topology anymore in
init_primary_domain(), so drop the variable to fix the following build
warning:
warning: unused variable: `topo`
--> src/main.rs:385:9
|
385 | topo: &Topology,
| ^^^^ help: if this is intentional, prefix it with an underscore: `_topo`
|
= note: `#[warn(unused_variables)]` on by default
Fixes: 1da249f ("scx_utils::topology: Always use NR_CPU_IDS and NR_CPUS_POSSIBLE")
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Use the base frequency, instead of maximum frequency, to classify fast
and slow CPUs. This ensures accurate distinction between Intel Turbo
Boost CPUs and genuinely faster CPUs when auto-detecting the primary
scheduling domain.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
- Update scx_utils/build.rs so that 12 char SHA1 is generated instead of
full one.
- Add --version to scx_rusty. Use custom one as we don't want to use the
default cargo version one.
Tasks enqueued with SCX_ENQ_WAKEUP are immediately classified as
interactive. However, if interactive tasks classification is disabled
(via `-c 0`), we should avoid promoting them as interactive.
This is particularly important because, with the nvcsw logic disabled,
tasks can remain classified as interactive indefinitely and they will
never be demoted to regular tasks.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Rely on scx_utils::Cpumask instead of re-implementing a custom struct to
parse and manage CPU masks.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Fix a bug introduced in #510 where it assumed core ids are incremental.
This refactors the core ordering for layers to be far more simple and
provide some space for layer core isolation in low utilization.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
Rely on scx_utils::Topology to get CPU and cache information, instead of
re-implementing custom methods.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Currently the core selection logic in scx_layered uses the first
available core in the bitmask. This is suboptimal when the scheduler is
configured with specific NUMA/LLC restrictions. The ideal core selection
logic should try to find the least used cores within the preferred
scheduling domain and allocate new cpus from shared cores within that
domain.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
- If --monitor is specified with layer specs, the scheduler also starts
stats monitoring on a thread.
- Standalone monitoring mode no longer exits when the scheduler isn't there.
Instead of keeping one copy of sched_stats, each stats server session
carries their own so that stats can be generated independently by each
client at any interval. CPU allocation min/max tracking is broken for now.
The primary scheduling domain represents a group of CPUs in the system
where the scheduler will initially attempt to assign tasks. Tasks will
only be dispatched to CPUs within this primary domain until they are
fully utilized, after which tasks may overflow to other available CPUs.
The primary scheduling domain can defined using the option
`--primary-domain CPUMASK` (by default all the CPUs in the system are
used as primary domain).
This change introduces two new special values for the CPUMASK argument:
- `performance`: automatically detect the fastest CPUs in the system
and use them as primary scheduling domain,
- `powersave`: automatically detect the slowest CPUs in the system and
use them as primary scheduling domain.
The current logic only supports creating two groups: fast and slow CPUs.
The fast CPU group is created by excluding CPUs with the lowest
frequency from the overall set, which means that within the fast CPU
group, CPUs may have different maximum frequencies.
When using the `performance` mode the fast CPUs will be used as primary
domain, whereas in `powersave` mode, the slow CPUs will be used instead.
This option is particularly useful in hybrid architectures (with P-cores
and E-cores), as it allows the use of bpfland to prioritize task
scheduling on either P-cores or E-cores, depending on the desired
performance profile.
Example:
- Dell Precision 5480
- CPU: 13th Gen Intel(R) Core(TM) i7-13800H
- P-cores: 0-11 / max freq: 5.2GHz
- E-cores: 12-19 / max freq: 4.0GHz
$ scx_bpfland --primary-domain performance
0[||||||||| 24.5%] 10[|||||||| 22.8%]
1[|||||| 14.9%] 11[||||||||||||| 36.9%]
2[|||||| 16.2%] 12[ 0.0%]
3[||||||||| 25.3%] 13[ 0.0%]
4[||||||||||| 33.3%] 14[ 0.0%]
5[|||| 9.9%] 15[ 0.0%]
6[||||||||||| 31.5%] 16[ 0.0%]
7[||||||| 17.4%] 17[ 0.0%]
8[|||||||| 23.4%] 18[ 0.0%]
9[||||||||| 26.1%] 19[ 0.0%]
Avg power consumption: 3.29W
$ scx_bpfland --primary-domain powersave
0[| 2.5%] 10[ 0.0%]
1[ 0.0%] 11[ 0.0%]
2[ 0.0%] 12[|||| 8.0%]
3[ 0.0%] 13[||||||||||||||||||||| 64.2%]
4[ 0.0%] 14[|||||||||| 29.6%]
5[ 0.0%] 15[||||||||||||||||| 52.5%]
6[ 0.0%] 16[||||||||| 24.7%]
7[ 0.0%] 17[|||||||||| 30.4%]
8[ 0.0%] 18[||||||| 22.4%]
9[ 0.0%] 19[||||| 12.4%]
Avg power consumption: 2.17W
(Info collected from htop and turbostat)
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
While the system is not saturated the scheduler will use the following
strategy to select the next CPU for a task:
- pick the same CPU if it's a full-idle SMT core
- pick any full-idle SMT core in the primary scheduling group that
shares the same L2 cache
- pick any full-idle SMT core in the primary scheduling grouop that
shares the same L3 cache
- pick the same CPU (ignoring SMT)
- pick any idle CPU in the primary scheduling group that shares the
same L2 cache
- pick any idle CPU in the primary scheduling group that shares the
same L3 cache
- pick any idle CPU in the system
While the system is completely saturated (no idle CPUs available), tasks
will be dispatched on the first CPU that becomes available.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
This option chooses little (effiency) cores over big (performance) cores
to save power consumption for core compaction.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
The changes include 1) chopping down a big function into smaller ones
for readability and maintainability and 2) using the interior mutability
pattern (Cell and RefCell) to avoid unnecessary clone() calls. There
are no functional changes.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Fix the uninitialized variable "layer" in the function match_layer which
caused the compiling process to fail. "layer" is supposed to be the same
as "&layers[layer_id]".
Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
- Allow no-value user attributes which are automatically assigned "true"
when specified.
- Make "top" attribute string "true" instead of bool true for consistency.
Testing for existence is always enough for value-less attributes.
- Don't drop leading "_" from user attribute names when storing in dicts.
Dropping makes things more confusing.
- Add "_om_skip" to scx_layered fields which don't jive well with OM.
scxstats_to_openmetrics.py is updated accordignly and no longer generates
warnings on those fields.
- Examples and README updated accordingly.
After updating scx_layered to be topology aware the nr_cpus field on the
layer was not being updated properly. Update layer growing/shrinking
logic to correctly update the nr_cpus count.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
- This makes the scheduler side simpler and allows on-demand monitoring.
- OpenMetrics support is dropped for now. Will add a generic tool for it.
- This is a naive conversion. Will be further refined.
scx_layered no longer prints statistics by default. To watch statistics, run
`scx_layered --monitor` while the scheduler is running.
Allow to specify a primary scheduling domain via the new command line
option `--primary-domain CPUMASK`, where CPUMASK can be a hex number of
arbitrary length, representing the CPUs assigned to the domain.
If this option is not specified the scheduler will use all the available
CPUs in the system as primary domain (no behavior change).
Otherwise, if a primary scheduling domain is defined, the scheduler will
try to dispatch tasks only to the CPUs assigned to the primary domain,
until these CPUs are saturated, at which point tasks may overflow to
other available CPUs.
This feature can be used to prioritize certain cores over others and it
can be really effective in systems with heterogeneous cores (e.g.,
hybrid systems with P-cores and E-cores).
== Example (hybrid architecture) ==
Hardware:
- Dell Precision 5480 with 13th Gen Intel(R) Core(TM) i7-13800H
- 6 P-cores 0..5 with 2 CPUs each (CPU from 0..11)
- 8 E-cores 6..13 with 1 CPU each (CPU from 12..19)
== Test ==
WebGL application (https://webglsamples.org/aquarium/aquarium.html):
this allows to generate a steady workload in the system without
over-saturating the CPUs.
Use different scheduler configurations:
- EEVDF (default)
- scx_bpfland using P-cores only (--primary-domain 0x00fff)
- scx_bpfland using E-cores only (--primary-domain 0xff000)
Measure performance (fps) and power consumption (W).
== Result ==
+-----+-----+------+-----+----------+
| min | max | avg | | |
| fps | fps | fps | stdev | power |
+-----------------+-----+-----+------+-------+--------+
| EEVDF | 28 | 34 | 31.0 | 1.73 | 3.5W |
| bpfland-p-cores | 33 | 34 | 33.5 | 0.29 | 3.5W |
| bpfland-e-cores | 25 | 26 | 25.5 | 0.29 | 2.2W |
+-----------------+-----+-----+------+-------+--------+
Using a primary scheduling domain of only P-cores with scx_bpfland
allows to achieve a more stable and predictable level of performance,
with an average of 33.5 fps and an error of ±0.5 fps.
In contrast, using EEVDF results in an average frame rate of 31.0 fps
with an error of ±3.0 fps, indicating slightly less consistency, due to
the fact that tasks are evenly distributed across all the cores in the
system (both slow and fast cores).
On the other hand, using a scheduling domain solely of E-cores with
scx_bpfland results in a lower average frame rate (25.5 fps), though it
maintains a stable performance (error of ±0.5 fps), but the power
consumption is also reduced, averaging 2.2W, compared to 3.5W with
either of the other configurations.
== Conclusion ==
In summary, with this change users have the flexibility to prioritize
scheduling on performance cores for better performance and consistency,
or prioritize energy efficient cores for reduced power consumption, on
hybrid architectures.
Moreover, this feature can also be used to minimize the number of cores
used by the scheduler, until they reach full capacity. This capability
can be useful for reducing power consumption even in homogeneous systems
or for conducting scheduling experiments with smaller sets of cores,
provided the system is not overcommitted.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Abbreviate the statistics reported to stdout and remove the slice_ms
metric: this metric can be easily derived from slice_ns, slice_ns_min
and nr_wait, which is already reported to stdout.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Layer matching currently takes a large number of bpf instructions.
Moving layer matching to a global function will reduce the overall
instruction count and allow for other layer matching methods such as
glob.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
Put a performance-critical task to a performance critical queue and a
regular task to a regular queue.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
The logic of tasks filtering were moved from find_first_candidate() into
a vector filter operation in commit 1c3b563. However, it was forgotten
to transfer the logic with "NOT" since now .filter() will populate the
tasks we want, rather than .skip_while() which was throwing unwanted
tasks out.
That's why the logic here should be reverse so we won't take kworker or
migrated tasks into considerations.
Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
The member "topo_map" in Scheduler is never used and thus should be
removed, the related imports are removed as well.
Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
Re-add the partial mode option that was dropped during the refactoring.
The partial option allows to apply the scheduler only to the tasks which
have their scheduling policy set to SCHED_EXT via sched_setscheduler().
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
The API for determining which PID is running on a specific CPU is racy
and is unnecessary since this information can be obtained from user
space.
Additionally, it's not reliable for identifying idle CPUs. Therefore,
it's better to remove this API and, in the future, provide a cpumask
alternative that can export the idle state of the CPUs to user space.
As a consequence also change scx_rustland to dispatch one task a time,
instead of dispatching tasks in batches of idle cores (that are usually
not accurate due to the racy nature of the CPU ownership interaface).
Dispatching one task at a time even makes the scheduler more performant,
due to the vruntime scheduling being applied to more tasks sitting in
the scheduler's queue.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Drop the slice boost logic and apply a vruntime and task time slice
evaluation approach similar to scx_bpfland (but implement this in the
user-space component instead of the BPF part).
Additionally, introduce a slice_us_min parameter to define the minimum
time slice that can be assigned to a task, also similar to scx_bpfland.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Use the same idle selection logic used in scx_bpfland also in
scx_rustland_core.
Also drop fifo_mode and always use the BPF idle selection logic by
default as long as the system is not saturated, unless full_user is
specified.
This approach allows user-space schedulers aiming for maximum
performance to leverage the BPF idle selection logic (bypassing
user-space), while those seeking full control can enable full_user to
bypass the BPF CPU idle selection logic and choose the target CPU for
each task from user-space.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
We don't need to send the number of voluntary context switches (nvcsw)
from BPF to user-space, as this information is already accessible in
user-space via procfs. Sending this data would only create unnecessary
overhead for schedulers that don't require it, and those that do can
easily retrieve it through procfs.
Therefore, drop this metric from scx_rustland_core and change
scx_rustland implementing an interactive task classifier fully in the
user-space part of the scheduler.
Also drop some options that are not provide any significant benefit
(also in preparation of a bigger refactoring to define a better API for
the user-space framework).
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
- Use .enumerate() consistently while building the cpu_fids vector.
- Use .then_with() to chain .cmp() when sorting cpu_fids.
Both reduce visual clutter.
Add a parameter to disable topology awareness. This is useful when
trying to compare the scheduling performance of topology aware
scheduling compared to the previous scheduling strategy.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
With optimizations of calculatring ineligibility duration, now the
scheduler works well under heavy load without 2-level scheduling, so we
drop it for simplicitiy.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
This commit include a few changes:
- treat a new forked task more conservatively
- defer the execution of more tasks for longer time using ineligibility duration
- consider if a task is waken up in calculating ineligibility duration
Immediately re-align p->scx.dsq_vtime to the global vruntime (+/- slice
lag) as soon as we are evaluating the task's vruntime.
This allows rapidly chase the minimum global vruntime, ensuring to not
over prioritize tasks tasks with a predominantly sleeping behavior
pattern.
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
When the previous CPU for a task is not known do not fall back to
dispatching to CPU 0, use the current CPU.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
L or R: Latency-critical, Regular
H or I: performance-Hungry, performance-Insensitive
B or T: Big, liTtle
E or G: Eligible, Greedy
P or N: Preemption, Not
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Add a per cpu counter offset to round robin when iterating on layers.
This is to make selection from different layers more fair.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
Tuning the time slice under high load and change the kick/tick margins
for preemption more conservative. Especially, aggressive IPI-based
preemption (kick) causes performance unstability.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Instead of using coarse-grained log(), let's directly use the ratio of
task's service time. Also, the virtual dealine equation is also updated
to reflect this change.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
The max_entries parameter in BPF_MAP_TYPE_PERCPU_ARRAY defines the
number of values per CPU and for cpu_ctx_stor we only need one item: the
CPU context.
Set max_entries to 1 to avoid allocating unnecessary memory and slightly
reduce the memory footprint.
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
We introduce two-level scheduling similar to scx_bpfland. The two-level
scheduling consists of two DSQs: 1) latency-critical run queue and 2)
regular run queue. The scheduler prioritizes scheduling tasks on the
latency-critical queue but makes its best effort to schedule tasks on
the regular queue. The scheduler could be more resilient under heavy
load by segregating regular, non-latency-critical tasks from
latency-critical tasks.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
The max frequency information from topology (from sysfs) seems not
always true. In some installations, it returns zero for all CPUs. In
this case, let's just consider all CPUs have the same capacity (1024),
hoping the kernel can give more preceise information.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Latency criticality is a task's inherent property, but the starvation
factor is its dynamic status for the urgency of scheduling. Hence, we
segregate the starvation factor out. Also, cleaned up unnecessary
arguments and struct fields related.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
When a task is running on more performant core, the scheduler will give
a longer time slice. On the other hand, on a less performant core, a
shorter time slice will be assigned. The longer time slice helps
boosting clock frequency on a performant core. Also, the shorter time
slice gives more chance the performant core being utilized.
Regarding the CPU capacity, we first check if kernel-provided capacitiy values
are trustworthy or not. If not (i.e., all the same values), we rely on
the user-provided value, based on each CPU's maximum clock frequency.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
With the --prefer-smt-core option is on, the core compaction prefers to
utilizae hyper-twin first before utilizing the other physical CPUs. By
default, the option is off.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Previously, the core compaction assumed that each core's capacity was
the same. Now, we additionally consider each core's max clock frequency.
So, it always tries to use the higher-frequency cores first.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Remove unused constants and rename outdated constants to proper names
(LAVD_TC_* to LAVC_CC_* and LAVD_ELIGIBLE_DSQ to LAVD_GLOBAL_DSQ).
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Using negative values with --slice-us-lag can be useful to make
performance more consistent and prioritize newly created tasks over the
running tasks.
Therefore, allow to specify negative values from the command line and
also update the documentation of this option.
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
In some scenarios, a CPU-intensive task may be on the critical path for
interactive workloads. For example, you may have a game with CPU-intensive
tasks that are crunching the logic for the game, and that's required for the
game to proceed without being choppy.
To support such workflows, this change adds logic to allow a non-interactive
task to inherit the lower (i.e. stronger) latency priority of another task if
it wakes or is woken by that task.
Signed-off-by: David Vernet <void@manifault.com>
Currently, a task's deadline is computed as its vtime + a scaled function of
its average runtime (with its deadline being scaled down if it's more
interactive). This makes sense intuitively, as we do want an interactive task
to have an earlier deadline, but it also has some flaws.
For one thing, we're currently ignoring duty cycle when determining a task's
deadline. This has a few implications. Firstly, because we reward tasks with
higher waker and blocked frequencies due to considering them to be part of a
work chain, we implicitly penalize tasks that rarely ever use the CPU because
those frequencies are low. While those tasks are likely not part of a work
chain, they also should get an interactivity boost just by pure virtue of not
using the CPU very often. This should in theory be addressed by vruntime, but
because we cap the amount of vtime that a task can accumulate to one slice, it
may not be adequately reflected after a task runs for the first time.
Another problem is that we're minimizing a task's deadline if it's interactive,
but we're also not really penalizing a task that's a super CPU hog by
increasing its deadline. We sort of do a bit by applying a higher niceness
which gives it a higher deadline for a lower weight, but its somewhat minimal
considering that we're using niceness, and that the best an interactive task
can do is minimize its deadline to near zero relative to its vtime.
What we really want to do is "negatively" scale an interactive task's deadline
with the same magnitude as we "positively" scale a CPU-hogging task's deadline.
To do this, we make two major changes to how we compute deadline:
1. Instead of using niceness, we now instead use our own straightforward
scaling factor. This was chosen arbitrarily to be a scaling by 1000, but we
can and should improve this in the future.
2. We now create a _signed_ linear latency priority factor as a sum of the
three following inputs:
- Work-chain factor (log_2 of product of blocked freq and waker freq)
- Inverse duty cycle factor (log_2 of the inverse of a task's duty cycle --
higher duty cycle means lower factor)
- Average runtime factor (Higher avg runtime means higher average runtime
factor)
We then compute the latency priority as:
lat_prio := Average runtime factor - (work-chain factor + duty cycle factor)
This gives us a signed value that can be negative. With this, we can compute a
non-negative weight value by calculating a weight from the absolute value of
lat_prio, and use this to scale slice_ns. If lat_prio is negative we calculate
a task's deadline as its vtime MINUS its scaled slice_ns, and if it's positive,
it's the task's vtime PLUS scaled slice_ns.
This ends up working well because you get a higher weight both for highly
interactive tasks, and highly CPU-hogging / non-interactive tasks, which lets
you scale a task's deadline "more negatively" for interactive tasks, and "more
positively" for the CPU hogs.
With this change, we get a significant improvement in FPS. On a 7950X, if I run
the following workload:
$ stress-ng -c $((8 * $(nproc)))
1. I get 60 FPS when playing Stellaris (while time is progressing at max
speed), whereas EEVDF gets 6-7 FPS.
2. I get ~15-40 FPS while playing Civ6, whereas EEVDF seems to get < 1 FPS. The
Civ6 benchmark doesn't even start after over 4 minutes in the initial frame
with EEVDF, but gets us 13s / turn with rusty.
3. It seems that EEVDF has improved with Terraria in v6.9. It was able to
maintain ~30-55 FPS, as opposed to the ~5-10FPS we've seen in the past.
rusty is still able to maintain a solid 60-62FPS consistently with no
problem, however.
Add a layer match based on either the effective user id or the effective
group id. This allows for creating layers for individual users or
groups.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
Add NUMA node topology awareness for scx_layared. This borrows some of
the NUMA handling from scx_rusty and allows layers to set a node mask.
Different layer kinds will use the node mask differently.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
Simplify LoadBalancer::populate_tasks_by_load() by cutting out the
heap allocation bits, by moving mutable accesses in front of immutable
ones. Because multiple immutable accesses (between bss and rodata) do
not conflict, we don't need the intermediate PID storage.
Signed-off-by: Daniel Müller <deso@posteo.net>
Periodically report to stdout samples of the effective time slice
applied to tasks.
While one could determine this metric by examining the max slice_ns and
nr_waiting metrics, directly reporting it to stdout allows users to
quickly identify what is happening and it provides a clearer overview of
the scheduling behavior.
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
Dispatching per-CPU kthreads directly is disabled by default, reporting
this metric can generate some confusion (since it is always 0), and even
if local kthread dispatches are enabled, they should be still considered
as regular direct dispatches (there is no difference in practice).
Therefore, merge direct kthread dispatches into direct dispatches and
drop the separate nr_kthread_dispatches metric.
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
Scale the task's time slice based on the average amount of tasks that
are currently waiting to be dispatched.
Use a moving average for the amount of waiting tasks to smooth out
potential spikes caused by temporary bursts of tasks piling in the wait
queues.
This was initially modeled in scx_rustland and it seems to work pretty
well also in scx_bpfland now.
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
With all the other optimizations and tunings, it turns out that maintaining
two runqueues has more harm than good.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Further depenalize above-average latency-critical tasks and penalize
further below-avergage latency-critical tasks in ineligibility duration.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
LAVD_VDL_LOOSENESS_FT represents how loose the deadline is. The smaller
value means the deadline is tighter. While it is unlikely to be tuned,
let's keep it as a tunable for now.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Non-kthreads with custom affinities in non-open layers are dispatched into a
LO_FALLBACK_DSQ, with the idea being that they're penalized for their custom
affinities. When a host is fully utilized, these tasks can end up being starved
due to LO_FALLBACK_DSQ being consumed only when there are no other layers to
consume from. In internal workloads at Meta, we've observed that this can
happen in practice.
Longer term, we can probably address this by implementing layer weights and
applying that to fallback DSQs to avoid starvation. For now, let's just
dispatch them to HI_FALLBACK_DSQ to avoid this starvation issue.
Signed-off-by: David Vernet <void@manifault.com>
Refactor the main module for scx_layered to move metrics into a separate
module. This change does no functional differences, only code structure.
This will make it a little easier to navigate the logic in the main
scheduler code.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
That is okay since the runtime is considered in calculating a virtual
deadline. A shorter runtime will result in a tighter deadline linearly.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
If inheriting the parent's properties, a new fork task tends to be too
prioritized. That is, many parent processes, such as `make,` are a bit
more latency-critical than average.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Instead of using a static value to classify tasks based on their average
amount of voluntary context switches, try to periodically evaluate an
optimal threshold, based on a global average of voluntary context
switches among of all the running tasks.
Tasks with an average amount of voluntary context switches greater than
the global average will be classified as interactive.
The global average is evaluated as an exponentially weighted moving
average (EWMA), as:
avg(t) = avg(t - 1) * 0.75 - task_avg(t) * 0.25
This approach is more efficient than iterating through all tasks and it
helps to prevent rapid fluctuations that may be caused by bursts of
voluntary context switch events.
The dynamic nvcsw threshold enables a more precise adjustment of
the classification criteria to swiftly respond to global system changes:
tasks can be quickly classified as interactive, but if the system
experiences too many interactive events, the criteria for maintaining
interactive status become stricter. This creates a natural selection
process where only the most deserving tasks remain interactive.
Additionally, introduce the new option `--nvcsw-max-thresh N`, which
allows to extend or restrict the fluctuation range of the global average
threshold for voluntary context switches.
Tested-by: Piotr Gorski <piotrgorski@cachyos.org>
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
Advancing the clock slower when overloaded gives more opportunities for
latency-critical tasks to cut in the run queue. Controlling the clock
better reflects the actual load than the prior approach of stretching
the time-space when overloaded.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
We now maintain two run queues—an eligible run queue (DSQ) and an
ineligible run queue (rbtree)—sorted by the task's virtual deadline.
When the eligible run queue is empty, or the ineligible run queue has
not been consumed for too long (e.g., 15 msec), a task in the ineligible
run queue is moved to the eligible run queue for execution. With these
two queues, we have a better admission control.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Followed commit 1c3b563, move the checking of task.migrated.get() into
the vector filter. In this way, we can remove the skip_while() call in
find_first_candidate().
Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
Update libbpf-rs & libbpf-cargo to 0.24. Among other things, generated
skeletons now contain directly accessible map and program objects, no
longer necessitating the use of accessor methods. As a result, the risk
for mutability conflicts is reduced greatly.
Signed-off-by: Daniel Müller <deso@posteo.net>
This change refactors some of the helper methods for getting the
preferred node for tasks using mempolicy. The load balancing logic in
try_find_move_task is updated to allow for a filter, which is used to
filter for tasks with a preferred mempolicy.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
This change makes scx_rusty mempolicy aware. When a process uses
set_mempolicy it can change NUMA memory preferences and cause
performance issues when tasks are scheduled on remote NUMA nodes. This
change modifies task_pick_domain to use the new helper method that
returns the preferred node id.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
Estimating the service time from run time and frequency is not
incorrect. However, it reacts slowly to sudden changes since it relies
on the moving average. Hence, we directly measure the service time to
enforce fairness.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Instead of performing domain mask checking inside
"find_first_candidate()" every time, check whether the tasks within push
domain are abled to run on pull domain by performing the mask check at
vector generation stage.
This way can also avoid repeated computation generated by the same
(task, pull_dom) pair as they'll try to check whether the pull domain is
in the task domain mask.
Also since whether a task is a kworker won't change in time, we can
perform the check earlier and put it in the filter, too.
Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
We always use nr_cpu_ids to represent the maximum CPU id returned by
scx_bpf_nr_cpu_ids().
Replace cpu_max with nr_cpu_ids to be more consistent with the rest of
the code.
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
We can rely on scx_bpf_nr_cpu_ids() to create all the possible per-CPU
DSQs, eliminating the need for the hard-coded limit MAX_CPUS.
In this way scx_bpfland can support the same amount of CPUs that the
kernel can handle.
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
Instead of constantly checking the need to drain tasks from the DSQs of
the offline CPUs, provide an atomic flag to notify when there are tasks
to be drained from the offline CPUs.
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
Refine the safeguard mechanism to avoid generating too many interactive
tasks in the system, which could nullify the effect of the
interactive/regular task classification.
The safeguard mechanism operates by pausing the promotion of new tasks
to interactive status during the task wake-up process, whenever the
number of interactive tasks in the priority queue exceeds a specific
limit (set to 4x the number of online CPUs).
Halting the promotion of additional interactive tasks allows to
prioritize those already classified as interactive, thereby preventing
potential "bursts" of excessive interactive tasks in the system.
This refines the mitigation already provided by commit 640bd562
("scx_bpfland: prevent tasks from abusing interactive priority boost").
Fixes: 640bd562 ("scx_bpfland: prevent tasks from abusing interactive priority boost")
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
Always assign the maximum time slice if there are idle CPUs in the
system.
Otherwise, double the task's unused time slice to reward tasks that use
less CPU time and at the same time refill the time slice of the tasks
every time they're dispatched.
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
sched_ext is about to be merged upstream. There are some compatibility
breaking changes and we're making the current sched_ext/for-6.11
1edab907b57d ("sched_ext/scx_qmap: Pick idle CPU for direct dispatch on
!wakeup enqueues") the baseline.
Tag everything except scx_mitosis as 1.0.0. As scx_mitosis is still in early
development and is currently temporarily disabled, only the patchlevel is
bumped.
Sync to vmlinux.h from sched_ext/for-6.11 1edab907b57d ("sched_ext/scx_qmap:
Pick idle CPU for direct dispatch on !wakeup enqueues"). This most likely
will be the commit which will be merged during the upcoming kernel v6.11
merge window.
Unfortunately, this is a compatibility breaking change. As the size of
bpf_iter_scx_dsq is reduced, schedulers that use the iterator - scx_lavd and
scx_layered - won't be able to run on older kernels. Likewise, older
binaries from before this commit won't be able to run on newer kernels.
Sync from sched_ext/for-6.11 1edab907b57d ("sched_ext/scx_qmap: Pick idle
CPU for direct dispatch on !wakeup enqueues")
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git for-6.11
- cgroup support hasn't landed in the upstream kernel yet. This most likely
will happen in a few weeks. For the time being, disable scx_flatcg,
scx_pair and scx_mitosis.
- Compat macro for DSQ task iterator dropped. This is now a part of
the baseline.
- scx_bpf_consume() isn't upstream yet. BPF interfacing side is still being
discussed. Dropped example usage from tools/sched_ext. None of the
practical schedulers use it, so this should be fine for now.
- scx_bpf_cpu_rq() added.
- AUTOATTACH workaround for newer libbpf versions added.
A task can become a runnable on any task's context not only its waker
task. Thus, we should not count wake-up on unrelated task's context.
With this commit, the scheduler can (much more) accurately detect
waker-wakee relationsships.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
The prior approach using the sum of weights gives too much penalty to
nice tasks with large nice values. With this commit, the time slice is
determined by the number of runnable tasks regardless of nice priority.
Note that the fairness will still be enforced based on tasks' nice
priorities (weights).
Signed-off-by: Changwoo Min <changwoo@igalia.com>
To easily distinguish, let's initialize the current logical clock to
zero (not the current physical time). Also, avoid the deadline
calculation being zero by adding +1 here and there.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
The priority boost for interactive tasks can be exploited to render the
system nearly unresponsive by creating numerous tasks that constantly
switch between wait/wakeup states.
For example, stress tests like `hackbench -l 10000` can significantly
degrade system responsiveness.
To mitigate this, limit the number of interactive tasks added to the
priority queue to 4x the number of online CPUs.
This simple approach appears to be a quite effective at identifying
potential spam of "fake" interactive tasks, while still prioritizing
legitimate interactive tasks.
Additionally, periodically refresh the interactive status of the tasks
based on their most recent average of voluntary context switches,
preventing the interactive status from being too "sticky".
Tested-by: Piotr Gorski <lucjan.lucjanov@gmail.com>
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
Avoid dispatching per-CPU kthreads directly, since this may cause
interactivity problems or unfairness, for example if there are too many
softirqs being scheduled (e.g., in presence of high RX network traffic
or when running certain stress tests, like hackbench).
Moreover, in order to help with testing and benchmarks, introduce the
option --local-kthread, that allows to restore the old behavior if
enabled.
Tested-by: Piotr Gorski <lucjan.lucjanov@gmail.com>
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
When updating the task vruntime, ensure the time slice delta is always a
positive value. Failing to do so may cause the global vruntime to
increase excessively due to overflows.
Tested-by: Piotr Gorski <lucjan.lucjanov@gmail.com>
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
Periodically report the amount of online CPUs to stdout.
The online CPUs are initially evaluated looking at the online cpumask,
then the value is updated in the .cpu_offline() / .cpu_online()
callbacks.
Tested-by: Piotr Gorski <lucjan.lucjanov@gmail.com>
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
Keep track of the CPUs that are running interactive tasks and report
their amount to stdout.
Tested-by: Piotr Gorski <lucjan.lucjanov@gmail.com>
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
The correct default value of slice_ns 5ms, not 5s.
This change doesn't really make any difference in practice, since these
values are changed by the Rust part when the scheduler is started, but
it's good to keep this aligned to the proper values for consistency.
Tested-by: Piotr Gorski <lucjan.lucjanov@gmail.com>
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
This commit changes the use of a physical clock to a virtual, logical
clock in calculating deadlines.
- The virtual current clock advances upon a task's running to its
virtual deadline.
- When enqueuing a task, its virtual deadline from the virtual current
clock is calculated.
With the above two changes, this guarantees that there is no such task
whose virtual deadline is smaller than the virtual current clock. This
means any enqueuing task can compete with any other already enqueued
tasks. This allows a latency-critical task to be immediately scheduled
if needed.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Every time we need to dispatch a task re-evalate its time slice as:
(unused_time_slice + min_time_slice) / 2
This allows to refill the time slice for tasks that haven't used much of
their previously assigned time, improving fairness.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Make sure to always classify interactive tasks, even when the system is
not fully utilized. This ensures that if the system suddenly becomes
overloaded, we already know which tasks need to be dispatched to the
priority DSQ.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Fetch the value of "delta" directly from the returned value from
__sync_fetch_and_sub, as it returns the origin value of
cgc->cvtime_delta.
Additional fetching instruction of cgc->cvtime_delta would be redundant
here.
Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
Tasks are consumed from various DSQs in the following order:
per-CPU DSQs => priority DSQ => shared DSQ
Tasks in the shared DSQ may be starved by those in the priority DSQ,
which in turn may be starved by tasks dispatched to any per-CPU DSQ.
To mitigate this, record the timestamp of the last task scheduling event
both from the priority DSQ and the shared DSQ.
If the starvation threshold is exceeded without consuming a task, the
scheduler will be forced to consume a task from the corresponding DSQ.
The starvation threshold can be adjusted using the --starvation-thresh
command line parameter (default is 5ms).
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
There is no need to RCU protect the cpumask for the offline CPUs: it is
created once when the scheduler is initialized and it's never
deallocated.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Reduce the default time slice down to 5ms for a faster reaction and
system responsiveness when the system is overcomissioned.
This also helps to provide a more predictable level of performance.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Always use direct CPU dispatch for kthreads, there is no need to treat
kthreads in a special way, simply reuse direct CPU dispatch to
prioritize them.
Moreover, change direct CPU dispatches to use scx_bpf_dispatch_vtime(),
since we may dispatch multiple tasks to the same per-CPU DSQ now.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Small refactoring of the idle CPU selection logic:
- optimize idle CPU selection for tasks that can run on a single CPU
- drop the built-in idle selection policy and completely rely on the
custom one
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
We are incorrectly using the SMT idle cpumask to find any idle CPU, fix
by using the generic idle cpumask.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Implement CPU hotplugging in scx_bpfland without restarting the
scheduler.
The idle selection logic has been updated to consider online CPUs.
Additionally, a cpumask for offline CPUs has been introduced. Tasks
that have been dispatched to the DSQs associated with offline CPUs are
consumed by the other CPUs that are still online.
Moreover, the dependency on the Topology crate is temporarily dropped
and instead, /sys/devices/system/cpu/smt/active is used to determine if
SMT should be taken into account during idle selection. The Topology
crate will be re-introduced later when scx_bpfland will gain more
topology-aware capabilities.
This fixes#406.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>