Dispatching per-CPU kthreads directly is disabled by default, reporting
this metric can generate some confusion (since it is always 0), and even
if local kthread dispatches are enabled, they should be still considered
as regular direct dispatches (there is no difference in practice).
Therefore, merge direct kthread dispatches into direct dispatches and
drop the separate nr_kthread_dispatches metric.
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
Scale the task's time slice based on the average amount of tasks that
are currently waiting to be dispatched.
Use a moving average for the amount of waiting tasks to smooth out
potential spikes caused by temporary bursts of tasks piling in the wait
queues.
This was initially modeled in scx_rustland and it seems to work pretty
well also in scx_bpfland now.
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
With all the other optimizations and tunings, it turns out that maintaining
two runqueues has more harm than good.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Further depenalize above-average latency-critical tasks and penalize
further below-avergage latency-critical tasks in ineligibility duration.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
LAVD_VDL_LOOSENESS_FT represents how loose the deadline is. The smaller
value means the deadline is tighter. While it is unlikely to be tuned,
let's keep it as a tunable for now.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Non-kthreads with custom affinities in non-open layers are dispatched into a
LO_FALLBACK_DSQ, with the idea being that they're penalized for their custom
affinities. When a host is fully utilized, these tasks can end up being starved
due to LO_FALLBACK_DSQ being consumed only when there are no other layers to
consume from. In internal workloads at Meta, we've observed that this can
happen in practice.
Longer term, we can probably address this by implementing layer weights and
applying that to fallback DSQs to avoid starvation. For now, let's just
dispatch them to HI_FALLBACK_DSQ to avoid this starvation issue.
Signed-off-by: David Vernet <void@manifault.com>
Refactor the main module for scx_layered to move metrics into a separate
module. This change does no functional differences, only code structure.
This will make it a little easier to navigate the logic in the main
scheduler code.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
That is okay since the runtime is considered in calculating a virtual
deadline. A shorter runtime will result in a tighter deadline linearly.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
If inheriting the parent's properties, a new fork task tends to be too
prioritized. That is, many parent processes, such as `make,` are a bit
more latency-critical than average.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Instead of using a static value to classify tasks based on their average
amount of voluntary context switches, try to periodically evaluate an
optimal threshold, based on a global average of voluntary context
switches among of all the running tasks.
Tasks with an average amount of voluntary context switches greater than
the global average will be classified as interactive.
The global average is evaluated as an exponentially weighted moving
average (EWMA), as:
avg(t) = avg(t - 1) * 0.75 - task_avg(t) * 0.25
This approach is more efficient than iterating through all tasks and it
helps to prevent rapid fluctuations that may be caused by bursts of
voluntary context switch events.
The dynamic nvcsw threshold enables a more precise adjustment of
the classification criteria to swiftly respond to global system changes:
tasks can be quickly classified as interactive, but if the system
experiences too many interactive events, the criteria for maintaining
interactive status become stricter. This creates a natural selection
process where only the most deserving tasks remain interactive.
Additionally, introduce the new option `--nvcsw-max-thresh N`, which
allows to extend or restrict the fluctuation range of the global average
threshold for voluntary context switches.
Tested-by: Piotr Gorski <piotrgorski@cachyos.org>
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
Advancing the clock slower when overloaded gives more opportunities for
latency-critical tasks to cut in the run queue. Controlling the clock
better reflects the actual load than the prior approach of stretching
the time-space when overloaded.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
We now maintain two run queues—an eligible run queue (DSQ) and an
ineligible run queue (rbtree)—sorted by the task's virtual deadline.
When the eligible run queue is empty, or the ineligible run queue has
not been consumed for too long (e.g., 15 msec), a task in the ineligible
run queue is moved to the eligible run queue for execution. With these
two queues, we have a better admission control.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Followed commit 1c3b563, move the checking of task.migrated.get() into
the vector filter. In this way, we can remove the skip_while() call in
find_first_candidate().
Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
Update libbpf-rs & libbpf-cargo to 0.24. Among other things, generated
skeletons now contain directly accessible map and program objects, no
longer necessitating the use of accessor methods. As a result, the risk
for mutability conflicts is reduced greatly.
Signed-off-by: Daniel Müller <deso@posteo.net>
This change refactors some of the helper methods for getting the
preferred node for tasks using mempolicy. The load balancing logic in
try_find_move_task is updated to allow for a filter, which is used to
filter for tasks with a preferred mempolicy.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
This change makes scx_rusty mempolicy aware. When a process uses
set_mempolicy it can change NUMA memory preferences and cause
performance issues when tasks are scheduled on remote NUMA nodes. This
change modifies task_pick_domain to use the new helper method that
returns the preferred node id.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
Estimating the service time from run time and frequency is not
incorrect. However, it reacts slowly to sudden changes since it relies
on the moving average. Hence, we directly measure the service time to
enforce fairness.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Instead of performing domain mask checking inside
"find_first_candidate()" every time, check whether the tasks within push
domain are abled to run on pull domain by performing the mask check at
vector generation stage.
This way can also avoid repeated computation generated by the same
(task, pull_dom) pair as they'll try to check whether the pull domain is
in the task domain mask.
Also since whether a task is a kworker won't change in time, we can
perform the check earlier and put it in the filter, too.
Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
We always use nr_cpu_ids to represent the maximum CPU id returned by
scx_bpf_nr_cpu_ids().
Replace cpu_max with nr_cpu_ids to be more consistent with the rest of
the code.
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
We can rely on scx_bpf_nr_cpu_ids() to create all the possible per-CPU
DSQs, eliminating the need for the hard-coded limit MAX_CPUS.
In this way scx_bpfland can support the same amount of CPUs that the
kernel can handle.
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
Instead of constantly checking the need to drain tasks from the DSQs of
the offline CPUs, provide an atomic flag to notify when there are tasks
to be drained from the offline CPUs.
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
Refine the safeguard mechanism to avoid generating too many interactive
tasks in the system, which could nullify the effect of the
interactive/regular task classification.
The safeguard mechanism operates by pausing the promotion of new tasks
to interactive status during the task wake-up process, whenever the
number of interactive tasks in the priority queue exceeds a specific
limit (set to 4x the number of online CPUs).
Halting the promotion of additional interactive tasks allows to
prioritize those already classified as interactive, thereby preventing
potential "bursts" of excessive interactive tasks in the system.
This refines the mitigation already provided by commit 640bd562
("scx_bpfland: prevent tasks from abusing interactive priority boost").
Fixes: 640bd562 ("scx_bpfland: prevent tasks from abusing interactive priority boost")
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
Always assign the maximum time slice if there are idle CPUs in the
system.
Otherwise, double the task's unused time slice to reward tasks that use
less CPU time and at the same time refill the time slice of the tasks
every time they're dispatched.
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
sched_ext is about to be merged upstream. There are some compatibility
breaking changes and we're making the current sched_ext/for-6.11
1edab907b57d ("sched_ext/scx_qmap: Pick idle CPU for direct dispatch on
!wakeup enqueues") the baseline.
Tag everything except scx_mitosis as 1.0.0. As scx_mitosis is still in early
development and is currently temporarily disabled, only the patchlevel is
bumped.
Sync from sched_ext/for-6.11 1edab907b57d ("sched_ext/scx_qmap: Pick idle
CPU for direct dispatch on !wakeup enqueues")
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git for-6.11
- cgroup support hasn't landed in the upstream kernel yet. This most likely
will happen in a few weeks. For the time being, disable scx_flatcg,
scx_pair and scx_mitosis.
- Compat macro for DSQ task iterator dropped. This is now a part of
the baseline.
- scx_bpf_consume() isn't upstream yet. BPF interfacing side is still being
discussed. Dropped example usage from tools/sched_ext. None of the
practical schedulers use it, so this should be fine for now.
- scx_bpf_cpu_rq() added.
- AUTOATTACH workaround for newer libbpf versions added.
A task can become a runnable on any task's context not only its waker
task. Thus, we should not count wake-up on unrelated task's context.
With this commit, the scheduler can (much more) accurately detect
waker-wakee relationsships.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
The prior approach using the sum of weights gives too much penalty to
nice tasks with large nice values. With this commit, the time slice is
determined by the number of runnable tasks regardless of nice priority.
Note that the fairness will still be enforced based on tasks' nice
priorities (weights).
Signed-off-by: Changwoo Min <changwoo@igalia.com>
To easily distinguish, let's initialize the current logical clock to
zero (not the current physical time). Also, avoid the deadline
calculation being zero by adding +1 here and there.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
The priority boost for interactive tasks can be exploited to render the
system nearly unresponsive by creating numerous tasks that constantly
switch between wait/wakeup states.
For example, stress tests like `hackbench -l 10000` can significantly
degrade system responsiveness.
To mitigate this, limit the number of interactive tasks added to the
priority queue to 4x the number of online CPUs.
This simple approach appears to be a quite effective at identifying
potential spam of "fake" interactive tasks, while still prioritizing
legitimate interactive tasks.
Additionally, periodically refresh the interactive status of the tasks
based on their most recent average of voluntary context switches,
preventing the interactive status from being too "sticky".
Tested-by: Piotr Gorski <lucjan.lucjanov@gmail.com>
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
Avoid dispatching per-CPU kthreads directly, since this may cause
interactivity problems or unfairness, for example if there are too many
softirqs being scheduled (e.g., in presence of high RX network traffic
or when running certain stress tests, like hackbench).
Moreover, in order to help with testing and benchmarks, introduce the
option --local-kthread, that allows to restore the old behavior if
enabled.
Tested-by: Piotr Gorski <lucjan.lucjanov@gmail.com>
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
When updating the task vruntime, ensure the time slice delta is always a
positive value. Failing to do so may cause the global vruntime to
increase excessively due to overflows.
Tested-by: Piotr Gorski <lucjan.lucjanov@gmail.com>
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
Periodically report the amount of online CPUs to stdout.
The online CPUs are initially evaluated looking at the online cpumask,
then the value is updated in the .cpu_offline() / .cpu_online()
callbacks.
Tested-by: Piotr Gorski <lucjan.lucjanov@gmail.com>
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
Keep track of the CPUs that are running interactive tasks and report
their amount to stdout.
Tested-by: Piotr Gorski <lucjan.lucjanov@gmail.com>
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
The correct default value of slice_ns 5ms, not 5s.
This change doesn't really make any difference in practice, since these
values are changed by the Rust part when the scheduler is started, but
it's good to keep this aligned to the proper values for consistency.
Tested-by: Piotr Gorski <lucjan.lucjanov@gmail.com>
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
This commit changes the use of a physical clock to a virtual, logical
clock in calculating deadlines.
- The virtual current clock advances upon a task's running to its
virtual deadline.
- When enqueuing a task, its virtual deadline from the virtual current
clock is calculated.
With the above two changes, this guarantees that there is no such task
whose virtual deadline is smaller than the virtual current clock. This
means any enqueuing task can compete with any other already enqueued
tasks. This allows a latency-critical task to be immediately scheduled
if needed.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Every time we need to dispatch a task re-evalate its time slice as:
(unused_time_slice + min_time_slice) / 2
This allows to refill the time slice for tasks that haven't used much of
their previously assigned time, improving fairness.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Make sure to always classify interactive tasks, even when the system is
not fully utilized. This ensures that if the system suddenly becomes
overloaded, we already know which tasks need to be dispatched to the
priority DSQ.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Tasks are consumed from various DSQs in the following order:
per-CPU DSQs => priority DSQ => shared DSQ
Tasks in the shared DSQ may be starved by those in the priority DSQ,
which in turn may be starved by tasks dispatched to any per-CPU DSQ.
To mitigate this, record the timestamp of the last task scheduling event
both from the priority DSQ and the shared DSQ.
If the starvation threshold is exceeded without consuming a task, the
scheduler will be forced to consume a task from the corresponding DSQ.
The starvation threshold can be adjusted using the --starvation-thresh
command line parameter (default is 5ms).
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
There is no need to RCU protect the cpumask for the offline CPUs: it is
created once when the scheduler is initialized and it's never
deallocated.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Reduce the default time slice down to 5ms for a faster reaction and
system responsiveness when the system is overcomissioned.
This also helps to provide a more predictable level of performance.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Always use direct CPU dispatch for kthreads, there is no need to treat
kthreads in a special way, simply reuse direct CPU dispatch to
prioritize them.
Moreover, change direct CPU dispatches to use scx_bpf_dispatch_vtime(),
since we may dispatch multiple tasks to the same per-CPU DSQ now.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Small refactoring of the idle CPU selection logic:
- optimize idle CPU selection for tasks that can run on a single CPU
- drop the built-in idle selection policy and completely rely on the
custom one
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
We are incorrectly using the SMT idle cpumask to find any idle CPU, fix
by using the generic idle cpumask.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Implement CPU hotplugging in scx_bpfland without restarting the
scheduler.
The idle selection logic has been updated to consider online CPUs.
Additionally, a cpumask for offline CPUs has been introduced. Tasks
that have been dispatched to the DSQs associated with offline CPUs are
consumed by the other CPUs that are still online.
Moreover, the dependency on the Topology crate is temporarily dropped
and instead, /sys/devices/system/cpu/smt/active is used to determine if
SMT should be taken into account during idle selection. The Topology
crate will be re-introduced later when scx_bpfland will gain more
topology-aware capabilities.
This fixes#406.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
The stats map in scx_rusty is a BPF_MAP_TYPE_PERCPU_ARRAY, with its size
determined by num_possible_cpus(). Initializing it with nr_cpu_ids() can
result in errors such as:
Error: Failed to zero stat
Caused by:
number of values 6 != number of cpus 8
Fix by using num_possible_cpus() to initialize it.
Fixes: 263e02f6 ("rusty: Use nr_cpu_ids instead of nr_cpus_possible")
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
With commit 5d20f89a ("scheds-rust: build rust schedulers in sequence"),
schedulers are now built serially one after the other to prevent meson
and cargo from forking NxN parallel tasks.
However, this change has made building a single scheduler much more
cumbersome, due to the chain of dependencies.
For example, building scx_rusty using the specific meson target would
still result in all schedulers being built, because they all depend on
each other.
To address this issue, introduce the new meson build option
`serialize=true|false` (default is false).
This option allows to disable the schedulers' build chain, restoring the
old behavior.
With this option enabled, it is now possible to build just a single
scheduler, parallelizing the cargo build properly, without triggering
the build of the others. Example:
$ meson setup build -Dbuildtype=release -Dserialize=false
$ meson compile -C build scx_rusty
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
The competition window was 7.5 msec, half of the targeted latency.
However, it is too wide for some workloads, so unrelated tasks may
compete with each other. Hence, it is tightened to about 1 msec with
LAVD_LAT_WEIGHT_SHIFT to avoid unnecessary competition.
Also, when a system is overloaded, now the time space is stretched more
aggressively (i.e., lat_prio^2) when a task's latency priority is low
(high value).
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Introduce a tunable to set a limit of the minimum vruntime that is used
when a task is dispatched, as:
vtime_min = vtime_now - slice_lag_ns
Increasing the time slice lag can make interactive tasks even more
responsive at the cost of starving regular and newly created tasks.
Default time slice lag is 0.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Overview
========
This scheduler is derived from scx_rustland, but it is fully implemented
in BFP with minimal user-space Rust part to process command line
options, collect metrics and logs out scheduling statistics.
Unlike scx_rustland, all scheduling decisions are made by the BPF
component.
Motivation
==========
The primary goal of this scheduler is to act as a performance baseline
for comparison with scx_rustland, allowing for a better assessment of
the overhead caused by kernel/user-space interactions.
It can also be used to deploy prototypes initially tested in the
scx_rustland scheduler. In fact, this scheduler is expected to
outperform scx_rustland, due to the elimitation of the kernel/user-space
overhead.
Scheduling policy
=================
scx_bpfland is a vruntime-based sched_ext scheduler that prioritizes
interactive workloads. Its scheduling policy closely mirrors
scx_rustland, but it has been re-implemented in BPF with some small
adjustments.
Tasks are categorized as either interactive or regular based on their
average rate of voluntary context switches per second: tasks that exceed
a specific voluntary context switch threshold are classified as
interactive.
Interactive tasks are prioritized in a higher-priority DSQ, while
regular tasks are placed in a lower-priority DSQ. Within each queue,
tasks are sorted based on their weighted runtime, using the built-in scx
vtime ordering capabilities (scx_bpf_dispatch_vtime()).
Moreover, each task gets a time slice budget. When a task is dispatched,
it receives a time slice equivalent to the remaining unused portion of
its previously allocated time slice (with a minimum threshold applied).
This gives latency-sensitive workloads more chances to exceed their time
slice when needed to perform short bursts of CPU activity without being
interrupted (i.e., real-time audio encoding / decoding workloads).
Results
=======
According to the initial test results, using the same benchmark "playing
a videogame while recompiling the kernel", this scheduler seems to
provide a +5% improvement in the frames-per-second (fps) compared to
scx_rustland, with video games such as Cyberpunk 2077, Counter-Strike 2
and Baldur's Gate 3.
Initial test results indicate that this scheduler offers around a +5%
improvement in frames-per-second (fps) compared to scx_rustland when
using the benchmark "playing a video game while recompiling the kernel".
This improvement was observed in games such as Cyberpunk 2077,
Counter-Strike 2, and Baldur's Gate 3.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
The old approach was too conservative in running a new task, so when a
fork-heavy workload competes with a CPU-bound workload, the fork-heavy
one is starved. The new approach solves the starvation problem by
inheriting parent's statistics. It seems a good (at least better than
old) guess how a new task will behave.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
When the system is highly loaded with compute-intensive tasks, the old
setting chokes latensive-intensive tasks, so loosen the dealine when the
system is overloaded (> 100% utilization).
Signed-off-by: Changwoo Min <changwoo@igalia.com>
When the lavd is loaded, it prints out its build id. It helps to easily
identify what version it is when testing.
```
01:56:54 [INFO] scx_lavd scheduler is initialized (build ID: 0.8.1-g98a5fa8595430414115c504857cea1a458393838-dirty x86_64-unknown-linux-gnu)
```
Signed-off-by: Changwoo Min <changwoo@igalia.com>
The synchronization for mitosis is a bit ad-hoc, working around lack of
atomics in BPF. This commit updates the logic to use READ/WRITE_ONCE and
compiler barriers to get the behaviors we want.
Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
When someone is testing schedulers, we often have to ask what version
the scheduler is running as. Now that we can access the build ID from
rust schedulers, let's update scx_rusty to print the build ID when rusty
first starts running.
This results in output such as the following:
```
[void@maniforge scx]$ rusty
19:04:26 [INFO] Running scx_rusty (build ID: 0.8.1-g2043d2537f37c8d75753bb65eb75bca965067564 x86_64-unknown-linux-gnu/debug)
19:04:26 [INFO] NUMA[00] mask= 0b11111111111111111111111111111111
19:04:26 [INFO] DOM[00] mask= 0b00000000111111110000000011111111
19:04:26 [INFO] DOM[01] mask= 0b11111111000000001111111100000000
19:04:26 [INFO] Rusty scheduler started!
```
Signed-off-by: David Vernet <void@manifault.com>
This is a second attempt to optimize tunables for a wider range of
games.
1) LAVD_BOOST_RANGE increased from 14 (35%) to 40 (100% of nice range).
Now the latency priority (biased by nice value) will decide which
task should run first . The nice value will decide the time slice.
2) The first change will give higher priority to latency-critical task
compared to before. For compensation, the slice boost also increased
(2x -> 3x).
Signed-off-by: Changwoo Min <changwoo@igalia.com>
This change adds a new module to the scx_utils crate that provides a
log recorder for metrics-rs. The log recorder will log all metrics to
the console at a configurable interval in an easy to read format. Each
metric type will be displayed in a separate section. Indentation will
be used to show the hierarchy of the metrics. This results in a more
verbose output, but it is easier to read and understand.
scx_rusty was updated to use the log recorder and all explicit metric
logging was removed.
Counters will show the total count and the rate of change per second.
Counters with an additional label, like `type` in
`dispatched_tasks_total` in rusty, will show the count, rate, and
percentage of the total count.
Counters:
dispatched_tasks_total: 65559 [1344.8/s]
prev_idle: 44963 (68.6%) [966.5/s]
wsync_prev_idle: 15696 (23.9%) [317.3/s]
direct_dispatch: 2833 (4.3%) [35.3/s]
dsq: 1804 (2.8%) [21.3/s]
wsync: 262 (0.4%) [4.3/s]
direct_greedy: 1 (0.0%) [0.0/s]
pinned: 0 (0.0%) [0.0/s]
greedy_idle: 0 (0.0%) [0.0/s]
greedy_xnuma: 0 (0.0%) [0.0/s]
direct_greedy_far: 0 (0.0%) [0.0/s]
greedy_local: 0 (0.0%) [0.0/s]
dl_clamped_total: 1290 [20.3/s]
dl_preset_total: 514 [1.0/s]
kick_greedy_total: 6 [0.3/s]
lb_data_errors_total: 0 [0.0/s]
load_balance_total: 0 [0.0/s]
repatriate_total: 0 [0.0/s]
task_errors_total: 0 [0.0/s]
Gauges will show the last set value:
Gauges:
slice_length_us: 20000.00
Histograms will show the average, min, and max. The histogram will be
reset after each log interval to avoid memory leaks, since the data
structure that holds the samples is unbounded.
Histograms:
cpu_busy_pct: avg=1.66 min=1.16 max=2.16
load_avg node=0: avg=0.31 min=0.23 max=0.39
load_avg node=0 dom=0: avg=0.31 min=0.23 max=0.39
processing_duration_us: avg=297.50 min=296.00 max=299.00
Signed-off-by: Jose Fernandez <josef@netflix.com>
In some games (e.g., Elden Ring), it was observed that preemption
happens much less frequently. The reason is that tasks' runtime per
schedule is similar, so it does not meet the existing criteria. To
alleviate the problem, the following three tunables are revised:
1) Smaller LAVD_PREEMPT_KICK_MARGIN and LAVD_PREEMPT_TICK_MARGIN help to
trigger more preemption.
2) Smaller LAVD_SLICE_MAX_NS works better especially 250 or 300Hz
kernels.
3) Longer LAVD_ELIGIBLE_TIME_MAX purturbes time lines less frequently.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Origin assignment of the variable ridx is equivalent to comparing
between "ridx" and "wids - MAX_PIDS". Using u64 max library helper
function to perform the comparison and provide better readability.
Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
Check whether the BalanceState of pull_dom.load inside function
try_find_move_task is actually the variant NeedsPull. It'll perform task
migration in abit more conservative manner when the system is under high
loading situation.
Experiments are performed when the system is compiling linux kernel and
undergoing a large amount of I/O operation at the same time using fio.
The result showns that before the modification, there're 12,6617 times
of task migrations system wide. After the modification, there're 11,5419
times of task migrations system wide.
Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
In scx_rlfifo, we're currently using topo.nr_cpus_possible() to
determine how many possible CPU IDs we could have on the system. To
properly support systems whose disabled CPUs may be in the middle of the
range of possible CPU IDs, let's instead use topo.nr_cpu_ids() so that
we don't accidentally dispatch to an invalid DSQ.
Signed-off-by: David Vernet <void@manifault.com>
In scx_rusty, we're currently using topo.nr_cpus_possible() to determine
how many possible CPU IDs we could have on the system. scx_rusty already
accounts for offlined CPUs, so to properly support systems whose
disabled CPUs may be in the middle of the range of possible CPU IDs,
let's instead use topo.nr_cpu_ids().
Signed-off-by: David Vernet <void@manifault.com>
In some cases, a host may have an odd topology where there are gaps in
CPU IDs (including between possible CPUs). A common pattern in
schedulers is to perform allocations for every possible CPU ID, such as
creating a per-cpu DSQ. In order to avoid confusing schedulers, let's
track the maximum CPU ID on a system so that we can return the number of
CPU IDs on the system which is inclusive of gaps.
We also update scx_rustland in this change to accommodate the fact that
we no longer export nr_cpus_possible() from TopologyMap.
Signed-off-by: David Vernet <void@manifault.com>
We need a layer of indirection between the stats collection and their
output destinations. Currently, stats are only printed to stdout. Our
goal is to integrate with various telemetry systems such as Prometheus,
StatsD, and custom metric backends like those used by Meta and Netflix.
Importantly, adding a new backend should not require changes to the
existing stats code.
This patch introduces the `metrics` [1] crate, which provides a
framework for defining metrics and publishing them to different
backends.
The initial implementation includes the `dispatched_tasks_count`
metric, tagged with `type`. This metric increments every time a task is
dispatched, emitting the raw count instead of a percentage. A monotonic
counter is the most suitable metric type for this use case, as
percentages can be calculated at query time if needed. Existing logged
metrics continue to print percentages and remain unchanged.
A new flag, `--enable-prometheus`, has been added. When enabled, it
starts a Prometheus endpoint on port 9000 (default is false). This
endpoint allows metrics to be charted in Prometheus or Grafana
dashboards.
Future changes will migrate additional stats to this framework and add
support for other backends.
[1] https://metrics.rs/
Signed-off-by: Jose Fernandez <josef@netflix.com>
Use the function can_task1_kick_task2() to replace places which also
checking the comp_preemption_info between two cpus for better
consistency.
Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
It seems that we are not updating `is_idle` when we find an idle CPU
with pick_cpu(), causing unnecessary rescheduling events when
select_cpu() is called.
To resolve this, ensure that the is_idle state is correctly set.
Additionally, always ensure that the task is dispatched to the local DSQ
immediately upon finding (and reserving) an idle CPU.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
- clean up u63 and u32 usages in structures to reduce struct size
- refactoring pick_cpu() for readability
Signed-off-by: Changwoo Min <changwoo@igalia.com>
The required CPU performance (cpuperf) was set to 1024 (100%) when the
CPU utilization was 100%. When a sudden load spike happens, it makes the
system adapt slowly in the next interval.
The new scheme always reserves some headroom in advance, so it sets
cpuperf to 1024 when the CPU utilization reaches to 85%. This gives some
room to adapt in advance.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Modify the execution sequence before lookup operation for new_domc. If
new_dom_id == NO_DOM_FOUND, lookup operation for new_domc is definitely
going to fail so we don't have to wait until we found that new_domc is
NULL, clearing of cpumask and return operation should be done directly
in that case.
Plus we should avoid using try_lookup_dom_ctx outside the context of
lookup_dom_ctx, as it can keep the interface's consistency.
Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
The rusty dispatch logic is a bit unnecessarily convoluted. Let's clean it up
so that we're just comparing dom ids rather than iterating over arrays nested
inside of pcpu context.
Signed-off-by: David Vernet <void@manifault.com>
Right now, the SCX_WAKE_SYNC logic in rusty is very primitive. We only check to
see if the waker CPU's runqueue is empty, and then migrate the wakee there if
so. We'll want to expand this to be more thorough, such as:
- Checking to see if prev_cpu and waker_cpu share the same LLC when determining
where to migrate
- Check for whether SCX_WAKE_SYNC migration helps load imbalance between cores
- ...
Right now all of that code is just a big blob in the middle of
rusty_select_cpu(). Let's pull it into its own function to improve readability,
and also add some logic to stay on prev_cpu if it shares an LLC with the waker.
Signed-off-by: David Vernet <void@manifault.com>
It seems that task_set_domain() is nearly at the point where it can
cause the verifier to get confused and think that it's exceeding the
number of available instructions per program. I've seen this a number of
times when making small changes to task_set_domain(), and it's once
again happened @vax-r (I-Hsin Cheng) made a small cleanup change to
rusty in https://github.com/sched-ext/scx/pull/362.
To avoid this, let's just make dom_xfer_task() a separate global program
so that the verifier doens't have to worry about branch pruning, etc
depending on what the caller does. This should hopefully make
task_set_domain() (and its callers) much less brittle.
Signed-off-by: David Vernet <void@manifault.com>
In preparation of upstreaming, let's set the min version requirement at the
released v6.9 kernels. Drop __COMPAT_scx_bpf_cpuperf_*(). The open helper
macros now check the existence of scx_bpf_cpuperf_cap() and abort if not.
In preparation of upstreaming, let's set the min version requirement at the
released v6.9 kernels. Drop __COMPAT_HAS_CPUMASKS(). The open helper macros
now check the existence of scx_bpf_nr_cpu_ids() and abort if not.
In preparation of upstreaming, let's set the min version requirement at the
released v6.9 kernels. Drop __COMPAT_SCX_KICK_IDLE. The open helper macros
now check the existence of SCX_KICK_IDLE and abort if not.
In preparation of upstreaming, let's set the min version requirement at the
released v6.9 kernels. Drop __COMPAT_scx_bpf_switch_call(). The open helper
macros now check the existence of SCX_OPS_SWITCH_PARTIAL and abort if not.
With commit 786ec0c0 ("scx_rlfifo: schedule all tasks in user-space")
all the scheduling decisions are now happening in user-space. This also
bypasses the built-in idle selection logic, delegating the CPU selection
for each task to the user-space scheduler.
The easiest way to distribute tasks across the available CPUs is to
simply allow to dispatch them on the first CPU available.
In this way the scheduler becomes usable in practical scenarios and at
the same time it also maintains its simplicity.
This allows to spread all tasks across all the available CPUs
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Disable all the BPF optimization shortcuts by default and force all
tasks to be processed by the user-space scheduler.
Given that the primary goal of this scheduler is to offer a
straightforward and intuitive example for experimental purposes, this
change simplifies the process for individuals looking to experiment,
allowing them to apply changes to user-space code and quickly observe
the effects, without dealing with any in-kernel optimizations.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
No functional change, just add some comments to better describe the
parameters used when initializing the main BpfScheduler object.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
The bpf_ prefix is used for BPF API. Rename bpf_log2() to u32_log2() and
bpf_log2l() to u64_log2(). While at it, relocate them below compiler
directive helpers.
Keep track of the maximum vruntime among all tasks and flush them if the
difference between the maximum and minimum vruntime exceeds slice_ns.
This helps to prevent excessive starvation, as every task is guaranteed
to be dispatched within the slice_ns time limit.
Tested-by: Tested-by: SoulHarsh007 <harsh.peshwani@outlook.com>
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
These are used in mitosis, but they belong in common code so other
schedulers can do css iteration.
Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
The old logic for CPU frequency scaling is that the task's CPU
performance target (i.e., target CPU frequency) is checked every tick
interval and updated immediately. Indeed, it samples and updates a
performance target every tick interval. Ultimately, it fluctuates CPU
frequency every tick interval, resulting in less steady performance.
Now, we take a different strategy. The key idea is to increase the
frequency as soon as possible when a task starts running for quick
adoption to load spikes. However, if necessary, it decreases gradually
every tick interval to avoid frequency fluctuations.
In my testing, it shows more stable performance in many workloads
(games, compilation).
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Originally, do_update_sys_stat() simply calculated the system-wide CPU
utilization. Over time, it has evolved to collect all kinds of
system-wide, periodic statistics for decision-making, so it has become
bulky. Now, it is time to refactor it for readability. This commit does
not contain functional changes other than refactoring.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
The periodic CPU utilization routine does a lot of other work now. So we
rename LAVD_CPU_UTIL_INTERVAL_NS to LAVD_SYS_STAT_INTERVAL_NS.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
When a device is suspended and resumed, the suspended duration is added
up to a task's runtime if the task was running on the CPU. After the
resume, the task's runtime is incorrectly long and the scheduler starts
to recognize the system is under heavy load. To avoid such problem, the
suspended duration is measured and substracted from the task's runtime.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
scx_mitosis is a dynamic affinity scheduler which assigns cgroups to
Cells and Cells to discrete sets of CPUs. The number of cells is dynamic
as is the CPU assignment. BPF mostly just does vtime scheduling for each
cell, tracks load, and responds to reconfiguration from userspace.
Userspace makes decisions about how to assign cgroups to cells and cells
to cpus.
This is not yet a complete scheduler, much of the userspace logic is a
placeholder as I experiment with better logic. I also want to add richer
scheduling semantics to userspace, e.g. so that cells can do more
"soft-affinity" rather than the strict partitioning implemented
currently.
Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
READ_ONCE()/WRITE_ONCE() macros are added in commit 0932fde, we should
be able to utilize the macros to get around the possibility of data
races for domc->min_vruntime.
Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
- pick_idle_cpu() was putting idle_smtmask that it didn't acquire.
- layered_enqueue() was unnecessarily entering preemption path after finding
an idle CPU.
- No need to test whether scx_bpf_get_idle_cpu/smtmask() return NULL. They
never do.
- Relocate cctx->yielding test into keep_runinng() from its caller.
scx_lavd: core compaction for low power consumption
When system-wide CPU utilization is low, it is very likely all the CPUs
are running with very low utilization. That means all CPUs run with low
clock frequency thanks to dynamic frequency scaling and very frequently
go in and out from/to C-state. That results in low performance (i.e.,
low clock frequency) and high power consumption (i.e., frequent
P-/C-state transition).
The idea of *core compaction* is using less number of CPUs when
system-wide CPU utilization is low. The chosen cores (called "active
cores") will run in higher utilization and higher clock frequency, and
the rest of the cores (called "idle cores") will be in a C-state for a
much longer duration. Thus, the core compaction can achieve higher
performance with lower power consumption.
One potential problem of core compaction is latency spikes when all the
active cores are overloaded. A few techniques are incorporated to solve
this problem.
1) Limit the active CPU core's utilization below a certain limit (say 50%).
2) Do not use the core compaction when the system-wide utilization is
moderate (say 50%).
3) Do not enforce the core compaction for kernel and pinned user-space
tasks since they are manually optimized for performance.
In my experiments, under a wide range of system-wide CPU utilization
(5%—80%), the core compaction reduces 7-30% power consumption without
sacrificing average and 99p tail latency.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Currently, when preempting, searching for the candidate CPU always starts
from the RR preemption cursor. Let's first try the previous CPU the
preempting task was on as that may have some locality benefits.
When a task is being enqueued outside wakeup path, ops.select_cpu() isn't
called, so we can end up in a situation where a newly enqueued task keeps
waiting in one of the DSQs while there are idle CPUs. Factor out idle CPU
selection path into pick_idle_cpu() and call it from the enqueue path in
such cases. This problem is shared across schedulers and likely needs a more
generic solution in the future.
yield(2) currently gives up the entire slice. Add "yield_ignore" layer
parameter which can modulate the magnitude of yiedling. When 1.0, yields are
completely ignored. 0.5, only half worth of the full slice is given up and
so on.
Currently, a task which yields is treated the same as a task which has run
out its slice. As the budget charged to a task is calculated from wall clock
time, a repeatedly yielding task can stay at the top of the queue for quite
a while hogging the CPU and spiking the number of scheduling events.
Let's add explicit yield support. An yielding task is now always charged the
full slice and not allowed to keep running on the same CPU.
The keep_running path relies on the implicit last task enqueue which makes
the statistics a bit difficult to track. Let's make the enqueue path
comprehensive:
- Set SCX_OPS_ENQ_LAST and handle the last runnable task enqueue explicitly.
- Implement layered_cpu_release() to re-enqueue tasks from a CPU preempted
by a higher pri sched class and handle the re-enqueued tasks explicitly in
layered_enqueue().
- Add more statistics to track all enqueue operations.
When a task exhausts its slice, layered currently doesn't make any effort to
keep it on the same CPU. It dispatches the next task to run and then
enqueues the running one. This leads to suboptimal behaviors. e.g. When this
happens to a task in a preempting layer, the task will most likely find an
idle CPU or a task to preempt and then migrate there causing a completely
unnecessary migration.
This patch layered_dispatch() test whether the current task should keep
running on the CPU and then skip dispatching to keep the task running. This
behavior depends on the implicit local DSQ enqueue mechanism which triggers
when there are no other tasks to run.
- scx_utils: Replace kfunc_exists() with ksym_exists() which doesn't care
about the type of the symbol.
- scx_layered: Fix load failure on kernels >= v6.10-rc due to
scheduler_tick() -> sched_tick rename. Attach the tick fentry function to
either scheduler_tick() or sched_tick().
Make sure to never assign a time slice longer than the default time
slice, that can be used as an upper limit.
This seems to prevent potential stall conditions (reported by the
CachyOS community) when running CPU-intensive workloads, such as:
[ 68.062813] sched_ext: BPF scheduler "rustland" errored, disabling
[ 68.062831] sched_ext: runnable task stall (ollama_llama_se[3312] failed to run for 5.180s)
[ 68.062832] scx_watchdog_workfn+0x154/0x1e0
[ 68.062837] process_one_work+0x18e/0x350
[ 68.062839] worker_thread+0x2fa/0x490
[ 68.062841] kthread+0xd2/0x100
[ 68.062842] ret_from_fork+0x34/0x50
[ 68.062844] ret_from_fork_asm+0x1a/0x30
Fixes: 6f4cd853 ("scx_rustland: introduce virtual time slice")
Tested-by: SoulHarsh007 <harsh.peshwani@outlook.com>
Tested-by: Piotr Gorski <piotrgorski@cachyos.org>
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Overview
========
Currently, a task's time slice is determined based on the total number
of tasks waiting to be scheduled: the more overloaded the system, the
shorter the time slice.
This approach can help to reduce the average wait time of all tasks,
allowing them to progress more slowly, but uniformly, thus providing a
smoother overall system performance.
However, under heavy system load, this approach can lead to very short
time slices distributed among all tasks, causing excessive context
switches that can badly affect soft real-time workloads.
Moreover, the scheduler tends to operate in a bursty manner (tasks are
queued and dispatched in bursts). This can also result in fluctuations
of longer and shorter time slices, depending on the number of tasks
still waiting in the scheduler's queue.
Such behavior can also negatively impact on soft real-time workloads,
such as real-time audio processing.
Virtual time slice
==================
To mitigate this problem, introduce the concept of virtual time slice:
the idea is to evaluate the optimal time slice of a task, considering
the vruntime as a deadline for the task to complete its work before
releasing the CPU.
This is accomplished by calculating the difference between the task's
vruntime and the global current vruntime and use this value as the task
time slice:
task_slice = task_vruntime - min_vruntime
In this way, tasks that "promise" to release the CPU quickly (based on
their previous work pattern) get a much higher priority (due to
vruntime-based scheduling and the additional priority boost for being
classified as interactive), but they are also given a shorter time slice
to complete their work and fulfill their promise of rapidity.
At the same time tasks that are more CPU-intensive get de-prioritized,
but they will tend to have a longer time slice available, reducing in
this way the amount of context switches that can negatively affect their
performance.
In conclusion, latency-sensitive tasks get a high priority and a short
time slice (and they can preempt other tasks), CPU-intensive tasks get
low priority and a long time slice.
Example
=======
Let's consider the following theoretical scenario:
task | time
-----+-----
A | 1
B | 3
C | 6
D | 6
In this case task A represents a short interactive task, task C and D
are CPU-intensive tasks and task B is mainly interactive, but it also
requires some CPU time.
With a uniform time slice, scaled based on the amount of tasks, the
scheduling looks like this (assuming the time slice is 2):
A B B C C D D A B C C D D C C D D
| | | | | | | | |
`---`---`---`-`-`---`---`---`----> 9 context switches
With the virtual time slice the scheduling changes to this:
A B B C C C D A B C C C D D D D D
| | | | | | |
`---`-----`-`-`-`-----`----------> 7 context switches
In the latter scenario, tasks do not receive the same time slice scaled
by the total number of tasks waiting to be scheduled. Instead, their
time slice is adjusted based on their previous CPU usage. Tasks that
used more CPU time are given longer slices and their processing time
tends to be packed together, reducing the amount of context switches.
Meanwhile, latency-sensitive tasks can still be processed as soon as
they need to, because they get a higher priority and they can preempt
other tasks. However, they will get a short time slice, so tasks that
were incorrectly classified as interactive will still be forced to
release the CPU quickly.
Experimental results
====================
This patch has been tested on a on a 8-cores AMD Ryzen 7 5800X 8-Core
Processor (16 threads with SMT), 16GB RAM, NVIDIA GeForce RTX 3070.
The test case involves the usual benchmark of playing a video game while
simultaneously overloading the system with a parallel kernel build
(`make -j32`).
The average frames per second (fps) reported by Steam is used as a
metric for measuring system responsiveness (the higher the better):
Game | before | after | delta |
---------------------------+---------+---------+--------+
Baldur's Gate 3 | 40 fps | 48 fps | +20.0% |
Counter-Strike 2 | 8 fps | 15 fps | +87.5% |
Cyberpunk 2077 | 41 fps | 46 fps | +12.2% |
Terraria | 98 fps | 108 fps | +10.2% |
Team Fortress 2 | 81 fps | 92 fps | +13.6% |
WebGL demo (firefox) [1] | 32 fps | 42 fps | +31.2% |
---------------------------+---------+---------+--------+
Apart from the massive boost with Counter-Strike 2 (that should be taken
with a grain of salt, considering the overall poor performance in both
cases), the virtual time slice seems to systematically provide a boost
in responsiveness of around +10-20% fps.
It also seems to significantly prevent potential audio cracking issues
when the system is massively overloaded: no audio cracking was detected
during the entire run of these tests with the virtual deadline change
applied.
[1] https://webglsamples.org/aquarium/aquarium.html
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Make restart handling with user_exit_info simpler and consistently use the
load and report macros consistently across the rust schedulers. This makes
all schedulers automatically handle auto restarts from CPU hotplug events.
Note that this is necessary even for scx_lavd which has CPU hotplug
operations as CPU hotplug operations which took place between skel open and
scheduler init can still trigger restart.
In cpumask_intersects_domain(), we check whether a given cpumask has any
CPUs in common with the specified domain by looking at the const, static
dom_cpumasks map. This map is only really necessary when creating the
domain struct bpf_cpumask objects at scheduler load time. After that, we
can just use the actual struct bpf_cpumask object embedded in the domain
context. Let's use that and cpumask kfuncs instead.
This allows rusty to load with
https://github.com/sched-ext/sched_ext/pull/216.
Signed-off-by: David Vernet <void@manifault.com>
Commit 23b0bb5f ("scx_rustland: dispatch interactive tasks on any CPU")
allows only interactive tasks to be dispatched on any CPU, enabling them
to quickly use the first idle CPU available. Non-interactive tasks, on
the other hand, are kept on the same CPU as much as possible.
This change deprioritizes CPU-intensive tasks further, but it also helps
to exploit cache locality, while latency-sensitive tasks are dispatched
sooner, improving overall responsiveness, despite the potential
migration cost.
Given this new logic, the built-idle option, which forces all tasks to
be dispatched on the CPU assigned during select_cpu(), no longer offers
significant benefits. It would merely reduce the responsiveness of
interactive tasks.
Therefore, simply remove this option, allowing the scheduler to
determine the target CPU(s) for all tasks based on their nature.
Fixes: 23b0bb5f ("scx_rustland: dispatch interactive tasks on any CPU")
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
In order to prevent compiler from merging or refetching load/store
operations or unwanted reordering, we take the implemetation of
READ_ONCE()/WRITE_ONCE() from kernel sources under
"/include/asm-generic/rwonce.h".
Use WRITE_ONCE() in function flip_sys_cpu_util() to ensure the compiler
doesn't perform unnecessary optimization so the compiler won't make
incorrect assumptions when performing the operation of modifying of bit
flipping.
Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
layered_dispatch() was incorrectly continuing down to the lower priority
DSQs after successfully consuming from HI_FALLBACK_DSQ which can lead to
latency issues. Fix it.
The main reason why custom affinities are tricky for scx_layered is because
if we put a task which doesn't allow all CPUs into a layer's DSQ, it may not
get consumed for an indefinite amount of time. However, this is only true
for confined layers. Both open and grouped layers always consumed from all
CPUs and thus don't have this risk.
Let's allow tasks with custom affinities in open and grouped layers.
- In select_cpu(), don't consider direct dispatching to a local DSQ as
affinity violation even if the target CPU is outside the layer's cpumask
if the layer is open.
- In enqueue(), separate out per-cpu kthread special case into its own
block. Note that this is only applied if the layer is not preempting as a
preempting layer has a higher priority than HI_FALLBACK_DSQ anyway.
- Trigger the LO_FALLBACK_DSQ path for other threads only if the layer is
confined.
- The preemption path now also runs for tasks with a custom affinity in open
and grouped layers. Update it so that it only considers the CPUs in the
preempting task's allowed cpumask.
(cherry picked from commit 82d2f887a4608de61ddf5e15643c10e504a88f7b)
- AFFN_VIOL for per-cpu tasks could be double counted. Once in select_cpu()
and again in enqueue(). Count in select_cpu() only when direct
dispatching.
- Violating tasks were prioritized over non-violating ones because they were
queued on SCX_DSQ_GLOBAL which has priority over all user DSQs. This
doesn't make sense. Let's introduce two fallback DSQs - HI_FALLBACK_DSQ
and LO_FALLBACK_DSQ. HI is used for violating kthreads and LO for
violating user threads. HI is dispatched after preempting layers and LO
after all other layers. This shouldn't change the behavior too much for
kthreads while punshing, rather than rewarding, violating user threads.
(cherry picked from commit 67f69645667ba8a155cae9a9b7e90c055d39e23c)
Dispatch non-interactive tasks on the CPU selected by the built-in idle
selection logic and allow interactive tasks to be dispatched on any CPU.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Do not always assign the maximum time slice to interactive tasks, but
use the same value of the dynamic time slice for everyone.
This seems to prevent potential audio cracking when the system is over
commissioned.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
The option --full-user is provided to delegate *all* scheduling
decisions to the user-space scheduler with no exception, including the
idle selection logic.
Therefore, make this option incompatible with --builtin-idle and
completely bypass the built-in idle selection logic when running in
full-user mode.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Provide a knob in scx_rustland_core to automatically turn the scheduler
into a simple FIFO when the system is underutilized.
This choice is based on the assumption that, in the case of system
underutilization (less tasks running than the amount of available CPUs),
the best scheduling policy is FIFO.
With this option enabled the scheduler starts in FIFO mode. If most of
the CPUs are busy (nr_running >= num_cpus - 1), the scheduler
immediately exits from FIFO mode and starts to apply the logic
implemented by the user-space component. Then the scheduler can switch
back to FIFO if there are no tasks waiting to be scheduled (evaluated
using a moving average).
This option can be enabled/disabled by the user-space scheduler using
the fifo_sched parameter in BpfScheduler: if set, the BPF component will
periodically check for system utilization and switch back and forth to
FIFO mode based on that.
This allows to improve performance of workloads that are using a small
amount of the available CPUs in the system, while still maintaining the
same good level of performance for interactive tasks when the system is
over commissioned.
In certain video games, such as Baldur's Gate 3 or Counter-Strike 2,
running in "normal" system conditions, we can experience a boost in fps
of approximately 4-8% with this change applied.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
This merge included additional commits that were supposed to be included
in a separate pull request and have nothing to do with the fifo-mode
changes.
Therefore, revert the whole pull request and create a separate one with
the correct list of commits required to implement this feature.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Dispatch non-interactive tasks on the CPU selected by the built-in idle
selection logic and allow interactive tasks to be dispatched on any CPU.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Do not always assign the maximum time slice to interactive tasks, but
use the same value of the dynamic time slice for everyone.
This seems to prevent potential audio cracking when the system is over
commissioned.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Provide a knob in scx_rustland_core to automatically turn the scheduler
into a simple FIFO when the system is underutilized.
This choice is based on the assumption that, in the case of system
underutilization (less tasks running than the amount of available CPUs),
the best scheduling policy is FIFO.
With this option enabled the scheduler starts in FIFO mode. If most of
the CPUs are busy (nr_running >= num_cpus - 1), the scheduler
immediately exits from FIFO mode and starts to apply the logic
implemented by the user-space component. Then the scheduler can switch
back to FIFO if there are no tasks waiting to be scheduled (evaluated
using a moving average).
This option can be enabled/disabled by the user-space scheduler using
the fifo_sched parameter in BpfScheduler: if set, the BPF component will
periodically check for system utilization and switch back and forth to
FIFO mode based on that.
This allows to improve performance of workloads that are using a small
amount of the available CPUs in the system, while still maintaining the
same good level of performance for interactive tasks when the system is
over commissioned.
In certain video games, such as Baldur's Gate 3 or Counter-Strike 2,
running in "normal" system conditions, we can experience a boost in fps
of approximately 4-8% with this change applied.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Report the amount of running tasks to stdout. This value also represents
the amount of active CPUs that are currently executing a task.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
The dynamic slice boost is not used anymore in the code, so there is no
reason to keep evaluating it.
Moreover, using it instead of the static slice boost seems to make
things worse, so let's just get rid of it.
Fixes: 0b3c399 ("scx_rustland: introduce dynamic slice boost")
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
When building with warnings enabled, a few obvious bugs are pointed out:
- We're not correctly calculating waker frequency
- We're not taking the min of avg_run_raw compared to max latency
- We're missing an element from sched_prio_to_weight
Fix these. With these changes, interactivity is seemingly improved. We
go from ~12 sec / turn -> 11 seconds / turn in the Civ 6 AI benchmark
with a 4 x nproc CPU hogging workload in the background. It's clear,
however, that we really need preemption.
Signed-off-by: David Vernet <void@manifault.com>
C SCX_OPS_ATTACH() and rust scx_ops_attach() macros were not calling
.attach() and were only attaching the struct_ops. This meant that all
non-struct_ops BPF programs contained in the skels were never attached which
breaks e.g. scx_layered.
Let's fix it by adding .attach() invocation the the attach macros.
Originally the implementation of function rsigmoid_u64 will
perform substraction even when the value of "v" equals to the value
of "max" , in which the result is certainly zero.
We can avoid this redundant substration by changing the condition from
">" to ">=" since we know when the value of "v" and "max" are equal
we can return 0 without any substract operation.
Now that the scx_ops_open!() macro is available, let's use it in scx_rusty to
cover all cases of when hotplug can happen.
Signed-off-by: David Vernet <void@manifault.com>
Now that the kernel exports the SCX_ECODE_ACT_RESTART exit code, we can
remove the custom hotplug logic from scx_rusty, and instead rely on the
built-in logic from the kernel. There's still a corner case that we're not
honoring: when a hotplug event happens on the init path. A future change will
address this as well.
Signed-off-by: David Vernet <void@manifault.com>
Introduce a low-power mode to force the scheduler to operate in a very
non-work conserving way, causing a significant saving in terms of power
consumption, while still providing a good level of responsiveness in the
system.
This option can be enabled in scx_rustland via the --low_power / -l
option.
The idea is to not immediately re-kick a CPU when it enters an idle
state, but do that only if there are no other tasks running in the
system.
In this way, latency-critical tasks can be still dispatched immediately
on the other active CPUs, while CPU-bound tasks will be forced to spend
more time waiting to be scheduled, basically enforcing a special CPU
throttling mechanism that affects only the tasks that are not latency
critical.
The consequence is a reduction in the overall system throughput, but
also a significant reduction of power consumption, that can be useful
for mobile / battery-powered devices.
Test case (using `scx_rustland -l`):
- play a video game (Terraria) while recompiling the kernel
- measure game performance (fps) and core power consumption (W)
- compare the result of normal mode vs low-power mode
Result:
Game performance | Power consumption |
------------+-----------------+-------------------+
normal mode | 60 fps | 6W |
low-power mode | 60 fps | 3W |
As we can see from the result the reduction of power consumption is
quite significant (50%), while the responsiveness of the game (fps)
remains the same, that means battery life can be potentially doubled
without significantly affecting system responsiveness.
The overall throughput of the system is, of course, affected in a
negative way (kernel build is approximately 50% slower during this
test), but the goal here is to save power while still maintaining a good
level of responsiveness in the system.
For this reason the low-power mode should be considered only in
emergency conditions, for example when the system is close to completely
run out of power or simply to extend the battery life of a mobile device
without compromising its responsiveness.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
During the initialization phase the scheduler needs to be aware of all
the available CPUs in the system (also those that are offline), in order
to create a proper per-CPU DSQ for all of them.
Otherwise, if some cores are offline, we may get errors like the
following:
swapper/7[0] triggered exit kind 1024:
runtime error (invalid DSQ ID 0x0000000000000007)
Backtrace:
scx_bpf_consume+0xaa/0xd0
bpf_prog_42ff1b9d1ac5b184_rustland_dispatch+0x12b/0x187
Change the code to configure the BpfScheduler object with the total
amount of CPUs available in the system and prevent such failure.
This fixes#280.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Always dispatch at least one task, even if all the CPUs are busy.
This small overcommitment allows to maximize the CPU utilization without
introducing bubbles in the scheduling and also without introducing
regressions in terms of resposiveness.
Before this change the average CPU utilization of a `stress-ng -c 8` on
an 8-cores system is around 95%. With this change applied the CPU
utilization goes up to a consistent 100%.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Add a method to TopologyMap to get the amount of online CPUs.
Considering that most of the schedulers are not handling CPU hotplugging
it can be useful to expose also this metric in addition to the amount of
available CPUs in the system.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Drop the global effective time-slice and use the more fine-grained
per-task time-slice to implement the dynamic time-slice capability.
This allows to reduce the scheduler's overhead (dropping the global time
slice volatile variable shared between user-space and BPF) and it
provides a more fine-grained control on the per-task time slice.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
If there is a higher priority task when running ops.tick(),
ops.select_cpu(), and ops.enqueue() callbacks, the current running tasks
yields its CPU by shrinking time slice to zero and a higher priority
task can run on the current CPU.
As low-cost, fine-grained preemption becomes available, default
parameters are adjusted as follows:
- Raise the bar for remote CPU preemption to avoid IPIs.
- Increase the maximum time slice.
- Gradually enforce the fair use of CPU time (i.e., ineligible duration)
Lastly, using CAS, we ensure that a remote CPU is preempted by only one
CPU. This removes unnecessary remote preemptions (and IPIs).
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Replace the BPF_MAP_TYPE_QUEUE with a BPF_MAP_TYPE_USER_RINGBUF to store
the tasks dispatched from the user-space scheduler to the BPF component.
This eliminates the need of the bpf() syscalls, significantly reducing
the overhead of the user-space->kernel communication and delivering a
notable performance boost in the overall system throughput.
Based on experimental results, this change allows to reduces the scheduling
overhead by approximately 30-35% when the system is overcommitted.
This improvement has the potential to make user-space schedulers based
on scx_rustland_core viable options for real production systems.
Link: https://github.com/libbpf/libbpf-rs/pull/776
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
scx_rusty's intention is to support hotplug by automatically restarting
whenever a hotplug event is encountered. Now that we're not trying to
consume a bogus DSQ in the rusty_dispatch() on a newly hotplugged CPU,
let's just remove offline tracking. It's really just there as a sanity
check, but it triggers if an offline task is made runnable during a
hotplug event before the ops.hotplug() callback has been invoked.
Signed-off-by: David Vernet <void@manifault.com>
There's currently a slight issue on existing kernels on the hotplug
path wherein we can start to receive scheduling callbacks on a CPU
before that CPU has received hotplug events. For CPUs going online, this
can possibly confuse a scheduler because it may not be expecting
anything to ever happen on that CPU, and therefore may do things that
could cause the scheduler to crash. For example, without this patch in
scx_rusty, we try to consume from a bogus DSQ that doesn't exist, which
causes ext.c to boot out the scheduler.
Though this issue will soon be fixed in ext.c, let's explicitly avoid
dispatching from an onlining CPU in rusty so that we properly support
hotplug on older kernels as well.
Signed-off-by: David Vernet <void@manifault.com>
scx_lavd implemented 32 and 64 bit versions of a base-2 logarithm
function. This is now also used in rusty. To avoid code duplication,
let's pull it into a shared header.
Note that there is technically a functional change here as we remove the
always inline compiler directive. We instead assume that the compiler
will know best whether or not to inline the function.
Signed-off-by: David Vernet <void@manifault.com>
In user space in rusty, the tuner detects system utilization, and uses
it to inform how we do load balancing, our greedy / direct cpumasks,
etc. Something else we could be doing but currently aren't, is using
system utilization to inform how we dispatch tasks. We currently have a
static, unchanging slice length for the runtime of the program, but this
is inefficient for all scenarios.
Giving a task a long slice length does have advantages, such as
decreasing the number of involuntary context switches, decreasing the
overhead of preemption by doing it less frequently, possibly getting
better cache locality due to a task running on a CPU for a longer amount
of time, etc. On the other hand, long slices can be problematic as well.
When a system is highly utilized, a CPU-hogging task running for too
long can harm interactive tasks. When the system is under-utilized,
those interactive tasks can likely find an idle, or under-utilized core
to run on. When the system is over-utilized, however, they're likely to
have to park in a runqueue.
Thus, in order to better accommodate such scenarios, this patch
implements a rudimentary slice scaling mechanism in scx_rusty. Rather
than having one global, static slice length, we instead have a dynamic,
global slice length that can be changed depending on system utilization.
When over-utilized, we go with a longer slice length, and vice versa for
when the system is under-utilized. With Terraria, this results in
roughly a 50% improvement in mean FPS when playing on an AMD Ryzen 9
7950X, while running Spotify, and stress-ng -c $((4 * $(nproc))).
Signed-off-by: David Vernet <void@manifault.com>
scx_rusty doesn't do terribly well with interactive workloads. In order
to improve the situation, this patch adds support for basic deadline
scheduling in rusty. This approach doesn't incorporate eligibility, and
simply uses a crude avg_runtime tracking approach to scaling a task's
deadline.
In a series of follow-on changes, we'll update the scheduler to use more
indicators for interactivity that affect both slice length, and deadline
calculation.
Signed-off-by: David Vernet <void@manifault.com>
To know the required CPU performance (e.g., frequency) demand, we keep
track of 1) utilization of each CPU and 2) _performance criticality_ of
each task. The performance criticality of a task denotes how critical it
is to CPU performance (frequency). Like the notion of latency
criticality, we use three factors: the task's average runtime, wake-up
frequency, and waken-up frequency. A task's runtime is longer, and its
two frequencies are higher; the task is more performance-critical
because it would be a bottleneck in the middle of the task chain.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Let's remove the extraneous copy pasting and use a lookup helper like we
do for task and pcpu context.
Signed-off-by: David Vernet <void@manifault.com>
A LoadEntity gets the load to transfer between two entities by taking
the minimum of their imbalances and reducing its abs value by
xfer_ratio.
In practice self.imbal(), the push node or domain, always has positive
imbalance and other.imbal(), the pull node or domain, always has
negative imbalance, so other.imbal() is always the minimum even though
the abs value of its imbalance might be greater than the abs value of
self.imbal(). It seems like the intent is to take the minimum of the
two absolute values instead to avoid overbalancing at the puller, so
make both values abs.
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Rusty's load balancer calculates load differently based on average
system CPU utilization in create_domain_hierarchy(). At >= 99.999%
utilization, load is the product of a task's weight and duty cycle;
below that, load is the same as the task's duty cycle.
populate_tasks_by_load(), however, always uses the product when
calculating per-task load so that in the sub-99.999% util case, load is
inflated, typically by a factor of 100 with a normal priority task.
Tasks look too heavy to migrate as a result because a single task would
transfer more load than the domain imbalance allows, leading to
significant imbalance in some cases.
Make populate_tasks_by_load() calculate task load the same way as
domain load, checking lb_apply_weight.
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
The current code replenishes the task's time slice whenever the task
becomes ops.running(). However, there is a case where such behavior can
starve the other tasks, causing the watchdog timeout error. One (if not
all) such case is when a task is preempted while running by the higher
scheduler class (e.g., RT, DL). In such a case, the task will be transit
in a cycle of ops.running() -> ops.stopping() -> ops.running() -> etc.
Whenever it becomes re-running, it will be placed at the head of local
DSQ and ops.running() will renew its time slice. Hence, in the worst
case, the task can run forever since its time slice is never exhausted.
The fix is assigning the time slice only once by checking if the time
slice is calculated before.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Provide a command line option to print the version of the scheduler and
the scx_rustland_core crate.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Given that rustland_core now supports task preemption and it has been
tested successfully, it's worhtwhile to cut a new version of the crate.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
In Rust c_char can be aliased to i8 or u8, depending on the particular
target architecture.
For example, trying to build scx_lavd on ppc64 triggers the following
error:
error[E0308]: mismatched types
--> src/main.rs:200:38
|
200 | let c_tx_cm: *const c_char = (&tx.comm as *const [i8; 17]) as *const i8;
| ------------- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ expected `*const u8`, found `*const i8`
| |
| expected due to this
|
= note: expected raw pointer `*const u8`
found raw pointer `*const i8`
To fix this, consistently use c_char instead of assuming it corresponds
to i8.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
In _some_ kernel versions, loading scx_lavd fails with an error of
"bpf_rcu_read_unlock is missing". The usage of
bpf_rcu_read_lock/unlock() in proc_dump_all_tasks() is correct but the
bpf verifier still think bpf_rcu_read_unlock() is missing. The most
plausible reason so far is that the problematic kernel does not have a
commit 6fceea0fa59f ("bpf: Transfer RCU lock state between subprog
calls"), failing inter-procedural analysis between proc_dump_all_tasks()
and submit_task_ctx(). Thus, we force inline submit_task_ctx() (no
inter-procedural analysis by the verifier is necessary) for the time
being.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Only the very newest kernels support scx_bpf_cpuperf_set(). Let's update
scx_layered to accommodate older kernels as well.
Signed-off-by: David Vernet <void@manifault.com>
Looking at perf top it seems that the scheduler can spend a significant
amount of time iterating over the CPU topology/cpumask information,
especially when the system is running a significant amount of tasks:
2.57% scx_rustland [.] <scx_utils::cpumask::CpumaskIntoIterator as core::iter::traits::iterator::Iterator>::next
Considering that scx_rustland doesn't support CPU hotplugging yet (it
requires a full restart to properly handle CPU hotplug events), we can
completely avoid this overhead by caching a TopologyMap object at the
beginning, when the scheduler starts, instead of constantly
re-evaluating the CPU topology information.
This allows to reduce the scheduler overhead by ~5% CPU utilization
under heavy load conditions (from ~65% -> ~60%, according to top).
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
This change adds `scx_bpf_cpuperf_cap`, `scx_bpf_cpuperf_cur` and
`scx_bpf_cpuperf_set` definitions that were recently introduced into
[`sched_ext`](https://github.com/sched-ext/sched_ext/pull/180). It adds
a `perf` field to `scx_layered` to allow for controlling performance per
layer.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
If a library creates threads, those threads will often have the same
name. If two different processes of different priority both use a
library, it may be that we want the library's threads in each process to
be put into different layers.
To support this, let's add the ability to filter not only by task name,
but also by process name via the task thread group leader's comm.
Tested by creating two executables named "foo" and "bar", which both
spawn a bunch of tasks named "exp_worker" that spin until being
interrupted. With this config: https://pastebin.com/Uz2phzxQ, the tasks
were correctly matched to the expected layers.
Signed-off-by: David Vernet <void@manifault.com>
We're currently cloning cpumasks returned by calls to {Core, Cache,
Node, Topology}::span(). If a caller needs to clone it, they can. Let's
not penalize the callers that just want to query the underlying cpumask.
Signed-off-by: David Vernet <void@manifault.com>
Some people have expressed confusion at this behavior. Let's be a bit
more explicit in the documentation.
Signed-off-by: David Vernet <void@manifault.com>
Provide a run-time option to disable task preemption.
This option can be used to improve the throughput of the CPU-intensive
tasks while still providing a good level of responsiveness in the
system.
By default preemption is enabled, to provide a higher level of
responsiveness to the interactive tasks.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Use the new scx_rustland_core dispatch flag RL_PREEMPT_CPU to allow
interactive tasks to preempt other tasks with scx_rustland.
If the built-in idle selection logic is enforced (option `-i`), the
scheduler prioritizes keeping tasks on the target CPU designated by this
logic. With preemption enabled, these tasks have a higher likelihood of
reusing their cached working set, potentially improving performance.
Alternatively, when tasks are dispatched to the first available CPU
(default behavior), interactive tasks benefit from running more promptly
by kicking out other tasks before their assigned time slice expires.
This potentially allows to increase the default time slice to higher
values in the future, to improve the overall throughput in the system
and, at the same time, still maintain a good level of responsiveness,
because interactive tasks are now able to run pretty much immediately,
independently on the remaining time slice of the other tasks that are
contending the CPUs in the system.
= Results =
Measuring the performance of the usual benchmark "playing a video game
while running a parallel kernel build in background" seems to give
around 2-10% boost in the fps with preemption enabled, depending on the
particular video game.
Results were obtained running a `make -j32` kernel build on a AMD Ryzen
7 5800X 8-Cores 16GB RAM, while testing video games such as Baldur's
Gate 3 (with a solid +10% fps), Counter Strike 2 (around +5%) and Team
Fortress 2 (+2% boost).
Moreover, some WebGL applications (such as
https://webglsamples.org/aquarium/aquarium.html) seem to benefit even
more with preemption enabled, providing up to a +15% fps boost.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Reserve some bits of the `cpu` attribute of a task to store special
dispatch flags.
Initially, let's introduce just RL_CPU_ANY to replace the special value
NO_CPU, indicating that the task can be dispatched on any CPU,
specifically the first CPU that becomes available.
This allows to keep the CPU value assigned by the builtin idle selection
logic, that can potentially be used later for further optimizations.
Moreover, having the possibility to specify dispatch flags gives more
flexibility and it allows to map new scheduling features to such flags.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
When I transitioned layered to using task local storage, I messed up
initializing the task ctx, not realizing we previously had a separate
variable that was initializing the hasmap entry. We need to initialize
the task's layer to -11, and also set refresh_layer to 1.
Signed-off-by: David Vernet <void@manifault.com>
We have a lot of boilerplate code where we create a cpumask, initialize
it, and then bpf_kptr_xchg() it into the map. In an effort to slightly
reduce the amount of boilerplate, let's create a helper that can
alleviate some of it.
Signed-off-by: David Vernet <void@manifault.com>
There are some random issues in the code, like unused variables, and bad
print formatters. I'm not sure why the compiler isn't consistently
complaining, but let's fix them.
Signed-off-by: David Vernet <void@manifault.com>
In scx_rusty, now that we have a complete view of the host's topology
thanks to the Topology crate, we can update our calls to
scx_bpf_create_dsq() to create the DSQ on the NUMA node of the domain.
It's unclear how much this will end up mattering for performance in the
typical case, but we might as well do the right thing given that host
topolgoy is static, and we have the information.
Signed-off-by: David Vernet <void@manifault.com>
* scx-lavd: preemption of a lower-priority task using kick cpu
When a task is enqueued to the global queue, the scheduler checks if
there is a lower priority task than the enqueued task. If so, it kicks
out the lower-priority task, hoping the newly enqueued task or another
higher-priority task runs on the kicked CPU. Kicking another CPU is
expensive as an IPI is involved, so the scheduler judiciously kicks the
CPU when its benefit (i.e., priority gap) is clear enough.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
The scx_rusty scheduler does not support hotplug, and expects a static
host topology throughout its runtime. Though the kernel does have
support for detecting hotplug events, we currently don't detect this in
the kernel, nor surface it to user space when it happens. Now that we
have scx_bpf_exit(), we can gracefully exit the kernel in the event of a
hotplug, and communicate to user space that it should restart the
scheduler.
This patch adds that support to scx_rusty. Note that this assumes that
we're running on a recent enough kernel that has scx_bpf_exit(). If it
doesn't, then we instead just error out of the kernel scheduler and exit
the application.
Signed-off-by: David Vernet <void@manifault.com>
In rusty_select_cpu(), if a task is WAKE_SYNC, we'll currently migrate
the task to that CPU if there are any idle cores on the system. As in
[0], this condition is insufficient, as there could be idle cores
elsewhere on the system, but still tasks piled up on a single local DSQ.
Let's add a condition that the local DSQ has to be empty in order to
apply the WAKE_SYNC migration.
Before patch:
[void@maniforge src]$ hackbench
Running in process mode with 10 groups using 40 file descriptors each (== 400 tasks)
Each sender will pass 100 messages of 100 bytes
Time: 0.433
With patch:
[void@maniforge src]$ hackbench
Running in process mode with 10 groups using 40 file descriptors each (== 400 tasks)
Each sender will pass 100 messages of 100 bytes
Time: 0.035
Signed-off-by: David Vernet <void@manifault.com>
Change the upper bound of ineligible duration (LAVD_ELIGIBLE_TIME_MAX).
The updated (2x increased) upper bound reflects the distribution of
tasks' eligible_delta_ns better.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Change the calculation of the run_frequence using the wait_period from
the last time the task yielded CPU to this time when the task is
running. The old implementation measures the time interval between the
last stopping and the current running and increases run_freq without
reason.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Change the last_{start/stop/wait/wake}_clk in task_ctx to
last_{running/stopping/quiescent/runnable}_clk, matching with state
transition names. In addition, add comments and reorder fields in
task_ctx for readability.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
When a task runs more than once (running <->stopping) within one
runnable-quiescent transition, accumulate runtime of multiple runnings
for statistics. This helps to get the task's runtime per schedule when
supposing that a huge time slice is given, which is what we want to
collect for scheduling decisions.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Remove runtime_boost using slice_boost_prio. Without slice_boost_prio,
the scheduler collects the exact time slice.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Let's change the function names of update_stat_for_*() as follow their
callers for consistency and less confusion.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
The run_time_boosted_ns calculation requires updated slice_boost_prio,
so updating slice_boost_prio should be done before updating
run_time_boosted_ns.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
In scx_layered, we're using a BPF_MAP_TYPE_HASH map (indexed by pid)
rather than a BPF_MAP_TYPE_TASK_STORAGE, to track local storage for a
task. As far as I can tell, there's no reason we need to be doing this.
We never access the map from user space, and we're even passing a
struct task_struct * to a helper subprog to look up the task context
rather than only doing it by pid.
Using a hashmap is error prone for this because we end up having to
manually track lifecycles for entries in the map rather than relying on
BPF to do it for us. For example, BPF will automatically free a task's
entry from the map when it exits. Let's just use TLS here rather than a
hashmap to avoid issues from this (e.g. we've observed the scheduler
getting evicted because we're accessing a stale map entry after a task
has been destroyed).
Reported-by: Valentin Andrei <vandrei@meta.com>
Signed-off-by: David Vernet <void@manifault.com>
transit_task_stat() is now tracking the same runnable, running, stopping,
quiescent transitions that sched_ext core already tracks and always returns
%true. Let's remove it.
LAVD_TASK_STAT_ENQ is tracking a subset of runnable task state transitions -
the ones which end up calling ops.enqueue(). However, what it is trying to
track is a task becoming runnable so that its load can be added to the cpu's
load sum.
Move the LAVD_TASK_STAT_ENQ state transition and update_stat_for_enq()
invocation to ops.runnable() which is called for all runnable transitions.
Note that when all the methods are invoked, the invocation order would be
ops.select_cpu(), runnable() and then enqueue(). So, this change moves
update_stat_for_enq() invocation before calc_when_to_run() for
put_global_rq(). update_stat_for_enq() updates taskc->load_actual which is
consumed by calc_greedy_ratio() and thus affects calc_when_to_run().
Before this patch, calc_greedy_ratio() would use load_actual which doesn't
reflect the last running period. After this patch, the latest running period
will be reflected when the task gets queued to the global queue.
The difference is unlikely to matter but it'd probably make sense to make it
more consistent (e.g. do it at the end of quiescent transition).
After this change, transit_task_stat() doesn't detect any invalid
transitions.
scx_lavd tracks task state transitions and updates statistics on each valid
transition. However, there's an asymmetry between the runnable/running and
stopping/quiescent transitions. In the former, the runnable and running
transitions are accounted separately in update_stat_for_enq() and
update_stat_for_run(), respectively. However, in the latter, the two
transitions are combined together in update_stat_for_stop().
This asymmetry leads to incorrect accounting. For example, a task's load
should be added to the cpu's load sum when the task gets enqueued and
subtracted when the task is no longer runnable (quiescent). The former is
accounted correctly from update_stat_for_enq() but the latter is done
whenever the task stops. A task can transit between running and stopping
multiple times before becoming quiescent, so the asymmetry can end up
subtracting the load of a task which is still running from the cpu's load
sum.
This patch:
- introduces LAVD_TASK_STAT_QUIESCENT and updates transit_task_stat() so
that it can handle all valid state transitions including the multiple back
and forth transitions between two pairs - QUIESCENT <-> ENQ and RUNNING
<-> STOPPING.
- restores the symmetry by moving load adjustments part from
update_stat_for_stop() to new update_stat_for_quiescent().
This removes a good chunk of ignored transitions. The next patch will take
care of the rest.
lookup_task_ctx(), lookup_task_ctx_may_fail(), and lookup_layer()
currently don't have the static keyword, so BPF may treat them as a
global function. We don't actually want these to be global, so let's
make them static to avoid confusing the verifier.
Signed-off-by: David Vernet <void@manifault.com>
The old approach is mapping [0, maximum latency criticliy] to [-boost
range, boost range). This approach is easily affected by one outlier
maximum value and suffers from the integer truncation error. The new
approach divides the range into two -- [minimum latency criticality,
average latency criticality) and [average latency criticality, maximum
latency criticality] -- and maps them into [boost range/2, 0) and [0,
-boost range/2), respectively,
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Replace a latency weight arrary to more skewed one, which is the
inverse of sched_prio_to_slice_weight. It turns out more skewed one
works better under highly CPU-overloaded cases since it gives a longer
deadline to non-latency-critical tasks.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
As the calculated runtime increases by considering the number of
full-time slice consumption, increase the upper bound
(LAVD_LC_RUNTIME_MAX) of runtime to be considered in latency
calculation. Also, add LAVD_SLICE_BOOST_MAX_PRIO to avoid
slice_boost_prio dropping to zero suddenly.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Take slice_boost_prio -- how many times a full time slice was consumed
-- into consideration in calculating run_time_ns (runtime per schedule).
This improve the accuracy especially when a task is overscheduled and
its time slice is reduced for enforcing fairness.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Returning prev_cpu after picking an idle CPU will cause the idle CPU
stall because the idle core was already punched out from the idle mask
by the scx core so it is no longer idle from the scx core's point of
view.
This fix conducts the idle core selection at the last step so it never
return prev_cpu after picking the idle core.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
get_task_ctx() and try_get_task_ctx() were added for common error
handling for task lookup failure. Since idle "swapper" task is not under
sched_ext, try_get_task_ctx() is added for the case such that idle task
can be searched.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
We don't need to test SCX_WAKE_SYNC because SCX_WAKE_SYNC should only be
set when SCX_WAKE_TTWU is set.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
scx_lavd is a BPF scheduler that implements an LAVD (Latency-criticality
Aware Virtual Deadline) scheduling algorithm. While LAVD is new and
still evolving, its core ideas are 1) measuring how much a task is
latency critical and 2) leveraging the task's latency-criticality
information in making various scheduling decisions (e.g., task's
deadline, time slice, etc.). As the name implies, LAVD is based on the
foundation of deadline scheduling. This scheduler consists of the BPF
part and the rust part. The BPF part makes all the scheduling decisions;
the rust part loads the BPF code and conducts other chores (e.g.,
printing sampled scheduling decisions).
There were a few issues, e.g. us still mentioning the infeasible weights
problem, and arguments using underscores despite clap rendering them
with dashes. Let's fix them up.
Signed-off-by: David Vernet <void@manifault.com>
As described in https://bugzilla.kernel.org/show_bug.cgi?id=218109,
https://github.com/sched-ext/scx/issues/147 and
https://github.com/sched-ext/sched_ext/issues/69, AMD chips can
sometimes report fully disabled CPUs as offline, which causes us to
count them when looking at /sys/devices/system/cpu/possible.
Additionally, systems can have holes in their active CPU maps. For
example, a system with CPUs 0, 1, 2, 3 possible, may have only 0 and 2
active. To address this, we need to do a few things:
1. Update topology.rs to be clear that it's returning the number of
_possible_ CPUs in the system. Also update Topology to only record
online CPUs when creating its span and iterating over sysfs when
creating domains. It was previously trying to record when a CPU was
online, but this was actually broken as the topology directory isn't
present in sysfs when the CPU is offline.
2. Schedulers should not be relying on nr_possible_cpus for anything
other than interacting with per-CPU data (e.g. for stats extraction),
or e.g. verifying maximum sizes of statically sized arrays in BPF. It
should _not_ be used for e.g. performing load calculations, etc. With
that said, we'll also need to update schedulers to not rely on the
nr_possible_cpus figure being exported by the topology crate. We do
that for rusty in this patch, but don't fix any of the others other
than updating how they call topology.rs.
3. Account for the fact that LLC IDs may be non-contiguous. For example,
if there is a single core in an LLC, then if we assign LLC IDs to
domains, then the domain IDs won't be contiguous. This doesn't fit
our current model which is used by e.g. infeasible_weights.rs. We'll
update some of the code in rusty to accomodate this, but we'll need
to do more.
4. Update schedulers to properly reset themselves in the event of a
hotplug event. We'll take care of that in a follow-on change.
Signed-off-by: David Vernet <void@manifault.com>
If a CPU is offline, it could cause an LLC to go offline, which could
cause us to have non-contiguous domain IDs. Right now, a few places in
code assume contiguous domain IDs, such as in the infeasible weights
crate. Let's update domain.rs and load_balaance.rs to do the right
thing. We'll fix the others later.
Signed-off-by: David Vernet <void@manifault.com>
We implement functions or(), and(), and xor() for cpumasks, but we
should also implement the bitwise ops for those operations in case
people prefer that syntax.
Signed-off-by: David Vernet <void@manifault.com>
We're iterating from min..max cpu in cpus_online(), but that's not
inclusive of the max CPU. Let's also include that so we don't think that
last CPU is offline.
Signed-off-by: David Vernet <void@manifault.com>
Most of the schedulers assume that the amount of possible CPUs in the
system represents the actual number of CPUs available.
This is not always true: some CPUs may be offline or certain CPU models
(AMD CPUs for example) may include unavailable CPUs in this number.
This can lead to sub-optimal performance or even errors in the scheduler
(see for example [1][2]).
Ideally, we need to attack this issue in a more generic way, such as
having a proper API provided by a C library, that can be used by all
schedulers and the topology Rust module (scx_utils crate).
But for now, let's try to mitigate most of the common sub-optimal cases
separately inside each scheduler.
For rustland we can apply some mitigations both in select_cpu() (for the
BPF part) and in the user-space part:
- the former is fixed in the sched-ext kernel by commit 94dc0c01b957
("scx: Use cpu_online_mask when resetting idle masks"). However,
adding an extra check `cpu < num_possible_cpus` in select_cpu(),
allows to properly support AMD CPUs, even with kernels that don't
have the cpu_online_mask fix yet (this doesn't always guarantee the
validity of cpu, but it should be enough to mitigate the majority of
the potential sub-optimal cases, without introducing any significant
overhead)
- the latter can be fixed relying on topology.span(), instead of
topology.nr_cpus(), to count the amount of available CPUs in the
system.
[1] https://github.com/sched-ext/sched_ext/issues/69
[2] https://github.com/sched-ext/scx/issues/147
Link: 94dc0c01b9
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Given the complexity of migrating load between nodes (we're doing four
nested loops), we should add a comment explaining what we're doing. This
commit does that. In addition, we use a VecDeque to store (and then
restore) push nodes and push domains so that we can re-add them to their
respective lists in load-sorted order rather than reverse-load-sorted
order. This allows us to avoid having to do unnecessary right-shifts
every time a push object is re-added to its containing list.
Signed-off-by: David Vernet <void@manifault.com>
Fixing alignment, moving a couple bail! calls around, and adding a
missing break from move_between_nodes() that lets us bail out of a loop
early.
Signed-off-by: David Vernet <void@manifault.com>
As Tejun pointed out in review, the disadvantage of using
push/pull/balanced lists is that if the domains inside the nodes are
balanced, we won't be able to push load between them. I'd originally
done it that way both as an optimization, but also to allow me to
iterate over the lists of pushable and pullable domains mutably. That
was addressed in the prior commit, but the nodes themselves were still
put into 3 buckets.
I think this is generally just a cleaner way of doing things, so let's
just collapse the nodes into a flat list as well. This prevents us from
having to coalesce the lists, std::mem::swap them, etc.
Signed-off-by: David Vernet <void@manifault.com>
Tejun pointed out that a possible issue exists in the current
implementation, wherein if you have two NUMA nodes that are imbalanced,
but their domains are internally balanced, we'll fail to migrate between
them if all nodes are in the balanced_nodes list.
To address this, let's just use a single global list for all types of
domains, and do checking internally for imbalances. The reason it was
done this way in the first place was to allow me to mutably iterate over
both vectors in a nested loop. The way around that is to just use loop
{} and push/pop domains from the list.
We could do the same thing for the NUMA nodes themselves, which are also
in 3 separate lists in the LoadBalancer. We'll do that in a subsequent
commit.
Signed-off-by: David Vernet <void@manifault.com>
In scx_rusty, a CPU that is going to go idle will attempt to steal tasks
from remote domains when its domain has no tasks to run, and a remote
domain has at least greedy_threshold enqueued tasks. This stealing is
temporary, but of course has a cost in that the CPU that's stealing the
task may cause it to suffer from cache misses, or in the case of
multi-node machines, remote NUMA accesses and working sets split across
multiple domains.
Given the higher cost of x NUMA work stealing, let's add a separate flag
that lets users tune the threshold for doing cross NUMA greedy task
stealing.
Signed-off-by: David Vernet <void@manifault.com>
In order to use the new consume_raw() API we need to depend on a version
of libbpf-rs that is not released yet.
Apparently adding such dependency may introduce a potential dependency
conflict with libbpf-sys.
Therefore, revert this change and go back to the previous consume() API.
One a new version of libbpf-rs will be out we can update all our
dependencies to use the new libbpf-rs and re-apply this patch to
scx_rustland_core.
Fixes: 7c8c5fd ("scx_rustland_core: use new consume_raw() libbpf-rs API")
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
In line with rustland's focus on prioritizing interactive tasks, set the
default base time slice to 5ms.
This allows to mitigate potential audio craking issues or system lags
when the system is overloaded or under memory pressure condition (i.e.,
https://github.com/sched-ext/scx/issues/96#issuecomment-1978154324).
A downside of this change is to introduce potential regressions in the
throughput of CPU-intensive workloads, but in such scenarios rustland
may not be the optimal choice and alternative schedulers may be
preferred.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Some high-priority tasks may have a weight too high, that can
potentially disrupt the slice boost optimization logic, causing
interactive tasks to be less responsive.
In line with rustland's focus on prioritizing interactive tasks, prevent
giving too much CPU bandwidth to such high-priority tasks by limiting
the maximum task weight to 1000.
This allows to maintain a good level of system responsiveness even in
presence of tasks with a really high priority.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Use the new consume_raw() API provided by libbpf-rs with
https://github.com/libbpf/libbpf-rs/pull/680.
This allows to be more precise and efficient at processing tasks
consumed from the BPF ring buffer.
NOTE: the new consume_raw() API is not available yet in any official
release of the libbpf-rs crate, but cargo allows to pick versions
directly from git. This slightly increases the build time of
scx_rustland_core and the schedulers based on this crate (since we need
to recompile libbpf-rs from source), but we can re-add a proper
versioned dependency once the libbpf-rs is out.
TODO: this new API also offers the possibility to consume multiple items
from the BPF ring buffer with a single call to consume_raw(). This could
be investigated and implemented as a potential future enhancement.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
The current topology.rs crate assumes that all cores have unique core
IDs in a system. This need not be the case, such as in certain Intel
Xeon processors which reuse core IDs in different NUMA nodes. Let's
update the crate to assume unique core IDs only per socket.
Signed-off-by: David Vernet <void@manifault.com>
We removed the debug!() output that was previously present in main.rs. Let's
add more debug!() output that helps debug the current LB hierarchy.
Signed-off-by: David Vernet <void@manifault.com>
The scx_rusty load balancer is currently no longer exporting statistics such as
domain load averages, load sums, etc. Now that we're also balancing by NUMA,
we'll need a way to hierarchically illustrate load balancing statistics. This
patch adds support for that.
Signed-off-by: David Vernet <void@manifault.com>
updating stats printing
Signed-off-by: David Vernet <void@manifault.com>
Users may want to toggle whether tasks can be temporarily sent to idle CPUs on
remote NUMA nodes. By default, we want it to be disabled as a task spanning
multiple NUMA nodes will end up having its working set spanning both nodes,
which is probably not desirable. However, in case a workload really wants to
encourage work conservation, let's add a flag that allows them to toggle it.
Signed-off-by: David Vernet <void@manifault.com>
scx_rusty currently pushes tasks to idle cores if the direct greedy threshold
is exceeded, even if the core is on a remote NUMA node. This behavior is
probably not desired in most scenarios. The worst that will happen if a task is
pushed to an idle core in the same node is some L3 cache miss traffic, but for
multiple NUMA nodes, it could cause the task to have its working set span
multiple nodes.
Let's disable direct greedy work stealing across NUMA nodes. A future commit
will add a flag that's disabled by default, and let's users turn this on if
they really want to encourage work conservation.
Signed-off-by: David Vernet <void@manifault.com>
Right now, scx_rusty has no notion of domains spanning NUMA nodes, and makes no
distinction when making load balancing decisions, or work stealing. This can
cause problems on multi-NUMA machines, as load balancing and work stealing
across NUMA nodes has significantly different cost from across L3 cache
boundaries.
In order to better support multi-NUMA machines, this commit adds another layer
to the rusty load balancer, which balances across NUMA nodes using a different
cost function from balancing across domains. Load balancing now takes place
over the span of two passes:
1. In the first pass, we fix imbalances across NUMA nodes by moving tasks
between domains across those NUMA node boundaries. We require a load
imbalance of at least 17% in order to move load at this stage. The ratio of
load imbalance we attempt to adjust (50%) and the maximum amount of load
we're allowed to push out of a domain (50%) is still the same as when
balancing between domains inside a NUMA node, but this is easy to tune with
the current setup.
2. Once we've balanced across NUMA nodes, we iterate over all nodes and balance
between the domains within each NUMA node. The cost function here is the
same as what it has been thus far: we require at least a 5% imbalance in
order to trigger load balancing.
There are a few additional changes / improvements to load balancing in this
commit:
1. NUMA nodes and domains are now ordered according to their load by using
SortedVec objects. We were previously using BTreeMap keyed by load, but this
was suboptimal due to the fact that it doesn't allow duplicate entries.
2. We're no longer exporting load balancing statistics as a vector of data such
as load sums, averages, and imbalances. This is instead all encapsulated in
the load balancing hierarchy we setup in lb.load_balance(). These statistics
are not yet exported, but they will be in a subsequent commit.
One of the issues with this commit is that it does introduce some
almost-identical logic that somehow begs to be deduplicated. For example, when
we balance between NUMA nodes, the logic for iterating over push nodes and
pushing to pull nodes is very similar to the logic of iterating over push
domains and pull domains when balancing within a node. It may be that this can
be improved.
The following are some benchmarks run on an Intel Xeon Gold 6138 (2 x 40 core
processor):
kcompile
--------
On Commit a27648c74210 ("afs: Fix setting of mtime when creating a
file/dir/symlink"):
1. make allyesconfig
2. make -j $(nproc) built-in.a
3. make -j clean
4. goto 2
Runtime
-------
o-----------o-----------o----------o
| scx_rusty | CFS | Delta |
---------o-----------o-----------o----------o
Mean | 562.688s | 566.085s | -.6% |
---------o-----------o-----------o----------o
Variance | 0.54387 | 0.72431 | -24.9% |
---------o-----------o-----------o----------o
o-----------o-----------o----------o
| rusty NUMA| rusty ORIG| Delta |
---------o-----------o-----------o----------o
Mean | 562.688s | 563.209s | -.092% |
---------o-----------o-----------o----------o
Variance | 0.54387 | 0.42038 | 29.38% |
---------o-----------o-----------o----------o
scx_rusty with NUMA awareness clearly beats CFS, but only barely beats
scx_rusty without it. This isn't necessarily super surprising given that
this is kcompile, which has very poor front-end CPU locality. Further
experimentation with toggling the cost function for performing
migrations may improve this further.
CPU util
--------
o-----------o-----------o----------o
| scx_rusty | CFS | Delta |
---------o-----------o-----------o----------o
Mean | 7654.25% | 7551.67% | 1.11% |
---------o-----------o-----------o----------o
Variance | 165.35714 | 158.3333 | 4.436% |
---------o-----------o-----------o----------o
o-----------o-----------o----------o
| rusty NUMA| rusty ORIG| Delta |
---------o-----------o-----------o----------o
Mean | 7654.25% | 7641.57% | 0.1659% |
---------o-----------o-----------o----------o
Variance | 165.35714 | 1230.619 | -86.5% |
---------o-----------o-----------o----------o
As expected, CPU util is quite a bit higher with scx_rusty than it is
with CFS. Further experiments that could be interesting are always
enabling direct-greedy stealing between domains within a NUMA node, and
then comparing rusty NUMA and rusty ORIG. rusty NUMA prevents stealing
between NUMA nodes, so this would show whether the locality introduced
by NUMA awareness appropriately offsets the loss of work conservation.
Major PFs
---------
o-----------o-----------o----------o
| scx_rusty | CFS | Delta |
---------o-----------o-----------o----------o
Mean | 5332 | 3950 | 36.566% |
---------o-----------o-----------o----------o
Variance | 6975.5 | 5986.333 | 16.5237% |
---------o-----------o-----------o----------o
o-----------o-----------o----------o
| rusty NUMA| rusty ORIG| Delta |
---------o-----------o-----------o----------o
Mean | 5332 | 5336.5 | -.084% |
---------o-----------o-----------o----------o
Variance | 6975.5 | 955.5 | 630.03% |
---------o-----------o-----------o----------o
Also as expected, major page faults are far highe higher with scx_rusty
than with CFS. This is expected even with NUMA awareness, given that
scx_rusty is still less sticky than CFS.
Further experiments that could be interesting are tuning the threshold
for which we perform x NUMA migrations to try and keep this value even
lower. The rate of major page faults between rusty NUMA and rusty ORIG
were very close, though rusty NUMA was a bit lower.
Signed-off-by: David Vernet <void@manifault.com>
More cleanup of scx_rusty. Let's move the LoadBalancer out of rusty.rs and into
its own file. It will soon be extended quite a bit to support multi-NUMA and
other multivariate LB cost functions, so it's time to clean things up and split
it out.
Signed-off-by: David Vernet <void@manifault.com>
rusty.rs is growing a bit unwieldy. We're going to want to update its load
balancing logic somewhat significantly to account for multi-NUMA and other cost
functions, so let's start cleaning the code up so that things are more
logically segmented and easier to work with.
To start, we move the Tuner and DomainGroup/Domain objects into their own
modules.
Signed-off-by: David Vernet <void@manifault.com>