This change makes scx_rusty mempolicy aware. When a process uses
set_mempolicy it can change NUMA memory preferences and cause
performance issues when tasks are scheduled on remote NUMA nodes. This
change modifies task_pick_domain to use the new helper method that
returns the preferred node id.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
Estimating the service time from run time and frequency is not
incorrect. However, it reacts slowly to sudden changes since it relies
on the moving average. Hence, we directly measure the service time to
enforce fairness.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Instead of performing domain mask checking inside
"find_first_candidate()" every time, check whether the tasks within push
domain are abled to run on pull domain by performing the mask check at
vector generation stage.
This way can also avoid repeated computation generated by the same
(task, pull_dom) pair as they'll try to check whether the pull domain is
in the task domain mask.
Also since whether a task is a kworker won't change in time, we can
perform the check earlier and put it in the filter, too.
Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
We always use nr_cpu_ids to represent the maximum CPU id returned by
scx_bpf_nr_cpu_ids().
Replace cpu_max with nr_cpu_ids to be more consistent with the rest of
the code.
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
We can rely on scx_bpf_nr_cpu_ids() to create all the possible per-CPU
DSQs, eliminating the need for the hard-coded limit MAX_CPUS.
In this way scx_bpfland can support the same amount of CPUs that the
kernel can handle.
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
Instead of constantly checking the need to drain tasks from the DSQs of
the offline CPUs, provide an atomic flag to notify when there are tasks
to be drained from the offline CPUs.
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
Refine the safeguard mechanism to avoid generating too many interactive
tasks in the system, which could nullify the effect of the
interactive/regular task classification.
The safeguard mechanism operates by pausing the promotion of new tasks
to interactive status during the task wake-up process, whenever the
number of interactive tasks in the priority queue exceeds a specific
limit (set to 4x the number of online CPUs).
Halting the promotion of additional interactive tasks allows to
prioritize those already classified as interactive, thereby preventing
potential "bursts" of excessive interactive tasks in the system.
This refines the mitigation already provided by commit 640bd562
("scx_bpfland: prevent tasks from abusing interactive priority boost").
Fixes: 640bd562 ("scx_bpfland: prevent tasks from abusing interactive priority boost")
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
Always assign the maximum time slice if there are idle CPUs in the
system.
Otherwise, double the task's unused time slice to reward tasks that use
less CPU time and at the same time refill the time slice of the tasks
every time they're dispatched.
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
sched_ext is about to be merged upstream. There are some compatibility
breaking changes and we're making the current sched_ext/for-6.11
1edab907b57d ("sched_ext/scx_qmap: Pick idle CPU for direct dispatch on
!wakeup enqueues") the baseline.
Tag everything except scx_mitosis as 1.0.0. As scx_mitosis is still in early
development and is currently temporarily disabled, only the patchlevel is
bumped.
Sync from sched_ext/for-6.11 1edab907b57d ("sched_ext/scx_qmap: Pick idle
CPU for direct dispatch on !wakeup enqueues")
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git for-6.11
- cgroup support hasn't landed in the upstream kernel yet. This most likely
will happen in a few weeks. For the time being, disable scx_flatcg,
scx_pair and scx_mitosis.
- Compat macro for DSQ task iterator dropped. This is now a part of
the baseline.
- scx_bpf_consume() isn't upstream yet. BPF interfacing side is still being
discussed. Dropped example usage from tools/sched_ext. None of the
practical schedulers use it, so this should be fine for now.
- scx_bpf_cpu_rq() added.
- AUTOATTACH workaround for newer libbpf versions added.
A task can become a runnable on any task's context not only its waker
task. Thus, we should not count wake-up on unrelated task's context.
With this commit, the scheduler can (much more) accurately detect
waker-wakee relationsships.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
The prior approach using the sum of weights gives too much penalty to
nice tasks with large nice values. With this commit, the time slice is
determined by the number of runnable tasks regardless of nice priority.
Note that the fairness will still be enforced based on tasks' nice
priorities (weights).
Signed-off-by: Changwoo Min <changwoo@igalia.com>
To easily distinguish, let's initialize the current logical clock to
zero (not the current physical time). Also, avoid the deadline
calculation being zero by adding +1 here and there.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
The priority boost for interactive tasks can be exploited to render the
system nearly unresponsive by creating numerous tasks that constantly
switch between wait/wakeup states.
For example, stress tests like `hackbench -l 10000` can significantly
degrade system responsiveness.
To mitigate this, limit the number of interactive tasks added to the
priority queue to 4x the number of online CPUs.
This simple approach appears to be a quite effective at identifying
potential spam of "fake" interactive tasks, while still prioritizing
legitimate interactive tasks.
Additionally, periodically refresh the interactive status of the tasks
based on their most recent average of voluntary context switches,
preventing the interactive status from being too "sticky".
Tested-by: Piotr Gorski <lucjan.lucjanov@gmail.com>
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
Avoid dispatching per-CPU kthreads directly, since this may cause
interactivity problems or unfairness, for example if there are too many
softirqs being scheduled (e.g., in presence of high RX network traffic
or when running certain stress tests, like hackbench).
Moreover, in order to help with testing and benchmarks, introduce the
option --local-kthread, that allows to restore the old behavior if
enabled.
Tested-by: Piotr Gorski <lucjan.lucjanov@gmail.com>
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
When updating the task vruntime, ensure the time slice delta is always a
positive value. Failing to do so may cause the global vruntime to
increase excessively due to overflows.
Tested-by: Piotr Gorski <lucjan.lucjanov@gmail.com>
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
Periodically report the amount of online CPUs to stdout.
The online CPUs are initially evaluated looking at the online cpumask,
then the value is updated in the .cpu_offline() / .cpu_online()
callbacks.
Tested-by: Piotr Gorski <lucjan.lucjanov@gmail.com>
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
Keep track of the CPUs that are running interactive tasks and report
their amount to stdout.
Tested-by: Piotr Gorski <lucjan.lucjanov@gmail.com>
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
The correct default value of slice_ns 5ms, not 5s.
This change doesn't really make any difference in practice, since these
values are changed by the Rust part when the scheduler is started, but
it's good to keep this aligned to the proper values for consistency.
Tested-by: Piotr Gorski <lucjan.lucjanov@gmail.com>
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
This commit changes the use of a physical clock to a virtual, logical
clock in calculating deadlines.
- The virtual current clock advances upon a task's running to its
virtual deadline.
- When enqueuing a task, its virtual deadline from the virtual current
clock is calculated.
With the above two changes, this guarantees that there is no such task
whose virtual deadline is smaller than the virtual current clock. This
means any enqueuing task can compete with any other already enqueued
tasks. This allows a latency-critical task to be immediately scheduled
if needed.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Every time we need to dispatch a task re-evalate its time slice as:
(unused_time_slice + min_time_slice) / 2
This allows to refill the time slice for tasks that haven't used much of
their previously assigned time, improving fairness.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Make sure to always classify interactive tasks, even when the system is
not fully utilized. This ensures that if the system suddenly becomes
overloaded, we already know which tasks need to be dispatched to the
priority DSQ.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Tasks are consumed from various DSQs in the following order:
per-CPU DSQs => priority DSQ => shared DSQ
Tasks in the shared DSQ may be starved by those in the priority DSQ,
which in turn may be starved by tasks dispatched to any per-CPU DSQ.
To mitigate this, record the timestamp of the last task scheduling event
both from the priority DSQ and the shared DSQ.
If the starvation threshold is exceeded without consuming a task, the
scheduler will be forced to consume a task from the corresponding DSQ.
The starvation threshold can be adjusted using the --starvation-thresh
command line parameter (default is 5ms).
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
There is no need to RCU protect the cpumask for the offline CPUs: it is
created once when the scheduler is initialized and it's never
deallocated.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Reduce the default time slice down to 5ms for a faster reaction and
system responsiveness when the system is overcomissioned.
This also helps to provide a more predictable level of performance.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Always use direct CPU dispatch for kthreads, there is no need to treat
kthreads in a special way, simply reuse direct CPU dispatch to
prioritize them.
Moreover, change direct CPU dispatches to use scx_bpf_dispatch_vtime(),
since we may dispatch multiple tasks to the same per-CPU DSQ now.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Small refactoring of the idle CPU selection logic:
- optimize idle CPU selection for tasks that can run on a single CPU
- drop the built-in idle selection policy and completely rely on the
custom one
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
We are incorrectly using the SMT idle cpumask to find any idle CPU, fix
by using the generic idle cpumask.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Implement CPU hotplugging in scx_bpfland without restarting the
scheduler.
The idle selection logic has been updated to consider online CPUs.
Additionally, a cpumask for offline CPUs has been introduced. Tasks
that have been dispatched to the DSQs associated with offline CPUs are
consumed by the other CPUs that are still online.
Moreover, the dependency on the Topology crate is temporarily dropped
and instead, /sys/devices/system/cpu/smt/active is used to determine if
SMT should be taken into account during idle selection. The Topology
crate will be re-introduced later when scx_bpfland will gain more
topology-aware capabilities.
This fixes#406.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
The stats map in scx_rusty is a BPF_MAP_TYPE_PERCPU_ARRAY, with its size
determined by num_possible_cpus(). Initializing it with nr_cpu_ids() can
result in errors such as:
Error: Failed to zero stat
Caused by:
number of values 6 != number of cpus 8
Fix by using num_possible_cpus() to initialize it.
Fixes: 263e02f6 ("rusty: Use nr_cpu_ids instead of nr_cpus_possible")
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
With commit 5d20f89a ("scheds-rust: build rust schedulers in sequence"),
schedulers are now built serially one after the other to prevent meson
and cargo from forking NxN parallel tasks.
However, this change has made building a single scheduler much more
cumbersome, due to the chain of dependencies.
For example, building scx_rusty using the specific meson target would
still result in all schedulers being built, because they all depend on
each other.
To address this issue, introduce the new meson build option
`serialize=true|false` (default is false).
This option allows to disable the schedulers' build chain, restoring the
old behavior.
With this option enabled, it is now possible to build just a single
scheduler, parallelizing the cargo build properly, without triggering
the build of the others. Example:
$ meson setup build -Dbuildtype=release -Dserialize=false
$ meson compile -C build scx_rusty
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
The competition window was 7.5 msec, half of the targeted latency.
However, it is too wide for some workloads, so unrelated tasks may
compete with each other. Hence, it is tightened to about 1 msec with
LAVD_LAT_WEIGHT_SHIFT to avoid unnecessary competition.
Also, when a system is overloaded, now the time space is stretched more
aggressively (i.e., lat_prio^2) when a task's latency priority is low
(high value).
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Introduce a tunable to set a limit of the minimum vruntime that is used
when a task is dispatched, as:
vtime_min = vtime_now - slice_lag_ns
Increasing the time slice lag can make interactive tasks even more
responsive at the cost of starving regular and newly created tasks.
Default time slice lag is 0.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Overview
========
This scheduler is derived from scx_rustland, but it is fully implemented
in BFP with minimal user-space Rust part to process command line
options, collect metrics and logs out scheduling statistics.
Unlike scx_rustland, all scheduling decisions are made by the BPF
component.
Motivation
==========
The primary goal of this scheduler is to act as a performance baseline
for comparison with scx_rustland, allowing for a better assessment of
the overhead caused by kernel/user-space interactions.
It can also be used to deploy prototypes initially tested in the
scx_rustland scheduler. In fact, this scheduler is expected to
outperform scx_rustland, due to the elimitation of the kernel/user-space
overhead.
Scheduling policy
=================
scx_bpfland is a vruntime-based sched_ext scheduler that prioritizes
interactive workloads. Its scheduling policy closely mirrors
scx_rustland, but it has been re-implemented in BPF with some small
adjustments.
Tasks are categorized as either interactive or regular based on their
average rate of voluntary context switches per second: tasks that exceed
a specific voluntary context switch threshold are classified as
interactive.
Interactive tasks are prioritized in a higher-priority DSQ, while
regular tasks are placed in a lower-priority DSQ. Within each queue,
tasks are sorted based on their weighted runtime, using the built-in scx
vtime ordering capabilities (scx_bpf_dispatch_vtime()).
Moreover, each task gets a time slice budget. When a task is dispatched,
it receives a time slice equivalent to the remaining unused portion of
its previously allocated time slice (with a minimum threshold applied).
This gives latency-sensitive workloads more chances to exceed their time
slice when needed to perform short bursts of CPU activity without being
interrupted (i.e., real-time audio encoding / decoding workloads).
Results
=======
According to the initial test results, using the same benchmark "playing
a videogame while recompiling the kernel", this scheduler seems to
provide a +5% improvement in the frames-per-second (fps) compared to
scx_rustland, with video games such as Cyberpunk 2077, Counter-Strike 2
and Baldur's Gate 3.
Initial test results indicate that this scheduler offers around a +5%
improvement in frames-per-second (fps) compared to scx_rustland when
using the benchmark "playing a video game while recompiling the kernel".
This improvement was observed in games such as Cyberpunk 2077,
Counter-Strike 2, and Baldur's Gate 3.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
The old approach was too conservative in running a new task, so when a
fork-heavy workload competes with a CPU-bound workload, the fork-heavy
one is starved. The new approach solves the starvation problem by
inheriting parent's statistics. It seems a good (at least better than
old) guess how a new task will behave.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
When the system is highly loaded with compute-intensive tasks, the old
setting chokes latensive-intensive tasks, so loosen the dealine when the
system is overloaded (> 100% utilization).
Signed-off-by: Changwoo Min <changwoo@igalia.com>
When the lavd is loaded, it prints out its build id. It helps to easily
identify what version it is when testing.
```
01:56:54 [INFO] scx_lavd scheduler is initialized (build ID: 0.8.1-g98a5fa8595430414115c504857cea1a458393838-dirty x86_64-unknown-linux-gnu)
```
Signed-off-by: Changwoo Min <changwoo@igalia.com>
The synchronization for mitosis is a bit ad-hoc, working around lack of
atomics in BPF. This commit updates the logic to use READ/WRITE_ONCE and
compiler barriers to get the behaviors we want.
Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
When someone is testing schedulers, we often have to ask what version
the scheduler is running as. Now that we can access the build ID from
rust schedulers, let's update scx_rusty to print the build ID when rusty
first starts running.
This results in output such as the following:
```
[void@maniforge scx]$ rusty
19:04:26 [INFO] Running scx_rusty (build ID: 0.8.1-g2043d2537f37c8d75753bb65eb75bca965067564 x86_64-unknown-linux-gnu/debug)
19:04:26 [INFO] NUMA[00] mask= 0b11111111111111111111111111111111
19:04:26 [INFO] DOM[00] mask= 0b00000000111111110000000011111111
19:04:26 [INFO] DOM[01] mask= 0b11111111000000001111111100000000
19:04:26 [INFO] Rusty scheduler started!
```
Signed-off-by: David Vernet <void@manifault.com>
This is a second attempt to optimize tunables for a wider range of
games.
1) LAVD_BOOST_RANGE increased from 14 (35%) to 40 (100% of nice range).
Now the latency priority (biased by nice value) will decide which
task should run first . The nice value will decide the time slice.
2) The first change will give higher priority to latency-critical task
compared to before. For compensation, the slice boost also increased
(2x -> 3x).
Signed-off-by: Changwoo Min <changwoo@igalia.com>
This change adds a new module to the scx_utils crate that provides a
log recorder for metrics-rs. The log recorder will log all metrics to
the console at a configurable interval in an easy to read format. Each
metric type will be displayed in a separate section. Indentation will
be used to show the hierarchy of the metrics. This results in a more
verbose output, but it is easier to read and understand.
scx_rusty was updated to use the log recorder and all explicit metric
logging was removed.
Counters will show the total count and the rate of change per second.
Counters with an additional label, like `type` in
`dispatched_tasks_total` in rusty, will show the count, rate, and
percentage of the total count.
Counters:
dispatched_tasks_total: 65559 [1344.8/s]
prev_idle: 44963 (68.6%) [966.5/s]
wsync_prev_idle: 15696 (23.9%) [317.3/s]
direct_dispatch: 2833 (4.3%) [35.3/s]
dsq: 1804 (2.8%) [21.3/s]
wsync: 262 (0.4%) [4.3/s]
direct_greedy: 1 (0.0%) [0.0/s]
pinned: 0 (0.0%) [0.0/s]
greedy_idle: 0 (0.0%) [0.0/s]
greedy_xnuma: 0 (0.0%) [0.0/s]
direct_greedy_far: 0 (0.0%) [0.0/s]
greedy_local: 0 (0.0%) [0.0/s]
dl_clamped_total: 1290 [20.3/s]
dl_preset_total: 514 [1.0/s]
kick_greedy_total: 6 [0.3/s]
lb_data_errors_total: 0 [0.0/s]
load_balance_total: 0 [0.0/s]
repatriate_total: 0 [0.0/s]
task_errors_total: 0 [0.0/s]
Gauges will show the last set value:
Gauges:
slice_length_us: 20000.00
Histograms will show the average, min, and max. The histogram will be
reset after each log interval to avoid memory leaks, since the data
structure that holds the samples is unbounded.
Histograms:
cpu_busy_pct: avg=1.66 min=1.16 max=2.16
load_avg node=0: avg=0.31 min=0.23 max=0.39
load_avg node=0 dom=0: avg=0.31 min=0.23 max=0.39
processing_duration_us: avg=297.50 min=296.00 max=299.00
Signed-off-by: Jose Fernandez <josef@netflix.com>
In some games (e.g., Elden Ring), it was observed that preemption
happens much less frequently. The reason is that tasks' runtime per
schedule is similar, so it does not meet the existing criteria. To
alleviate the problem, the following three tunables are revised:
1) Smaller LAVD_PREEMPT_KICK_MARGIN and LAVD_PREEMPT_TICK_MARGIN help to
trigger more preemption.
2) Smaller LAVD_SLICE_MAX_NS works better especially 250 or 300Hz
kernels.
3) Longer LAVD_ELIGIBLE_TIME_MAX purturbes time lines less frequently.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Origin assignment of the variable ridx is equivalent to comparing
between "ridx" and "wids - MAX_PIDS". Using u64 max library helper
function to perform the comparison and provide better readability.
Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
Check whether the BalanceState of pull_dom.load inside function
try_find_move_task is actually the variant NeedsPull. It'll perform task
migration in abit more conservative manner when the system is under high
loading situation.
Experiments are performed when the system is compiling linux kernel and
undergoing a large amount of I/O operation at the same time using fio.
The result showns that before the modification, there're 12,6617 times
of task migrations system wide. After the modification, there're 11,5419
times of task migrations system wide.
Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
In scx_rlfifo, we're currently using topo.nr_cpus_possible() to
determine how many possible CPU IDs we could have on the system. To
properly support systems whose disabled CPUs may be in the middle of the
range of possible CPU IDs, let's instead use topo.nr_cpu_ids() so that
we don't accidentally dispatch to an invalid DSQ.
Signed-off-by: David Vernet <void@manifault.com>
In scx_rusty, we're currently using topo.nr_cpus_possible() to determine
how many possible CPU IDs we could have on the system. scx_rusty already
accounts for offlined CPUs, so to properly support systems whose
disabled CPUs may be in the middle of the range of possible CPU IDs,
let's instead use topo.nr_cpu_ids().
Signed-off-by: David Vernet <void@manifault.com>
In some cases, a host may have an odd topology where there are gaps in
CPU IDs (including between possible CPUs). A common pattern in
schedulers is to perform allocations for every possible CPU ID, such as
creating a per-cpu DSQ. In order to avoid confusing schedulers, let's
track the maximum CPU ID on a system so that we can return the number of
CPU IDs on the system which is inclusive of gaps.
We also update scx_rustland in this change to accommodate the fact that
we no longer export nr_cpus_possible() from TopologyMap.
Signed-off-by: David Vernet <void@manifault.com>
We need a layer of indirection between the stats collection and their
output destinations. Currently, stats are only printed to stdout. Our
goal is to integrate with various telemetry systems such as Prometheus,
StatsD, and custom metric backends like those used by Meta and Netflix.
Importantly, adding a new backend should not require changes to the
existing stats code.
This patch introduces the `metrics` [1] crate, which provides a
framework for defining metrics and publishing them to different
backends.
The initial implementation includes the `dispatched_tasks_count`
metric, tagged with `type`. This metric increments every time a task is
dispatched, emitting the raw count instead of a percentage. A monotonic
counter is the most suitable metric type for this use case, as
percentages can be calculated at query time if needed. Existing logged
metrics continue to print percentages and remain unchanged.
A new flag, `--enable-prometheus`, has been added. When enabled, it
starts a Prometheus endpoint on port 9000 (default is false). This
endpoint allows metrics to be charted in Prometheus or Grafana
dashboards.
Future changes will migrate additional stats to this framework and add
support for other backends.
[1] https://metrics.rs/
Signed-off-by: Jose Fernandez <josef@netflix.com>
Use the function can_task1_kick_task2() to replace places which also
checking the comp_preemption_info between two cpus for better
consistency.
Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
It seems that we are not updating `is_idle` when we find an idle CPU
with pick_cpu(), causing unnecessary rescheduling events when
select_cpu() is called.
To resolve this, ensure that the is_idle state is correctly set.
Additionally, always ensure that the task is dispatched to the local DSQ
immediately upon finding (and reserving) an idle CPU.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
- clean up u63 and u32 usages in structures to reduce struct size
- refactoring pick_cpu() for readability
Signed-off-by: Changwoo Min <changwoo@igalia.com>
The required CPU performance (cpuperf) was set to 1024 (100%) when the
CPU utilization was 100%. When a sudden load spike happens, it makes the
system adapt slowly in the next interval.
The new scheme always reserves some headroom in advance, so it sets
cpuperf to 1024 when the CPU utilization reaches to 85%. This gives some
room to adapt in advance.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Modify the execution sequence before lookup operation for new_domc. If
new_dom_id == NO_DOM_FOUND, lookup operation for new_domc is definitely
going to fail so we don't have to wait until we found that new_domc is
NULL, clearing of cpumask and return operation should be done directly
in that case.
Plus we should avoid using try_lookup_dom_ctx outside the context of
lookup_dom_ctx, as it can keep the interface's consistency.
Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
The rusty dispatch logic is a bit unnecessarily convoluted. Let's clean it up
so that we're just comparing dom ids rather than iterating over arrays nested
inside of pcpu context.
Signed-off-by: David Vernet <void@manifault.com>
Right now, the SCX_WAKE_SYNC logic in rusty is very primitive. We only check to
see if the waker CPU's runqueue is empty, and then migrate the wakee there if
so. We'll want to expand this to be more thorough, such as:
- Checking to see if prev_cpu and waker_cpu share the same LLC when determining
where to migrate
- Check for whether SCX_WAKE_SYNC migration helps load imbalance between cores
- ...
Right now all of that code is just a big blob in the middle of
rusty_select_cpu(). Let's pull it into its own function to improve readability,
and also add some logic to stay on prev_cpu if it shares an LLC with the waker.
Signed-off-by: David Vernet <void@manifault.com>
It seems that task_set_domain() is nearly at the point where it can
cause the verifier to get confused and think that it's exceeding the
number of available instructions per program. I've seen this a number of
times when making small changes to task_set_domain(), and it's once
again happened @vax-r (I-Hsin Cheng) made a small cleanup change to
rusty in https://github.com/sched-ext/scx/pull/362.
To avoid this, let's just make dom_xfer_task() a separate global program
so that the verifier doens't have to worry about branch pruning, etc
depending on what the caller does. This should hopefully make
task_set_domain() (and its callers) much less brittle.
Signed-off-by: David Vernet <void@manifault.com>
In preparation of upstreaming, let's set the min version requirement at the
released v6.9 kernels. Drop __COMPAT_scx_bpf_cpuperf_*(). The open helper
macros now check the existence of scx_bpf_cpuperf_cap() and abort if not.
In preparation of upstreaming, let's set the min version requirement at the
released v6.9 kernels. Drop __COMPAT_HAS_CPUMASKS(). The open helper macros
now check the existence of scx_bpf_nr_cpu_ids() and abort if not.
In preparation of upstreaming, let's set the min version requirement at the
released v6.9 kernels. Drop __COMPAT_SCX_KICK_IDLE. The open helper macros
now check the existence of SCX_KICK_IDLE and abort if not.
In preparation of upstreaming, let's set the min version requirement at the
released v6.9 kernels. Drop __COMPAT_scx_bpf_switch_call(). The open helper
macros now check the existence of SCX_OPS_SWITCH_PARTIAL and abort if not.
With commit 786ec0c0 ("scx_rlfifo: schedule all tasks in user-space")
all the scheduling decisions are now happening in user-space. This also
bypasses the built-in idle selection logic, delegating the CPU selection
for each task to the user-space scheduler.
The easiest way to distribute tasks across the available CPUs is to
simply allow to dispatch them on the first CPU available.
In this way the scheduler becomes usable in practical scenarios and at
the same time it also maintains its simplicity.
This allows to spread all tasks across all the available CPUs
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Disable all the BPF optimization shortcuts by default and force all
tasks to be processed by the user-space scheduler.
Given that the primary goal of this scheduler is to offer a
straightforward and intuitive example for experimental purposes, this
change simplifies the process for individuals looking to experiment,
allowing them to apply changes to user-space code and quickly observe
the effects, without dealing with any in-kernel optimizations.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
No functional change, just add some comments to better describe the
parameters used when initializing the main BpfScheduler object.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
The bpf_ prefix is used for BPF API. Rename bpf_log2() to u32_log2() and
bpf_log2l() to u64_log2(). While at it, relocate them below compiler
directive helpers.
Keep track of the maximum vruntime among all tasks and flush them if the
difference between the maximum and minimum vruntime exceeds slice_ns.
This helps to prevent excessive starvation, as every task is guaranteed
to be dispatched within the slice_ns time limit.
Tested-by: Tested-by: SoulHarsh007 <harsh.peshwani@outlook.com>
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>