Commit Graph

773 Commits

Author SHA1 Message Date
Andrea Righi
46ddca6bd5 scx_bpfland: report task time slice to stdout
Periodically report to stdout samples of the effective time slice
applied to tasks.

While one could determine this metric by examining the max slice_ns and
nr_waiting metrics, directly reporting it to stdout allows users to
quickly identify what is happening and it provides a clearer overview of
the scheduling behavior.

Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
2024-07-22 22:01:49 +02:00
Andrea Righi
c1d93d2a00 scx_bpfland: drop kthread dispatches metric
Dispatching per-CPU kthreads directly is disabled by default, reporting
this metric can generate some confusion (since it is always 0), and even
if local kthread dispatches are enabled, they should be still considered
as regular direct dispatches (there is no difference in practice).

Therefore, merge direct kthread dispatches into direct dispatches and
drop the separate nr_kthread_dispatches metric.

Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
2024-07-22 22:01:49 +02:00
Andrea Righi
a5f1d6b595 scx_bpfland: show average amount of tasks waiting to be dispatched
Periodically report the average amount of tasks sitting in the priority
and shared DSQs.

Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
2024-07-22 22:01:45 +02:00
Andrea Righi
5908a985bc scx_bpfland: adjust task time slice based on the amount of waiting tasks
Scale the task's time slice based on the average amount of tasks that
are currently waiting to be dispatched.

Use a moving average for the amount of waiting tasks to smooth out
potential spikes caused by temporary bursts of tasks piling in the wait
queues.

This was initially modeled in scx_rustland and it seems to work pretty
well also in scx_bpfland now.

Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
2024-07-22 21:53:25 +02:00
Changwoo Min
af75d147c8
Merge pull request #443 from multics69/lavd-vtime
scx_lavd: overhaul the virtual deadline algorithm
2024-07-21 18:00:57 +09:00
Changwoo Min
a9aab6b229 scx_lavd: fix typo
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-21 17:58:44 +09:00
Changwoo Min
add96f0e18 scx_lavd: do not maintain ineligible runnable tasks separately
With all the other optimizations and tunings, it turns out that maintaining
two runqueues has more harm than good.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-20 17:49:12 +09:00
Changwoo Min
827187d213 scx_lavd: adjust ineligible duration according to task's lat_cri
Further depenalize above-average latency-critical tasks and penalize
further below-avergage latency-critical tasks in ineligibility duration.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-20 17:37:27 +09:00
Changwoo Min
c653622ed9 scx_lavd: add LAVD_VDL_LOOSENESS_FT in calculating virtual deadline
LAVD_VDL_LOOSENESS_FT represents how loose the deadline is. The smaller
value means the deadline is tighter. While it is unlikely to be tuned,
let's keep it as a tunable for now.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-20 12:00:50 +09:00
Changwoo Min
e94070d5ca scx_lavd: remove LAVD_BOOST_*
These are no longer necessary after directly using latency criticality.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-20 11:53:20 +09:00
Changwoo Min
43f0fcb87c scx_lavd: removed unused LAVD_LOAD_FACTOR_*
These are no longer necessary after remnoving load factor calculation.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-20 11:51:12 +09:00
David Vernet
4f11e2abe2
layered: Don't dispatch to LO_FALLBACK_DSQ
Non-kthreads with custom affinities in non-open layers are dispatched into a
LO_FALLBACK_DSQ, with the idea being that they're penalized for their custom
affinities. When a host is fully utilized, these tasks can end up being starved
due to LO_FALLBACK_DSQ being consumed only when there are no other layers to
consume from. In internal workloads at Meta, we've observed that this can
happen in practice.

Longer term, we can probably address this by implementing layer weights and
applying that to fallback DSQs to avoid starvation. For now, let's just
dispatch them to HI_FALLBACK_DSQ to avoid this starvation issue.

Signed-off-by: David Vernet <void@manifault.com>
2024-07-19 19:14:18 -05:00
Changwoo Min
3924ebaa4d scx_lavd: properly synchronize taskc->vdeadline_log_clk
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-20 01:41:29 +09:00
Changwoo Min
02ad43d116 scx_lavd: directly use p->scx.weight instead load_ideal
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-20 00:25:11 +09:00
Changwoo Min
c955caefd8 scx_lavd: drop sys_load_factor
In theory, sys_load_factor should not be necessary since we do not
stretch the time space anymore.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-20 00:10:29 +09:00
Changwoo Min
67a6deb983 scx_lavd: use lat_cri instead of lat_prio universally
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-19 23:56:51 +09:00
Daniel Hodges
b98a9f56a8 scx_layered: Add separate module for metrics
Refactor the main module for scx_layered to move metrics into a separate
module. This change does no functional differences, only code structure.
This will make it a little easier to navigate the logic in the main
scheduler code.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-07-19 07:40:24 -07:00
Changwoo Min
6f10d6907c scx_lavd: drop sched_prio_to_slice_weight[] table
Use p->scx.weight instead.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-19 22:39:01 +09:00
Changwoo Min
034303f00f scx_lavd: consider starvation factor in determining latency criticality
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-19 22:17:50 +09:00
Daniel Hodges
d974690b5d
Merge pull request #435 from vax-r/remove_skip_while
scx_rusty: Remove skip_while in find_first_candidate
2024-07-19 08:38:58 -04:00
Changwoo Min
99e0d21c3c scx_lavd: drop the runtime factor in calculating latency criticality
That is okay since the runtime is considered in calculating a virtual
deadline. A shorter runtime will result in a tighter deadline linearly.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-19 17:28:40 +09:00
Changwoo Min
b90599e967 scx_lavd: do not inherit parent's properties
If inheriting the parent's properties, a new fork task tends to be too
prioritized. That is, many parent processes, such as `make,` are a bit
more latency-critical than average.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-19 15:29:13 +09:00
Andrea Righi
c4eb3ce7b4 scx_bpfland: introduce dynamic nvcsw threshold
Instead of using a static value to classify tasks based on their average
amount of voluntary context switches, try to periodically evaluate an
optimal threshold, based on a global average of voluntary context
switches among of all the running tasks.

Tasks with an average amount of voluntary context switches greater than
the global average will be classified as interactive.

The global average is evaluated as an exponentially weighted moving
average (EWMA), as:

  avg(t) = avg(t - 1) * 0.75 - task_avg(t) * 0.25

This approach is more efficient than iterating through all tasks and it
helps to prevent rapid fluctuations that may be caused by bursts of
voluntary context switch events.

The dynamic nvcsw threshold enables a more precise adjustment of
the classification criteria to swiftly respond to global system changes:
tasks can be quickly classified as interactive, but if the system
experiences too many interactive events, the criteria for maintaining
interactive status become stricter. This creates a natural selection
process where only the most deserving tasks remain interactive.

Additionally, introduce the new option `--nvcsw-max-thresh N`, which
allows to extend or restrict the fluctuation range of the global average
threshold for voluntary context switches.

Tested-by: Piotr Gorski <piotrgorski@cachyos.org>
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
2024-07-18 19:03:25 +02:00
Changwoo Min
78d96a6fb6 scx_lavd: advance clock by reverse proportional to the system load
Advancing the clock slower when overloaded gives more opportunities for
latency-critical tasks to cut in the run queue. Controlling the clock
better reflects the actual load than the prior approach of stretching
the time-space when overloaded.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-18 15:53:38 +09:00
Changwoo Min
9bc20f9160 scx_lavd: maintain ineligible runnable tasks separately
We now maintain two run queues—an eligible run queue (DSQ) and an
ineligible run queue (rbtree)—sorted by the task's virtual deadline.
When the eligible run queue is empty, or the ineligible run queue has
not been consumed for too long (e.g., 15 msec), a task in the ineligible
run queue is moved to the eligible run queue for execution. With these
two queues, we have a better admission control.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-17 23:46:11 +09:00
I Hsin Cheng
2525b94af4 scx_rusty: Remove unused variable
Remove unused variable "has_preferred_dom".

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
2024-07-17 20:30:17 +08:00
I Hsin Cheng
bf2f0fbf35 scx_rusty: Remove skip_while in find_first_candidate
Followed commit 1c3b563, move the checking of task.migrated.get() into
the vector filter. In this way, we can remove the skip_while() call in
find_first_candidate().

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
2024-07-17 20:27:12 +08:00
Changwoo Min
55e19ea5df scx_lavd: do not prioritize a wake-up task in ops.select_cpu()
This is a prep for adding an ineligible DSQ.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-17 11:16:02 +09:00
Changwoo Min
c84b73e971 scx_lavd: rename LAVD_GLOBAL_DSQ to LAVD_ELIGIBLE_DSQ
This is a prep to add a global ineligible dsq.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-17 10:34:34 +09:00
Daniel Müller
565aec3662 rust: Update libbpf-rs & libbpf-cargo to 0.24
Update libbpf-rs & libbpf-cargo to 0.24. Among other things, generated
skeletons now contain directly accessible map and program objects, no
longer necessitating the use of accessor methods. As a result, the risk
for mutability conflicts is reduced greatly.

Signed-off-by: Daniel Müller <deso@posteo.net>
2024-07-16 11:48:52 -07:00
Daniel Hodges
27122a8a00 scx_rusty: refactor mempolicy handling bpf code and load balancing
This change refactors some of the helper methods for getting the
preferred node for tasks using mempolicy. The load balancing logic in
try_find_move_task is updated to allow for a filter, which is used to
filter for tasks with a preferred mempolicy.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-07-16 09:40:00 -07:00
Daniel Hodges
43a263aa75 scx_rusty: Use preferred node mask with balancer
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-07-16 08:11:19 -07:00
Daniel Hodges
bab6e9523c scx_rusty: Add mempolicy checks to rusty
This change makes scx_rusty mempolicy aware. When a process uses
set_mempolicy it can change NUMA memory preferences and cause
performance issues when tasks are scheduled on remote NUMA nodes. This
change modifies task_pick_domain to use the new helper method that
returns the preferred node id.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-07-16 08:11:19 -07:00
Changwoo Min
971bb2e024 scx_lavd: pretty formatting for ineligible duration
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-16 23:54:15 +09:00
Changwoo Min
adfbf3934c scx_lavd: tuning the max ineligible duration
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-16 23:52:23 +09:00
Changwoo Min
eff444516f scx_lavd: directly measure service time for eligibility enforcement
Estimating the service time from run time and frequency is not
incorrect. However, it reacts slowly to sudden changes since it relies
on the moving average. Hence, we directly measure the service time to
enforce fairness.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-16 23:48:26 +09:00
I Hsin Cheng
1c3b563caf scx_rusty: Pre-check task domain mask with pull domain mask
Instead of performing domain mask checking inside
"find_first_candidate()" every time, check whether the tasks within push
domain are abled to run on pull domain by performing the mask check at
vector generation stage.

This way can also avoid repeated computation generated by the same
(task, pull_dom) pair as they'll try to check whether the pull domain is
in the task domain mask.

Also since whether a task is a kworker won't change in time, we can
perform the check earlier and put it in the filter, too.

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
2024-07-16 21:48:06 +08:00
Tejun Heo
51334b5c4d Bump versions for 1.0.1 release 2024-07-15 13:21:52 -10:00
Andrea Righi
8e7a526356 scx_bpfland: use nr_cpu_ids for consistency
We always use nr_cpu_ids to represent the maximum CPU id returned by
scx_bpf_nr_cpu_ids().

Replace cpu_max with nr_cpu_ids to be more consistent with the rest of
the code.

Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
2024-07-15 08:44:35 +02:00
Andrea Righi
33d06f653b scx_bpfland: get rid of the MAX_CPUS hard-coded limit
We can rely on scx_bpf_nr_cpu_ids() to create all the possible per-CPU
DSQs, eliminating the need for the hard-coded limit MAX_CPUS.

In this way scx_bpfland can support the same amount of CPUs that the
kernel can handle.

Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
2024-07-15 00:17:30 +02:00
Andrea Righi
b80ef7d8eb scx_bpfland: optimize offline CPU handling
Instead of constantly checking the need to drain tasks from the DSQs of
the offline CPUs, provide an atomic flag to notify when there are tasks
to be drained from the offline CPUs.

Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
2024-07-15 00:17:23 +02:00
Andrea Righi
0530706710 scx_bpfland: properly initialize the nvcsw metrics
Initialize the number of voluntary context switches metrics in the local
task storage.

Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
2024-07-15 00:16:10 +02:00
Andrea Righi
bf4ad23599 scx_bpfland: refine interactive tasks flood safeguard
Refine the safeguard mechanism to avoid generating too many interactive
tasks in the system, which could nullify the effect of the
interactive/regular task classification.

The safeguard mechanism operates by pausing the promotion of new tasks
to interactive status during the task wake-up process, whenever the
number of interactive tasks in the priority queue exceeds a specific
limit (set to 4x the number of online CPUs).

Halting the promotion of additional interactive tasks allows to
prioritize those already classified as interactive, thereby preventing
potential "bursts" of excessive interactive tasks in the system.

This refines the mitigation already provided by commit 640bd562
("scx_bpfland: prevent tasks from abusing interactive priority boost").

Fixes: 640bd562 ("scx_bpfland: prevent tasks from abusing interactive priority boost")
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
2024-07-15 00:11:34 +02:00
Andrea Righi
eb1cf0e670 scx_bpfland: improve task time slice evaluation
Always assign the maximum time slice if there are idle CPUs in the
system.

Otherwise, double the task's unused time slice to reward tasks that use
less CPU time and at the same time refill the time slice of the tasks
every time they're dispatched.

Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
2024-07-14 23:24:24 +02:00
Tejun Heo
3ae76acd12
Merge pull request #424 from sched-ext/sync-upstream-kernel-and-bump-to-1.0
Sync to upstream kernel and bump to 1.0
2024-07-14 07:00:38 -10:00
Changwoo Min
5b2112dd81
Merge pull request #421 from multics69/lavd-metrics
scx_lavd: improve time slice and waker freq calculation
2024-07-14 18:49:36 +09:00
Tejun Heo
761ec142ce Bump most versions to 1.0.0
sched_ext is about to be merged upstream. There are some compatibility
breaking changes and we're making the current sched_ext/for-6.11
1edab907b57d ("sched_ext/scx_qmap: Pick idle CPU for direct dispatch on
!wakeup enqueues") the baseline.

Tag everything except scx_mitosis as 1.0.0. As scx_mitosis is still in early
development and is currently temporarily disabled, only the patchlevel is
bumped.
2024-07-12 11:34:14 -10:00
Tejun Heo
54c487731a Update to vmlinux-v6.10-rc2-g1edab907b57d.h
Sync to vmlinux.h from sched_ext/for-6.11 1edab907b57d ("sched_ext/scx_qmap:
Pick idle CPU for direct dispatch on !wakeup enqueues"). This most likely
will be the commit which will be merged during the upcoming kernel v6.11
merge window.

Unfortunately, this is a compatibility breaking change. As the size of
bpf_iter_scx_dsq is reduced, schedulers that use the iterator - scx_lavd and
scx_layered - won't be able to run on older kernels. Likewise, older
binaries from before this commit won't be able to run on newer kernels.
2024-07-12 11:13:34 -10:00
Tejun Heo
f261d0f037 Sync from kernel - 1edab907b57d
Sync from sched_ext/for-6.11 1edab907b57d ("sched_ext/scx_qmap: Pick idle
CPU for direct dispatch on !wakeup enqueues")

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git for-6.11

- cgroup support hasn't landed in the upstream kernel yet. This most likely
  will happen in a few weeks. For the time being, disable scx_flatcg,
  scx_pair and scx_mitosis.

- Compat macro for DSQ task iterator dropped. This is now a part of
  the baseline.

- scx_bpf_consume() isn't upstream yet. BPF interfacing side is still being
  discussed. Dropped example usage from tools/sched_ext. None of the
  practical schedulers use it, so this should be fine for now.

- scx_bpf_cpu_rq() added.

- AUTOATTACH workaround for newer libbpf versions added.
2024-07-12 11:08:41 -10:00
Changwoo Min
512bd143a5 scx_lavd: count only related tasks in calculating waker_freq
A task can become a runnable on any task's context not only its waker
task.  Thus, we should not count wake-up on unrelated task's context.
With this commit, the scheduler can (much more) accurately detect
waker-wakee relationsships.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-12 22:51:09 +09:00
Changwoo Min
95733f63ab scx_lavd: calculate time slice as a function of run queue length
The prior approach using the sum of weights gives too much penalty to
nice tasks with large nice values. With this commit, the time slice is
determined by the number of runnable tasks regardless of nice priority.
Note that the fairness will still be enforced based on tasks' nice
priorities (weights).

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-12 22:45:22 +09:00
Changwoo Min
00fdc1d949
Merge pull request #417 from multics69/lavd-vdeadline
scx_lavd: improve virtual deadline and current clock handling
2024-07-12 14:05:44 +09:00
Changwoo Min
d4bc92bea7 scx_lavd: print lat_cri to output
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-12 13:23:56 +09:00
Changwoo Min
4c5c564523 scx_lavd: initial current logical clock to zero
To easily distinguish, let's initialize the current logical clock to
zero (not the current physical time). Also, avoid the deadline
calculation being zero by adding +1 here and there.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-12 10:15:54 +09:00
Andrea Righi
640bd562ff scx_bpfland: prevent tasks from abusing interactive priority boost
The priority boost for interactive tasks can be exploited to render the
system nearly unresponsive by creating numerous tasks that constantly
switch between wait/wakeup states.

For example, stress tests like `hackbench -l 10000` can significantly
degrade system responsiveness.

To mitigate this, limit the number of interactive tasks added to the
priority queue to 4x the number of online CPUs.

This simple approach appears to be a quite effective at identifying
potential spam of "fake" interactive tasks, while still prioritizing
legitimate interactive tasks.

Additionally, periodically refresh the interactive status of the tasks
based on their most recent average of voluntary context switches,
preventing the interactive status from being too "sticky".

Tested-by: Piotr Gorski <lucjan.lucjanov@gmail.com>
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
2024-07-11 16:13:55 +02:00
Andrea Righi
1babb2b92d scx_bpfland: prevent per-CPU kthreads starving other tasks
Avoid dispatching per-CPU kthreads directly, since this may cause
interactivity problems or unfairness, for example if there are too many
softirqs being scheduled (e.g., in presence of high RX network traffic
or when running certain stress tests, like hackbench).

Moreover, in order to help with testing and benchmarks, introduce the
option --local-kthread, that allows to restore the old behavior if
enabled.

Tested-by: Piotr Gorski <lucjan.lucjanov@gmail.com>
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
2024-07-11 16:13:48 +02:00
Andrea Righi
c3ebdd338f scx_bpfland: prevent slice delta overflow
When updating the task vruntime, ensure the time slice delta is always a
positive value. Failing to do so may cause the global vruntime to
increase excessively due to overflows.

Tested-by: Piotr Gorski <lucjan.lucjanov@gmail.com>
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
2024-07-11 15:58:01 +02:00
Andrea Righi
f59aa52fe7 scx_bpfland: expose the amount of online CPUs
Periodically report the amount of online CPUs to stdout.

The online CPUs are initially evaluated looking at the online cpumask,
then the value is updated in the .cpu_offline() / .cpu_online()
callbacks.

Tested-by: Piotr Gorski <lucjan.lucjanov@gmail.com>
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
2024-07-11 15:58:01 +02:00
Andrea Righi
3a47b484af scx_bpfland: report interactive tasks to stdout
Keep track of the CPUs that are running interactive tasks and report
their amount to stdout.

Tested-by: Piotr Gorski <lucjan.lucjanov@gmail.com>
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
2024-07-11 15:58:01 +02:00
Andrea Righi
1a1a16b9e9 scx_bpfland: fix typo in slice_ns definition
The correct default value of slice_ns 5ms, not 5s.

This change doesn't really make any difference in practice, since these
values are changed by the Rust part when the scheduler is started, but
it's good to keep this aligned to the proper values for consistency.

Tested-by: Piotr Gorski <lucjan.lucjanov@gmail.com>
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
2024-07-11 15:58:01 +02:00
Changwoo Min
bdbfeb9fd1 scx_lavd: use logical current clock for virtual deadlines
This commit changes the use of a physical clock to a virtual, logical
clock in calculating deadlines.

- The virtual current clock advances upon a task's running to its
  virtual deadline.

- When enqueuing a task, its virtual deadline from the virtual current
  clock is calculated.

With the above two changes, this guarantees that there is no such task
whose virtual deadline is smaller than the virtual current clock. This
means any enqueuing task can compete with any other already enqueued
tasks. This allows a latency-critical task to be immediately scheduled
if needed.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-11 22:41:56 +09:00
Changwoo Min
408ea7892c scx_lavd: induce sched_prio_to_latency_weight from slice weight
So sched_prio_to_latency_weight is removed.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-11 21:37:21 +09:00
Changwoo Min
bd964acff6 scx_lavd: deprioritize a newly forked task in latency
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-11 21:36:32 +09:00
Changwoo Min
48debe416e scx_lavd: tuning the deadline equation under high load
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-11 21:35:54 +09:00
Changwoo Min
c72e063680 scx_lavd: do not use lat_prio_to_greedy_thresholds
With other optimizations, lat_prio_to_greedy_thresholds is not effective
any more.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-11 21:35:01 +09:00
Changwoo Min
9ed488798e scx_lavd: use task's runtime to determine its deaddline
It has an effect of further perferring shorter jobs.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-11 21:34:25 +09:00
Changwoo Min
e081b2a294 scx_lavd: rename LAVD_MAX_CAS_RETRY to LAVD_MAX_RETRY
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-11 21:33:56 +09:00
Andrea Righi
995577762a scx_bpfland: refill task time slice
Every time we need to dispatch a task re-evalate its time slice as:

 (unused_time_slice + min_time_slice) / 2

This allows to refill the time slice for tasks that haven't used much of
their previously assigned time, improving fairness.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-07-06 14:07:24 +02:00
Andrea Righi
6a64182ef2 scx_bpfland: always classify interactive tasks
Make sure to always classify interactive tasks, even when the system is
not fully utilized. This ensures that if the system suddenly becomes
overloaded, we already know which tasks need to be dispatched to the
priority DSQ.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-07-06 14:07:24 +02:00
Andrea Righi
8dd528abfd scx_bpfland: pass enqueue flags when dispatching kthreads
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-07-06 14:07:10 +02:00
Andrea Righi
fc0d1bd003
Merge pull request #415 from sched-ext/bpfland-output
scx_bpfland: additional stats and output improvements
2024-07-05 19:50:07 +02:00
Tejun Heo
af5e89e73c
Merge pull request #412 from vax-r/flatcg_delta_fetch
scx_flatcg: Make good use of __sync_fetch_and_sub()
2024-07-05 07:39:22 -10:00
Tejun Heo
14d0a0ef64
Merge pull request #411 from vax-r/Fix_typo
scx_flatcg: Fix_typo
2024-07-05 07:35:31 -10:00
Andrea Righi
2bc8f800e7 scx_bpfland: report build id version
Use the version string provided by scx_utils:build_id.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-07-05 09:29:29 +02:00
Andrea Righi
bdb31e98e2 scx_bpfland: show statistics in a more human-readable format
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-07-05 09:29:29 +02:00
Andrea Righi
f9d7844b2e scx_bpfland: split direct dispatches and kthread dispatches
Show separate statistics for direct dispatches and kthread direct
dispatches.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-07-05 09:27:59 +02:00
I Hsin Cheng
aae826b1b3 scx_flatcg: Make good use of __sync_fetch_and_sub()
Fetch the value of "delta" directly from the returned value from
__sync_fetch_and_sub, as it returns the origin value of
cgc->cvtime_delta.

Additional fetching instruction of cgc->cvtime_delta would be redundant
here.

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
2024-07-05 01:03:20 +08:00
I Hsin Cheng
3e52761487 scx_flatcg: Fix_typo
Fix "oppotunistic" to "opportunistic".

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
2024-07-04 22:04:40 +08:00
Andrea Righi
cfe2ed063d scx_bpfland: time-based starvation prevention
Tasks are consumed from various DSQs in the following order:

  per-CPU DSQs => priority DSQ => shared DSQ

Tasks in the shared DSQ may be starved by those in the priority DSQ,
which in turn may be starved by tasks dispatched to any per-CPU DSQ.

To mitigate this, record the timestamp of the last task scheduling event
both from the priority DSQ and the shared DSQ.

If the starvation threshold is exceeded without consuming a task, the
scheduler will be forced to consume a task from the corresponding DSQ.

The starvation threshold can be adjusted using the --starvation-thresh
command line parameter (default is 5ms).

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-07-04 10:52:39 +02:00
Andrea Righi
9e0db4ae17 scx_bpfland: remove unnecessary RCU read protection
There is no need to RCU protect the cpumask for the offline CPUs: it is
created once when the scheduler is initialized and it's never
deallocated.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-07-04 10:24:43 +02:00
Andrea Righi
cef6ca93cf scx_bpfland: adjust default time slice to 5ms
Reduce the default time slice down to 5ms for a faster reaction and
system responsiveness when the system is overcomissioned.

This also helps to provide a more predictable level of performance.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-07-04 10:24:43 +02:00
Andrea Righi
7d15e3171c scx_bpfland: ensure task time slice never exceeds the slice_ns limit
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-07-04 10:24:43 +02:00
Andrea Righi
e8a4d350ad scx_bpfland: unify dispatching kthreads with direct CPU dispatches
Always use direct CPU dispatch for kthreads, there is no need to treat
kthreads in a special way, simply reuse direct CPU dispatch to
prioritize them.

Moreover, change direct CPU dispatches to use scx_bpf_dispatch_vtime(),
since we may dispatch multiple tasks to the same per-CPU DSQ now.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-07-03 09:38:43 +02:00
Andrea Righi
d2231b0aed scx_bpfland: drop built-in idle CPU selection logic
Small refactoring of the idle CPU selection logic:
 - optimize idle CPU selection for tasks that can run on a single CPU
 - drop the built-in idle selection policy and completely rely on the
   custom one

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-07-03 08:54:17 +02:00
Andrea Righi
7c355f50b2 scx_bpfland: use the right cpumask to find any idle CPU
We are incorrectly using the SMT idle cpumask to find any idle CPU, fix
by using the generic idle cpumask.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-07-01 20:36:24 +02:00
Andrea Righi
c458f345b4
Merge pull request #408 from sched-ext/bpfland-cpu-hotplug
scx_bpfland: support CPU hotplugging
2024-07-01 19:41:00 +02:00
Dan Schatzberg
32ac4b2cff
Merge pull request #389 from dschatzberg/mitosis
mitosis: Update synchronization
2024-07-01 09:44:26 -04:00
Andrea Righi
ff7a518d28 scx_bpfland: support CPU hotplugging
Implement CPU hotplugging in scx_bpfland without restarting the
scheduler.

The idle selection logic has been updated to consider online CPUs.
Additionally, a cpumask for offline CPUs has been introduced. Tasks
that have been dispatched to the DSQs associated with offline CPUs are
consumed by the other CPUs that are still online.

Moreover, the dependency on the Topology crate is temporarily dropped
and instead, /sys/devices/system/cpu/smt/active is used to determine if
SMT should be taken into account during idle selection. The Topology
crate will be re-introduced later when scx_bpfland will gain more
topology-aware capabilities.

This fixes #406.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-06-30 23:04:13 +02:00
Andrea Righi
d76551bbd3 scx_rusty: fix stats map initialization
The stats map in scx_rusty is a BPF_MAP_TYPE_PERCPU_ARRAY, with its size
determined by num_possible_cpus(). Initializing it with nr_cpu_ids() can
result in errors such as:

 Error: Failed to zero stat

 Caused by:
     number of values 6 != number of cpus 8

Fix by using num_possible_cpus() to initialize it.

Fixes: 263e02f6 ("rusty: Use nr_cpu_ids instead of nr_cpus_possible")
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-06-30 17:37:14 +02:00
Andrea Righi
74175f5a49 scx_bpfland: properly integrate with meson build
Properly honor the meson build `serialize` option.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-06-28 21:37:00 +02:00
Andrea Righi
f98c35fd07
Merge pull request #388 from sched-ext/bpfland
scheds: introduce scx_bpfland
2024-06-28 21:27:43 +02:00
Andrea Righi
cf4883fbf8 meson: introduce serialize build option
With commit 5d20f89a ("scheds-rust: build rust schedulers in sequence"),
schedulers are now built serially one after the other to prevent meson
and cargo from forking NxN parallel tasks.

However, this change has made building a single scheduler much more
cumbersome, due to the chain of dependencies.

For example, building scx_rusty using the specific meson target would
still result in all schedulers being built, because they all depend on
each other.

To address this issue, introduce the new meson build option
`serialize=true|false` (default is false).

This option allows to disable the schedulers' build chain, restoring the
old behavior.

With this option enabled, it is now possible to build just a single
scheduler, parallelizing the cargo build properly, without triggering
the build of the others. Example:

  $ meson setup build -Dbuildtype=release -Dserialize=false
  $ meson compile -C build scx_rusty

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-06-28 10:17:37 +02:00
Changwoo Min
24a238846e scx_lavd: optimizing deadline related tunables
The competition window was 7.5 msec, half of the targeted latency.
However, it is too wide for some workloads, so unrelated tasks may
compete with each other. Hence, it is tightened to about 1 msec with
LAVD_LAT_WEIGHT_SHIFT to avoid unnecessary competition.

Also, when a system is overloaded, now the time space is stretched more
aggressively (i.e., lat_prio^2) when a task's latency priority is low
(high value).

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-06-28 09:00:45 +09:00
Andrea Righi
7606b95150 scx_bpfland: introduce maximum time slice lag
Introduce a tunable to set a limit of the minimum vruntime that is used
when a task is dispatched, as:

 vtime_min = vtime_now - slice_lag_ns

Increasing the time slice lag can make interactive tasks even more
responsive at the cost of starving regular and newly created tasks.

Default time slice lag is 0.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-06-27 17:28:42 +02:00
Andrea Righi
5a44329d45 scheds: introduce scx_bpfland
Overview
========

This scheduler is derived from scx_rustland, but it is fully implemented
in BFP with minimal user-space Rust part to process command line
options, collect metrics and logs out scheduling statistics.

Unlike scx_rustland, all scheduling decisions are made by the BPF
component.

Motivation
==========

The primary goal of this scheduler is to act as a performance baseline
for comparison with scx_rustland, allowing for a better assessment of
the overhead caused by kernel/user-space interactions.

It can also be used to deploy prototypes initially tested in the
scx_rustland scheduler. In fact, this scheduler is expected to
outperform scx_rustland, due to the elimitation of the kernel/user-space
overhead.

Scheduling policy
=================

scx_bpfland is a vruntime-based sched_ext scheduler that prioritizes
interactive workloads. Its scheduling policy closely mirrors
scx_rustland, but it has been re-implemented in BPF with some small
adjustments.

Tasks are categorized as either interactive or regular based on their
average rate of voluntary context switches per second: tasks that exceed
a specific voluntary context switch threshold are classified as
interactive.

Interactive tasks are prioritized in a higher-priority DSQ, while
regular tasks are placed in a lower-priority DSQ. Within each queue,
tasks are sorted based on their weighted runtime, using the built-in scx
vtime ordering capabilities (scx_bpf_dispatch_vtime()).

Moreover, each task gets a time slice budget. When a task is dispatched,
it receives a time slice equivalent to the remaining unused portion of
its previously allocated time slice (with a minimum threshold applied).

This gives latency-sensitive workloads more chances to exceed their time
slice when needed to perform short bursts of CPU activity without being
interrupted (i.e., real-time audio encoding / decoding workloads).

Results
=======

According to the initial test results, using the same benchmark "playing
a videogame while recompiling the kernel", this scheduler seems to
provide a +5% improvement in the frames-per-second (fps) compared to
scx_rustland, with video games such as Cyberpunk 2077, Counter-Strike 2
and Baldur's Gate 3.

Initial test results indicate that this scheduler offers around a +5%
improvement in frames-per-second (fps) compared to scx_rustland when
using the benchmark "playing a video game while recompiling the kernel".

This improvement was observed in games such as Cyberpunk 2077,
Counter-Strike 2, and Baldur's Gate 3.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-06-27 17:28:42 +02:00
Changwoo Min
f86d564d89 scx_lavd: fast path for ops.dispatch() when fully loaded
When fully loaded so all CPUs are using, skip checking the cpumask.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-06-27 18:00:39 +09:00
David Vernet
fe3ce64a9b
Revert "scx_rusty: Refactor ridx assignment in populate_tasks_by_load" 2024-06-26 17:35:22 -04:00
Changwoo Min
abc6e31fef scx_lavd: for a forked task, inherit its parent's statistics
The old approach was too conservative in running a new task, so when a
fork-heavy workload competes with a CPU-bound workload, the fork-heavy
one is starved. The new approach solves the starvation problem by
inheriting parent's statistics. It seems a good (at least better than
old) guess how a new task will behave.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-06-26 19:00:10 +09:00
Changwoo Min
ac9c49f5b5 scx_lavd: loosen the deadline when overloaded
When the system is highly loaded with compute-intensive tasks, the old
setting chokes latensive-intensive tasks, so loosen the dealine when the
system is overloaded (> 100% utilization).

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-06-26 15:06:31 +09:00
Changwoo Min
b32734168b scx_lavd: print build ID when lavd is loaded
When the lavd is loaded, it prints out its build id. It helps to easily
identify what version it is when testing.

```
01:56:54 [INFO] scx_lavd scheduler is initialized (build ID: 0.8.1-g98a5fa8595430414115c504857cea1a458393838-dirty x86_64-unknown-linux-gnu)
```

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-06-26 10:57:19 +09:00
Dan Schatzberg
d349f86d04 mitosis: Update synchronization
The synchronization for mitosis is a bit ad-hoc, working around lack of
atomics in BPF. This commit updates the logic to use READ/WRITE_ONCE and
compiler barriers to get the behaviors we want.

Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
2024-06-25 12:44:16 -07:00
David Vernet
d42bae4fcf
rusty: Print build ID when rusty is loaded
When someone is testing schedulers, we often have to ask what version
the scheduler is running as. Now that we can access the build ID from
rust schedulers, let's update scx_rusty to print the build ID when rusty
first starts running.

This results in output such as the following:

```
[void@maniforge scx]$ rusty
19:04:26 [INFO] Running scx_rusty (build ID: 0.8.1-g2043d2537f37c8d75753bb65eb75bca965067564 x86_64-unknown-linux-gnu/debug)
19:04:26 [INFO] NUMA[00] mask= 0b11111111111111111111111111111111
19:04:26 [INFO]   DOM[00] mask= 0b00000000111111110000000011111111
19:04:26 [INFO]   DOM[01] mask= 0b11111111000000001111111100000000
19:04:26 [INFO] Rusty scheduler started!
```

Signed-off-by: David Vernet <void@manifault.com>
2024-06-25 11:44:46 -05:00
David Vernet
9d9ece11aa
Merge pull request #384 from jfernandez/log-recorder
scx_utils: Add log_recorder module for metrics-rs
2024-06-25 11:43:37 -05:00
Changwoo Min
5d0db5c5fe scx_lavd: revising tunables to reduce micro-stutters
This is a second attempt to optimize tunables for a wider range of
games.

1) LAVD_BOOST_RANGE increased from 14 (35%) to 40 (100% of nice range).
   Now the latency priority (biased by nice value) will decide which
   task should run first . The nice value will decide the time slice.

2) The first change will give higher priority to latency-critical task
   compared to before. For compensation, the slice boost also increased
   (2x -> 3x).

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-06-25 16:13:32 +09:00
Jose Fernandez
e5984ed016
scx_utils: Add log_recorder module for metrics-rs
This change adds a new module to the scx_utils crate that provides a
log recorder for metrics-rs. The log recorder will log all metrics to
the console at a configurable interval in an easy to read format. Each
metric type will be displayed in a separate section. Indentation will
be used to show the hierarchy of the metrics. This results in a more
verbose output, but it is easier to read and understand.

scx_rusty was updated to use the log recorder and all explicit metric
logging was removed.

Counters will show the total count and the rate of change per second.
Counters with an additional label, like `type` in
`dispatched_tasks_total` in rusty, will show the count, rate, and
percentage of the total count.

Counters:
  dispatched_tasks_total: 65559 [1344.8/s]
    prev_idle: 44963 (68.6%) [966.5/s]
    wsync_prev_idle: 15696 (23.9%) [317.3/s]
    direct_dispatch: 2833 (4.3%) [35.3/s]
    dsq: 1804 (2.8%) [21.3/s]
    wsync: 262 (0.4%) [4.3/s]
    direct_greedy: 1 (0.0%) [0.0/s]
    pinned: 0 (0.0%) [0.0/s]
    greedy_idle: 0 (0.0%) [0.0/s]
    greedy_xnuma: 0 (0.0%) [0.0/s]
    direct_greedy_far: 0 (0.0%) [0.0/s]
    greedy_local: 0 (0.0%) [0.0/s]
  dl_clamped_total: 1290 [20.3/s]
  dl_preset_total: 514 [1.0/s]
  kick_greedy_total: 6 [0.3/s]
  lb_data_errors_total: 0 [0.0/s]
  load_balance_total: 0 [0.0/s]
  repatriate_total: 0 [0.0/s]
  task_errors_total: 0 [0.0/s]

Gauges will show the last set value:

Gauges:
  slice_length_us: 20000.00

Histograms will show the average, min, and max. The histogram will be
reset after each log interval to avoid memory leaks, since the data
structure that holds the samples is unbounded.

Histograms:
  cpu_busy_pct: avg=1.66 min=1.16 max=2.16
  load_avg node=0: avg=0.31 min=0.23 max=0.39
  load_avg node=0 dom=0: avg=0.31 min=0.23 max=0.39
  processing_duration_us: avg=297.50 min=296.00 max=299.00

Signed-off-by: Jose Fernandez <josef@netflix.com>
2024-06-24 18:45:02 -06:00
David Vernet
8059acb634
Merge pull request #381 from vax-r/rusty_dom_load_status_check
scx_rusty: Pull domain status check
2024-06-24 17:54:54 -05:00
David Vernet
55ee210d42
Merge pull request #382 from vax-r/rusty_refactor
scx_rusty: Refactor ridx assignment in populate_tasks_by_load
2024-06-24 17:47:55 -05:00
Changwoo Min
016229cbcf scx_lavd: revising tunables for less-preemptive games
In some games (e.g., Elden Ring), it was observed that preemption
happens much less frequently. The reason is that tasks' runtime per
schedule is similar, so it does not meet the existing criteria. To
alleviate the problem, the following three tunables are revised:

1) Smaller LAVD_PREEMPT_KICK_MARGIN and LAVD_PREEMPT_TICK_MARGIN help to
   trigger more preemption.

2) Smaller LAVD_SLICE_MAX_NS works better especially 250 or 300Hz
   kernels.

3) Longer LAVD_ELIGIBLE_TIME_MAX purturbes time lines less frequently.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-06-24 00:27:33 +09:00
I Hsin Cheng
eab234a74f scx_rusty: Refactor ridx assignment in populate_tasks_by_load
Origin assignment of the variable ridx is equivalent to comparing
between "ridx" and "wids - MAX_PIDS". Using u64 max library helper
function to perform the comparison and provide better readability.

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
2024-06-23 21:58:51 +08:00
I Hsin Cheng
84b9ac4dce scx_rusty: Pull domain status check
Check whether the BalanceState of pull_dom.load inside function
try_find_move_task is actually the variant NeedsPull. It'll perform task
migration in abit more conservative manner when the system is under high
loading situation.

Experiments are performed when the system is compiling linux kernel and
undergoing a large amount of I/O operation at the same time using fio.

The result showns that before the modification, there're 12,6617 times
of task migrations system wide. After the modification, there're 11,5419
times of task migrations system wide.

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
2024-06-23 21:38:23 +08:00
David Vernet
5038f54701
Merge pull request #377 from jfernandez/metrics-rs
rusty: Integrate stats with the metrics framework
2024-06-21 15:23:20 -05:00
David Vernet
9919b71fd4
Merge pull request #379 from sched-ext/topo_nr_cpu_ids
Add topo.nr_cpu_ids() to Topology crate
2024-06-21 13:35:05 -05:00
David Vernet
3bd15be840
rlfifo: Use topo.nr_cpu_ids() instead of topo.nr_cpus_possible()
In scx_rlfifo, we're currently using topo.nr_cpus_possible() to
determine how many possible CPU IDs we could have on the system. To
properly support systems whose disabled CPUs may be in the middle of the
range of possible CPU IDs, let's instead use topo.nr_cpu_ids() so that
we don't accidentally dispatch to an invalid DSQ.

Signed-off-by: David Vernet <void@manifault.com>
2024-06-21 12:57:20 -05:00
David Vernet
263e02f644
rusty: Use nr_cpu_ids instead of nr_cpus_possible
In scx_rusty, we're currently using topo.nr_cpus_possible() to determine
how many possible CPU IDs we could have on the system. scx_rusty already
accounts for offlined CPUs, so to properly support systems whose
disabled CPUs may be in the middle of the range of possible CPU IDs,
let's instead use topo.nr_cpu_ids().

Signed-off-by: David Vernet <void@manifault.com>
2024-06-21 12:57:19 -05:00
David Vernet
bdbf4b9c05
topo: Return nr_cpu_ids from host Topology
In some cases, a host may have an odd topology where there are gaps in
CPU IDs (including between possible CPUs). A common pattern in
schedulers is to perform allocations for every possible CPU ID, such as
creating a per-cpu DSQ. In order to avoid confusing schedulers, let's
track the maximum CPU ID on a system so that we can return the number of
CPU IDs on the system which is inclusive of gaps.

We also update scx_rustland in this change to accommodate the fact that
we no longer export nr_cpus_possible() from TopologyMap.

Signed-off-by: David Vernet <void@manifault.com>
2024-06-21 12:57:13 -05:00
Jose Fernandez
83373b1f4e
rusty: Integrate stats with the metrics framework
We need a layer of indirection between the stats collection and their
output destinations. Currently, stats are only printed to stdout. Our
goal is to integrate with various telemetry systems such as Prometheus,
StatsD, and custom metric backends like those used by Meta and Netflix.
Importantly, adding a new backend should not require changes to the
existing stats code.

This patch introduces the `metrics` [1] crate, which provides a
framework for defining metrics and publishing them to different
backends.

The initial implementation includes the `dispatched_tasks_count`
metric, tagged with `type`. This metric increments every time a task is
dispatched, emitting the raw count instead of a percentage. A monotonic
counter is the most suitable metric type for this use case, as
percentages can be calculated at query time if needed. Existing logged
metrics continue to print percentages and remain unchanged.

A new flag, `--enable-prometheus`, has been added. When enabled, it
starts a Prometheus endpoint on port 9000 (default is false). This
endpoint allows metrics to be charted in Prometheus or Grafana
dashboards.

Future changes will migrate additional stats to this framework and add
support for other backends.

[1] https://metrics.rs/

Signed-off-by: Jose Fernandez <josef@netflix.com>
2024-06-21 10:18:44 -06:00
Tejun Heo
7a40059b55 Revert "scx_flatcg: Keep cgroup rb nodes stashed"
This reverts commit 3b7f33ea1b.

I haven't root caused it yet but it's easy to reproduce stall and trigger
the watchdog after the commit - just running stress in multiple cgroups
easily triggers stalls after a couple tens of seconds. Let's revert it for
now.
2024-06-19 14:44:26 -10:00
Changwoo Min
9c21ace276
Merge pull request #373 from vax-r/lavd_reuse
scx_lavd: Reuse can_task1_kick_task2
2024-06-19 15:29:05 +09:00
I Hsin Cheng
99960ad960 scx_lavd: Reuse can_task1_kick_task2
Use the function can_task1_kick_task2() to replace places which also
checking the comp_preemption_info between two cpus for better
consistency.

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
2024-06-19 11:01:31 +08:00
Changwoo Min
691869e83f
Merge pull request #369 from sched-ext/lavd-fix-pick-cpu
scx_lavd: properly check for idle CPUs in pick_cpu()
2024-06-19 09:23:17 +09:00
Changwoo Min
dad25f1b5d
Merge pull request #368 from multics69/lavd-perf-misc
scx_lavd: misc performance tuning and code clean up
2024-06-19 07:26:52 +09:00
Andrea Righi
bad9ed13ef scx_lavd: properly check for idle CPUs in pick_cpu()
It seems that we are not updating `is_idle` when we find an idle CPU
with pick_cpu(), causing unnecessary rescheduling events when
select_cpu() is called.

To resolve this, ensure that the is_idle state is correctly set.
Additionally, always ensure that the task is dispatched to the local DSQ
immediately upon finding (and reserving) an idle CPU.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-06-18 17:36:39 +02:00
Changwoo Min
632fa9e4f2 scx_lavd: misc code clean up
- clean up u63 and u32 usages in structures to reduce struct size
- refactoring pick_cpu() for readability

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-06-18 18:11:49 +09:00
Changwoo Min
5165bf5a03 scx_lavd: tuning CPU frequency scaling
The required CPU performance (cpuperf) was set to 1024 (100%) when the
CPU utilization was 100%. When a sudden load spike happens, it makes the
system adapt slowly in the next interval.

The new scheme always reserves some headroom in advance, so it sets
cpuperf to 1024 when the CPU utilization reaches to 85%. This gives some
room to adapt in advance.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-06-18 18:11:49 +09:00
I Hsin Cheng
94e3616c02 scx_rusty: Refactor lookup operation for new_domc in task_set_domain
Modify the execution sequence before lookup operation for new_domc. If
new_dom_id == NO_DOM_FOUND, lookup operation for new_domc is definitely
going to fail so we don't have to wait until we found that new_domc is
NULL, clearing of cpumask and return operation should be done directly
in that case.

Plus we should avoid using try_lookup_dom_ctx outside the context of
lookup_dom_ctx, as it can keep the interface's consistency.

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
2024-06-18 12:58:17 +08:00
Tejun Heo
819ffd527f
Merge pull request #367 from sched-ext/htejun/dsq-iter-fix
scx/compat.bpf.h: Fix __COMPAT_scx_bpf_consume_task() and improve scx_qmap example
2024-06-17 10:29:38 -10:00
Tejun Heo
1012e3a6db scx/compat.bpf.h: Fix __COMPAT_scx_bpf_consume_task() and improve scx_qmap example
__COMPAT_scx_bpf_consume_task() wasn't calling scx_bpf_consume_task() at all
and was always returning false. Fix it.

Also, update scx_qmap usage example so that it matches cgroup ID rather than
comm prefix. This should make testing with multiple processes a bit easier.
2024-06-17 10:11:06 -10:00
David Vernet
0184444285
Merge pull request #366 from sched-ext/task_set_domain_global
rusty: Make dom_xfer_task() a global prog
2024-06-17 14:43:45 -05:00
David Vernet
dfe0ffb312
Merge pull request #347 from sched-ext/rusty_cleanup
rusty: Clean up some logic in rusty
2024-06-17 14:26:53 -05:00
David Vernet
7985ee556e
rusty: Clean up dispatch logic
The rusty dispatch logic is a bit unnecessarily convoluted. Let's clean it up
so that we're just comparing dom ids rather than iterating over arrays nested
inside of pcpu context.

Signed-off-by: David Vernet <void@manifault.com>
2024-06-17 14:24:30 -05:00
David Vernet
87aa86845d
rusty: Refactor + slightly improve wake_sync
Right now, the SCX_WAKE_SYNC logic in rusty is very primitive. We only check to
see if the waker CPU's runqueue is empty, and then migrate the wakee there if
so. We'll want to expand this to be more thorough, such as:

- Checking to see if prev_cpu and waker_cpu share the same LLC when determining
  where to migrate
- Check for whether SCX_WAKE_SYNC migration helps load imbalance between cores
- ...

Right now all of that code is just a big blob in the middle of
rusty_select_cpu(). Let's pull it into its own function to improve readability,
and also add some logic to stay on prev_cpu if it shares an LLC with the waker.

Signed-off-by: David Vernet <void@manifault.com>
2024-06-17 14:24:29 -05:00
David Vernet
fed66fa571
rusty: Make dom_xfer_task() a global prog
It seems that task_set_domain() is nearly at the point where it can
cause the verifier to get confused and think that it's exceeding the
number of available instructions per program. I've seen this a number of
times when making small changes to task_set_domain(), and it's once
again happened @vax-r (I-Hsin Cheng) made a small cleanup change to
rusty in https://github.com/sched-ext/scx/pull/362.

To avoid this, let's just make dom_xfer_task() a separate global program
so that the verifier doens't have to worry about branch pruning, etc
depending on what the caller does. This should hopefully make
task_set_domain() (and its callers) much less brittle.

Signed-off-by: David Vernet <void@manifault.com>
2024-06-17 14:22:26 -05:00
Tejun Heo
b6ebdc635a compat: Compact min requirement checks
Let's check only the latest one.
2024-06-16 06:53:58 -10:00
Tejun Heo
aeb805a93e compat: Drop support for missing sched_ext_ops.dump*()
In preparation of upstreaming, let's set the min version requirement at the
released v6.9 kernels. Drop support for missing sched_ext_ops.dump*(). The
open helper macros now check the existence of the fields and abort if
missing.
2024-06-16 06:43:43 -10:00
Tejun Heo
4cca1e9acf compat: Drop support for missing sched_ext_ops.tick()
In preparation of upstreaming, let's set the min version requirement at the
released v6.9 kernels. Drop support for missing sched_ext_ops.tick(). The
open helper macros now check the existence of the field and abort if
missing.
2024-06-16 06:40:28 -10:00
Tejun Heo
970c04b43a compat: Drop support for missing sched_ext_ops.exit_dump_len
In preparation of upstreaming, let's set the min version requirement at the
released v6.9 kernels. Drop support for missing sched_ext_ops.exit_dump_len.
The open helper macros now check the existence of the field and abort if
missing.
2024-06-16 06:37:34 -10:00
Tejun Heo
046bdfd5e0 compat: Drop support for missing sched_ext_ops.hotplug_seq
In preparation of upstreaming, let's set the min version requirement at the
released v6.9 kernels. Drop support for missing sched_ext_ops.hotplug_seq.
The open helper macros now check the existence of the field and abort if
missing.
2024-06-16 06:34:59 -10:00
Tejun Heo
dde2942125 compat: Drop __COMPAT_scx_bpf_cpuperf_*()
In preparation of upstreaming, let's set the min version requirement at the
released v6.9 kernels. Drop __COMPAT_scx_bpf_cpuperf_*(). The open helper
macros now check the existence of scx_bpf_cpuperf_cap() and abort if not.
2024-06-16 06:16:53 -10:00
Tejun Heo
13e8388e1e compat: Drop __COMPAT_HAS_CPUMASKS
In preparation of upstreaming, let's set the min version requirement at the
released v6.9 kernels. Drop __COMPAT_HAS_CPUMASKS(). The open helper macros
now check the existence of scx_bpf_nr_cpu_ids() and abort if not.
2024-06-16 06:12:06 -10:00
Tejun Heo
66901e2b44 compat: Drop __COMPAT_scx_bpf_dump()
In preparation of upstreaming, let's set the min version requirement at the
released v6.9 kernels. Drop __COMPAT_scx_bpf_dump(). The open helper macros
now check the existence of scx_bpf_dump_bstr() and abort if not.

While at it, reorder the min requirement checks so that newly added ones are
up top to make testing easier.
2024-06-16 06:02:47 -10:00
Tejun Heo
0d8adf2260 compat: Drop __COMPAT_scx_bpf_exit()
In preparation of upstreaming, let's set the min version requirement at the
released v6.9 kernels. Drop __COMPAT_scx_bpf_exit(). The open helper macros
now check the existence of scx_bpf_exit_bstr() and abort if not.
2024-06-15 20:36:17 -10:00
Tejun Heo
5b5e5be906 compat: Drop __COMPAT_SCX_KICK_IDLE
In preparation of upstreaming, let's set the min version requirement at the
released v6.9 kernels. Drop __COMPAT_SCX_KICK_IDLE. The open helper macros
now check the existence of SCX_KICK_IDLE and abort if not.
2024-06-15 20:24:15 -10:00
Tejun Heo
b730f35e68 scx/common.h: Improve SCX_BUG() macro
There's no guarantee that errno is set or contains relevant information when
SCX_BUG() is invoked. This sometimes leads to "task failed successfully"
messages:

  # ./scx_simple
  ../scheds/c/scx_simple.c:72 [scx panic]: Success
  SCX_OPS_SWITCH_PARTIAL missing, kernel too old?

While not critical, it's not great. Let's update it so that errno is printed
in parentheses when non-zero and match the tag to the macro name so that
what's printed is the following:

  # ./scx_simple
  [SCX_BUG] ../scheds/c/scx_simple.c:72
  SCX_OPS_SWITCH_PARTIAL missing, kernel too old?
2024-06-15 20:17:32 -10:00
Tejun Heo
7c9aedaefe compat: Drop __COMPAT_scx_bpf_switch_all()
In preparation of upstreaming, let's set the min version requirement at the
released v6.9 kernels. Drop __COMPAT_scx_bpf_switch_call(). The open helper
macros now check the existence of SCX_OPS_SWITCH_PARTIAL and abort if not.
2024-06-15 20:03:37 -10:00
Tejun Heo
dd6255a601
Merge pull request #359 from sched-ext/htejun/cosmetic
common.bpf.h: Cosmetic changes
2024-06-15 06:42:00 -10:00
Andrea Righi
cb20a6f136 scx_rlfifo: dispatch all tasks on the first CPU available
With commit 786ec0c0 ("scx_rlfifo: schedule all tasks in user-space")
all the scheduling decisions are now happening in user-space. This also
bypasses the built-in idle selection logic, delegating the CPU selection
for each task to the user-space scheduler.

The easiest way to distribute tasks across the available CPUs is to
simply allow to dispatch them on the first CPU available.

In this way the scheduler becomes usable in practical scenarios and at
the same time it also maintains its simplicity.

This allows to spread all tasks across all the available CPUs

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-06-15 16:13:53 +02:00
Andrea Righi
786ec0c04a scx_rlfifo: schedule all tasks in user-space
Disable all the BPF optimization shortcuts by default and force all
tasks to be processed by the user-space scheduler.

Given that the primary goal of this scheduler is to offer a
straightforward and intuitive example for experimental purposes, this
change simplifies the process for individuals looking to experiment,
allowing them to apply changes to user-space code and quickly observe
the effects, without dealing with any in-kernel optimizations.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-06-15 16:07:39 +02:00
Andrea Righi
59f47d6659 scx_rlfifo: improve code readability
No functional change, just add some comments to better describe the
parameters used when initializing the main BpfScheduler object.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-06-15 16:05:28 +02:00
Tejun Heo
d3b34d1df7 scx_qmap: Rename central_timer to monitor_timer
The name was copied from scx_central.bpf.c and doesn't match what the timer
is used for in scx_qmap.bpf.c.
2024-06-14 16:07:20 -10:00
Tejun Heo
13abb6fd26 scx/common.bpf.h: Reorganize
Currently, the BPF declarations and generic helpers are in the same section.
Let's move the generic helpers down to its own section.
2024-06-14 15:36:00 -10:00
Tejun Heo
d7677e3e5c scx/common.bpf.h: Rename bpf_log2[l]() to u32/64_log2()
The bpf_ prefix is used for BPF API. Rename bpf_log2() to u32_log2() and
bpf_log2l() to u64_log2(). While at it, relocate them below compiler
directive helpers.
2024-06-14 15:22:39 -10:00
Tejun Heo
5a2412c211 scx/common.bpf.h: Minor comment updates 2024-06-14 15:22:29 -10:00
Andrea Righi
8c6fe540eb scx_rustland: prevent excessive starvation when system is congested
Keep track of the maximum vruntime among all tasks and flush them if the
difference between the maximum and minimum vruntime exceeds slice_ns.

This helps to prevent excessive starvation, as every task is guaranteed
to be dispatched within the slice_ns time limit.

Tested-by: Tested-by: SoulHarsh007 <harsh.peshwani@outlook.com>
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-06-14 20:09:19 +02:00
Changwoo Min
94a39f419f scx_lavd: add the design of core compaction
The core compaction seems to work great in various hardware. Now it is
time to document its design.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-06-14 11:53:52 +09:00
Changwoo Min
5068d75bf3
Merge pull request #351 from multics69/lavd-power-v2
scx_lavd: improve CPU frequency scaling
2024-06-14 09:29:10 +09:00
Tejun Heo
a3342810c7
Merge pull request #352 from dschatzberg/mitosis
common: Add css iter forward declares
2024-06-13 06:50:06 -10:00
Dan Schatzberg
114e4b644b common: Add css iter forward declares
These are used in mitosis, but they belong in common code so other
schedulers can do css iteration.

Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
2024-06-12 15:02:48 -07:00
Changwoo Min
747bf2a7d7 scx_lavd: add the design of CPU frequency scaling
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-06-13 01:42:19 +09:00
Changwoo Min
2e74b86b4a scx_lavd: logging cpu performance target
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-06-13 00:44:04 +09:00
Changwoo Min
e6348a11e9 scx_lavd: improve frequency scaling logic
The old logic for CPU frequency scaling is that the task's CPU
performance target (i.e., target CPU frequency) is checked every tick
interval and updated immediately. Indeed, it samples and updates a
performance target every tick interval. Ultimately, it fluctuates CPU
frequency every tick interval, resulting in less steady performance.

Now, we take a different strategy. The key idea is to increase the
frequency as soon as possible when a task starts running for quick
adoption to load spikes. However, if necessary, it decreases gradually
every tick interval to avoid frequency fluctuations.

In my testing, it shows more stable performance in many workloads
(games, compilation).

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-06-12 23:40:40 +09:00
Changwoo Min
753f333c09 scx_lavd: refactoring do_update_sys_stat()
Originally, do_update_sys_stat() simply calculated the system-wide CPU
utilization. Over time, it has evolved to collect all kinds of
system-wide, periodic statistics for decision-making, so it has become
bulky. Now, it is time to refactor it for readability. This commit does
not contain functional changes other than refactoring.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-06-12 21:15:25 +09:00
Changwoo Min
9d129f0afa scx_lavd: rename LAVD_CPU_UTIL_INTERVAL_NS to LAVD_SYS_STAT_INTERVAL_NS
The periodic CPU utilization routine does a lot of other work now. So we
rename LAVD_CPU_UTIL_INTERVAL_NS to LAVD_SYS_STAT_INTERVAL_NS.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-06-12 20:06:17 +09:00
Changwoo Min
7046b47b9c scx_lavd: properly calculate task's runtime after suspend/resume
When a device is suspended and resumed, the suspended duration is added
up to a task's runtime if the task was running on the CPU. After the
resume, the task's runtime is incorrectly long and the scheduler starts
to recognize the system is under heavy load. To avoid such problem, the
suspended duration is measured and substracted from the task's runtime.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-06-12 15:58:41 +09:00
Dan Schatzberg
b95cfb0772 mitosis: Fix build
The target wasn't dependent on the previous sched so building all
schedulers ended up not building scx_mitosis which broke the install
script.
2024-06-11 14:33:32 -07:00
Dan Schatzberg
9528d4603e
Merge pull request #339 from dschatzberg/mitosis
scheds: Add scx_mitosis scheduler
2024-06-11 16:50:25 -04:00
Dan Schatzberg
3b6e2dee20 scheds: Add scx_mitosis scheduler
scx_mitosis is a dynamic affinity scheduler which assigns cgroups to
Cells and Cells to discrete sets of CPUs. The number of cells is dynamic
as is the CPU assignment. BPF mostly just does vtime scheduling for each
cell, tracks load, and responds to reconfiguration from userspace.
Userspace makes decisions about how to assign cgroups to cells and cells
to cpus.

This is not yet a complete scheduler, much of the userspace logic is a
placeholder as I experiment with better logic. I also want to add richer
scheduling semantics to userspace, e.g. so that cells can do more
"soft-affinity" rather than the strict partitioning implemented
currently.

Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
2024-06-11 10:34:53 -07:00
David Vernet
1dbf874709
Merge pull request #341 from vax-r/rusty_data_races
scx_rusty: Elimate data races possibility for domain min_vruntime
2024-06-11 12:04:40 -05:00
David Vernet
b50ba626cc
uei: Pass skel to RESIZE_ARRAY()
The RESIZE_ARRAY() macro assumes the presence of an in-scope "skel" variable.
This is bad practice and can cause issues in other macros that use it. Let's
update it to explicitly take a skel argument.

Signed-off-by: David Vernet <void@manifault.com>
2024-06-11 10:15:26 -05:00
I Hsin Cheng
4e30bb9ccf scx_rusty: Elimate data races possibility for domain min_vruntime
READ_ONCE()/WRITE_ONCE() macros are added in commit 0932fde, we should
be able to utilize the macros to get around the possibility of data
races for domc->min_vruntime.

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
2024-06-11 10:57:03 +08:00
Tejun Heo
30f27d99d9
Merge pull request #340 from sched-ext/htejun/layered-updates
scx_layered: Improve yield, preemption and other behaviors
2024-06-10 11:27:44 -10:00
Tejun Heo
9ec3594b4f scx_layered: Several fixes to address David's review
- pick_idle_cpu() was putting idle_smtmask that it didn't acquire.

- layered_enqueue() was unnecessarily entering preemption path after finding
  an idle CPU.

- No need to test whether scx_bpf_get_idle_cpu/smtmask() return NULL. They
  never do.

- Relocate cctx->yielding test into keep_runinng() from its caller.
2024-06-10 11:23:37 -10:00
Tejun Heo
92317aa2f9 Use __always_inline uniformly
Instead of using __attribute__((always_inline)) use the __always_inline
macro provided by BPF.
2024-06-10 11:23:26 -10:00
Changwoo Min
472ab945b8
scx_lavd: core compaction for low power consumption (#338)
scx_lavd: core compaction for low power consumption

When system-wide CPU utilization is low, it is very likely all the CPUs
are running with very low utilization. That means all CPUs run with low
clock frequency thanks to dynamic frequency scaling and very frequently
go in and out from/to C-state. That results in low performance (i.e.,
low clock frequency) and high power consumption (i.e., frequent
P-/C-state transition).

The idea of *core compaction* is using less number of CPUs when
system-wide CPU utilization is low. The chosen cores (called "active
cores") will run in higher utilization and higher clock frequency, and
the rest of the cores (called "idle cores") will be in a C-state for a
much longer duration. Thus, the core compaction can achieve higher
performance with lower power consumption.

One potential problem of core compaction is latency spikes when all the
active cores are overloaded. A few techniques are incorporated to solve
this problem.

1) Limit the active CPU core's utilization below a certain limit (say 50%).

2) Do not use the core compaction when the system-wide utilization is
   moderate (say 50%).

3) Do not enforce the core compaction for kernel and pinned user-space
   tasks since they are manually optimized for performance.

In my experiments, under a wide range of system-wide CPU utilization
(5%—80%), the core compaction reduces 7-30% power consumption without
sacrificing average and 99p tail latency.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-06-08 09:25:27 +09:00
Tejun Heo
a165970ab9 scx_layered: Add migration statistic
Keep track of how frequent migrations are.
2024-06-07 11:49:39 -10:00
Tejun Heo
5b31d96c3d scx_layered: Implement "preempt_first" layer property
If set, tasks in the layer will try to preempt tasks in their previous CPUs
before trying to find idle CPUs.
2024-06-07 11:49:39 -10:00
Tejun Heo
ece3638664 scx_layered: Allow confined layers to preempt
There's no reason to restrict confined layers from preempting on the CPUs
that they are entitled to. Allow preemption for confined layers.
2024-06-07 11:49:39 -10:00
Tejun Heo
7c48814ed0 scx_layered: Prefer preempting the CPU the task was previously on
Currently, when preempting, searching for the candidate CPU always starts
from the RR preemption cursor. Let's first try the previous CPU the
preempting task was on as that may have some locality benefits.
2024-06-07 11:49:38 -10:00
Tejun Heo
3db3257911 scx_layered: Find and kick an idle CPU from enqueue path
When a task is being enqueued outside wakeup path, ops.select_cpu() isn't
called, so we can end up in a situation where a newly enqueued task keeps
waiting in one of the DSQs while there are idle CPUs. Factor out idle CPU
selection path into pick_idle_cpu() and call it from the enqueue path in
such cases. This problem is shared across schedulers and likely needs a more
generic solution in the future.
2024-06-07 11:49:38 -10:00
Tejun Heo
0f2d1ad2fa scx_layered: Implement a new layer parameter "yield_ignore"
yield(2) currently gives up the entire slice. Add "yield_ignore" layer
parameter which can modulate the magnitude of yiedling. When 1.0, yields are
completely ignored. 0.5, only half worth of the full slice is given up and
so on.
2024-06-07 11:49:38 -10:00
Tejun Heo
4aa8124b9c scx_layered: Add explicit yield() support
Currently, a task which yields is treated the same as a task which has run
out its slice. As the budget charged to a task is calculated from wall clock
time, a repeatedly yielding task can stay at the top of the queue for quite
a while hogging the CPU and spiking the number of scheduling events.

Let's add explicit yield support. An yielding task is now always charged the
full slice and not allowed to keep running on the same CPU.
2024-06-07 11:49:38 -10:00
Tejun Heo
436cd7ba9e scx_layered: Make enqueue path comprehensive and handle CPU preemptions
The keep_running path relies on the implicit last task enqueue which makes
the statistics a bit difficult to track. Let's make the enqueue path
comprehensive:

- Set SCX_OPS_ENQ_LAST and handle the last runnable task enqueue explicitly.

- Implement layered_cpu_release() to re-enqueue tasks from a CPU preempted
  by a higher pri sched class and handle the re-enqueued tasks explicitly in
  layered_enqueue().

- Add more statistics to track all enqueue operations.
2024-06-07 11:49:38 -10:00
Tejun Heo
4a0993ceab scx_layered: Allow long-running tasks to keep running on the same CPU
When a task exhausts its slice, layered currently doesn't make any effort to
keep it on the same CPU. It dispatches the next task to run and then
enqueues the running one. This leads to suboptimal behaviors. e.g. When this
happens to a task in a preempting layer, the task will most likely find an
idle CPU or a task to preempt and then migrate there causing a completely
unnecessary migration.

This patch layered_dispatch() test whether the current task should keep
running on the CPU and then skip dispatching to keep the task running. This
behavior depends on the implicit local DSQ enqueue mechanism which triggers
when there are no other tasks to run.
2024-06-07 11:49:38 -10:00
Tejun Heo
200af60f2a scx_layered: Fix load failure due to scheduler_tick() -> sched_tick() rename
- scx_utils: Replace kfunc_exists() with ksym_exists() which doesn't care
  about the type of the symbol.

- scx_layered: Fix load failure on kernels >= v6.10-rc due to
  scheduler_tick() -> sched_tick rename. Attach the tick fentry function to
  either scheduler_tick() or sched_tick().
2024-06-06 12:54:59 -10:00
Andrea Righi
8a3ee7b801 scx_rustland: never use a time slice that exceeds the default value
Make sure to never assign a time slice longer than the default time
slice, that can be used as an upper limit.

This seems to prevent potential stall conditions (reported by the
CachyOS community) when running CPU-intensive workloads, such as:

 [   68.062813] sched_ext: BPF scheduler "rustland" errored, disabling
 [   68.062831] sched_ext: runnable task stall (ollama_llama_se[3312] failed to run for 5.180s)
 [   68.062832]    scx_watchdog_workfn+0x154/0x1e0
 [   68.062837]    process_one_work+0x18e/0x350
 [   68.062839]    worker_thread+0x2fa/0x490
 [   68.062841]    kthread+0xd2/0x100
 [   68.062842]    ret_from_fork+0x34/0x50
 [   68.062844]    ret_from_fork_asm+0x1a/0x30

Fixes: 6f4cd853 ("scx_rustland: introduce virtual time slice")
Tested-by: SoulHarsh007 <harsh.peshwani@outlook.com>
Tested-by: Piotr Gorski <piotrgorski@cachyos.org>
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-06-06 17:56:23 +02:00
Andrea Righi
6f4cd853f9 scx_rustland: introduce virtual time slice
Overview
========

Currently, a task's time slice is determined based on the total number
of tasks waiting to be scheduled: the more overloaded the system, the
shorter the time slice.

This approach can help to reduce the average wait time of all tasks,
allowing them to progress more slowly, but uniformly, thus providing a
smoother overall system performance.

However, under heavy system load, this approach can lead to very short
time slices distributed among all tasks, causing excessive context
switches that can badly affect soft real-time workloads.

Moreover, the scheduler tends to operate in a bursty manner (tasks are
queued and dispatched in bursts). This can also result in fluctuations
of longer and shorter time slices, depending on the number of tasks
still waiting in the scheduler's queue.

Such behavior can also negatively impact on soft real-time workloads,
such as real-time audio processing.

Virtual time slice
==================

To mitigate this problem, introduce the concept of virtual time slice:
the idea is to evaluate the optimal time slice of a task, considering
the vruntime as a deadline for the task to complete its work before
releasing the CPU.

This is accomplished by calculating the difference between the task's
vruntime and the global current vruntime and use this value as the task
time slice:

  task_slice = task_vruntime - min_vruntime

In this way, tasks that "promise" to release the CPU quickly (based on
their previous work pattern) get a much higher priority (due to
vruntime-based scheduling and the additional priority boost for being
classified as interactive), but they are also given a shorter time slice
to complete their work and fulfill their promise of rapidity.

At the same time tasks that are more CPU-intensive get de-prioritized,
but they will tend to have a longer time slice available, reducing in
this way the amount of context switches that can negatively affect their
performance.

In conclusion, latency-sensitive tasks get a high priority and a short
time slice (and they can preempt other tasks), CPU-intensive tasks get
low priority and a long time slice.

Example
=======

Let's consider the following theoretical scenario:

 task | time
 -----+-----
   A  | 1
   B  | 3
   C  | 6
   D  | 6

In this case task A represents a short interactive task, task C and D
are CPU-intensive tasks and task B is mainly interactive, but it also
requires some CPU time.

With a uniform time slice, scaled based on the amount of tasks, the
scheduling looks like this (assuming the time slice is 2):

 A B B C C D D A B C C D D C C D D
  |   |   |   | | |   |   |   |
  `---`---`---`-`-`---`---`---`----> 9 context switches

With the virtual time slice the scheduling changes to this:

 A B B C C C D A B C C C D D D D D
  |   |     | | | |     |
  `---`-----`-`-`-`-----`----------> 7 context switches

In the latter scenario, tasks do not receive the same time slice scaled
by the total number of tasks waiting to be scheduled. Instead, their
time slice is adjusted based on their previous CPU usage. Tasks that
used more CPU time are given longer slices and their processing time
tends to be packed together, reducing the amount of context switches.

Meanwhile, latency-sensitive tasks can still be processed as soon as
they need to, because they get a higher priority and they can preempt
other tasks. However, they will get a short time slice, so tasks that
were incorrectly classified as interactive will still be forced to
release the CPU quickly.

Experimental results
====================

This patch has been tested on a on a 8-cores AMD Ryzen 7 5800X 8-Core
Processor (16 threads with SMT), 16GB RAM, NVIDIA GeForce RTX 3070.

The test case involves the usual benchmark of playing a video game while
simultaneously overloading the system with a parallel kernel build
(`make -j32`).

The average frames per second (fps) reported by Steam is used as a
metric for measuring system responsiveness (the higher the better):

 Game                       |  before |  after  | delta  |
 ---------------------------+---------+---------+--------+
 Baldur's Gate 3            |  40 fps |  48 fps | +20.0% |
 Counter-Strike 2           |   8 fps |  15 fps | +87.5% |
 Cyberpunk 2077             |  41 fps |  46 fps | +12.2% |
 Terraria                   |  98 fps | 108 fps | +10.2% |
 Team Fortress 2            |  81 fps |  92 fps | +13.6% |
 WebGL demo (firefox) [1]   |  32 fps |  42 fps | +31.2% |
 ---------------------------+---------+---------+--------+

Apart from the massive boost with Counter-Strike 2 (that should be taken
with a grain of salt, considering the overall poor performance in both
cases), the virtual time slice seems to systematically provide a boost
in responsiveness of around +10-20% fps.

It also seems to significantly prevent potential audio cracking issues
when the system is massively overloaded: no audio cracking was detected
during the entire run of these tests with the virtual deadline change
applied.

[1] https://webglsamples.org/aquarium/aquarium.html

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-06-04 23:01:13 +02:00
Tejun Heo
e556dd375d scx: Unify loading and running boilerplate across rust schedulers
Make restart handling with user_exit_info simpler and consistently use the
load and report macros consistently across the rust schedulers. This makes
all schedulers automatically handle auto restarts from CPU hotplug events.
Note that this is necessary even for scx_lavd which has CPU hotplug
operations as CPU hotplug operations which took place between skel open and
scheduler init can still trigger restart.
2024-06-03 12:25:41 -10:00
David Vernet
a26d3f2220
Merge pull request #328 from sched-ext/rusty_cpumask_overlap
rusty: Use cpumask kfuncs in cpumask_intersects_domain()
2024-06-03 20:42:11 +00:00
David Vernet
0ae676a9ca
rusty: Use cpumask kfuncs in cpumask_intersects_domain()
In cpumask_intersects_domain(), we check whether a given cpumask has any
CPUs in common with the specified domain by looking at the const, static
dom_cpumasks map. This map is only really necessary when creating the
domain struct bpf_cpumask objects at scheduler load time. After that, we
can just use the actual struct bpf_cpumask object embedded in the domain
context. Let's use that and cpumask kfuncs instead.

This allows rusty to load with
https://github.com/sched-ext/sched_ext/pull/216.

Signed-off-by: David Vernet <void@manifault.com>
2024-06-03 15:01:19 -05:00
Tejun Heo
a2d5310cb6 Bump versions for a release 2024-06-03 08:35:21 -10:00
Andrea Righi
ccef4d0ba1 scx_rustland: get rid of --builtin-idle option
Commit 23b0bb5f ("scx_rustland: dispatch interactive tasks on any CPU")
allows only interactive tasks to be dispatched on any CPU, enabling them
to quickly use the first idle CPU available. Non-interactive tasks, on
the other hand, are kept on the same CPU as much as possible.

This change deprioritizes CPU-intensive tasks further, but it also helps
to exploit cache locality, while latency-sensitive tasks are dispatched
sooner, improving overall responsiveness, despite the potential
migration cost.

Given this new logic, the built-idle option, which forces all tasks to
be dispatched on the CPU assigned during select_cpu(), no longer offers
significant benefits. It would merely reduce the responsiveness of
interactive tasks.

Therefore, simply remove this option, allowing the scheduler to
determine the target CPU(s) for all tasks based on their nature.

Fixes: 23b0bb5f ("scx_rustland: dispatch interactive tasks on any CPU")
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-06-03 10:02:04 +02:00
I Hsin Cheng
0921fde1f1 scx_lavd: Adding READ_ONCE()/WRITE_ONCE() macros
In order to prevent compiler from merging or refetching load/store
operations or unwanted reordering, we take the implemetation of
READ_ONCE()/WRITE_ONCE() from kernel sources under
"/include/asm-generic/rwonce.h".

Use WRITE_ONCE() in function flip_sys_cpu_util() to ensure the compiler
doesn't perform unnecessary optimization so the compiler won't make
incorrect assumptions when performing the operation of modifying of bit
 flipping.

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
2024-06-01 11:07:52 +08:00
Tejun Heo
ebae7d5e6a
Merge pull request #312 from sched-ext/htejun/layered-updates
scx_layered: Improve affn_viol handling and implement dump method
2024-05-28 10:22:31 -10:00
Tejun Heo
d3ed4cb5c7 scx_layered: Successfully consuming from HI_FALLBACK_DSQ should terminate dispatching
layered_dispatch() was incorrectly continuing down to the lower priority
DSQs after successfully consuming from HI_FALLBACK_DSQ which can lead to
latency issues. Fix it.
2024-05-28 10:20:55 -10:00
Changwoo Min
4c0f996ddc
Revert "scx_lavd: Enforce memory barrier in flip_sys_cpu_util" 2024-05-27 12:19:21 +09:00
Changwoo Min
0371ccae40
Merge pull request #318 from vax-r/Memory_barrier
scx_lavd: Enforce memory barrier in flip_sys_cpu_util
2024-05-26 21:00:25 +09:00
I Hsin Cheng
f839106a57 scx_lavd: Enforce memory barrier in flip_sys_cpu_util
Use the GNU built-in __sync_fetch_and_xor() to perform the XOR operation
on global variable "__sys_cpu_util_idx" to ensure the operations
visibility.

The built-in function "__sync_fetch_and_xor()" can provide both atomic
operation and full memory barrier which is needed by every operation
(especially store operation) on global variables.

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
2024-05-26 15:27:10 +08:00
I Hsin Cheng
5881c61a5e scx_central: Provide backward compability
Newer sched_ext kernel versions sets the scheduler to schedule all tasks
within the system by default. However, some users are using the old
versions of kernel.

Therefore we call "__COMPAT_scx_bpf_switch_all()" to move all tasks to
"SCHED_EXT" class so scx_central would schedule all tasks by default in
older kernels.
2024-05-24 15:12:34 +08:00
Tejun Heo
99eb56b6b5 scx_layered: Implement layered_dump()
which dumps layer states.
2024-05-23 12:54:17 -10:00
Tejun Heo
a576242b69 scx_layered: Open and grouped layers can handle tasks with custom affinities
The main reason why custom affinities are tricky for scx_layered is because
if we put a task which doesn't allow all CPUs into a layer's DSQ, it may not
get consumed for an indefinite amount of time. However, this is only true
for confined layers. Both open and grouped layers always consumed from all
CPUs and thus don't have this risk.

Let's allow tasks with custom affinities in open and grouped layers.

- In select_cpu(), don't consider direct dispatching to a local DSQ as
  affinity violation even if the target CPU is outside the layer's cpumask
  if the layer is open.

- In enqueue(), separate out per-cpu kthread special case into its own
  block. Note that this is only applied if the layer is not preempting as a
  preempting layer has a higher priority than HI_FALLBACK_DSQ anyway.

- Trigger the LO_FALLBACK_DSQ path for other threads only if the layer is
  confined.

- The preemption path now also runs for tasks with a custom affinity in open
  and grouped layers. Update it so that it only considers the CPUs in the
  preempting task's allowed cpumask.

(cherry picked from commit 82d2f887a4608de61ddf5e15643c10e504a88f7b)
2024-05-23 12:54:17 -10:00
Tejun Heo
1ce23760b5 scx_layered: Improve affinity violation handling
- AFFN_VIOL for per-cpu tasks could be double counted. Once in select_cpu()
  and again in enqueue(). Count in select_cpu() only when direct
  dispatching.

- Violating tasks were prioritized over non-violating ones because they were
  queued on SCX_DSQ_GLOBAL which has priority over all user DSQs. This
  doesn't make sense. Let's introduce two fallback DSQs - HI_FALLBACK_DSQ
  and LO_FALLBACK_DSQ. HI is used for violating kthreads and LO for
  violating user threads. HI is dispatched after preempting layers and LO
  after all other layers. This shouldn't change the behavior too much for
  kthreads while punshing, rather than rewarding, violating user threads.

(cherry picked from commit 67f69645667ba8a155cae9a9b7e90c055d39e23c)
2024-05-23 12:54:17 -10:00