Commit Graph

1282 Commits

Author SHA1 Message Date
Peter Jung
9e2caa74c0
systemd: Drop temporarily disabled schedulers from service
Signed-off-by: Peter Jung <admin@ptr1337.dev>
2024-07-14 20:26:04 +02:00
Peter Jung
a7fa651bfc
install_user_scheds: Skip packaging of scx_mitosis
Signed-off-by: Peter Jung <admin@ptr1337.dev>
2024-07-14 20:21:14 +02:00
Tejun Heo
3ae76acd12
Merge pull request #424 from sched-ext/sync-upstream-kernel-and-bump-to-1.0
Sync to upstream kernel and bump to 1.0
2024-07-14 07:00:38 -10:00
Changwoo Min
5b2112dd81
Merge pull request #421 from multics69/lavd-metrics
scx_lavd: improve time slice and waker freq calculation
2024-07-14 18:49:36 +09:00
Tejun Heo
761ec142ce Bump most versions to 1.0.0
sched_ext is about to be merged upstream. There are some compatibility
breaking changes and we're making the current sched_ext/for-6.11
1edab907b57d ("sched_ext/scx_qmap: Pick idle CPU for direct dispatch on
!wakeup enqueues") the baseline.

Tag everything except scx_mitosis as 1.0.0. As scx_mitosis is still in early
development and is currently temporarily disabled, only the patchlevel is
bumped.
2024-07-12 11:34:14 -10:00
Tejun Heo
54c487731a Update to vmlinux-v6.10-rc2-g1edab907b57d.h
Sync to vmlinux.h from sched_ext/for-6.11 1edab907b57d ("sched_ext/scx_qmap:
Pick idle CPU for direct dispatch on !wakeup enqueues"). This most likely
will be the commit which will be merged during the upcoming kernel v6.11
merge window.

Unfortunately, this is a compatibility breaking change. As the size of
bpf_iter_scx_dsq is reduced, schedulers that use the iterator - scx_lavd and
scx_layered - won't be able to run on older kernels. Likewise, older
binaries from before this commit won't be able to run on newer kernels.
2024-07-12 11:13:34 -10:00
Tejun Heo
f261d0f037 Sync from kernel - 1edab907b57d
Sync from sched_ext/for-6.11 1edab907b57d ("sched_ext/scx_qmap: Pick idle
CPU for direct dispatch on !wakeup enqueues")

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git for-6.11

- cgroup support hasn't landed in the upstream kernel yet. This most likely
  will happen in a few weeks. For the time being, disable scx_flatcg,
  scx_pair and scx_mitosis.

- Compat macro for DSQ task iterator dropped. This is now a part of
  the baseline.

- scx_bpf_consume() isn't upstream yet. BPF interfacing side is still being
  discussed. Dropped example usage from tools/sched_ext. None of the
  practical schedulers use it, so this should be fine for now.

- scx_bpf_cpu_rq() added.

- AUTOATTACH workaround for newer libbpf versions added.
2024-07-12 11:08:41 -10:00
Tejun Heo
228080606c Update libbpf and bpftool commits
Sync to the latest. Right now, it's in an awkward place where receiving
AUTOATTACH compat updatees from kernel breaks build as the libbpf version is
marked as 1.5 but bpf_map__set_autoattach() is not available yet. Sync to
the latest.
2024-07-12 10:51:44 -10:00
Tejun Heo
274bcf7f02
Merge pull request #420 from CachyOS/fix/meson-env-var-append
meson: fix RUSTFLAGS being appended incorrectly
2024-07-12 09:06:03 -10:00
Changwoo Min
512bd143a5 scx_lavd: count only related tasks in calculating waker_freq
A task can become a runnable on any task's context not only its waker
task.  Thus, we should not count wake-up on unrelated task's context.
With this commit, the scheduler can (much more) accurately detect
waker-wakee relationsships.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-12 22:51:09 +09:00
Changwoo Min
95733f63ab scx_lavd: calculate time slice as a function of run queue length
The prior approach using the sum of weights gives too much penalty to
nice tasks with large nice values. With this commit, the time slice is
determined by the number of runnable tasks regardless of nice priority.
Note that the fairness will still be enforced based on tasks' nice
priorities (weights).

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-12 22:45:22 +09:00
Vladislav Nepogodin
c7f1909415
meson: fix RUSTFLAGS being appended incorrectly 2024-07-12 16:42:53 +04:00
Changwoo Min
00fdc1d949
Merge pull request #417 from multics69/lavd-vdeadline
scx_lavd: improve virtual deadline and current clock handling
2024-07-12 14:05:44 +09:00
Changwoo Min
d4bc92bea7 scx_lavd: print lat_cri to output
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-12 13:23:56 +09:00
Changwoo Min
4c5c564523 scx_lavd: initial current logical clock to zero
To easily distinguish, let's initialize the current logical clock to
zero (not the current physical time). Also, avoid the deadline
calculation being zero by adding +1 here and there.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-12 10:15:54 +09:00
Andrea Righi
641a8c4c5c
Merge pull request #418 from sched-ext/bpfland-mitigations
scx_bpfland: mitigations and additional statistics
2024-07-11 18:32:18 +02:00
Andrea Righi
640bd562ff scx_bpfland: prevent tasks from abusing interactive priority boost
The priority boost for interactive tasks can be exploited to render the
system nearly unresponsive by creating numerous tasks that constantly
switch between wait/wakeup states.

For example, stress tests like `hackbench -l 10000` can significantly
degrade system responsiveness.

To mitigate this, limit the number of interactive tasks added to the
priority queue to 4x the number of online CPUs.

This simple approach appears to be a quite effective at identifying
potential spam of "fake" interactive tasks, while still prioritizing
legitimate interactive tasks.

Additionally, periodically refresh the interactive status of the tasks
based on their most recent average of voluntary context switches,
preventing the interactive status from being too "sticky".

Tested-by: Piotr Gorski <lucjan.lucjanov@gmail.com>
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
2024-07-11 16:13:55 +02:00
Andrea Righi
1babb2b92d scx_bpfland: prevent per-CPU kthreads starving other tasks
Avoid dispatching per-CPU kthreads directly, since this may cause
interactivity problems or unfairness, for example if there are too many
softirqs being scheduled (e.g., in presence of high RX network traffic
or when running certain stress tests, like hackbench).

Moreover, in order to help with testing and benchmarks, introduce the
option --local-kthread, that allows to restore the old behavior if
enabled.

Tested-by: Piotr Gorski <lucjan.lucjanov@gmail.com>
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
2024-07-11 16:13:48 +02:00
Andrea Righi
c3ebdd338f scx_bpfland: prevent slice delta overflow
When updating the task vruntime, ensure the time slice delta is always a
positive value. Failing to do so may cause the global vruntime to
increase excessively due to overflows.

Tested-by: Piotr Gorski <lucjan.lucjanov@gmail.com>
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
2024-07-11 15:58:01 +02:00
Andrea Righi
f59aa52fe7 scx_bpfland: expose the amount of online CPUs
Periodically report the amount of online CPUs to stdout.

The online CPUs are initially evaluated looking at the online cpumask,
then the value is updated in the .cpu_offline() / .cpu_online()
callbacks.

Tested-by: Piotr Gorski <lucjan.lucjanov@gmail.com>
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
2024-07-11 15:58:01 +02:00
Andrea Righi
3a47b484af scx_bpfland: report interactive tasks to stdout
Keep track of the CPUs that are running interactive tasks and report
their amount to stdout.

Tested-by: Piotr Gorski <lucjan.lucjanov@gmail.com>
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
2024-07-11 15:58:01 +02:00
Andrea Righi
1a1a16b9e9 scx_bpfland: fix typo in slice_ns definition
The correct default value of slice_ns 5ms, not 5s.

This change doesn't really make any difference in practice, since these
values are changed by the Rust part when the scheduler is started, but
it's good to keep this aligned to the proper values for consistency.

Tested-by: Piotr Gorski <lucjan.lucjanov@gmail.com>
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
2024-07-11 15:58:01 +02:00
Changwoo Min
bdbfeb9fd1 scx_lavd: use logical current clock for virtual deadlines
This commit changes the use of a physical clock to a virtual, logical
clock in calculating deadlines.

- The virtual current clock advances upon a task's running to its
  virtual deadline.

- When enqueuing a task, its virtual deadline from the virtual current
  clock is calculated.

With the above two changes, this guarantees that there is no such task
whose virtual deadline is smaller than the virtual current clock. This
means any enqueuing task can compete with any other already enqueued
tasks. This allows a latency-critical task to be immediately scheduled
if needed.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-11 22:41:56 +09:00
Changwoo Min
408ea7892c scx_lavd: induce sched_prio_to_latency_weight from slice weight
So sched_prio_to_latency_weight is removed.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-11 21:37:21 +09:00
Changwoo Min
bd964acff6 scx_lavd: deprioritize a newly forked task in latency
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-11 21:36:32 +09:00
Changwoo Min
48debe416e scx_lavd: tuning the deadline equation under high load
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-11 21:35:54 +09:00
Changwoo Min
c72e063680 scx_lavd: do not use lat_prio_to_greedy_thresholds
With other optimizations, lat_prio_to_greedy_thresholds is not effective
any more.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-11 21:35:01 +09:00
Changwoo Min
9ed488798e scx_lavd: use task's runtime to determine its deaddline
It has an effect of further perferring shorter jobs.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-11 21:34:25 +09:00
Changwoo Min
e081b2a294 scx_lavd: rename LAVD_MAX_CAS_RETRY to LAVD_MAX_RETRY
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-11 21:33:56 +09:00
Andrea Righi
3df7a13117
Merge pull request #416 from sched-ext/bpfland-small-improvements
bpfland: small improvements
2024-07-08 23:11:47 +02:00
Andrea Righi
995577762a scx_bpfland: refill task time slice
Every time we need to dispatch a task re-evalate its time slice as:

 (unused_time_slice + min_time_slice) / 2

This allows to refill the time slice for tasks that haven't used much of
their previously assigned time, improving fairness.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-07-06 14:07:24 +02:00
Andrea Righi
6a64182ef2 scx_bpfland: always classify interactive tasks
Make sure to always classify interactive tasks, even when the system is
not fully utilized. This ensures that if the system suddenly becomes
overloaded, we already know which tasks need to be dispatched to the
priority DSQ.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-07-06 14:07:24 +02:00
Andrea Righi
8dd528abfd scx_bpfland: pass enqueue flags when dispatching kthreads
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-07-06 14:07:10 +02:00
Andrea Righi
fc0d1bd003
Merge pull request #415 from sched-ext/bpfland-output
scx_bpfland: additional stats and output improvements
2024-07-05 19:50:07 +02:00
Tejun Heo
af5e89e73c
Merge pull request #412 from vax-r/flatcg_delta_fetch
scx_flatcg: Make good use of __sync_fetch_and_sub()
2024-07-05 07:39:22 -10:00
Tejun Heo
7f8e4edb53
Merge pull request #397 from jfernandez/log-recorder-customize
sched_utils: Add log recorder format customization
2024-07-05 07:37:39 -10:00
Tejun Heo
14d0a0ef64
Merge pull request #411 from vax-r/Fix_typo
scx_flatcg: Fix_typo
2024-07-05 07:35:31 -10:00
Andrea Righi
2bc8f800e7 scx_bpfland: report build id version
Use the version string provided by scx_utils:build_id.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-07-05 09:29:29 +02:00
Andrea Righi
bdb31e98e2 scx_bpfland: show statistics in a more human-readable format
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-07-05 09:29:29 +02:00
Andrea Righi
f9d7844b2e scx_bpfland: split direct dispatches and kthread dispatches
Show separate statistics for direct dispatches and kthread direct
dispatches.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-07-05 09:27:59 +02:00
Andrea Righi
86d2f50230
Merge pull request #410 from sched-ext/bpfland-smooth-perf
scx_bpfland: enhance performance consistency and predictability
2024-07-04 21:37:07 +02:00
Andrea Righi
d98516fe75
Merge pull request #413 from vax-r/Remove_unused_variable
scx_rustland_core: Remove unused variable
2024-07-04 19:09:27 +02:00
I Hsin Cheng
1595da78dc scx_rustland_core: Remove unused variable
Remove unused variable "tctx" in rustland_select_cpu.

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
2024-07-05 01:04:49 +08:00
I Hsin Cheng
aae826b1b3 scx_flatcg: Make good use of __sync_fetch_and_sub()
Fetch the value of "delta" directly from the returned value from
__sync_fetch_and_sub, as it returns the origin value of
cgc->cvtime_delta.

Additional fetching instruction of cgc->cvtime_delta would be redundant
here.

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
2024-07-05 01:03:20 +08:00
I Hsin Cheng
3e52761487 scx_flatcg: Fix_typo
Fix "oppotunistic" to "opportunistic".

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
2024-07-04 22:04:40 +08:00
Andrea Righi
cfe2ed063d scx_bpfland: time-based starvation prevention
Tasks are consumed from various DSQs in the following order:

  per-CPU DSQs => priority DSQ => shared DSQ

Tasks in the shared DSQ may be starved by those in the priority DSQ,
which in turn may be starved by tasks dispatched to any per-CPU DSQ.

To mitigate this, record the timestamp of the last task scheduling event
both from the priority DSQ and the shared DSQ.

If the starvation threshold is exceeded without consuming a task, the
scheduler will be forced to consume a task from the corresponding DSQ.

The starvation threshold can be adjusted using the --starvation-thresh
command line parameter (default is 5ms).

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-07-04 10:52:39 +02:00
Andrea Righi
9e0db4ae17 scx_bpfland: remove unnecessary RCU read protection
There is no need to RCU protect the cpumask for the offline CPUs: it is
created once when the scheduler is initialized and it's never
deallocated.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-07-04 10:24:43 +02:00
Andrea Righi
cef6ca93cf scx_bpfland: adjust default time slice to 5ms
Reduce the default time slice down to 5ms for a faster reaction and
system responsiveness when the system is overcomissioned.

This also helps to provide a more predictable level of performance.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-07-04 10:24:43 +02:00
Andrea Righi
7d15e3171c scx_bpfland: ensure task time slice never exceeds the slice_ns limit
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-07-04 10:24:43 +02:00
Andrea Righi
e8a4d350ad scx_bpfland: unify dispatching kthreads with direct CPU dispatches
Always use direct CPU dispatch for kthreads, there is no need to treat
kthreads in a special way, simply reuse direct CPU dispatch to
prioritize them.

Moreover, change direct CPU dispatches to use scx_bpf_dispatch_vtime(),
since we may dispatch multiple tasks to the same per-CPU DSQ now.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-07-03 09:38:43 +02:00