Commit Graph

1152 Commits

Author SHA1 Message Date
Changwoo Min
eff444516f scx_lavd: directly measure service time for eligibility enforcement
Estimating the service time from run time and frequency is not
incorrect. However, it reacts slowly to sudden changes since it relies
on the moving average. Hence, we directly measure the service time to
enforce fairness.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-16 23:48:26 +09:00
Daniel Hodges
97fb0fa0c8 scx_utils: Add LLC id to CPU
Add the LLC id to the Cpu struct, which will make it easier for
referencing the LLC when doing operations on a Cpu.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-07-16 07:46:55 -07:00
David Vernet
a4d9bcd453
Merge pull request #425 from vax-r/dom_mask_precheck
scx_rusty: Pre-check task domain mask with pull domain mask
2024-07-16 09:30:27 -05:00
I Hsin Cheng
1c3b563caf scx_rusty: Pre-check task domain mask with pull domain mask
Instead of performing domain mask checking inside
"find_first_candidate()" every time, check whether the tasks within push
domain are abled to run on pull domain by performing the mask check at
vector generation stage.

This way can also avoid repeated computation generated by the same
(task, pull_dom) pair as they'll try to check whether the pull domain is
in the task domain mask.

Also since whether a task is a kworker won't change in time, we can
perform the check earlier and put it in the filter, too.

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
2024-07-16 21:48:06 +08:00
Tejun Heo
757cf6ff54
Merge pull request #433 from sched-ext/htejun/bump-versions
Bump versions for 1.0.1 release
2024-07-15 13:33:33 -10:00
Tejun Heo
51334b5c4d Bump versions for 1.0.1 release 2024-07-15 13:21:52 -10:00
Tejun Heo
99afc66251
Merge pull request #432 from jfernandez/fix-sub-counters
scx_utils: Display the tag value for nested counters in log_recorder
2024-07-15 12:14:49 -10:00
Jose Fernandez
a63bf77d68
scx_utils: Display the tag value for nested counters in log_recorder
The log_recorder was incorrectly displaying the tag name instead of the
value for nested counters.

Signed-off-by: Jose Fernandez <josef@netflix.com>
2024-07-15 15:20:44 -06:00
Andrea Righi
e268c589ca
Merge pull request #429 from sched-ext/bpfland-small-improvements
scx_bpfland: small improvements
2024-07-15 20:08:46 +02:00
Andrea Righi
8e7a526356 scx_bpfland: use nr_cpu_ids for consistency
We always use nr_cpu_ids to represent the maximum CPU id returned by
scx_bpf_nr_cpu_ids().

Replace cpu_max with nr_cpu_ids to be more consistent with the rest of
the code.

Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
2024-07-15 08:44:35 +02:00
Tejun Heo
8aae6917e6
Merge pull request #431 from sched-ext/ci-add-kernel-config
ci: Use latest upstream kernel and exclude scx_flatcg and scx_pair from testing
2024-07-14 19:40:55 -10:00
Andrea Righi
f72e7567bf ci: include kernel config chunk
The kernel from kernel.org lacks the necessary .config chunk required
by the build via virtme-ng.

To fix this, include the kernel .config chunk directly in the scx
repository and use it from here.

Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
2024-07-15 07:10:07 +02:00
Tejun Heo
759be0e406 ci: Use latest upstream kernel and exclude scx_flatcg and scx_pair from testing 2024-07-14 13:27:51 -10:00
Tejun Heo
e580ddec3c
Merge pull request #427 from ptr1337/fix-packaging
install_user_scheds: Skip packaging of scx_mitosis
2024-07-14 13:19:49 -10:00
Tejun Heo
4141082a26
Merge pull request #428 from ptr1337/systemd
systemd: Drop temporarily disabled schedulers from service
2024-07-14 13:16:26 -10:00
Andrea Righi
33d06f653b scx_bpfland: get rid of the MAX_CPUS hard-coded limit
We can rely on scx_bpf_nr_cpu_ids() to create all the possible per-CPU
DSQs, eliminating the need for the hard-coded limit MAX_CPUS.

In this way scx_bpfland can support the same amount of CPUs that the
kernel can handle.

Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
2024-07-15 00:17:30 +02:00
Andrea Righi
b80ef7d8eb scx_bpfland: optimize offline CPU handling
Instead of constantly checking the need to drain tasks from the DSQs of
the offline CPUs, provide an atomic flag to notify when there are tasks
to be drained from the offline CPUs.

Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
2024-07-15 00:17:23 +02:00
Andrea Righi
0530706710 scx_bpfland: properly initialize the nvcsw metrics
Initialize the number of voluntary context switches metrics in the local
task storage.

Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
2024-07-15 00:16:10 +02:00
Andrea Righi
bf4ad23599 scx_bpfland: refine interactive tasks flood safeguard
Refine the safeguard mechanism to avoid generating too many interactive
tasks in the system, which could nullify the effect of the
interactive/regular task classification.

The safeguard mechanism operates by pausing the promotion of new tasks
to interactive status during the task wake-up process, whenever the
number of interactive tasks in the priority queue exceeds a specific
limit (set to 4x the number of online CPUs).

Halting the promotion of additional interactive tasks allows to
prioritize those already classified as interactive, thereby preventing
potential "bursts" of excessive interactive tasks in the system.

This refines the mitigation already provided by commit 640bd562
("scx_bpfland: prevent tasks from abusing interactive priority boost").

Fixes: 640bd562 ("scx_bpfland: prevent tasks from abusing interactive priority boost")
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
2024-07-15 00:11:34 +02:00
Andrea Righi
eb1cf0e670 scx_bpfland: improve task time slice evaluation
Always assign the maximum time slice if there are idle CPUs in the
system.

Otherwise, double the task's unused time slice to reward tasks that use
less CPU time and at the same time refill the time slice of the tasks
every time they're dispatched.

Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
2024-07-14 23:24:24 +02:00
Peter Jung
9e2caa74c0
systemd: Drop temporarily disabled schedulers from service
Signed-off-by: Peter Jung <admin@ptr1337.dev>
2024-07-14 20:26:04 +02:00
Peter Jung
a7fa651bfc
install_user_scheds: Skip packaging of scx_mitosis
Signed-off-by: Peter Jung <admin@ptr1337.dev>
2024-07-14 20:21:14 +02:00
Tejun Heo
3ae76acd12
Merge pull request #424 from sched-ext/sync-upstream-kernel-and-bump-to-1.0
Sync to upstream kernel and bump to 1.0
2024-07-14 07:00:38 -10:00
Changwoo Min
5b2112dd81
Merge pull request #421 from multics69/lavd-metrics
scx_lavd: improve time slice and waker freq calculation
2024-07-14 18:49:36 +09:00
Tejun Heo
761ec142ce Bump most versions to 1.0.0
sched_ext is about to be merged upstream. There are some compatibility
breaking changes and we're making the current sched_ext/for-6.11
1edab907b57d ("sched_ext/scx_qmap: Pick idle CPU for direct dispatch on
!wakeup enqueues") the baseline.

Tag everything except scx_mitosis as 1.0.0. As scx_mitosis is still in early
development and is currently temporarily disabled, only the patchlevel is
bumped.
2024-07-12 11:34:14 -10:00
Tejun Heo
54c487731a Update to vmlinux-v6.10-rc2-g1edab907b57d.h
Sync to vmlinux.h from sched_ext/for-6.11 1edab907b57d ("sched_ext/scx_qmap:
Pick idle CPU for direct dispatch on !wakeup enqueues"). This most likely
will be the commit which will be merged during the upcoming kernel v6.11
merge window.

Unfortunately, this is a compatibility breaking change. As the size of
bpf_iter_scx_dsq is reduced, schedulers that use the iterator - scx_lavd and
scx_layered - won't be able to run on older kernels. Likewise, older
binaries from before this commit won't be able to run on newer kernels.
2024-07-12 11:13:34 -10:00
Tejun Heo
f261d0f037 Sync from kernel - 1edab907b57d
Sync from sched_ext/for-6.11 1edab907b57d ("sched_ext/scx_qmap: Pick idle
CPU for direct dispatch on !wakeup enqueues")

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git for-6.11

- cgroup support hasn't landed in the upstream kernel yet. This most likely
  will happen in a few weeks. For the time being, disable scx_flatcg,
  scx_pair and scx_mitosis.

- Compat macro for DSQ task iterator dropped. This is now a part of
  the baseline.

- scx_bpf_consume() isn't upstream yet. BPF interfacing side is still being
  discussed. Dropped example usage from tools/sched_ext. None of the
  practical schedulers use it, so this should be fine for now.

- scx_bpf_cpu_rq() added.

- AUTOATTACH workaround for newer libbpf versions added.
2024-07-12 11:08:41 -10:00
Tejun Heo
228080606c Update libbpf and bpftool commits
Sync to the latest. Right now, it's in an awkward place where receiving
AUTOATTACH compat updatees from kernel breaks build as the libbpf version is
marked as 1.5 but bpf_map__set_autoattach() is not available yet. Sync to
the latest.
2024-07-12 10:51:44 -10:00
Tejun Heo
274bcf7f02
Merge pull request #420 from CachyOS/fix/meson-env-var-append
meson: fix RUSTFLAGS being appended incorrectly
2024-07-12 09:06:03 -10:00
Changwoo Min
512bd143a5 scx_lavd: count only related tasks in calculating waker_freq
A task can become a runnable on any task's context not only its waker
task.  Thus, we should not count wake-up on unrelated task's context.
With this commit, the scheduler can (much more) accurately detect
waker-wakee relationsships.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-12 22:51:09 +09:00
Changwoo Min
95733f63ab scx_lavd: calculate time slice as a function of run queue length
The prior approach using the sum of weights gives too much penalty to
nice tasks with large nice values. With this commit, the time slice is
determined by the number of runnable tasks regardless of nice priority.
Note that the fairness will still be enforced based on tasks' nice
priorities (weights).

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-12 22:45:22 +09:00
Vladislav Nepogodin
c7f1909415
meson: fix RUSTFLAGS being appended incorrectly 2024-07-12 16:42:53 +04:00
Changwoo Min
00fdc1d949
Merge pull request #417 from multics69/lavd-vdeadline
scx_lavd: improve virtual deadline and current clock handling
2024-07-12 14:05:44 +09:00
Changwoo Min
d4bc92bea7 scx_lavd: print lat_cri to output
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-12 13:23:56 +09:00
Changwoo Min
4c5c564523 scx_lavd: initial current logical clock to zero
To easily distinguish, let's initialize the current logical clock to
zero (not the current physical time). Also, avoid the deadline
calculation being zero by adding +1 here and there.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-12 10:15:54 +09:00
Andrea Righi
641a8c4c5c
Merge pull request #418 from sched-ext/bpfland-mitigations
scx_bpfland: mitigations and additional statistics
2024-07-11 18:32:18 +02:00
Andrea Righi
640bd562ff scx_bpfland: prevent tasks from abusing interactive priority boost
The priority boost for interactive tasks can be exploited to render the
system nearly unresponsive by creating numerous tasks that constantly
switch between wait/wakeup states.

For example, stress tests like `hackbench -l 10000` can significantly
degrade system responsiveness.

To mitigate this, limit the number of interactive tasks added to the
priority queue to 4x the number of online CPUs.

This simple approach appears to be a quite effective at identifying
potential spam of "fake" interactive tasks, while still prioritizing
legitimate interactive tasks.

Additionally, periodically refresh the interactive status of the tasks
based on their most recent average of voluntary context switches,
preventing the interactive status from being too "sticky".

Tested-by: Piotr Gorski <lucjan.lucjanov@gmail.com>
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
2024-07-11 16:13:55 +02:00
Andrea Righi
1babb2b92d scx_bpfland: prevent per-CPU kthreads starving other tasks
Avoid dispatching per-CPU kthreads directly, since this may cause
interactivity problems or unfairness, for example if there are too many
softirqs being scheduled (e.g., in presence of high RX network traffic
or when running certain stress tests, like hackbench).

Moreover, in order to help with testing and benchmarks, introduce the
option --local-kthread, that allows to restore the old behavior if
enabled.

Tested-by: Piotr Gorski <lucjan.lucjanov@gmail.com>
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
2024-07-11 16:13:48 +02:00
Andrea Righi
c3ebdd338f scx_bpfland: prevent slice delta overflow
When updating the task vruntime, ensure the time slice delta is always a
positive value. Failing to do so may cause the global vruntime to
increase excessively due to overflows.

Tested-by: Piotr Gorski <lucjan.lucjanov@gmail.com>
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
2024-07-11 15:58:01 +02:00
Andrea Righi
f59aa52fe7 scx_bpfland: expose the amount of online CPUs
Periodically report the amount of online CPUs to stdout.

The online CPUs are initially evaluated looking at the online cpumask,
then the value is updated in the .cpu_offline() / .cpu_online()
callbacks.

Tested-by: Piotr Gorski <lucjan.lucjanov@gmail.com>
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
2024-07-11 15:58:01 +02:00
Andrea Righi
3a47b484af scx_bpfland: report interactive tasks to stdout
Keep track of the CPUs that are running interactive tasks and report
their amount to stdout.

Tested-by: Piotr Gorski <lucjan.lucjanov@gmail.com>
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
2024-07-11 15:58:01 +02:00
Andrea Righi
1a1a16b9e9 scx_bpfland: fix typo in slice_ns definition
The correct default value of slice_ns 5ms, not 5s.

This change doesn't really make any difference in practice, since these
values are changed by the Rust part when the scheduler is started, but
it's good to keep this aligned to the proper values for consistency.

Tested-by: Piotr Gorski <lucjan.lucjanov@gmail.com>
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
2024-07-11 15:58:01 +02:00
Changwoo Min
bdbfeb9fd1 scx_lavd: use logical current clock for virtual deadlines
This commit changes the use of a physical clock to a virtual, logical
clock in calculating deadlines.

- The virtual current clock advances upon a task's running to its
  virtual deadline.

- When enqueuing a task, its virtual deadline from the virtual current
  clock is calculated.

With the above two changes, this guarantees that there is no such task
whose virtual deadline is smaller than the virtual current clock. This
means any enqueuing task can compete with any other already enqueued
tasks. This allows a latency-critical task to be immediately scheduled
if needed.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-11 22:41:56 +09:00
Changwoo Min
408ea7892c scx_lavd: induce sched_prio_to_latency_weight from slice weight
So sched_prio_to_latency_weight is removed.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-11 21:37:21 +09:00
Changwoo Min
bd964acff6 scx_lavd: deprioritize a newly forked task in latency
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-11 21:36:32 +09:00
Changwoo Min
48debe416e scx_lavd: tuning the deadline equation under high load
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-11 21:35:54 +09:00
Changwoo Min
c72e063680 scx_lavd: do not use lat_prio_to_greedy_thresholds
With other optimizations, lat_prio_to_greedy_thresholds is not effective
any more.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-11 21:35:01 +09:00
Changwoo Min
9ed488798e scx_lavd: use task's runtime to determine its deaddline
It has an effect of further perferring shorter jobs.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-11 21:34:25 +09:00
Changwoo Min
e081b2a294 scx_lavd: rename LAVD_MAX_CAS_RETRY to LAVD_MAX_RETRY
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-11 21:33:56 +09:00
Andrea Righi
3df7a13117
Merge pull request #416 from sched-ext/bpfland-small-improvements
bpfland: small improvements
2024-07-08 23:11:47 +02:00