Commit Graph

873 Commits

Author SHA1 Message Date
Daniel Hodges
0319afc88e scx_layered: Update nr_cpus when resizing layers
After updating scx_layered to be topology aware the nr_cpus field on the
layer was not being updated properly. Update layer growing/shrinking
logic to correctly update the nr_cpus count.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-08-15 13:22:26 -07:00
Tejun Heo
cc73b6a826
Merge pull request #496 from sched-ext/htejun/scx_stat
scx_stat: Initial commit
2024-08-15 09:24:55 -10:00
Tejun Heo
b614cf848f scx_layered: Make monitor time based iterations dumber
This makes ctrl-c a bit more responsive without complicating code.
2024-08-15 09:23:29 -10:00
Tejun Heo
45fb724ee2 scx_layered: Restore cpumask reporting 2024-08-15 09:12:29 -10:00
Tejun Heo
751a38e34e scx_layered: Refactor stats printing code 2024-08-15 08:53:19 -10:00
Tejun Heo
a4f424056e scx_layered: Move stats server launching to stats.rs 2024-08-15 06:30:42 -10:00
Tejun Heo
17afc72479 scx_stats: Rename cleanups
- s/stat/stats/ on several stragglers.

- Rename traits so that they are more distinctive from struct and other
  names and follow the convention.
2024-08-15 06:24:56 -10:00
Tejun Heo
a091d5ea7d scx_layered: s/monitor.rs/stats.rs/ and make stats refresh code struct ops 2024-08-15 06:13:05 -10:00
Tejun Heo
8aae9a5de2 scx_stats: s/scx_stat/scx_stats/
Use plural form which is more widespread and also used in scheduler
implementations. No functional changes.
2024-08-15 05:31:34 -10:00
Tejun Heo
6e466d18df scx_layered: Initial switch to scx_stat
- This makes the scheduler side simpler and allows on-demand monitoring.

- OpenMetrics support is dropped for now. Will add a generic tool for it.

- This is a naive conversion. Will be further refined.

scx_layered no longer prints statistics by default. To watch statistics, run
`scx_layered --monitor` while the scheduler is running.
2024-08-14 13:48:41 -10:00
Tejun Heo
7820ec9b46 scx_stat, scx_layered: cargo fmt 2024-08-14 11:47:37 -10:00
Tejun Heo
099b6c266a scx_lavd: Build fix
Add "signal" feature to nix dependency; otherwise, build fails.
2024-08-14 07:55:04 -10:00
Andrea Righi
0f018c5fff
Merge pull request #484 from vax-r/rustland_unused
scx: Remove unused variables, imports and functions
2024-08-14 19:03:26 +02:00
Andrea Righi
f9a994412d scx_bpfland: introduce primary scheduling domain
Allow to specify a primary scheduling domain via the new command line
option `--primary-domain CPUMASK`, where CPUMASK can be a hex number of
arbitrary length, representing the CPUs assigned to the domain.

If this option is not specified the scheduler will use all the available
CPUs in the system as primary domain (no behavior change).

Otherwise, if a primary scheduling domain is defined, the scheduler will
try to dispatch tasks only to the CPUs assigned to the primary domain,
until these CPUs are saturated, at which point tasks may overflow to
other available CPUs.

This feature can be used to prioritize certain cores over others and it
can be really effective in systems with heterogeneous cores (e.g.,
hybrid systems with P-cores and E-cores).

== Example (hybrid architecture) ==

Hardware:
 - Dell Precision 5480 with 13th Gen Intel(R) Core(TM) i7-13800H
   - 6 P-cores 0..5  with 2 CPUs each (CPU from  0..11)
   - 8 E-cores 6..13 with 1 CPU  each (CPU from 12..19)

== Test ==

WebGL application (https://webglsamples.org/aquarium/aquarium.html):
this allows to generate a steady workload in the system without
over-saturating the CPUs.

Use different scheduler configurations:

 - EEVDF (default)
 - scx_bpfland using P-cores only (--primary-domain 0x00fff)
 - scx_bpfland using E-cores only (--primary-domain 0xff000)

Measure performance (fps) and power consumption (W).

== Result ==

                  +-----+-----+------+-----+----------+
                  | min | max | avg  |       |        |
                  | fps | fps | fps  | stdev | power  |
+-----------------+-----+-----+------+-------+--------+
| EEVDF           | 28  | 34  | 31.0 |  1.73 |  3.5W  |
| bpfland-p-cores | 33  | 34  | 33.5 |  0.29 |  3.5W  |
| bpfland-e-cores | 25  | 26  | 25.5 |  0.29 |  2.2W  |
+-----------------+-----+-----+------+-------+--------+

Using a primary scheduling domain of only P-cores with scx_bpfland
allows to achieve a more stable and predictable level of performance,
with an average of 33.5 fps and an error of ±0.5 fps.

In contrast, using EEVDF results in an average frame rate of 31.0 fps
with an error of ±3.0 fps, indicating slightly less consistency, due to
the fact that tasks are evenly distributed across all the cores in the
system (both slow and fast cores).

On the other hand, using a scheduling domain solely of E-cores with
scx_bpfland results in a lower average frame rate (25.5 fps), though it
maintains a stable performance (error of ±0.5 fps), but the power
consumption is also reduced, averaging 2.2W, compared to 3.5W with
either of the other configurations.

== Conclusion ==

In summary, with this change users have the flexibility to prioritize
scheduling on performance cores for better performance and consistency,
or prioritize energy efficient cores for reduced power consumption, on
hybrid architectures.

Moreover, this feature can also be used to minimize the number of cores
used by the scheduler, until they reach full capacity. This capability
can be useful for reducing power consumption even in homogeneous systems
or for conducting scheduling experiments with smaller sets of cores,
provided the system is not overcommitted.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-14 16:17:54 +02:00
Andrea Righi
a6e977c70b scx_bpfland: make output more compact
Abbreviate the statistics reported to stdout and remove the slice_ms
metric: this metric can be easily derived from slice_ns, slice_ns_min
and nr_wait, which is already reported to stdout.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-14 16:17:54 +02:00
Andrea Righi
8656effa50 scx_bpfland: update copyright info
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-14 16:17:54 +02:00
Changwoo Min
3c6d86b342 scx_lavd: upgrade nix package from 0.28.0 to 0.29.0
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-08-14 22:31:05 +09:00
Changwoo Min
444f0b86a5
Merge pull request #489 from multics69/lavd-amp-v4
lavd: make LAVD core-type (AMP) aware
2024-08-14 14:24:09 +09:00
Tejun Heo
4612764b82
Merge pull request #486 from vax-r/Fix_rusty_logic
scx_rusty: Fix logical error when filtering tasks
2024-08-13 09:39:12 -10:00
Daniel Hodges
646cefd46d
Merge pull request #477 from hodgesds/layered-global-match
scx_rusty: Make layer matching a global function
2024-08-12 09:14:58 -04:00
Daniel Hodges
be5213e129 scx_rusty: Make layer matching a global function
Layer matching currently takes a large number of bpf instructions.
Moving layer matching to a global function will reduce the overall
instruction count and allow for other layer matching methods such as
glob.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-08-12 05:44:34 -07:00
Changwoo Min
b7b8c8de90 scx_lavd: fix build errors
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-08-12 14:10:40 +09:00
Changwoo Min
182b0bd249 scx_lavd: make the verifier in 6.8 kernel happy
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-08-12 13:04:04 +09:00
Changwoo Min
4ecf3fc94e scx_lavd: build cpdom map from rust
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-08-12 13:03:18 +09:00
Changwoo Min
1f1a3dc4f1 scx_lavd: sort cores in descending order of max freq
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-08-12 13:01:40 +09:00
Changwoo Min
c213a3e44f scx_lavd: make core compaction core type aware
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-08-12 13:01:40 +09:00
Changwoo Min
c35b6b27ff scx_lavd: consider task pinning for core-type-aware ops.enqueue()
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-08-12 13:01:40 +09:00
Changwoo Min
25bf98d2a0 scx_lavd: make ops.select_cpu() core type aware
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-08-12 13:01:40 +09:00
Changwoo Min
fa87e1c593 scx_lavd: make ops.dispatch() core type aware
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-08-12 13:01:40 +09:00
Changwoo Min
c1cf11f7b1 scx_lavd: make ops.enqueue() core type aware
Put a performance-critical task to a performance critical queue and a
regular task to a regular queue.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-08-12 13:01:40 +09:00
Changwoo Min
03a8c10ece scx_lavd: add cpdom_ctx to abstract compute domain and its DSQ
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-08-12 13:01:40 +09:00
Changwoo Min
623b05a282 scx_lavd: revise perf_cri factor to reflect wakeup, runtime, and run_freq
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-08-12 13:01:40 +09:00
Changwoo Min
15871fd032 scx_lavd: turn off pinned core less aggressively
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-08-12 13:01:40 +09:00
Changwoo Min
9dc7f94cb6 scx_lavd: unifiy the deadline calculation and ineligibility calculation
The unified version is not only simpler but also works better.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-08-12 13:01:40 +09:00
Changwoo Min
4705520d40 scx_lavd: remove unnecessary options which has never been used
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-08-12 13:01:34 +09:00
I Hsin Cheng
15b40de408 scx_rusty: Fix logical error when filtering tasks
The logic of tasks filtering were moved from find_first_candidate() into
a vector filter operation in commit 1c3b563. However, it was forgotten
to transfer the logic with "NOT" since now .filter() will populate the
tasks we want, rather than .skip_while() which was throwing unwanted
tasks out.

That's why the logic here should be reverse so we won't take kworker or
migrated tasks into considerations.

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
2024-08-10 22:56:20 +08:00
I Hsin Cheng
4e40ba3b11 scx_rustland: Removed unused imports and variables
The member "topo_map" in Scheduler is never used and thus should be
removed, the related imports are removed as well.

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
2024-08-09 20:35:12 +08:00
I Hsin Cheng
b7e03b7a76 scx_bpfland: Remove unused variable
Remove unused variable "vtime" in task_vtime().

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
2024-08-09 20:28:42 +08:00
Tejun Heo
45f7fd13b7 versions: Synchronize crate dependency versions 2024-08-08 14:45:46 -10:00
Tejun Heo
63c4a0191f
Merge branch 'main' into topic/inlined-skeleton-members 2024-08-08 14:23:37 -10:00
Tejun Heo
cd6a4d72c7 Bump versions for 1.0.2 release 2024-08-08 14:10:16 -10:00
Tejun Heo
7c3ffe96e1 Unify crate dependency versions
Different sub-projects are using different versions for the same crates.
Synchronize them to the latest.
2024-08-08 13:26:47 -10:00
Andrea Righi
9d808ae206
Merge pull request #468 from sched-ext/rustland-refactoring
scx_rustland refactoring
2024-08-07 11:38:21 +02:00
Andrea Righi
51cfb69199 scx_rustland_core: re-introduce partial mode
Re-add the partial mode option that was dropped during the refactoring.

The partial option allows to apply the scheduler only to the tasks which
have their scheduling policy set to SCHED_EXT via sched_setscheduler().

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-07 08:41:06 +02:00
Andrea Righi
e1f2b3822e scx_rustland_core: drop CPU ownership API
The API for determining which PID is running on a specific CPU is racy
and is unnecessary since this information can be obtained from user
space.

Additionally, it's not reliable for identifying idle CPUs.  Therefore,
it's better to remove this API and, in the future, provide a cpumask
alternative that can export the idle state of the CPUs to user space.

As a consequence also change scx_rustland to dispatch one task a time,
instead of dispatching tasks in batches of idle cores (that are usually
not accurate due to the racy nature of the CPU ownership interaface).

Dispatching one task at a time even makes the scheduler more performant,
due to the vruntime scheduling being applied to more tasks sitting in
the scheduler's queue.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-07 08:41:06 +02:00
Andrea Righi
9a0e7755df scx_rustland_core: export counter of online CPUs
Introduce a helper to get the amount of online CPUs tracked by the BPF
part.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-07 08:10:53 +02:00
Andrea Righi
d9c9f78e3e scx_rustland: re-align vruntime and time slice evaluation to scx_bpfland
Drop the slice boost logic and apply a vruntime and task time slice
evaluation approach similar to scx_bpfland (but implement this in the
user-space component instead of the BPF part).

Additionally, introduce a slice_us_min parameter to define the minimum
time slice that can be assigned to a task, also similar to scx_bpfland.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-07 08:10:53 +02:00
Andrea Righi
38a725ea34 scx_rlfifo: update copyright info
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-07 08:10:53 +02:00
Andrea Righi
c963d5eb05 scx_rustland: update copyright info
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-07 08:10:53 +02:00
Andrea Righi
b87541a26e scx_rustland_core: refactor idle CPU selection logic
Use the same idle selection logic used in scx_bpfland also in
scx_rustland_core.

Also drop fifo_mode and always use the BPF idle selection logic by
default as long as the system is not saturated, unless full_user is
specified.

This approach allows user-space schedulers aiming for maximum
performance to leverage the BPF idle selection logic (bypassing
user-space), while those seeking full control can enable full_user to
bypass the BPF CPU idle selection logic and choose the target CPU for
each task from user-space.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-07 08:10:53 +02:00
Andrea Righi
d8985306f4 scx_rustland: user-space interactive task classifier
We don't need to send the number of voluntary context switches (nvcsw)
from BPF to user-space, as this information is already accessible in
user-space via procfs. Sending this data would only create unnecessary
overhead for schedulers that don't require it, and those that do can
easily retrieve it through procfs.

Therefore, drop this metric from scx_rustland_core and change
scx_rustland implementing an interactive task classifier fully in the
user-space part of the scheduler.

Also drop some options that are not provide any significant benefit
(also in preparation of a bigger refactoring to define a better API for
the user-space framework).

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-06 17:56:58 +02:00
Daniel Hodges
d5efcd3245 scx_layered: Fix cred declaration
The use of the cred struct should be const.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-08-06 05:22:12 -07:00
Tejun Heo
b226865b96 scx_lavd: Make FlatTopology::new() a bit prettier
- Use .enumerate() consistently while building the cpu_fids vector.

- Use .then_with() to chain .cmp() when sorting cpu_fids.

Both reduce visual clutter.
2024-08-04 11:16:19 -10:00
Changwoo Min
130ea97fbf
Merge pull request #464 from multics69/lavd-amp-v3
scx_lavd: improve the calculation of ineligibility duration
2024-08-03 09:57:41 +09:00
Andrea Righi
3ad2875240
Merge pull request #463 from sched-ext/bpfland-update-dsq-vtime
scx_bpfland: always re-align task's vruntime to the global vruntime
2024-08-02 22:13:12 +02:00
Daniel Hodges
1f922b9a73 scx_layered: Add support for disabling topology awareness
Add a parameter to disable topology awareness. This is useful when
trying to compare the scheduling performance of topology aware
scheduling compared to the previous scheduling strategy.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-08-02 08:07:19 -07:00
Changwoo Min
f3fd6e9cb3 scx_lavd: drop 2-level-scheduling
With optimizations of calculatring ineligibility duration, now the
scheduler works well under heavy load without 2-level scheduling, so we
drop it for simplicitiy.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-08-02 21:46:07 +09:00
Changwoo Min
c38e749c36 scx_lavd: improve the equation for calculating ineligibility duration
This commit include a few changes:
- treat a new forked task more conservatively
- defer the execution of more tasks for longer time using ineligibility duration
- consider if a task is waken up in calculating ineligibility duration
2024-08-02 21:08:29 +09:00
Andrea Righi
bee0d699ef scx_bpfland: always re-align task's vruntime to the global vruntime
Immediately re-align p->scx.dsq_vtime to the global vruntime (+/- slice
lag) as soon as we are evaluating the task's vruntime.

This allows rapidly chase the minimum global vruntime, ensuring to not
over prioritize tasks tasks with a predominantly sleeping behavior
pattern.

Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
2024-08-02 13:11:25 +02:00
Changwoo Min
5e194330f0 scx_lavd: consider task's wakeup and vruntime (starvation) more aggressively
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-08-02 12:25:29 +09:00
Daniel Hodges
de7b5fe190 scx_layered: Fix dispatch fallback CPU selection
When the previous CPU for a task is not known do not fall back to
dispatching to CPU 0, use the current CPU.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-07-31 12:35:22 -07:00
Changwoo Min
fc0ffeb45b scx_lavd: print the overall status of a scheduled task
L or R: Latency-critical, Regular
H or I: performance-Hungry, performance-Insensitive
B or T: Big, liTtle
E or G: Eligible, Greedy
P or N: Preemption, Not

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-31 19:00:35 +09:00
Changwoo Min
22d4b13e8e scx_lavd: classify CPUs into BIG and little ones based on their average capacity
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-31 19:00:35 +09:00
Changwoo Min
0ad2f30fa8
Merge pull request #460 from multics69/lavd-misc
scx_lavd: misc updates
2024-07-31 08:55:04 +09:00
Daniel Hodges
c224154866
Merge pull request #459 from hodgesds/layer-cpu-counter
scx_layered: Add per cpu layer iterator offset
2024-07-30 16:00:37 -04:00
Daniel Hodges
4f12bebaa5 scx_layered: Add per cpu layer iterator offset
Add a per cpu counter offset to round robin when iterating on layers.
This is to make selection from different layers more fair.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-07-30 10:44:41 -07:00
Changwoo Min
9b455cf010
Merge pull request #458 from sched-ext/lavd-fix-cpu-ctx-size
scx_lavd: set correct size for cpu_ctx_stor
2024-07-31 00:39:13 +09:00
Changwoo Min
6136cbee65 scx_lavd: tuning the time slice and preemption margins
Tuning the time slice under high load and change the kick/tick margins
for preemption more conservative. Especially, aggressive IPI-based
preemption (kick) causes performance unstability.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-31 00:30:59 +09:00
Changwoo Min
35b0d9f3c2 scx_lavd: improve starvation factor equation
Instead of using coarse-grained log(), let's directly use the ratio of
task's service time. Also, the virtual dealine equation is also updated
to reflect this change.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-31 00:27:17 +09:00
Changwoo Min
f9657a549f scx_lavd: fix bpf verification error in old kernel versions
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-31 00:22:43 +09:00
Changwoo Min
d2615b4975 scx_lavd: fix warnings from the rust code
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-31 00:21:32 +09:00
Andrea Righi
2015faa745 scx_lavd: set correct size for cpu_ctx_stor
The max_entries parameter in BPF_MAP_TYPE_PERCPU_ARRAY defines the
number of values per CPU and for cpu_ctx_stor we only need one item: the
CPU context.

Set max_entries to 1 to avoid allocating unnecessary memory and slightly
reduce the memory footprint.

Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
2024-07-30 09:32:55 +02:00
Changwoo Min
643edb5431
Merge pull request #457 from multics69/lavd-amp-v2
scx_lavd: support two-level scheduling for heavy-loaded cases (like bpfland)
2024-07-30 10:39:06 +09:00
Changwoo Min
b91c1e4759 scx_lavd: add more comments on no_2_level_scheduling implementation
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-29 12:22:28 +09:00
Changwoo Min
f71fff9bbe scx_lavd: print a warning message when system does not provide a proper freq info
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-28 15:53:02 +09:00
Changwoo Min
4449d8e31c scx_lavd: incorporate a task's static priority in calculating its latency criticality
That's because static (nice) priority is a strong hint to distinguish
latency-critical tasks.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-28 15:41:43 +09:00
Changwoo Min
221f1fe12a scx_lavd: further prioritize producers over consumers
That is because many latency-critical tasks are producers.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-28 15:38:54 +09:00
Changwoo Min
7106e8cdca scx_lavd: support two-level scheduling for heavy-loaded cases
We introduce two-level scheduling similar to scx_bpfland. The two-level
scheduling consists of two DSQs: 1) latency-critical run queue and 2)
regular run queue. The scheduler prioritizes scheduling tasks on the
latency-critical queue but makes its best effort to schedule tasks on
the regular queue. The scheduler could be more resilient under heavy
load by segregating regular, non-latency-critical tasks from
latency-critical tasks.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-28 15:33:17 +09:00
Changwoo Min
9236c3e57c scx_lavd: increase the targeted latency for heavy loaded cases
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-28 15:30:01 +09:00
Changwoo Min
230512208d scx_lavd: fix div by zero error in some installations
The max frequency information from topology (from sysfs) seems not
always true. In some installations, it returns zero for all CPUs. In
this case, let's just consider all CPUs have the same capacity (1024),
hoping the kernel can give more preceise information.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-28 12:47:00 +09:00
Changwoo Min
59e54f4972 scx_lavd: print how to disable logging
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-28 12:31:51 +09:00
Changwoo Min
df1108ec6c scx_lavd: segregate starvation factor from the latency criticality (refactoring)
Latency criticality is a task's inherent property, but the starvation
factor is its dynamic status for the urgency of scheduling. Hence, we
segregate the starvation factor out. Also, cleaned up unnecessary
arguments and struct fields related.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-27 17:25:39 +09:00
Changwoo Min
d4a5a629ff
Merge pull request #452 from multics69/lavd-core-compaction-v2
lavd_lavd: initial support for AMP (asynmmetric multi-processor) architecture
2024-07-27 16:22:27 +09:00
Changwoo Min
eeea847697 scx_lavd: adjust time slice based on CPU's capacity
When a task is running on more performant core, the scheduler will give
a longer time slice. On the other hand, on a less performant core, a
shorter time slice will be assigned. The longer time slice helps
boosting clock frequency on a performant core. Also, the shorter time
slice gives more chance the performant core being utilized.

Regarding the CPU capacity, we first check if kernel-provided capacitiy values
are trustworthy or not. If not (i.e., all the same values), we rely on
the user-provided value, based on each CPU's maximum clock frequency.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-26 18:46:21 +09:00
Changwoo Min
e7b6ed1838 scx_lavd: add --prefer-smt-core option
With the --prefer-smt-core option is on, the core compaction prefers to
 utilizae hyper-twin first before utilizing the other physical CPUs. By
 default, the option is off.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-26 18:46:21 +09:00
Changwoo Min
19e337cd9b scx_lavd: make the core compaction AMP-aware
Previously, the core compaction assumed that each core's capacity was
the same. Now, we additionally consider each core's max clock frequency.
So, it always tries to use the higher-frequency cores first.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-26 18:46:21 +09:00
Changwoo Min
dbb3957eb1 scx_lavd: add a missing no_freq_scaling option check
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-26 18:46:21 +09:00
Changwoo Min
90b57a3fd7 scx_lavd: put a pinned kernel task to an overflow set
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-26 18:46:21 +09:00
Changwoo Min
e76bf999df scx_lavd: clean up constants (no functional changes)
Remove unused constants and rename outdated constants to proper names
(LAVD_TC_* to LAVC_CC_* and LAVD_ELIGIBLE_DSQ to LAVD_GLOBAL_DSQ).

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-26 18:46:21 +09:00
Andrea Righi
19854f1535 scx_bpfland: allow to specify negative values with --slice-us-lag
Using negative values with --slice-us-lag can be useful to make
performance more consistent and prioritize newly created tasks over the
running tasks.

Therefore, allow to specify negative values from the command line and
also update the documentation of this option.

Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
2024-07-26 09:10:18 +02:00
David Vernet
5401876430
Revert "rusty: Rework deadline as a signed sum" 2024-07-25 14:50:45 -05:00
David Vernet
09536aa15d
Merge pull request #309 from sched-ext/rusty_improved_dl
rusty: Rework deadline as a signed sum
2024-07-25 13:44:54 -05:00
David Vernet
c1ad602ce5
rusty: Transfer latency priority between CPU-intensive and interactive tasks
In some scenarios, a CPU-intensive task may be on the critical path for
interactive workloads. For example, you may have a game with CPU-intensive
tasks that are crunching the logic for the game, and that's required for the
game to proceed without being choppy.

To support such workflows, this change adds logic to allow a non-interactive
task to inherit the lower (i.e. stronger) latency priority of another task if
it wakes or is woken by that task.

Signed-off-by: David Vernet <void@manifault.com>
2024-07-25 11:55:40 -05:00
David Vernet
933ea9baa1
rusty: Rework deadline as a signed sum
Currently, a task's deadline is computed as its vtime + a scaled function of
its average runtime (with its deadline being scaled down if it's more
interactive). This makes sense intuitively, as we do want an interactive task
to have an earlier deadline, but it also has some flaws.

For one thing, we're currently ignoring duty cycle when determining a task's
deadline. This has a few implications. Firstly, because we reward tasks with
higher waker and blocked frequencies due to considering them to be part of a
work chain, we implicitly penalize tasks that rarely ever use the CPU because
those frequencies are low. While those tasks are likely not part of a work
chain, they also should get an interactivity boost just by pure virtue of not
using the CPU very often. This should in theory be addressed by vruntime, but
because we cap the amount of vtime that a task can accumulate to one slice, it
may not be adequately reflected after a task runs for the first time.

Another problem is that we're minimizing a task's deadline if it's interactive,
but we're also not really penalizing a task that's a super CPU hog by
increasing its deadline. We sort of do a bit by applying a higher niceness
which gives it a higher deadline for a lower weight, but its somewhat minimal
considering that we're using niceness, and that the best an interactive task
can do is minimize its deadline to near zero relative to its vtime.

What we really want to do is "negatively" scale an interactive task's deadline
with the same magnitude as we "positively" scale a CPU-hogging task's deadline.
To do this, we make two major changes to how we compute deadline:

1. Instead of using niceness, we now instead use our own straightforward
   scaling factor. This was chosen arbitrarily to be a scaling by 1000, but we
   can and should improve this in the future.

2. We now create a _signed_ linear latency priority factor as a sum of the
   three following inputs:
   - Work-chain factor (log_2 of product of blocked freq and waker freq)
   - Inverse duty cycle factor (log_2 of the inverse of a task's duty cycle --
     higher duty cycle means lower factor)
   - Average runtime factor (Higher avg runtime means higher average runtime
     factor)

We then compute the latency priority as:

	lat_prio := Average runtime factor - (work-chain factor + duty cycle factor)

This gives us a signed value that can be negative. With this, we can compute a
non-negative weight value by calculating a weight from the absolute value of
lat_prio, and use this to scale slice_ns. If lat_prio is negative we calculate
a task's deadline as its vtime MINUS its scaled slice_ns, and if it's positive,
it's the task's vtime PLUS scaled slice_ns.

This ends up working well because you get a higher weight both for highly
interactive tasks, and highly CPU-hogging / non-interactive tasks, which lets
you scale a task's deadline "more negatively" for interactive tasks, and "more
positively" for the CPU hogs.

With this change, we get a significant improvement in FPS. On a 7950X, if I run
the following workload:

	$ stress-ng -c $((8 * $(nproc)))

1. I get 60 FPS when playing Stellaris (while time is progressing at max
   speed), whereas EEVDF gets 6-7 FPS.

2. I get ~15-40 FPS while playing Civ6, whereas EEVDF seems to get < 1 FPS. The
   Civ6 benchmark doesn't even start after over 4 minutes in the initial frame
   with EEVDF, but gets us 13s / turn with rusty.

3. It seems that EEVDF has improved with Terraria in v6.9. It was able to
   maintain ~30-55 FPS, as opposed to the ~5-10FPS we've seen in the past.
   rusty is still able to maintain a solid 60-62FPS consistently with no
   problem, however.
2024-07-25 11:55:03 -05:00
Daniel Hodges
4c3fd6cd9b scx_layered: Rename UserId and GroupId
TLDR; rename UserId and GroupId to UIDEquals and GIDEquals.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-07-24 15:09:08 -07:00
Daniel Hodges
55f6d68eef scx_layered: Add user and group layers
Add a layer match based on either the effective user id or the effective
group id. This allows for creating layers for individual users or
groups.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-07-24 15:09:08 -07:00
Daniel Hodges
4042fc42d7
Merge pull request #446 from hodgesds/layered-topo
scx_layered: Add topology awareness for NUMA nodes and LLCs
2024-07-24 18:06:43 -04:00
Daniel Hodges
2803f9c127 scx_layered: Fix formatting issues
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-07-24 14:39:02 -07:00
Daniel Hodges
0814abf0b8 scx_layered: Add node topology awareness
Add NUMA node topology awareness for scx_layared. This borrows some of
the NUMA handling from scx_rusty and allows layers to set a node mask.
Different layer kinds will use the node mask differently.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-07-24 09:53:48 -07:00
Daniel Müller
98af514972 scx_rusty: Simplify LoadBalancer::populate_tasks_by_load()
Simplify LoadBalancer::populate_tasks_by_load() by cutting out the
heap allocation bits, by moving mutable accesses in front of immutable
ones. Because multiple immutable accesses (between bss and rodata) do
not conflict, we don't need the intermediate PID storage.

Signed-off-by: Daniel Müller <deso@posteo.net>
2024-07-23 13:59:26 -07:00
Andrea Righi
46ddca6bd5 scx_bpfland: report task time slice to stdout
Periodically report to stdout samples of the effective time slice
applied to tasks.

While one could determine this metric by examining the max slice_ns and
nr_waiting metrics, directly reporting it to stdout allows users to
quickly identify what is happening and it provides a clearer overview of
the scheduling behavior.

Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
2024-07-22 22:01:49 +02:00
Andrea Righi
c1d93d2a00 scx_bpfland: drop kthread dispatches metric
Dispatching per-CPU kthreads directly is disabled by default, reporting
this metric can generate some confusion (since it is always 0), and even
if local kthread dispatches are enabled, they should be still considered
as regular direct dispatches (there is no difference in practice).

Therefore, merge direct kthread dispatches into direct dispatches and
drop the separate nr_kthread_dispatches metric.

Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
2024-07-22 22:01:49 +02:00
Andrea Righi
a5f1d6b595 scx_bpfland: show average amount of tasks waiting to be dispatched
Periodically report the average amount of tasks sitting in the priority
and shared DSQs.

Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
2024-07-22 22:01:45 +02:00
Andrea Righi
5908a985bc scx_bpfland: adjust task time slice based on the amount of waiting tasks
Scale the task's time slice based on the average amount of tasks that
are currently waiting to be dispatched.

Use a moving average for the amount of waiting tasks to smooth out
potential spikes caused by temporary bursts of tasks piling in the wait
queues.

This was initially modeled in scx_rustland and it seems to work pretty
well also in scx_bpfland now.

Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
2024-07-22 21:53:25 +02:00
Changwoo Min
af75d147c8
Merge pull request #443 from multics69/lavd-vtime
scx_lavd: overhaul the virtual deadline algorithm
2024-07-21 18:00:57 +09:00
Changwoo Min
a9aab6b229 scx_lavd: fix typo
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-21 17:58:44 +09:00
Changwoo Min
add96f0e18 scx_lavd: do not maintain ineligible runnable tasks separately
With all the other optimizations and tunings, it turns out that maintaining
two runqueues has more harm than good.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-20 17:49:12 +09:00
Changwoo Min
827187d213 scx_lavd: adjust ineligible duration according to task's lat_cri
Further depenalize above-average latency-critical tasks and penalize
further below-avergage latency-critical tasks in ineligibility duration.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-20 17:37:27 +09:00
Changwoo Min
c653622ed9 scx_lavd: add LAVD_VDL_LOOSENESS_FT in calculating virtual deadline
LAVD_VDL_LOOSENESS_FT represents how loose the deadline is. The smaller
value means the deadline is tighter. While it is unlikely to be tuned,
let's keep it as a tunable for now.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-20 12:00:50 +09:00
Changwoo Min
e94070d5ca scx_lavd: remove LAVD_BOOST_*
These are no longer necessary after directly using latency criticality.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-20 11:53:20 +09:00
Changwoo Min
43f0fcb87c scx_lavd: removed unused LAVD_LOAD_FACTOR_*
These are no longer necessary after remnoving load factor calculation.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-20 11:51:12 +09:00
David Vernet
4f11e2abe2
layered: Don't dispatch to LO_FALLBACK_DSQ
Non-kthreads with custom affinities in non-open layers are dispatched into a
LO_FALLBACK_DSQ, with the idea being that they're penalized for their custom
affinities. When a host is fully utilized, these tasks can end up being starved
due to LO_FALLBACK_DSQ being consumed only when there are no other layers to
consume from. In internal workloads at Meta, we've observed that this can
happen in practice.

Longer term, we can probably address this by implementing layer weights and
applying that to fallback DSQs to avoid starvation. For now, let's just
dispatch them to HI_FALLBACK_DSQ to avoid this starvation issue.

Signed-off-by: David Vernet <void@manifault.com>
2024-07-19 19:14:18 -05:00
Changwoo Min
3924ebaa4d scx_lavd: properly synchronize taskc->vdeadline_log_clk
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-20 01:41:29 +09:00
Changwoo Min
02ad43d116 scx_lavd: directly use p->scx.weight instead load_ideal
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-20 00:25:11 +09:00
Changwoo Min
c955caefd8 scx_lavd: drop sys_load_factor
In theory, sys_load_factor should not be necessary since we do not
stretch the time space anymore.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-20 00:10:29 +09:00
Changwoo Min
67a6deb983 scx_lavd: use lat_cri instead of lat_prio universally
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-19 23:56:51 +09:00
Daniel Hodges
b98a9f56a8 scx_layered: Add separate module for metrics
Refactor the main module for scx_layered to move metrics into a separate
module. This change does no functional differences, only code structure.
This will make it a little easier to navigate the logic in the main
scheduler code.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-07-19 07:40:24 -07:00
Changwoo Min
6f10d6907c scx_lavd: drop sched_prio_to_slice_weight[] table
Use p->scx.weight instead.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-19 22:39:01 +09:00
Changwoo Min
034303f00f scx_lavd: consider starvation factor in determining latency criticality
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-19 22:17:50 +09:00
Daniel Hodges
d974690b5d
Merge pull request #435 from vax-r/remove_skip_while
scx_rusty: Remove skip_while in find_first_candidate
2024-07-19 08:38:58 -04:00
Changwoo Min
99e0d21c3c scx_lavd: drop the runtime factor in calculating latency criticality
That is okay since the runtime is considered in calculating a virtual
deadline. A shorter runtime will result in a tighter deadline linearly.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-19 17:28:40 +09:00
Changwoo Min
b90599e967 scx_lavd: do not inherit parent's properties
If inheriting the parent's properties, a new fork task tends to be too
prioritized. That is, many parent processes, such as `make,` are a bit
more latency-critical than average.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-19 15:29:13 +09:00
Andrea Righi
c4eb3ce7b4 scx_bpfland: introduce dynamic nvcsw threshold
Instead of using a static value to classify tasks based on their average
amount of voluntary context switches, try to periodically evaluate an
optimal threshold, based on a global average of voluntary context
switches among of all the running tasks.

Tasks with an average amount of voluntary context switches greater than
the global average will be classified as interactive.

The global average is evaluated as an exponentially weighted moving
average (EWMA), as:

  avg(t) = avg(t - 1) * 0.75 - task_avg(t) * 0.25

This approach is more efficient than iterating through all tasks and it
helps to prevent rapid fluctuations that may be caused by bursts of
voluntary context switch events.

The dynamic nvcsw threshold enables a more precise adjustment of
the classification criteria to swiftly respond to global system changes:
tasks can be quickly classified as interactive, but if the system
experiences too many interactive events, the criteria for maintaining
interactive status become stricter. This creates a natural selection
process where only the most deserving tasks remain interactive.

Additionally, introduce the new option `--nvcsw-max-thresh N`, which
allows to extend or restrict the fluctuation range of the global average
threshold for voluntary context switches.

Tested-by: Piotr Gorski <piotrgorski@cachyos.org>
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
2024-07-18 19:03:25 +02:00
Changwoo Min
78d96a6fb6 scx_lavd: advance clock by reverse proportional to the system load
Advancing the clock slower when overloaded gives more opportunities for
latency-critical tasks to cut in the run queue. Controlling the clock
better reflects the actual load than the prior approach of stretching
the time-space when overloaded.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-18 15:53:38 +09:00
Changwoo Min
9bc20f9160 scx_lavd: maintain ineligible runnable tasks separately
We now maintain two run queues—an eligible run queue (DSQ) and an
ineligible run queue (rbtree)—sorted by the task's virtual deadline.
When the eligible run queue is empty, or the ineligible run queue has
not been consumed for too long (e.g., 15 msec), a task in the ineligible
run queue is moved to the eligible run queue for execution. With these
two queues, we have a better admission control.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-17 23:46:11 +09:00
I Hsin Cheng
2525b94af4 scx_rusty: Remove unused variable
Remove unused variable "has_preferred_dom".

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
2024-07-17 20:30:17 +08:00
I Hsin Cheng
bf2f0fbf35 scx_rusty: Remove skip_while in find_first_candidate
Followed commit 1c3b563, move the checking of task.migrated.get() into
the vector filter. In this way, we can remove the skip_while() call in
find_first_candidate().

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
2024-07-17 20:27:12 +08:00
Changwoo Min
55e19ea5df scx_lavd: do not prioritize a wake-up task in ops.select_cpu()
This is a prep for adding an ineligible DSQ.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-17 11:16:02 +09:00
Changwoo Min
c84b73e971 scx_lavd: rename LAVD_GLOBAL_DSQ to LAVD_ELIGIBLE_DSQ
This is a prep to add a global ineligible dsq.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-17 10:34:34 +09:00
Daniel Müller
565aec3662 rust: Update libbpf-rs & libbpf-cargo to 0.24
Update libbpf-rs & libbpf-cargo to 0.24. Among other things, generated
skeletons now contain directly accessible map and program objects, no
longer necessitating the use of accessor methods. As a result, the risk
for mutability conflicts is reduced greatly.

Signed-off-by: Daniel Müller <deso@posteo.net>
2024-07-16 11:48:52 -07:00
Daniel Hodges
27122a8a00 scx_rusty: refactor mempolicy handling bpf code and load balancing
This change refactors some of the helper methods for getting the
preferred node for tasks using mempolicy. The load balancing logic in
try_find_move_task is updated to allow for a filter, which is used to
filter for tasks with a preferred mempolicy.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-07-16 09:40:00 -07:00
Daniel Hodges
43a263aa75 scx_rusty: Use preferred node mask with balancer
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-07-16 08:11:19 -07:00
Daniel Hodges
bab6e9523c scx_rusty: Add mempolicy checks to rusty
This change makes scx_rusty mempolicy aware. When a process uses
set_mempolicy it can change NUMA memory preferences and cause
performance issues when tasks are scheduled on remote NUMA nodes. This
change modifies task_pick_domain to use the new helper method that
returns the preferred node id.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-07-16 08:11:19 -07:00
Changwoo Min
971bb2e024 scx_lavd: pretty formatting for ineligible duration
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-16 23:54:15 +09:00
Changwoo Min
adfbf3934c scx_lavd: tuning the max ineligible duration
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-16 23:52:23 +09:00
Changwoo Min
eff444516f scx_lavd: directly measure service time for eligibility enforcement
Estimating the service time from run time and frequency is not
incorrect. However, it reacts slowly to sudden changes since it relies
on the moving average. Hence, we directly measure the service time to
enforce fairness.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-16 23:48:26 +09:00
I Hsin Cheng
1c3b563caf scx_rusty: Pre-check task domain mask with pull domain mask
Instead of performing domain mask checking inside
"find_first_candidate()" every time, check whether the tasks within push
domain are abled to run on pull domain by performing the mask check at
vector generation stage.

This way can also avoid repeated computation generated by the same
(task, pull_dom) pair as they'll try to check whether the pull domain is
in the task domain mask.

Also since whether a task is a kworker won't change in time, we can
perform the check earlier and put it in the filter, too.

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
2024-07-16 21:48:06 +08:00
Tejun Heo
51334b5c4d Bump versions for 1.0.1 release 2024-07-15 13:21:52 -10:00
Andrea Righi
8e7a526356 scx_bpfland: use nr_cpu_ids for consistency
We always use nr_cpu_ids to represent the maximum CPU id returned by
scx_bpf_nr_cpu_ids().

Replace cpu_max with nr_cpu_ids to be more consistent with the rest of
the code.

Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
2024-07-15 08:44:35 +02:00
Andrea Righi
33d06f653b scx_bpfland: get rid of the MAX_CPUS hard-coded limit
We can rely on scx_bpf_nr_cpu_ids() to create all the possible per-CPU
DSQs, eliminating the need for the hard-coded limit MAX_CPUS.

In this way scx_bpfland can support the same amount of CPUs that the
kernel can handle.

Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
2024-07-15 00:17:30 +02:00
Andrea Righi
b80ef7d8eb scx_bpfland: optimize offline CPU handling
Instead of constantly checking the need to drain tasks from the DSQs of
the offline CPUs, provide an atomic flag to notify when there are tasks
to be drained from the offline CPUs.

Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
2024-07-15 00:17:23 +02:00
Andrea Righi
0530706710 scx_bpfland: properly initialize the nvcsw metrics
Initialize the number of voluntary context switches metrics in the local
task storage.

Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
2024-07-15 00:16:10 +02:00
Andrea Righi
bf4ad23599 scx_bpfland: refine interactive tasks flood safeguard
Refine the safeguard mechanism to avoid generating too many interactive
tasks in the system, which could nullify the effect of the
interactive/regular task classification.

The safeguard mechanism operates by pausing the promotion of new tasks
to interactive status during the task wake-up process, whenever the
number of interactive tasks in the priority queue exceeds a specific
limit (set to 4x the number of online CPUs).

Halting the promotion of additional interactive tasks allows to
prioritize those already classified as interactive, thereby preventing
potential "bursts" of excessive interactive tasks in the system.

This refines the mitigation already provided by commit 640bd562
("scx_bpfland: prevent tasks from abusing interactive priority boost").

Fixes: 640bd562 ("scx_bpfland: prevent tasks from abusing interactive priority boost")
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
2024-07-15 00:11:34 +02:00
Andrea Righi
eb1cf0e670 scx_bpfland: improve task time slice evaluation
Always assign the maximum time slice if there are idle CPUs in the
system.

Otherwise, double the task's unused time slice to reward tasks that use
less CPU time and at the same time refill the time slice of the tasks
every time they're dispatched.

Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
2024-07-14 23:24:24 +02:00
Tejun Heo
3ae76acd12
Merge pull request #424 from sched-ext/sync-upstream-kernel-and-bump-to-1.0
Sync to upstream kernel and bump to 1.0
2024-07-14 07:00:38 -10:00
Changwoo Min
5b2112dd81
Merge pull request #421 from multics69/lavd-metrics
scx_lavd: improve time slice and waker freq calculation
2024-07-14 18:49:36 +09:00
Tejun Heo
761ec142ce Bump most versions to 1.0.0
sched_ext is about to be merged upstream. There are some compatibility
breaking changes and we're making the current sched_ext/for-6.11
1edab907b57d ("sched_ext/scx_qmap: Pick idle CPU for direct dispatch on
!wakeup enqueues") the baseline.

Tag everything except scx_mitosis as 1.0.0. As scx_mitosis is still in early
development and is currently temporarily disabled, only the patchlevel is
bumped.
2024-07-12 11:34:14 -10:00
Tejun Heo
54c487731a Update to vmlinux-v6.10-rc2-g1edab907b57d.h
Sync to vmlinux.h from sched_ext/for-6.11 1edab907b57d ("sched_ext/scx_qmap:
Pick idle CPU for direct dispatch on !wakeup enqueues"). This most likely
will be the commit which will be merged during the upcoming kernel v6.11
merge window.

Unfortunately, this is a compatibility breaking change. As the size of
bpf_iter_scx_dsq is reduced, schedulers that use the iterator - scx_lavd and
scx_layered - won't be able to run on older kernels. Likewise, older
binaries from before this commit won't be able to run on newer kernels.
2024-07-12 11:13:34 -10:00
Tejun Heo
f261d0f037 Sync from kernel - 1edab907b57d
Sync from sched_ext/for-6.11 1edab907b57d ("sched_ext/scx_qmap: Pick idle
CPU for direct dispatch on !wakeup enqueues")

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git for-6.11

- cgroup support hasn't landed in the upstream kernel yet. This most likely
  will happen in a few weeks. For the time being, disable scx_flatcg,
  scx_pair and scx_mitosis.

- Compat macro for DSQ task iterator dropped. This is now a part of
  the baseline.

- scx_bpf_consume() isn't upstream yet. BPF interfacing side is still being
  discussed. Dropped example usage from tools/sched_ext. None of the
  practical schedulers use it, so this should be fine for now.

- scx_bpf_cpu_rq() added.

- AUTOATTACH workaround for newer libbpf versions added.
2024-07-12 11:08:41 -10:00
Changwoo Min
512bd143a5 scx_lavd: count only related tasks in calculating waker_freq
A task can become a runnable on any task's context not only its waker
task.  Thus, we should not count wake-up on unrelated task's context.
With this commit, the scheduler can (much more) accurately detect
waker-wakee relationsships.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-12 22:51:09 +09:00
Changwoo Min
95733f63ab scx_lavd: calculate time slice as a function of run queue length
The prior approach using the sum of weights gives too much penalty to
nice tasks with large nice values. With this commit, the time slice is
determined by the number of runnable tasks regardless of nice priority.
Note that the fairness will still be enforced based on tasks' nice
priorities (weights).

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-12 22:45:22 +09:00
Changwoo Min
00fdc1d949
Merge pull request #417 from multics69/lavd-vdeadline
scx_lavd: improve virtual deadline and current clock handling
2024-07-12 14:05:44 +09:00
Changwoo Min
d4bc92bea7 scx_lavd: print lat_cri to output
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-12 13:23:56 +09:00
Changwoo Min
4c5c564523 scx_lavd: initial current logical clock to zero
To easily distinguish, let's initialize the current logical clock to
zero (not the current physical time). Also, avoid the deadline
calculation being zero by adding +1 here and there.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-12 10:15:54 +09:00
Andrea Righi
640bd562ff scx_bpfland: prevent tasks from abusing interactive priority boost
The priority boost for interactive tasks can be exploited to render the
system nearly unresponsive by creating numerous tasks that constantly
switch between wait/wakeup states.

For example, stress tests like `hackbench -l 10000` can significantly
degrade system responsiveness.

To mitigate this, limit the number of interactive tasks added to the
priority queue to 4x the number of online CPUs.

This simple approach appears to be a quite effective at identifying
potential spam of "fake" interactive tasks, while still prioritizing
legitimate interactive tasks.

Additionally, periodically refresh the interactive status of the tasks
based on their most recent average of voluntary context switches,
preventing the interactive status from being too "sticky".

Tested-by: Piotr Gorski <lucjan.lucjanov@gmail.com>
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
2024-07-11 16:13:55 +02:00
Andrea Righi
1babb2b92d scx_bpfland: prevent per-CPU kthreads starving other tasks
Avoid dispatching per-CPU kthreads directly, since this may cause
interactivity problems or unfairness, for example if there are too many
softirqs being scheduled (e.g., in presence of high RX network traffic
or when running certain stress tests, like hackbench).

Moreover, in order to help with testing and benchmarks, introduce the
option --local-kthread, that allows to restore the old behavior if
enabled.

Tested-by: Piotr Gorski <lucjan.lucjanov@gmail.com>
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
2024-07-11 16:13:48 +02:00
Andrea Righi
c3ebdd338f scx_bpfland: prevent slice delta overflow
When updating the task vruntime, ensure the time slice delta is always a
positive value. Failing to do so may cause the global vruntime to
increase excessively due to overflows.

Tested-by: Piotr Gorski <lucjan.lucjanov@gmail.com>
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
2024-07-11 15:58:01 +02:00
Andrea Righi
f59aa52fe7 scx_bpfland: expose the amount of online CPUs
Periodically report the amount of online CPUs to stdout.

The online CPUs are initially evaluated looking at the online cpumask,
then the value is updated in the .cpu_offline() / .cpu_online()
callbacks.

Tested-by: Piotr Gorski <lucjan.lucjanov@gmail.com>
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
2024-07-11 15:58:01 +02:00
Andrea Righi
3a47b484af scx_bpfland: report interactive tasks to stdout
Keep track of the CPUs that are running interactive tasks and report
their amount to stdout.

Tested-by: Piotr Gorski <lucjan.lucjanov@gmail.com>
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
2024-07-11 15:58:01 +02:00
Andrea Righi
1a1a16b9e9 scx_bpfland: fix typo in slice_ns definition
The correct default value of slice_ns 5ms, not 5s.

This change doesn't really make any difference in practice, since these
values are changed by the Rust part when the scheduler is started, but
it's good to keep this aligned to the proper values for consistency.

Tested-by: Piotr Gorski <lucjan.lucjanov@gmail.com>
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
2024-07-11 15:58:01 +02:00
Changwoo Min
bdbfeb9fd1 scx_lavd: use logical current clock for virtual deadlines
This commit changes the use of a physical clock to a virtual, logical
clock in calculating deadlines.

- The virtual current clock advances upon a task's running to its
  virtual deadline.

- When enqueuing a task, its virtual deadline from the virtual current
  clock is calculated.

With the above two changes, this guarantees that there is no such task
whose virtual deadline is smaller than the virtual current clock. This
means any enqueuing task can compete with any other already enqueued
tasks. This allows a latency-critical task to be immediately scheduled
if needed.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-11 22:41:56 +09:00
Changwoo Min
408ea7892c scx_lavd: induce sched_prio_to_latency_weight from slice weight
So sched_prio_to_latency_weight is removed.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-11 21:37:21 +09:00
Changwoo Min
bd964acff6 scx_lavd: deprioritize a newly forked task in latency
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-11 21:36:32 +09:00
Changwoo Min
48debe416e scx_lavd: tuning the deadline equation under high load
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-11 21:35:54 +09:00
Changwoo Min
c72e063680 scx_lavd: do not use lat_prio_to_greedy_thresholds
With other optimizations, lat_prio_to_greedy_thresholds is not effective
any more.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-11 21:35:01 +09:00
Changwoo Min
9ed488798e scx_lavd: use task's runtime to determine its deaddline
It has an effect of further perferring shorter jobs.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-11 21:34:25 +09:00
Changwoo Min
e081b2a294 scx_lavd: rename LAVD_MAX_CAS_RETRY to LAVD_MAX_RETRY
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-07-11 21:33:56 +09:00
Andrea Righi
995577762a scx_bpfland: refill task time slice
Every time we need to dispatch a task re-evalate its time slice as:

 (unused_time_slice + min_time_slice) / 2

This allows to refill the time slice for tasks that haven't used much of
their previously assigned time, improving fairness.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-07-06 14:07:24 +02:00
Andrea Righi
6a64182ef2 scx_bpfland: always classify interactive tasks
Make sure to always classify interactive tasks, even when the system is
not fully utilized. This ensures that if the system suddenly becomes
overloaded, we already know which tasks need to be dispatched to the
priority DSQ.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-07-06 14:07:24 +02:00
Andrea Righi
8dd528abfd scx_bpfland: pass enqueue flags when dispatching kthreads
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-07-06 14:07:10 +02:00
Andrea Righi
fc0d1bd003
Merge pull request #415 from sched-ext/bpfland-output
scx_bpfland: additional stats and output improvements
2024-07-05 19:50:07 +02:00
Tejun Heo
af5e89e73c
Merge pull request #412 from vax-r/flatcg_delta_fetch
scx_flatcg: Make good use of __sync_fetch_and_sub()
2024-07-05 07:39:22 -10:00
Tejun Heo
14d0a0ef64
Merge pull request #411 from vax-r/Fix_typo
scx_flatcg: Fix_typo
2024-07-05 07:35:31 -10:00
Andrea Righi
2bc8f800e7 scx_bpfland: report build id version
Use the version string provided by scx_utils:build_id.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-07-05 09:29:29 +02:00
Andrea Righi
bdb31e98e2 scx_bpfland: show statistics in a more human-readable format
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-07-05 09:29:29 +02:00
Andrea Righi
f9d7844b2e scx_bpfland: split direct dispatches and kthread dispatches
Show separate statistics for direct dispatches and kthread direct
dispatches.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-07-05 09:27:59 +02:00
I Hsin Cheng
aae826b1b3 scx_flatcg: Make good use of __sync_fetch_and_sub()
Fetch the value of "delta" directly from the returned value from
__sync_fetch_and_sub, as it returns the origin value of
cgc->cvtime_delta.

Additional fetching instruction of cgc->cvtime_delta would be redundant
here.

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
2024-07-05 01:03:20 +08:00
I Hsin Cheng
3e52761487 scx_flatcg: Fix_typo
Fix "oppotunistic" to "opportunistic".

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
2024-07-04 22:04:40 +08:00
Andrea Righi
cfe2ed063d scx_bpfland: time-based starvation prevention
Tasks are consumed from various DSQs in the following order:

  per-CPU DSQs => priority DSQ => shared DSQ

Tasks in the shared DSQ may be starved by those in the priority DSQ,
which in turn may be starved by tasks dispatched to any per-CPU DSQ.

To mitigate this, record the timestamp of the last task scheduling event
both from the priority DSQ and the shared DSQ.

If the starvation threshold is exceeded without consuming a task, the
scheduler will be forced to consume a task from the corresponding DSQ.

The starvation threshold can be adjusted using the --starvation-thresh
command line parameter (default is 5ms).

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-07-04 10:52:39 +02:00
Andrea Righi
9e0db4ae17 scx_bpfland: remove unnecessary RCU read protection
There is no need to RCU protect the cpumask for the offline CPUs: it is
created once when the scheduler is initialized and it's never
deallocated.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-07-04 10:24:43 +02:00
Andrea Righi
cef6ca93cf scx_bpfland: adjust default time slice to 5ms
Reduce the default time slice down to 5ms for a faster reaction and
system responsiveness when the system is overcomissioned.

This also helps to provide a more predictable level of performance.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-07-04 10:24:43 +02:00
Andrea Righi
7d15e3171c scx_bpfland: ensure task time slice never exceeds the slice_ns limit
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-07-04 10:24:43 +02:00
Andrea Righi
e8a4d350ad scx_bpfland: unify dispatching kthreads with direct CPU dispatches
Always use direct CPU dispatch for kthreads, there is no need to treat
kthreads in a special way, simply reuse direct CPU dispatch to
prioritize them.

Moreover, change direct CPU dispatches to use scx_bpf_dispatch_vtime(),
since we may dispatch multiple tasks to the same per-CPU DSQ now.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-07-03 09:38:43 +02:00
Andrea Righi
d2231b0aed scx_bpfland: drop built-in idle CPU selection logic
Small refactoring of the idle CPU selection logic:
 - optimize idle CPU selection for tasks that can run on a single CPU
 - drop the built-in idle selection policy and completely rely on the
   custom one

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-07-03 08:54:17 +02:00
Andrea Righi
7c355f50b2 scx_bpfland: use the right cpumask to find any idle CPU
We are incorrectly using the SMT idle cpumask to find any idle CPU, fix
by using the generic idle cpumask.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-07-01 20:36:24 +02:00
Andrea Righi
c458f345b4
Merge pull request #408 from sched-ext/bpfland-cpu-hotplug
scx_bpfland: support CPU hotplugging
2024-07-01 19:41:00 +02:00
Dan Schatzberg
32ac4b2cff
Merge pull request #389 from dschatzberg/mitosis
mitosis: Update synchronization
2024-07-01 09:44:26 -04:00
Andrea Righi
ff7a518d28 scx_bpfland: support CPU hotplugging
Implement CPU hotplugging in scx_bpfland without restarting the
scheduler.

The idle selection logic has been updated to consider online CPUs.
Additionally, a cpumask for offline CPUs has been introduced. Tasks
that have been dispatched to the DSQs associated with offline CPUs are
consumed by the other CPUs that are still online.

Moreover, the dependency on the Topology crate is temporarily dropped
and instead, /sys/devices/system/cpu/smt/active is used to determine if
SMT should be taken into account during idle selection. The Topology
crate will be re-introduced later when scx_bpfland will gain more
topology-aware capabilities.

This fixes #406.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-06-30 23:04:13 +02:00
Andrea Righi
d76551bbd3 scx_rusty: fix stats map initialization
The stats map in scx_rusty is a BPF_MAP_TYPE_PERCPU_ARRAY, with its size
determined by num_possible_cpus(). Initializing it with nr_cpu_ids() can
result in errors such as:

 Error: Failed to zero stat

 Caused by:
     number of values 6 != number of cpus 8

Fix by using num_possible_cpus() to initialize it.

Fixes: 263e02f6 ("rusty: Use nr_cpu_ids instead of nr_cpus_possible")
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-06-30 17:37:14 +02:00
Andrea Righi
74175f5a49 scx_bpfland: properly integrate with meson build
Properly honor the meson build `serialize` option.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-06-28 21:37:00 +02:00
Andrea Righi
f98c35fd07
Merge pull request #388 from sched-ext/bpfland
scheds: introduce scx_bpfland
2024-06-28 21:27:43 +02:00
Andrea Righi
cf4883fbf8 meson: introduce serialize build option
With commit 5d20f89a ("scheds-rust: build rust schedulers in sequence"),
schedulers are now built serially one after the other to prevent meson
and cargo from forking NxN parallel tasks.

However, this change has made building a single scheduler much more
cumbersome, due to the chain of dependencies.

For example, building scx_rusty using the specific meson target would
still result in all schedulers being built, because they all depend on
each other.

To address this issue, introduce the new meson build option
`serialize=true|false` (default is false).

This option allows to disable the schedulers' build chain, restoring the
old behavior.

With this option enabled, it is now possible to build just a single
scheduler, parallelizing the cargo build properly, without triggering
the build of the others. Example:

  $ meson setup build -Dbuildtype=release -Dserialize=false
  $ meson compile -C build scx_rusty

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-06-28 10:17:37 +02:00
Changwoo Min
24a238846e scx_lavd: optimizing deadline related tunables
The competition window was 7.5 msec, half of the targeted latency.
However, it is too wide for some workloads, so unrelated tasks may
compete with each other. Hence, it is tightened to about 1 msec with
LAVD_LAT_WEIGHT_SHIFT to avoid unnecessary competition.

Also, when a system is overloaded, now the time space is stretched more
aggressively (i.e., lat_prio^2) when a task's latency priority is low
(high value).

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-06-28 09:00:45 +09:00
Andrea Righi
7606b95150 scx_bpfland: introduce maximum time slice lag
Introduce a tunable to set a limit of the minimum vruntime that is used
when a task is dispatched, as:

 vtime_min = vtime_now - slice_lag_ns

Increasing the time slice lag can make interactive tasks even more
responsive at the cost of starving regular and newly created tasks.

Default time slice lag is 0.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-06-27 17:28:42 +02:00
Andrea Righi
5a44329d45 scheds: introduce scx_bpfland
Overview
========

This scheduler is derived from scx_rustland, but it is fully implemented
in BFP with minimal user-space Rust part to process command line
options, collect metrics and logs out scheduling statistics.

Unlike scx_rustland, all scheduling decisions are made by the BPF
component.

Motivation
==========

The primary goal of this scheduler is to act as a performance baseline
for comparison with scx_rustland, allowing for a better assessment of
the overhead caused by kernel/user-space interactions.

It can also be used to deploy prototypes initially tested in the
scx_rustland scheduler. In fact, this scheduler is expected to
outperform scx_rustland, due to the elimitation of the kernel/user-space
overhead.

Scheduling policy
=================

scx_bpfland is a vruntime-based sched_ext scheduler that prioritizes
interactive workloads. Its scheduling policy closely mirrors
scx_rustland, but it has been re-implemented in BPF with some small
adjustments.

Tasks are categorized as either interactive or regular based on their
average rate of voluntary context switches per second: tasks that exceed
a specific voluntary context switch threshold are classified as
interactive.

Interactive tasks are prioritized in a higher-priority DSQ, while
regular tasks are placed in a lower-priority DSQ. Within each queue,
tasks are sorted based on their weighted runtime, using the built-in scx
vtime ordering capabilities (scx_bpf_dispatch_vtime()).

Moreover, each task gets a time slice budget. When a task is dispatched,
it receives a time slice equivalent to the remaining unused portion of
its previously allocated time slice (with a minimum threshold applied).

This gives latency-sensitive workloads more chances to exceed their time
slice when needed to perform short bursts of CPU activity without being
interrupted (i.e., real-time audio encoding / decoding workloads).

Results
=======

According to the initial test results, using the same benchmark "playing
a videogame while recompiling the kernel", this scheduler seems to
provide a +5% improvement in the frames-per-second (fps) compared to
scx_rustland, with video games such as Cyberpunk 2077, Counter-Strike 2
and Baldur's Gate 3.

Initial test results indicate that this scheduler offers around a +5%
improvement in frames-per-second (fps) compared to scx_rustland when
using the benchmark "playing a video game while recompiling the kernel".

This improvement was observed in games such as Cyberpunk 2077,
Counter-Strike 2, and Baldur's Gate 3.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-06-27 17:28:42 +02:00
Changwoo Min
f86d564d89 scx_lavd: fast path for ops.dispatch() when fully loaded
When fully loaded so all CPUs are using, skip checking the cpumask.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-06-27 18:00:39 +09:00
David Vernet
fe3ce64a9b
Revert "scx_rusty: Refactor ridx assignment in populate_tasks_by_load" 2024-06-26 17:35:22 -04:00
Changwoo Min
abc6e31fef scx_lavd: for a forked task, inherit its parent's statistics
The old approach was too conservative in running a new task, so when a
fork-heavy workload competes with a CPU-bound workload, the fork-heavy
one is starved. The new approach solves the starvation problem by
inheriting parent's statistics. It seems a good (at least better than
old) guess how a new task will behave.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-06-26 19:00:10 +09:00
Changwoo Min
ac9c49f5b5 scx_lavd: loosen the deadline when overloaded
When the system is highly loaded with compute-intensive tasks, the old
setting chokes latensive-intensive tasks, so loosen the dealine when the
system is overloaded (> 100% utilization).

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-06-26 15:06:31 +09:00
Changwoo Min
b32734168b scx_lavd: print build ID when lavd is loaded
When the lavd is loaded, it prints out its build id. It helps to easily
identify what version it is when testing.

```
01:56:54 [INFO] scx_lavd scheduler is initialized (build ID: 0.8.1-g98a5fa8595430414115c504857cea1a458393838-dirty x86_64-unknown-linux-gnu)
```

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-06-26 10:57:19 +09:00