Commit Graph

801 Commits

Author SHA1 Message Date
Peter Jung
906d054770
scx_lavd: Drop message about unsupported multi-CXX support
Signed-off-by: Peter Jung <admin@ptr1337.dev>
2024-08-25 01:10:38 +02:00
Andrea Righi
0aa23481de scx_rustland_core: drop update_tasks() and introduce notify_complete()
The update_tasks() API is somewhat confusing, so replace it with a
clearer API, notify_complete().

This new API will return control to the BPF component and inform it
about the number of tasks still pending in the user-space scheduler.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-25 00:45:23 +02:00
Daniel Hodges
e81faef103
Merge pull request #544 from hodgesds/layered-tgid
scx_layered: Add layer match for tgid
2024-08-24 16:58:19 -04:00
Andrea Righi
5ece102554 scx_rustland: get rid of unnecessary debugging information
Additional statistics will be re-added later via scx_stats.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-24 21:29:10 +02:00
Andrea Righi
cef8ff8757 scx_rustland_core: get rid of the low_power API
The low-power API is a bit of a hack implemented purely in the BPF
layer, this should be better re-implemented with some concepts of
topology awareness.

Therefore, get rid of this API for now.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-24 21:29:10 +02:00
Andrea Righi
be7ef1009b scx_rlfifo: user-space idle CPU selection
Select an idle CPU from user-space, instead of always dispatching on the
first CPU available.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-24 21:29:10 +02:00
Andrea Righi
568e292a24 scx_rustland_core: get rid of the exiting task API
The current API used to notify the user-space scheduler when a task
exits is really confusing (setting a negative value in
queued_task_ctx.cpu), and it's also possible to detect task exiting
events from user-space (or check in procfs, even if it's slower).

In any case, a better API should be provided for this, so drop the
current one for now.

NOTE: this will cause additional memory usage for scx_rustland, but it
can be fixed/addressed later in a separate commit (i.e., providing a
periodic garbage collector for the unused task entries).

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-24 21:29:10 +02:00
Andrea Righi
5d544ea264 scx_rustland_core: move CPU idle selection logic in user-space
Allow user-space scheduler to pick an idle CPU via
self.bpf.select_cpu(pid, prev_task, flags), mimicking the BPF's
select_cpu() iterface.

Also remove the full_user option and always rely on the idle selection
logic from user-space.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-24 21:28:13 +02:00
Andrea Righi
1dd329dd7d scx_rustland: update Cargo.lock
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-24 20:24:48 +02:00
Andrea Righi
106d59d997 scx_rlfifo: update Cargo.lock
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-24 20:24:48 +02:00
Andrea Righi
016aae759f
Merge pull request #545 from sched-ext/bpfland-honor-avg-nvcsw
scx_bpfland: always honor average nvcsw in lowlatency mode
2024-08-24 20:24:33 +02:00
Avraham Hollander
66b5dd0de9 Clean up scx_rusty help info a bit 2024-08-24 11:56:12 -04:00
Avraham Hollander
c34a470024 scx_lavd: Fix my own formatting error 2024-08-24 11:36:19 -04:00
Andrea Righi
5a08855a86 scx_bpfland: always honor average nvcsw in lowlatency mode
Keep evaluating the average number of voluntary context switches for
each task when lowlatency mode is enabled, even when interactive tasks
classification is disabled (via `-c 0`).

The average nvcsw is also used in lowlatency mode to evaluate the
proportional bonus to the tasks' deadline and it shouldn't be ignored
when interactive tasks classification is disabled. Moreover, make sure
that such bonus never exceeds the starvation threshold.

Keep in mind that it is still possible to disable the periodic average
nvcsw evaluation with `-c 0`, without specifying `--lowlatency`.

Fixes: 6a22853 ("scx_bpfland: introduce --lowlatency option")
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-24 10:42:22 +02:00
Tejun Heo
48092c6f88 scx_lavd: Relay introspection output in stats::TaskSample
This indirection doesn't make any visible behavior difference now but will
be used to implement scx_stats support.
2024-08-23 18:49:36 -10:00
Tejun Heo
725fa7f1be Merge branch 'main' into htejun/scx_stats 2024-08-23 17:10:08 -10:00
Daniel Hodges
5a2012763e
scx_layered: Add layer match for tgid
Add layer match for tgid.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-08-23 23:00:28 -04:00
Avraham Hollander
bedb18b48e Improve scx_lavd help info
A lot of scx_lavd's options do not clearly explain what they do. Add
some short explanations, clean up the existing ones, and direct the user
to read the in-code documentation for more info.
2024-08-23 18:56:14 -04:00
Avraham Hollander
d6e27b59e7 Clean up scx_bpfland help info a bit 2024-08-23 18:55:04 -04:00
Tejun Heo
25e437753c scx_layered, scx_rusty: Implement --help-stats
which shows all the defined stats. While at it, make some cosmetic updates.
2024-08-23 12:39:47 -10:00
Tejun Heo
405bcc63fe scx_stats: Make ScxStatsServerData a public carrier of data needed for stats server
And move related ops into it. This is a bit more natural and will also allow
doing other operaitons (e.g. describing stats) without launching the server.
2024-08-23 12:23:57 -10:00
Tejun Heo
7bd35b6cd3 scx_lavd: Cargo.lock update (caused by scx_utils depending on scx_stats) 2024-08-23 09:21:44 -10:00
Andrea Righi
e72676ede3
Merge pull request #540 from sched-ext/bpfland-cpufreq-awareness
scx_bpfland: cpu frequency and energy awareness
2024-08-23 21:17:34 +02:00
Tejun Heo
9e3b4e6db0 scx_stats: A bit of cleanups and renames 2024-08-23 09:09:02 -10:00
Tejun Heo
b6ccb87bec
Merge pull request #539 from sched-ext/htejun/scx_rusty
scx_rusty: Convert to scx_stats
2024-08-23 08:42:47 -10:00
Daniel Hodges
7d45059fa9
Merge pull request #538 from hodgesds/layered-pid
scx_layered: Add pid/ppid matches
2024-08-23 14:08:40 -04:00
Tejun Heo
8c8912ccea Merge branch 'main' into htejun/scx_rusty 2024-08-23 07:50:23 -10:00
Andrea Righi
50684e4569 scx_bpfland: introduce Intel Turbo Boost awareness
Make `--primar-domain auto` aware of turbo boosted CPUs and prioritize
them over the primary scheduling domain when the energy model
`balance_power` is used (typically when running on battery power with
the "balanced" profile).

With this change the scheduling hierarchy becomes the following:

 1) CPUs in the turbo scheduling domain
 2) CPUs in the primary scheduling domain
 3) full-idle SMT CPUs
 4) CPUs in the same L2 cache
 5) CPUs in the same L3 cache
 6) CPUs in the task's allowed domain

And the idle selection logic is modified as following:

 - In the turbo scheduling domain:
   - pick same full-idle SMT CPU
   - pick any other full-idle SMT CPU sharing the same L2 cache
   - pick any other full-idle SMT CPU sharing the same L3 cache
   - pick any other full-idle SMT CPU
   - pick same idle CPU
   - pick any other idle CPU sharing the same L2 cache
   - pick any other idle CPU sharing the same L3 cache
   - pick any other idle SMT CPU
 - In the primary scheduling domain:
   - pick same full-idle SMT CPU
   - pick any other full-idle SMT CPU sharing the same L2 cache
   - pick any other full-idle SMT CPU sharing the same L3 cache
   - pick any other full-idle SMT CPU
   - pick same idle CPU
   - pick any other idle CPU sharing the same L2 cache
   - pick any other idle CPU sharing the same L3 cache
   - pick any other idle SMT CPU
 - In the entire task domain:
   - pick any other idle CPU

Keep in mind that the turbo domain will be evaluated only when the
scheduler is started with `--primary-domain auto` and only when the
`balance_power` energy profile is used.

The turbo domain is always made using the subset of CPUs in the system
with the highest max frequency. If such subset can't be determined (for
example if all the CPUs in the primary domain have all the same
frequency), the turbo domain will be ignored.

Prioritizing turbo boosted CPUs can help to improve performance by
forcing the governor to scale up their frequency, without increasing too
much power consumption, due to the fact that tasks will be preferably
confined into a reduced amount of cores.

This change seems to improve performance, without increasing much
power consuption, on Intel laptops while using the `balanced_power`
energy profile.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-23 19:49:08 +02:00
Andrea Righi
d958dd4482 scx_bpfland: introduce dynamic energy profile
Introduce the new option `--primary-domain auto`. With this option the
scheduler will dynamically adjusts the primary scheduling domain at
run-time, in function of the current energy profile reported in
/sys/devices/system/cpu/cpufreq/policy0/energy_performance_preference.

When the `power` energy profile is selected, the primary scheduling
domain will prioritize E-cores. Alternatively, when the `performance`
profile is selected, it will prioritize P-cores. For all the other
energy profiles, all the CPUs in the system will be used.

Note that this option is only relevant on hybrid architectures with
P-cores and E-cores.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-23 19:49:01 +02:00
Tejun Heo
44a0f1b124 scx_utils: Factor out monitor_stats() from scx_rusty and scx_layered 2024-08-23 06:46:19 -10:00
Tejun Heo
ae3024e938 scx_layered: Add --stats and make --monitor behavior consistent with scx_rusty 2024-08-23 05:52:52 -10:00
Tejun Heo
0f04a93dd1 scx_rusty: Add stat descriptions and make minor adjustments 2024-08-23 05:46:13 -10:00
Tejun Heo
36865234f8 scx_rusty: Add scx_stats annotations necessary for openmetrics translation 2024-08-23 04:59:08 -10:00
Tejun Heo
2f3f473cd3 scx_rusty: Improve timestamp reporting 2024-08-23 04:31:27 -10:00
Daniel Hodges
11b978a892 scx_layered: Add pid/ppid matches
Add matches for pid/ppid.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-08-23 07:20:05 -07:00
Tejun Heo
76934f3aab scx_rusty: Convert to scx_stats
This allows scx_rusty to avoid generating excessive logs for statistics
while still allowing detailed monitoring on demand.
2024-08-22 19:44:12 -10:00
Tejun Heo
16c07a5cd9 scx_rusty: Don't reset bpf_stats, remember prev states and calculate delta
This will ease transition to scx_stats.
2024-08-22 13:02:23 -10:00
Tejun Heo
13fa48a871 scx_rusty: Separate out stats generation and formatting
to prepare for scx_stats conversion.
2024-08-22 10:03:10 -10:00
Tejun Heo
b4564520e5 scx_rusty: Simplify Stats structs and take id out of the structs
to prepare for scx_stats conversion. While at it, make some cosmetic
changes.
2024-08-22 08:45:33 -10:00
Andrea Righi
6a2285398d scx_bpfland: introduce --lowlatency option
Introduce the new `--lowlatency` option, which enables switching between
the default pure vruntime-based scheduling (more optimized for server
workloads) and a deadline-based scheduling (better suited for
low-latency workloads).

When the low-latency mode is activated, a task's deadline is calculated
as its vruntime, adjusted by a bonus proportional to the task's average
number of voluntary context switches (the more voluntary context
switches, the shorter the deadline).

This feature enhances the prioritization of interactive tasks even more,
proportionally to their average voluntary context switches, also within
the two main global queues (priority / shared) and it helps to maintain
interactive workloads always responsive, even in presence of heavy
non-interactive background work.

Low-latency mode allows to prevent audio cracking even in presence of a
large amount of short-lived tasks with pseudo-interactive behavior (i.e,
hackbench) and it enables achieving approximately a +33% average
frames-per-second (FPS) in the typical "gaming while building the
kernel" benchmark.

However, it can also amplify the de-prioritization of CPU-intensive
tasks, making this option more suitable for specific low-latency
scenarios. Therefore the low-latency mode is disabled by default and it
can only be enabled via the `--lowlatency` option.

Tested-by: Piotr Gorski (piotrgorski@cachyos.org)
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-22 13:26:19 +02:00
Tejun Heo
4834dec684 scx_rusty: Move stats structs to stats.rs and rename for consistency 2024-08-21 22:04:38 -10:00
Andrea Righi
b0a8e4a91e scx_bpfland: better time slice control
Explicitly replenish the task's time slice from ops.dispatch() if the
task still wants to run and no other task is selected. In this way the
sched_ext core won't automatically re-schedule the task on the same CPU,
implicitly assigning a time slice of SCX_SLICE_DFL.

Moreover, instead of determining the task time slice in ops.enqueue(),
refresh the time slice immediately before the task is started on its
assigned CPU in ops.running().

This allows to use a more precise time slice, adjusted based on the
actual amount of tasks that are currently waiting to be scheduled.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-22 09:23:37 +02:00
Tejun Heo
d6ac5fbd9c scx_layered: Drop SCX_OPS_ENQ_LAST
The meaning of SCX_OPS_ENQ_LAST will change with future kernel updates and
enqueueing on local DSQ will no longer be sufficient to avoid stalls. No
reason to do it anyway. Just drop it.
2024-08-21 13:13:59 -10:00
Tejun Heo
f726f0b73b Version: Cargo.lock 2024-08-21 06:45:19 -10:00
Tejun Heo
4d1f0639d8 Version: v1.0.3 2024-08-21 06:42:11 -10:00
Andrea Righi
fedfee0bd6 scx_bpfland: drop unused variable
With the global scx_utils::NR_CPU_IDS we don't need Topology anymore in
init_primary_domain(), so drop the variable to fix the following build
warning:

warning: unused variable: `topo`
   --> src/main.rs:385:9
    |
385 |         topo: &Topology,
    |         ^^^^ help: if this is intentional, prefix it with an underscore: `_topo`
    |
    = note: `#[warn(unused_variables)]` on by default

Fixes: 1da249f ("scx_utils::topology: Always use NR_CPU_IDS and NR_CPUS_POSSIBLE")
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-21 17:46:12 +02:00
Andrea Righi
9f7a11bba6
Merge pull request #528 from sched-ext/bpfland-turbo-boost
scx_bpfland: properly classify Intel Turbo Boost CPUs
2024-08-21 17:40:25 +02:00
Daniel Hodges
f2a6661a85
Merge pull request #524 from hodgesds/layered-core-fixes
scx_layered: Fix core selection
2024-08-21 08:13:33 -04:00
Tejun Heo
9c62019c81
Merge pull request #527 from sched-ext/htejun/scx_utils
scx_utils::cpumask,topology: Misc updates
2024-08-20 22:25:25 -10:00
Andrea Righi
695e3b25b0 scx_bpfland: classify CPUs depending of their the base frequency
Use the base frequency, instead of maximum frequency, to classify fast
and slow CPUs. This ensures accurate distinction between Intel Turbo
Boost CPUs and genuinely faster CPUs when auto-detecting the primary
scheduling domain.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-21 10:16:41 +02:00
Andrea Righi
e0fb99835d
Merge pull request #525 from sched-ext/bpfland-disable-interactive
scx_bpfland: allow to completely disable interactive classification
2024-08-21 10:02:43 +02:00
Tejun Heo
5cf4212330 Revert "rusty: Integrate stats with the metrics framework"
This reverts commit 83373b1f4e in prepration
for converting to scx_stats.
2024-08-20 21:59:25 -10:00
Tejun Heo
516a7590db scx_rusty: Revert log_recorder conversion
scx_rusty will be converted to scx_stats in a similar fashin with
scx_layered. Undo log_recorder conversion in preparation.
2024-08-20 21:59:20 -10:00
Tejun Heo
1da249f063 scx_utils::topology: Always use NR_CPU_IDS and NR_CPUS_POSSIBLE
Always use the LazyLock versions and drop the counterparts from Topology.
2024-08-20 21:57:56 -10:00
Tejun Heo
092f5422d6
Merge pull request #518 from sched-ext/htejun/misc
scx_layered: Add `--run-example` and enable CI testing
2024-08-20 21:42:45 -10:00
Tejun Heo
f7c193e528 scx_utils, scx_rusty: Minor updates to version handling
- Update scx_utils/build.rs so that 12 char SHA1 is generated instead of
  full one.

- Add --version to scx_rusty. Use custom one as we don't want to use the
  default cargo version one.
2024-08-20 21:03:05 -10:00
Tejun Heo
8f786be08f scx_rusty: cargo fmt 2024-08-20 21:03:05 -10:00
Tejun Heo
4440567949 scx_rusty: Update Cargo.lock 2024-08-20 21:03:05 -10:00
Andrea Righi
c85315d527 scx_bpfland: allow to completely disable interactive classification
Tasks enqueued with SCX_ENQ_WAKEUP are immediately classified as
interactive. However, if interactive tasks classification is disabled
(via `-c 0`), we should avoid promoting them as interactive.

This is particularly important because, with the nvcsw logic disabled,
tasks can remain classified as interactive indefinitely and they will
never be demoted to regular tasks.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-21 08:45:13 +02:00
Andrea Righi
a9f5aaa536 scx_bpfland: replace custom CpuMask with scx_utils::Cpumask
Rely on scx_utils::Cpumask instead of re-implementing a custom struct to
parse and manage CPU masks.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-21 07:21:52 +02:00
Daniel Hodges
4d1c932619 scx_layered: Fix core selection
Fix a bug introduced in #510 where it assumed core ids are incremental.
This refactors the core ordering for layers to be far more simple and
provide some space for layer core isolation in low utilization.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-08-20 19:26:53 -07:00
Andrea Righi
33b6ada98e
Merge pull request #509 from sched-ext/bpfland-topology
scx_bpfland: topology awareness
2024-08-20 14:37:23 +02:00
Andrea Righi
467d4b5ea4 scx_bpfland: get topology information from scx_utils::Topology
Rely on scx_utils::Topology to get CPU and cache information, instead of
re-implementing custom methods.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-20 10:16:02 +02:00
Tejun Heo
c0418250f4 scx_layered: Add --run-example option
So that scx_layered can be run in CI environment in a single command.
2024-08-19 20:50:10 -10:00
Changwoo Min
41bc6f0967
Merge pull request #511 from multics69/lavd-perf-profile
scx_lavd: add power profile options: --performance, --balanced, --powersave
2024-08-20 09:02:37 +09:00
Changwoo Min
1d61dd4c1d
Merge pull request #508 from multics69/lavd-numa-fix
scx_lavd: fix a potential watchdog timeout error at multi-NUMA/CCX platforms
2024-08-20 09:02:23 +09:00
Changwoo Min
2c4c2a0ccf
Merge pull request #507 from multics69/lavd-pretty-rust
scx_lavd: revise FlatTopology prettier
2024-08-20 09:01:26 +09:00
Daniel Hodges
05a2721f8e
Merge pull request #510 from hodgesds/layered-core-topo-selection
scx_layered: Use topology for core selection
2024-08-19 20:01:16 -04:00
Tejun Heo
d01b49bd0e scx_layered: Fix verification failure
4fccc06905 ("scx_layered: Fix uninitialized variable") causes the
following verification failure. Fix it by moving assignments below range
checking.

  Validating match_layer() func#1...
  283: R1=scalar() R2=scalar() R3=mem_or_null(id=49,sz=1) R10=fp0
  ; int match_layer(u32 layer_id, pid_t pid, const char *cgrp_path) @ main.bpf.c:1029
  283: (7b) *(u64 *)(r10 -24) = r3      ; R3=mem_or_null(id=49,sz=1) R10=fp0 fp-24_w=mem_or_null(id=49,sz=1)
  284: (bc) w7 = w1                     ; R1=scalar() R7_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff))
  ; struct layer *layer = &layers[layer_id]; @ main.bpf.c:1033
  285: (bc) w1 = w7                     ; R1_w=scalar(id=50,smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) R7_w=scalar(id=50,smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff))
  286: (27) r1 *= 1061192               ; R1_w=scalar(smin=0,smax=umax=0x103147ffefceb8,smax32=0x7ffffff8,umax32=0xfffffff8,var_off=(0x0; 0x1ffffffffffff8))
  287: (18) r8 = 0xffffc90002a26000     ; R8_w=map_value(map=bpf_bpf.bss,ks=4,vs=16979080)
  289: (0f) r8 += r1                    ; R1_w=scalar(smin=0,smax=umax=0x103147ffefceb8,smax32=0x7ffffff8,umax32=0xfffffff8,var_off=(0x0; 0x1ffffffffffff8)) R8_w=map_value(map=bpf_bpf.bss,ks=4,vs=16979080,smin=0,smax=umax=0x103147ffefceb8,smax32=0x7ffffff8,umax32=0xfffffff8,var_off=(0x0; 0x1ffffffffffff8))
  ; u32 nr_match_ors = layer->nr_match_ors; @ main.bpf.c:1034
  290: (bf) r1 = r8                     ; R1_w=map_value(map=bpf_bpf.bss,ks=4,vs=16979080,smin=0,smax=umax=0x103147ffefceb8,smax32=0x7ffffff8,umax32=0xfffffff8,var_off=(0x0; 0x1ffffffffffff8)) R8_w=map_value(map=bpf_bpf.bss,ks=4,vs=16979080,smin=0,smax=umax=0x103147ffefceb8,smax32=0x7ffffff8,umax32=0xfffffff8,var_off=(0x0; 0x1ffffffffffff8))
  291: (07) r1 += 1060992               ; R1_w=map_value(map=bpf_bpf.bss,ks=4,vs=16979080,off=0x103080,smin=0,smax=umax=0x103147ffefceb8,smax32=0x7ffffff8,umax32=0xfffffff8,var_off=(0x0; 0x1ffffffffffff8))
  292: (61) r1 = *(u32 *)(r1 +0)
  R1 unbounded memory access, make sure to bounds check any such access
  processed 1099 insns (limit 1000000) max_states_per_insn 2 total_states 72 peak_states 72 mark_read 9
  -- END PROG LOAD LOG --
2024-08-19 13:18:20 -10:00
Daniel Hodges
b3793e0069 scx_layered: Use topology for core selection
Currently the core selection logic in scx_layered uses the first
available core in the bitmask. This is suboptimal when the scheduler is
configured with specific NUMA/LLC restrictions. The ideal core selection
logic should try to find the least used cores within the preferred
scheduling domain and allocate new cpus from shared cores within that
domain.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-08-19 15:51:35 -07:00
Tejun Heo
3498a2b899
Merge pull request #514 from sched-ext/htejun/scx_stats
scx_stats, scx_layered: Implement independent stats client sessions
2024-08-19 11:24:53 -10:00
Tejun Heo
f6bc52d31e scx_layered: Make --monitor behavior more useful
- If --monitor is specified with layer specs, the scheduler also starts
  stats monitoring on a thread.

- Standalone monitoring mode no longer exits when the scheduler isn't there.
2024-08-19 10:55:02 -10:00
Tejun Heo
d03e48eb75 scx_layered: Implement per-stats-client nr_layer_cpus_ranges tracking
With this, every client sees the correct nr_layer_cpus_ranges without
interfering with each other.
2024-08-19 09:12:51 -10:00
Tejun Heo
448aacfd60 scx_layered: Initialize Stats.prev_layer_cycles properly on new()
So that new stats session doesn't start with an inflated utilization number.
2024-08-19 08:40:40 -10:00
Tejun Heo
25d7e6f787 scx_layered: Implement on-demand statistics generation
Instead of keeping one copy of sched_stats, each stats server session
carries their own so that stats can be generated independently by each
client at any interval. CPU allocation min/max tracking is broken for now.
2024-08-19 08:27:36 -10:00
Andrea Righi
f8a2445869 scx_bpfland: introduce performance/powersave primary domain
The primary scheduling domain represents a group of CPUs in the system
where the scheduler will initially attempt to assign tasks. Tasks will
only be dispatched to CPUs within this primary domain until they are
fully utilized, after which tasks may overflow to other available CPUs.

The primary scheduling domain can defined using the option
`--primary-domain CPUMASK` (by default all the CPUs in the system are
used as primary domain).

This change introduces two new special values for the CPUMASK argument:
 - `performance`: automatically detect the fastest CPUs in the system
   and use them as primary scheduling domain,
 - `powersave`: automatically detect the slowest CPUs in the system and
   use them as primary scheduling domain.

The current logic only supports creating two groups: fast and slow CPUs.

The fast CPU group is created by excluding CPUs with the lowest
frequency from the overall set, which means that within the fast CPU
group, CPUs may have different maximum frequencies.

When using the `performance` mode the fast CPUs will be used as primary
domain, whereas in `powersave` mode, the slow CPUs will be used instead.

This option is particularly useful in hybrid architectures (with P-cores
and E-cores), as it allows the use of bpfland to prioritize task
scheduling on either P-cores or E-cores, depending on the desired
performance profile.

Example:

 - Dell Precision 5480
   - CPU: 13th Gen Intel(R) Core(TM) i7-13800H
     - P-cores:  0-11 / max freq: 5.2GHz
     - E-cores: 12-19 / max freq: 4.0GHz

 $ scx_bpfland --primary-domain performance

  0[|||||||||                24.5%]  10[||||||||                  22.8%]
  1[||||||                   14.9%]  11[|||||||||||||             36.9%]
  2[||||||                   16.2%]  12[                           0.0%]
  3[|||||||||                25.3%]  13[                           0.0%]
  4[|||||||||||              33.3%]  14[                           0.0%]
  5[||||                      9.9%]  15[                           0.0%]
  6[|||||||||||              31.5%]  16[                           0.0%]
  7[|||||||                  17.4%]  17[                           0.0%]
  8[||||||||                 23.4%]  18[                           0.0%]
  9[|||||||||                26.1%]  19[                           0.0%]

  Avg power consumption: 3.29W

 $ scx_bpfland --primary-domain powersave

  0[|                         2.5%]  10[                           0.0%]
  1[                          0.0%]  11[                           0.0%]
  2[                          0.0%]  12[||||                       8.0%]
  3[                          0.0%]  13[|||||||||||||||||||||     64.2%]
  4[                          0.0%]  14[||||||||||                29.6%]
  5[                          0.0%]  15[|||||||||||||||||         52.5%]
  6[                          0.0%]  16[|||||||||                 24.7%]
  7[                          0.0%]  17[||||||||||                30.4%]
  8[                          0.0%]  18[|||||||                   22.4%]
  9[                          0.0%]  19[|||||                     12.4%]

  Avg power consumption: 2.17W

(Info collected from htop and turbostat)

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-19 20:19:21 +02:00
Andrea Righi
174993f9d2 scx_bpfland: introduce cache awareness
While the system is not saturated the scheduler will use the following
strategy to select the next CPU for a task:
  - pick the same CPU if it's a full-idle SMT core
  - pick any full-idle SMT core in the primary scheduling group that
    shares the same L2 cache
  - pick any full-idle SMT core in the primary scheduling grouop that
    shares the same L3 cache
  - pick the same CPU (ignoring SMT)
  - pick any idle CPU in the primary scheduling group that shares the
    same L2 cache
  - pick any idle CPU in the primary scheduling group that shares the
    same L3 cache
  - pick any idle CPU in the system

While the system is completely saturated (no idle CPUs available), tasks
will be dispatched on the first CPU that becomes available.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-19 20:19:21 +02:00
Tejun Heo
27c530e17e scx_stats: Add missing trait exports 2024-08-19 07:16:43 -10:00
Tejun Heo
0cf5ca605d scx_layered: Move processing_dur accounting into Stats and protect it with Arc<Mutex<>> 2024-08-19 06:25:23 -10:00
Tejun Heo
a77fe372d6 scx_stats: Make server shutdown when connection is dropped and add communication channel
This will make implementing connection sessions easier where each stats
client connection maintains a set of states.
2024-08-19 06:23:16 -10:00
Changwoo Min
832f194845 scx_lavd: add power profile options: --performance, --powersave, --balanced
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-08-19 19:03:51 +09:00
Changwoo Min
c4c157f91c scx_lavd: add "--prefer-little-core" option
This option chooses little (effiency) cores over big (performance) cores
to save power consumption for core compaction.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-08-19 18:23:35 +09:00
Changwoo Min
73b873827d scx_lavd: merge put_cpdom_rq() to ops.enqueue()
Clean and reorganized the code around ops.enqueue()

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-08-19 14:22:03 +09:00
Changwoo Min
9475ace336 scx_lavd: always enqueue to a DSQ in task's compute domain
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-08-19 14:07:56 +09:00
Changwoo Min
0656c3232e scx_lavd: revise FlatTopology prettier
The changes include 1) chopping down a big function into smaller ones
for readability and maintainability and 2) using the interior mutability
pattern (Cell and RefCell) to avoid unnecessary clone() calls.  There
are no functional changes.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-08-19 11:03:52 +09:00
I Hsin Cheng
4fccc06905 scx_layered: Fix uninitialized variable
Fix the uninitialized variable "layer" in the function match_layer which
caused the compiling process to fail. "layer" is supposed to be the same
as "&layers[layer_id]".

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
2024-08-17 23:32:53 +08:00
Tejun Heo
3a688cfde7 scx_stats: Add support for no-value user attributes and a bunch of other changes
- Allow no-value user attributes which are automatically assigned "true"
  when specified.

- Make "top" attribute string "true" instead of bool true for consistency.
  Testing for existence is always enough for value-less attributes.

- Don't drop leading "_" from user attribute names when storing in dicts.
  Dropping makes things more confusing.

- Add "_om_skip" to scx_layered fields which don't jive well with OM.
  scxstats_to_openmetrics.py is updated accordignly and no longer generates
  warnings on those fields.

- Examples and README updated accordingly.
2024-08-16 07:52:02 -10:00
I Hsin Cheng
5d85937842 scx_rusty: Fix typo
Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
2024-08-16 22:03:59 +08:00
Tejun Heo
c16b48d7b2 scheds/rust: Include Cargo.lock in the repo
Binary packages are expected to include Cargo.lock in the repo so that the
produced binaries match across different builds.
2024-08-15 23:08:35 -10:00
Tejun Heo
22167aeb14
Merge pull request #502 from sched-ext/htejun/scx_stats
scx_stats: Refine scx_stats and implement scxstats_to_openmetrics.py
2024-08-15 22:55:11 -10:00
Tejun Heo
570ca56c57 scx_layered: s/_om_field_prefix/_om_prefix/ 2024-08-15 21:29:58 -10:00
Tejun Heo
af01dd19ec
Merge pull request #500 from sched-ext/htejun/scx_stats
scx_stats, scx_layered: Add `om_prefix` attribute and fix s/stat/stats/ stragglers
2024-08-15 21:27:38 -10:00
Tejun Heo
ea453e51d3 scx_stats: Rename "all" attribute to "top" and clean up examples a bit 2024-08-15 21:24:55 -10:00
Tejun Heo
a910fa451a scx_layered: Add _om attributes to LayerStats for OpenMetrics piping 2024-08-15 19:11:49 -10:00
Tejun Heo
6a5d6f7c27 scx_stats: Replace field_prefix attribute with '_' prefixed user attributes 2024-08-15 19:09:59 -10:00
Tejun Heo
a9922deaa2 scx_stats: Add "all" attribute and rename metadata type strings 2024-08-15 14:50:00 -10:00
Tejun Heo
ebc1a89c34 scx_stats: s/stat/stats/ stragglers 2024-08-15 14:00:00 -10:00
Tejun Heo
bafd67b568 scx_stats: Fix parsing for multiple stat attributes
The code was assuming single attribute per #[stat()] block. Update it so
that there can be multiple comma separated attributes in a single block.
2024-08-15 13:46:20 -10:00
Tejun Heo
8f361af077 scx_layered: Shorten stat field descriptions 2024-08-15 13:25:48 -10:00
Tejun Heo
1912e05f0b
Merge pull request #499 from sched-ext/htejun/scx_stats
scx_stats: Misc changes to sync dep versions and publish on crates.io
2024-08-15 12:32:44 -10:00
Tejun Heo
0b9c8b5cbd scx_stats: Update versions to 0.2.0 to republish 2024-08-15 12:29:27 -10:00
Daniel Hodges
0319afc88e scx_layered: Update nr_cpus when resizing layers
After updating scx_layered to be topology aware the nr_cpus field on the
layer was not being updated properly. Update layer growing/shrinking
logic to correctly update the nr_cpus count.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-08-15 13:22:26 -07:00
Tejun Heo
cc73b6a826
Merge pull request #496 from sched-ext/htejun/scx_stat
scx_stat: Initial commit
2024-08-15 09:24:55 -10:00
Tejun Heo
b614cf848f scx_layered: Make monitor time based iterations dumber
This makes ctrl-c a bit more responsive without complicating code.
2024-08-15 09:23:29 -10:00
Tejun Heo
45fb724ee2 scx_layered: Restore cpumask reporting 2024-08-15 09:12:29 -10:00
Tejun Heo
751a38e34e scx_layered: Refactor stats printing code 2024-08-15 08:53:19 -10:00
Tejun Heo
a4f424056e scx_layered: Move stats server launching to stats.rs 2024-08-15 06:30:42 -10:00
Tejun Heo
17afc72479 scx_stats: Rename cleanups
- s/stat/stats/ on several stragglers.

- Rename traits so that they are more distinctive from struct and other
  names and follow the convention.
2024-08-15 06:24:56 -10:00
Tejun Heo
a091d5ea7d scx_layered: s/monitor.rs/stats.rs/ and make stats refresh code struct ops 2024-08-15 06:13:05 -10:00
Tejun Heo
8aae9a5de2 scx_stats: s/scx_stat/scx_stats/
Use plural form which is more widespread and also used in scheduler
implementations. No functional changes.
2024-08-15 05:31:34 -10:00
Tejun Heo
6e466d18df scx_layered: Initial switch to scx_stat
- This makes the scheduler side simpler and allows on-demand monitoring.

- OpenMetrics support is dropped for now. Will add a generic tool for it.

- This is a naive conversion. Will be further refined.

scx_layered no longer prints statistics by default. To watch statistics, run
`scx_layered --monitor` while the scheduler is running.
2024-08-14 13:48:41 -10:00
Tejun Heo
7820ec9b46 scx_stat, scx_layered: cargo fmt 2024-08-14 11:47:37 -10:00
Tejun Heo
099b6c266a scx_lavd: Build fix
Add "signal" feature to nix dependency; otherwise, build fails.
2024-08-14 07:55:04 -10:00
Andrea Righi
0f018c5fff
Merge pull request #484 from vax-r/rustland_unused
scx: Remove unused variables, imports and functions
2024-08-14 19:03:26 +02:00
Andrea Righi
f9a994412d scx_bpfland: introduce primary scheduling domain
Allow to specify a primary scheduling domain via the new command line
option `--primary-domain CPUMASK`, where CPUMASK can be a hex number of
arbitrary length, representing the CPUs assigned to the domain.

If this option is not specified the scheduler will use all the available
CPUs in the system as primary domain (no behavior change).

Otherwise, if a primary scheduling domain is defined, the scheduler will
try to dispatch tasks only to the CPUs assigned to the primary domain,
until these CPUs are saturated, at which point tasks may overflow to
other available CPUs.

This feature can be used to prioritize certain cores over others and it
can be really effective in systems with heterogeneous cores (e.g.,
hybrid systems with P-cores and E-cores).

== Example (hybrid architecture) ==

Hardware:
 - Dell Precision 5480 with 13th Gen Intel(R) Core(TM) i7-13800H
   - 6 P-cores 0..5  with 2 CPUs each (CPU from  0..11)
   - 8 E-cores 6..13 with 1 CPU  each (CPU from 12..19)

== Test ==

WebGL application (https://webglsamples.org/aquarium/aquarium.html):
this allows to generate a steady workload in the system without
over-saturating the CPUs.

Use different scheduler configurations:

 - EEVDF (default)
 - scx_bpfland using P-cores only (--primary-domain 0x00fff)
 - scx_bpfland using E-cores only (--primary-domain 0xff000)

Measure performance (fps) and power consumption (W).

== Result ==

                  +-----+-----+------+-----+----------+
                  | min | max | avg  |       |        |
                  | fps | fps | fps  | stdev | power  |
+-----------------+-----+-----+------+-------+--------+
| EEVDF           | 28  | 34  | 31.0 |  1.73 |  3.5W  |
| bpfland-p-cores | 33  | 34  | 33.5 |  0.29 |  3.5W  |
| bpfland-e-cores | 25  | 26  | 25.5 |  0.29 |  2.2W  |
+-----------------+-----+-----+------+-------+--------+

Using a primary scheduling domain of only P-cores with scx_bpfland
allows to achieve a more stable and predictable level of performance,
with an average of 33.5 fps and an error of ±0.5 fps.

In contrast, using EEVDF results in an average frame rate of 31.0 fps
with an error of ±3.0 fps, indicating slightly less consistency, due to
the fact that tasks are evenly distributed across all the cores in the
system (both slow and fast cores).

On the other hand, using a scheduling domain solely of E-cores with
scx_bpfland results in a lower average frame rate (25.5 fps), though it
maintains a stable performance (error of ±0.5 fps), but the power
consumption is also reduced, averaging 2.2W, compared to 3.5W with
either of the other configurations.

== Conclusion ==

In summary, with this change users have the flexibility to prioritize
scheduling on performance cores for better performance and consistency,
or prioritize energy efficient cores for reduced power consumption, on
hybrid architectures.

Moreover, this feature can also be used to minimize the number of cores
used by the scheduler, until they reach full capacity. This capability
can be useful for reducing power consumption even in homogeneous systems
or for conducting scheduling experiments with smaller sets of cores,
provided the system is not overcommitted.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-14 16:17:54 +02:00
Andrea Righi
a6e977c70b scx_bpfland: make output more compact
Abbreviate the statistics reported to stdout and remove the slice_ms
metric: this metric can be easily derived from slice_ns, slice_ns_min
and nr_wait, which is already reported to stdout.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-14 16:17:54 +02:00
Andrea Righi
8656effa50 scx_bpfland: update copyright info
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-14 16:17:54 +02:00
Changwoo Min
3c6d86b342 scx_lavd: upgrade nix package from 0.28.0 to 0.29.0
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-08-14 22:31:05 +09:00
Changwoo Min
444f0b86a5
Merge pull request #489 from multics69/lavd-amp-v4
lavd: make LAVD core-type (AMP) aware
2024-08-14 14:24:09 +09:00
Tejun Heo
4612764b82
Merge pull request #486 from vax-r/Fix_rusty_logic
scx_rusty: Fix logical error when filtering tasks
2024-08-13 09:39:12 -10:00
Daniel Hodges
646cefd46d
Merge pull request #477 from hodgesds/layered-global-match
scx_rusty: Make layer matching a global function
2024-08-12 09:14:58 -04:00
Daniel Hodges
be5213e129 scx_rusty: Make layer matching a global function
Layer matching currently takes a large number of bpf instructions.
Moving layer matching to a global function will reduce the overall
instruction count and allow for other layer matching methods such as
glob.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-08-12 05:44:34 -07:00
Changwoo Min
b7b8c8de90 scx_lavd: fix build errors
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-08-12 14:10:40 +09:00
Changwoo Min
182b0bd249 scx_lavd: make the verifier in 6.8 kernel happy
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-08-12 13:04:04 +09:00
Changwoo Min
4ecf3fc94e scx_lavd: build cpdom map from rust
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-08-12 13:03:18 +09:00
Changwoo Min
1f1a3dc4f1 scx_lavd: sort cores in descending order of max freq
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-08-12 13:01:40 +09:00
Changwoo Min
c213a3e44f scx_lavd: make core compaction core type aware
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-08-12 13:01:40 +09:00
Changwoo Min
c35b6b27ff scx_lavd: consider task pinning for core-type-aware ops.enqueue()
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-08-12 13:01:40 +09:00
Changwoo Min
25bf98d2a0 scx_lavd: make ops.select_cpu() core type aware
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-08-12 13:01:40 +09:00
Changwoo Min
fa87e1c593 scx_lavd: make ops.dispatch() core type aware
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-08-12 13:01:40 +09:00
Changwoo Min
c1cf11f7b1 scx_lavd: make ops.enqueue() core type aware
Put a performance-critical task to a performance critical queue and a
regular task to a regular queue.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-08-12 13:01:40 +09:00
Changwoo Min
03a8c10ece scx_lavd: add cpdom_ctx to abstract compute domain and its DSQ
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-08-12 13:01:40 +09:00
Changwoo Min
623b05a282 scx_lavd: revise perf_cri factor to reflect wakeup, runtime, and run_freq
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-08-12 13:01:40 +09:00
Changwoo Min
15871fd032 scx_lavd: turn off pinned core less aggressively
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-08-12 13:01:40 +09:00
Changwoo Min
9dc7f94cb6 scx_lavd: unifiy the deadline calculation and ineligibility calculation
The unified version is not only simpler but also works better.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-08-12 13:01:40 +09:00
Changwoo Min
4705520d40 scx_lavd: remove unnecessary options which has never been used
Signed-off-by: Changwoo Min <changwoo@igalia.com>
2024-08-12 13:01:34 +09:00
I Hsin Cheng
15b40de408 scx_rusty: Fix logical error when filtering tasks
The logic of tasks filtering were moved from find_first_candidate() into
a vector filter operation in commit 1c3b563. However, it was forgotten
to transfer the logic with "NOT" since now .filter() will populate the
tasks we want, rather than .skip_while() which was throwing unwanted
tasks out.

That's why the logic here should be reverse so we won't take kworker or
migrated tasks into considerations.

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
2024-08-10 22:56:20 +08:00
I Hsin Cheng
4e40ba3b11 scx_rustland: Removed unused imports and variables
The member "topo_map" in Scheduler is never used and thus should be
removed, the related imports are removed as well.

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
2024-08-09 20:35:12 +08:00
I Hsin Cheng
b7e03b7a76 scx_bpfland: Remove unused variable
Remove unused variable "vtime" in task_vtime().

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
2024-08-09 20:28:42 +08:00
Tejun Heo
45f7fd13b7 versions: Synchronize crate dependency versions 2024-08-08 14:45:46 -10:00
Tejun Heo
63c4a0191f
Merge branch 'main' into topic/inlined-skeleton-members 2024-08-08 14:23:37 -10:00
Tejun Heo
cd6a4d72c7 Bump versions for 1.0.2 release 2024-08-08 14:10:16 -10:00
Tejun Heo
7c3ffe96e1 Unify crate dependency versions
Different sub-projects are using different versions for the same crates.
Synchronize them to the latest.
2024-08-08 13:26:47 -10:00
Andrea Righi
9d808ae206
Merge pull request #468 from sched-ext/rustland-refactoring
scx_rustland refactoring
2024-08-07 11:38:21 +02:00
Andrea Righi
51cfb69199 scx_rustland_core: re-introduce partial mode
Re-add the partial mode option that was dropped during the refactoring.

The partial option allows to apply the scheduler only to the tasks which
have their scheduling policy set to SCHED_EXT via sched_setscheduler().

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-07 08:41:06 +02:00
Andrea Righi
e1f2b3822e scx_rustland_core: drop CPU ownership API
The API for determining which PID is running on a specific CPU is racy
and is unnecessary since this information can be obtained from user
space.

Additionally, it's not reliable for identifying idle CPUs.  Therefore,
it's better to remove this API and, in the future, provide a cpumask
alternative that can export the idle state of the CPUs to user space.

As a consequence also change scx_rustland to dispatch one task a time,
instead of dispatching tasks in batches of idle cores (that are usually
not accurate due to the racy nature of the CPU ownership interaface).

Dispatching one task at a time even makes the scheduler more performant,
due to the vruntime scheduling being applied to more tasks sitting in
the scheduler's queue.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-07 08:41:06 +02:00
Andrea Righi
9a0e7755df scx_rustland_core: export counter of online CPUs
Introduce a helper to get the amount of online CPUs tracked by the BPF
part.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-07 08:10:53 +02:00
Andrea Righi
d9c9f78e3e scx_rustland: re-align vruntime and time slice evaluation to scx_bpfland
Drop the slice boost logic and apply a vruntime and task time slice
evaluation approach similar to scx_bpfland (but implement this in the
user-space component instead of the BPF part).

Additionally, introduce a slice_us_min parameter to define the minimum
time slice that can be assigned to a task, also similar to scx_bpfland.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-07 08:10:53 +02:00
Andrea Righi
38a725ea34 scx_rlfifo: update copyright info
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-07 08:10:53 +02:00
Andrea Righi
c963d5eb05 scx_rustland: update copyright info
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-07 08:10:53 +02:00