Commit Graph

716 Commits

Author SHA1 Message Date
Andrea Righi
8853d9a9f2
Merge pull request #548 from sched-ext/rustland-core-refactoring
scx_rustland_core: user-space framework refactoring
2024-08-25 16:39:28 +02:00
Andrea Righi
894f9582d0 scx_rustland_core: hide shutdown boilerplate in BpfScheduler
Refactor the code to hide the shutdown handling inside BpfScheduler and
simply use the exited() method to check when the scheduler is stopped.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-25 12:17:04 +02:00
Tejun Heo
152a8471cc scx_bpfland: When reporting stats, use interval deltas
Three of the reported stats are cumulative. While they obviously can be
processed into delta values, that holds for the other direction too and the
cumulative values are difficult to make intutive sense of. Report interval
delta values instead.

Note that a stats client can reliably build back cumulative values even
under heavy system contention - the delta values reported between two
consecutive reads are guaranteed to be correct regardless of the duration of
the interval.
2024-08-24 23:14:57 -10:00
Tejun Heo
bd68e230b9 scx_bpfland: Convert to scx_stats
Use scx_stats instead of prometheus for stats reporting. This has a few
advantages:

- Stats metadata can be defined more succinctly.

- Natural support for nesting statistics which will be useful in making
  scheduler components composable.

- Support for multiple programmable readers where each reader can use their
  own reading interval.

- Built-in stats help message generation.

- Openmetrics integration is still available through
  scx_stats/scripts/scxstats_to_openmetrics.py.
2024-08-24 23:14:55 -10:00
Tejun Heo
625381280c scx_stats: Shorten exported names and add prelude module
Let's make it a bit easier to use:

- Shorten exported names by changing the prefix from ScxStats to Stats. This
  should be distinctive enough and more inline with how most libraries name
  their exports.

- Importing the right set of traits can be tricky. Introduce prelude module
  so that importing is a bit less painful.
2024-08-24 22:04:25 -10:00
Andrea Righi
a2e97fecbb scx_rustland_core: merge verbose and debug in the same option
There is no reason to have two separate options for "verbose" and
"debug" mode. Just merge the two and always use "debug". If enabled,
increase verbosity to stdout and enable reporting BPF scheduling events
in debugfs (e.g., /sys/kernel/debug/tracing/trace_pipe).

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-25 09:45:20 +02:00
Andrea Righi
cb16a11342 scx_rustland_core: get rid of the global scheduler's slice_us
Since scx_rustland_core enables setting a time slice on a per-task basis
during task dispatch, there's no need to maintain a global time slice in
the BPF component. Instead, a global time slice can simply be managed in
user-space, achieving the same outcome.

Therefore, drop the global slice_us property from BpfScheduler to
simplify the API.

NOTE: if a time slice is not specified for a task, SCX_SLICE_DFL will be
used by default.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-25 09:45:18 +02:00
Andrea Righi
e404bee5e7 scx_rustland / scx_rlfifo: small code format fixes
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-25 09:44:52 +02:00
Andrea Righi
1cd11ba916 scx_rlfifo: improve documentation and code readability
Add more comments to make the source code more understandable, so that
it can be easily used as a template for implementing more complex
scheduling policies.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-25 09:44:28 +02:00
Tejun Heo
35a4326aee scx_lavd: Drop unnecessary stat field explanation on startup
The scheduling instances no longer prints out sched samples. No reason to
print field explanation on startup.
2024-08-24 18:48:54 -10:00
Changwoo Min
02ad793c78
Merge branch 'main' into htejun/scx_lavd-stats 2024-08-25 11:57:41 +09:00
Changwoo Min
8b1874c27f
Merge pull request #552 from CachyOS/lavd-mutli-cxx2
scx_lavd: Drop message about unsupported multi-CXX support
2024-08-25 11:48:12 +09:00
Tejun Heo
fdfb7f60f4 Merge branch 'main' into htejun/scx_lavd-stats 2024-08-24 15:53:53 -10:00
Tejun Heo
55e5b8b43f scx_lavd: Switch to scx_stats
Scheduling sample reporting is switched to use scx_stats. This makes the
scheduler run without making too much noise while still allowing monitoring
on demand. It can also make introspection more dynamic - e.g. it shouldn't
be difficult to add other monitoring commands which take scheduling samples
based on different criteria or add other types of staisitcs.

--nr_sched-samples is replaced with --monitor-nr-samples.
2024-08-24 15:53:02 -10:00
Tejun Heo
1bba713a29
Merge pull request #542 from sched-ext/htejun/scx_stats
scx_stats, scx_rusty, scx_layered: Implement `--help-stats`
2024-08-24 15:38:36 -10:00
Peter Jung
906d054770
scx_lavd: Drop message about unsupported multi-CXX support
Signed-off-by: Peter Jung <admin@ptr1337.dev>
2024-08-25 01:10:38 +02:00
Andrea Righi
0aa23481de scx_rustland_core: drop update_tasks() and introduce notify_complete()
The update_tasks() API is somewhat confusing, so replace it with a
clearer API, notify_complete().

This new API will return control to the BPF component and inform it
about the number of tasks still pending in the user-space scheduler.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-25 00:45:23 +02:00
Daniel Hodges
e81faef103
Merge pull request #544 from hodgesds/layered-tgid
scx_layered: Add layer match for tgid
2024-08-24 16:58:19 -04:00
Andrea Righi
5ece102554 scx_rustland: get rid of unnecessary debugging information
Additional statistics will be re-added later via scx_stats.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-24 21:29:10 +02:00
Andrea Righi
cef8ff8757 scx_rustland_core: get rid of the low_power API
The low-power API is a bit of a hack implemented purely in the BPF
layer, this should be better re-implemented with some concepts of
topology awareness.

Therefore, get rid of this API for now.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-24 21:29:10 +02:00
Andrea Righi
be7ef1009b scx_rlfifo: user-space idle CPU selection
Select an idle CPU from user-space, instead of always dispatching on the
first CPU available.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-24 21:29:10 +02:00
Andrea Righi
568e292a24 scx_rustland_core: get rid of the exiting task API
The current API used to notify the user-space scheduler when a task
exits is really confusing (setting a negative value in
queued_task_ctx.cpu), and it's also possible to detect task exiting
events from user-space (or check in procfs, even if it's slower).

In any case, a better API should be provided for this, so drop the
current one for now.

NOTE: this will cause additional memory usage for scx_rustland, but it
can be fixed/addressed later in a separate commit (i.e., providing a
periodic garbage collector for the unused task entries).

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-24 21:29:10 +02:00
Andrea Righi
5d544ea264 scx_rustland_core: move CPU idle selection logic in user-space
Allow user-space scheduler to pick an idle CPU via
self.bpf.select_cpu(pid, prev_task, flags), mimicking the BPF's
select_cpu() iterface.

Also remove the full_user option and always rely on the idle selection
logic from user-space.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-24 21:28:13 +02:00
Andrea Righi
1dd329dd7d scx_rustland: update Cargo.lock
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-24 20:24:48 +02:00
Andrea Righi
106d59d997 scx_rlfifo: update Cargo.lock
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-24 20:24:48 +02:00
Andrea Righi
016aae759f
Merge pull request #545 from sched-ext/bpfland-honor-avg-nvcsw
scx_bpfland: always honor average nvcsw in lowlatency mode
2024-08-24 20:24:33 +02:00
Avraham Hollander
66b5dd0de9 Clean up scx_rusty help info a bit 2024-08-24 11:56:12 -04:00
Avraham Hollander
c34a470024 scx_lavd: Fix my own formatting error 2024-08-24 11:36:19 -04:00
Andrea Righi
5a08855a86 scx_bpfland: always honor average nvcsw in lowlatency mode
Keep evaluating the average number of voluntary context switches for
each task when lowlatency mode is enabled, even when interactive tasks
classification is disabled (via `-c 0`).

The average nvcsw is also used in lowlatency mode to evaluate the
proportional bonus to the tasks' deadline and it shouldn't be ignored
when interactive tasks classification is disabled. Moreover, make sure
that such bonus never exceeds the starvation threshold.

Keep in mind that it is still possible to disable the periodic average
nvcsw evaluation with `-c 0`, without specifying `--lowlatency`.

Fixes: 6a22853 ("scx_bpfland: introduce --lowlatency option")
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-24 10:42:22 +02:00
Tejun Heo
48092c6f88 scx_lavd: Relay introspection output in stats::TaskSample
This indirection doesn't make any visible behavior difference now but will
be used to implement scx_stats support.
2024-08-23 18:49:36 -10:00
Tejun Heo
725fa7f1be Merge branch 'main' into htejun/scx_stats 2024-08-23 17:10:08 -10:00
Daniel Hodges
5a2012763e
scx_layered: Add layer match for tgid
Add layer match for tgid.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-08-23 23:00:28 -04:00
Avraham Hollander
bedb18b48e Improve scx_lavd help info
A lot of scx_lavd's options do not clearly explain what they do. Add
some short explanations, clean up the existing ones, and direct the user
to read the in-code documentation for more info.
2024-08-23 18:56:14 -04:00
Avraham Hollander
d6e27b59e7 Clean up scx_bpfland help info a bit 2024-08-23 18:55:04 -04:00
Tejun Heo
25e437753c scx_layered, scx_rusty: Implement --help-stats
which shows all the defined stats. While at it, make some cosmetic updates.
2024-08-23 12:39:47 -10:00
Tejun Heo
405bcc63fe scx_stats: Make ScxStatsServerData a public carrier of data needed for stats server
And move related ops into it. This is a bit more natural and will also allow
doing other operaitons (e.g. describing stats) without launching the server.
2024-08-23 12:23:57 -10:00
Tejun Heo
7bd35b6cd3 scx_lavd: Cargo.lock update (caused by scx_utils depending on scx_stats) 2024-08-23 09:21:44 -10:00
Andrea Righi
e72676ede3
Merge pull request #540 from sched-ext/bpfland-cpufreq-awareness
scx_bpfland: cpu frequency and energy awareness
2024-08-23 21:17:34 +02:00
Tejun Heo
9e3b4e6db0 scx_stats: A bit of cleanups and renames 2024-08-23 09:09:02 -10:00
Tejun Heo
b6ccb87bec
Merge pull request #539 from sched-ext/htejun/scx_rusty
scx_rusty: Convert to scx_stats
2024-08-23 08:42:47 -10:00
Daniel Hodges
7d45059fa9
Merge pull request #538 from hodgesds/layered-pid
scx_layered: Add pid/ppid matches
2024-08-23 14:08:40 -04:00
Tejun Heo
8c8912ccea Merge branch 'main' into htejun/scx_rusty 2024-08-23 07:50:23 -10:00
Andrea Righi
50684e4569 scx_bpfland: introduce Intel Turbo Boost awareness
Make `--primar-domain auto` aware of turbo boosted CPUs and prioritize
them over the primary scheduling domain when the energy model
`balance_power` is used (typically when running on battery power with
the "balanced" profile).

With this change the scheduling hierarchy becomes the following:

 1) CPUs in the turbo scheduling domain
 2) CPUs in the primary scheduling domain
 3) full-idle SMT CPUs
 4) CPUs in the same L2 cache
 5) CPUs in the same L3 cache
 6) CPUs in the task's allowed domain

And the idle selection logic is modified as following:

 - In the turbo scheduling domain:
   - pick same full-idle SMT CPU
   - pick any other full-idle SMT CPU sharing the same L2 cache
   - pick any other full-idle SMT CPU sharing the same L3 cache
   - pick any other full-idle SMT CPU
   - pick same idle CPU
   - pick any other idle CPU sharing the same L2 cache
   - pick any other idle CPU sharing the same L3 cache
   - pick any other idle SMT CPU
 - In the primary scheduling domain:
   - pick same full-idle SMT CPU
   - pick any other full-idle SMT CPU sharing the same L2 cache
   - pick any other full-idle SMT CPU sharing the same L3 cache
   - pick any other full-idle SMT CPU
   - pick same idle CPU
   - pick any other idle CPU sharing the same L2 cache
   - pick any other idle CPU sharing the same L3 cache
   - pick any other idle SMT CPU
 - In the entire task domain:
   - pick any other idle CPU

Keep in mind that the turbo domain will be evaluated only when the
scheduler is started with `--primary-domain auto` and only when the
`balance_power` energy profile is used.

The turbo domain is always made using the subset of CPUs in the system
with the highest max frequency. If such subset can't be determined (for
example if all the CPUs in the primary domain have all the same
frequency), the turbo domain will be ignored.

Prioritizing turbo boosted CPUs can help to improve performance by
forcing the governor to scale up their frequency, without increasing too
much power consumption, due to the fact that tasks will be preferably
confined into a reduced amount of cores.

This change seems to improve performance, without increasing much
power consuption, on Intel laptops while using the `balanced_power`
energy profile.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-23 19:49:08 +02:00
Andrea Righi
d958dd4482 scx_bpfland: introduce dynamic energy profile
Introduce the new option `--primary-domain auto`. With this option the
scheduler will dynamically adjusts the primary scheduling domain at
run-time, in function of the current energy profile reported in
/sys/devices/system/cpu/cpufreq/policy0/energy_performance_preference.

When the `power` energy profile is selected, the primary scheduling
domain will prioritize E-cores. Alternatively, when the `performance`
profile is selected, it will prioritize P-cores. For all the other
energy profiles, all the CPUs in the system will be used.

Note that this option is only relevant on hybrid architectures with
P-cores and E-cores.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
2024-08-23 19:49:01 +02:00
Tejun Heo
44a0f1b124 scx_utils: Factor out monitor_stats() from scx_rusty and scx_layered 2024-08-23 06:46:19 -10:00
Tejun Heo
ae3024e938 scx_layered: Add --stats and make --monitor behavior consistent with scx_rusty 2024-08-23 05:52:52 -10:00
Tejun Heo
0f04a93dd1 scx_rusty: Add stat descriptions and make minor adjustments 2024-08-23 05:46:13 -10:00
Tejun Heo
36865234f8 scx_rusty: Add scx_stats annotations necessary for openmetrics translation 2024-08-23 04:59:08 -10:00
Tejun Heo
2f3f473cd3 scx_rusty: Improve timestamp reporting 2024-08-23 04:31:27 -10:00
Daniel Hodges
11b978a892 scx_layered: Add pid/ppid matches
Add matches for pid/ppid.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-08-23 07:20:05 -07:00