Commit Graph

310 Commits

Author SHA1 Message Date
Dan Schatzberg
11e487c165 scx_layered: dispatch from select_cpu if possible
If we are doing local dispatch, we can avoid enqueue() altogether by
dispatching from select_cpu()

Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
2024-01-31 09:54:26 -08:00
Tejun Heo
8f806c41b1
Merge pull request #113 from jordalgo/breaking-changes
Add BREAKING_CHANGES.md
2024-01-29 10:56:13 -10:00
Jordan Rome
347a81fcff Add BREAKING_CHANGES.md 2024-01-29 10:44:44 -08:00
Tejun Heo
53106aafa9
Merge pull request #111 from dschatzberg/cleanup_pick_idle
scx_layered: small idle_cpumask cleanups
2024-01-29 08:40:35 -10:00
Tejun Heo
d978df2a2f
Merge pull request #112 from sirlucjan/services-meson
Make meson.build more readable
2024-01-29 06:44:17 -10:00
Dan Schatzberg
ab5635ff6d scx_layered: Grab idle_smtmask a bit later
This is a really minor optimization, but we don't need idle_smtmask to
schedule pinned tasks, so defer it so the nr_cpus_allowed == 1 path is
marginally faster.

Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
2024-01-29 08:16:37 -08:00
Dan Schatzberg
8c9e65d880 scx_layered: Remove unnecessary idle_cpumask
idle_cpumask isn't used at all in pick_idle_cpu_from. The only need for
these cpumasks is to check if prev_cpu is a wholly idle CPU (and we only
do this when smt_enabled). idle_smtmask is sufficient for that check.

Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
2024-01-29 08:16:37 -08:00
Piotr Gorski
22e775842a
Make meson.build more readable
Signed-off-by: Piotr Gorski <lucjan.lucjanov@gmail.com>
2024-01-29 17:14:39 +01:00
David Vernet
46ba5908ab
Merge pull request #109 from sirlucjan/services-rework
Reworking systemd-service and adding a config file
2024-01-26 22:49:00 -06:00
Piotr Gorski
561cbc4e6d
Update README.md
Signed-off-by: Piotr Gorski <lucjan.lucjanov@gmail.com>
2024-01-27 00:59:23 +01:00
Piotr Gorski
26d53233de
systemd-services: add one service for all schedulers and config file
Signed-off-by: Piotr Gorski <lucjan.lucjanov@gmail.com>
2024-01-27 00:41:00 +01:00
David Vernet
5329ac698f
Merge pull request #108 from sirlucjan/services-conflicts
systemd-services: setting conflict between schedulers
2024-01-26 15:34:12 -06:00
Tejun Heo
4eb2367048
Merge pull request #107 from dschatzberg/fix_affn_viol
scx_layered: Fix AFFN_VIOL stat bump
2024-01-26 11:21:07 -10:00
Piotr Gorski
23223b8b77
systemd-services: setting conflict between schedulers
Signed-off-by: Piotr Gorski <lucjan.lucjanov@gmail.com>
2024-01-26 22:18:51 +01:00
Dan Schatzberg
142b6230b2 scx_layered: Fix AFFN_VIOL stat bump
Prior to this patch, we only bump LSTAT_AFFN_BIOL when the target cpu
was idle, but in both cases it should be counted as AFFN_VIOL.

Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
2024-01-26 13:13:16 -08:00
Tejun Heo
ee2e0c091c
Merge pull request #106 from sched-ext/github-ci-kernel-commit
ci: print the latest commit of the checked out sched-ext kernel
2024-01-25 10:30:22 -10:00
Andrea Righi
a8a1944a5d ci: print the latest commit of the checked out sched-ext kernel
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-25 21:20:50 +01:00
Tejun Heo
885d176ea1
Merge pull request #105 from sched-ext/htejun
Bump versions
2024-01-25 09:03:36 -10:00
Tejun Heo
988b7d13c1 Bump versions
scx_exit_info change doesn't require code to be updated but breaks binary
compatbility. Bump versions and cut a new release.
2024-01-25 09:01:23 -10:00
Tejun Heo
eb997a6e55
Merge pull request #101 from dschatzberg/openmetrics
scx_layered: Add support for OpenMetrics format
2024-01-25 08:59:16 -10:00
Tejun Heo
7117a22009
Merge pull request #104 from sched-ext/htejun
user_exit_info: Print out debug dump if available
2024-01-25 08:54:46 -10:00
Tejun Heo
740e382f12 user_exit_info: Print out debug dump if available
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-01-25 08:49:31 -10:00
Tejun Heo
09e2824b57
Merge pull request #103 from dschatzberg/update_vmlinux
Update vmlinux.h
2024-01-25 08:26:40 -10:00
Dan Schatzberg
975c698843 Update vmlinux.h
26ae1b0356

changed scx_exit_info which requires us to rebuild with a new vmlinux.h
This patch updates vmlinux.h to the current sched_ext branch in the
github repo.

Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
2024-01-25 10:21:48 -08:00
Dan Schatzberg
7f9548eb34 scx_layered: Add support for OpenMetrics format
Currently scx_layered outputs statistics periodically as info! logs. The
format of this is largely unstructured and mostly suitable for running
scx_layered interactively (e.g. observing its behavior on the command
line or via logs after the fact).

In order to run scx_layered at larger scale, it's desireable to have
statistics output in some format that is amenable to being ingested into
monitoring databases (e.g. Prometheseus). This allows collection of
stats across many machines.

This commit adds a command line flag (-o) that outputs statistics to
stdout in OpenMetrics format instead of the normal log mechanism.
OpenMetrics has a public format
specification (https://github.com/OpenObservability/OpenMetrics) and is
in use by many projects.

The library for producing OpenMetrics metrics is lightweight but does
induce some changes. Primarily, metrics need to be pre-registered (see
OpenMetricsStats::new()).

Without -o, the output looks as before, for example:

```
19:39:54 [INFO] CPUs: online/possible=52/52 nr_cores=26
19:39:54 [INFO] Layered Scheduler Attached
19:39:56 [INFO] tot=   9912 local=76.71 open_idle= 0.00 affn_viol= 2.63 tctx_err=0 proc=21ms
19:39:56 [INFO] busy=  1.3 util=   65.2 load=    263.4 fallback_cpu=  1
19:39:56 [INFO]   batch    : util/frac=   49.7/ 76.3 load/frac=    252.0: 95.7 tasks=   458
19:39:56 [INFO]              tot=   2842 local=45.04 open_idle= 0.00 preempt= 0.00 affn_viol= 0.00
19:39:56 [INFO]              cpus=  2 [  0,  2] 04000001 00000000
19:39:56 [INFO]   immediate: util/frac=    0.0/  0.0 load/frac=      0.0:  0.0 tasks=     0
19:39:56 [INFO]              tot=      0 local= 0.00 open_idle= 0.00 preempt= 0.00 affn_viol= 0.00
19:39:56 [INFO]              cpus= 50 [  0, 50] fbfffffe 000fffff
19:39:56 [INFO]   normal   : util/frac=   15.4/ 23.7 load/frac=     11.4:  4.3 tasks=   556
19:39:56 [INFO]              tot=   7070 local=89.43 open_idle= 0.00 preempt= 0.00 affn_viol= 3.69
19:39:56 [INFO]              cpus= 50 [  0, 50] fbfffffe 000fffff
19:39:58 [INFO] tot=   7091 local=84.91 open_idle= 0.00 affn_viol= 2.64 tctx_err=0 proc=21ms
19:39:58 [INFO] busy=  0.6 util=   31.2 load=    107.1 fallback_cpu=  1
19:39:58 [INFO]   batch    : util/frac=   18.3/ 58.5 load/frac=     93.9: 87.7 tasks=   589
19:39:58 [INFO]              tot=   2011 local=60.67 open_idle= 0.00 preempt= 0.00 affn_viol= 0.00
19:39:58 [INFO]              cpus=  2 [  2,  2] 04000001 00000000
19:39:58 [INFO]   immediate: util/frac=    0.0/  0.0 load/frac=      0.0:  0.0 tasks=     0
19:39:58 [INFO]              tot=      0 local= 0.00 open_idle= 0.00 preempt= 0.00 affn_viol= 0.00
19:39:58 [INFO]              cpus= 50 [ 50, 50] fbfffffe 000fffff
19:39:58 [INFO]   normal   : util/frac=   13.0/ 41.5 load/frac=     13.2: 12.3 tasks=   650
19:39:58 [INFO]              tot=   5080 local=94.51 open_idle= 0.00 preempt= 0.00 affn_viol= 3.68
19:39:58 [INFO]              cpus= 50 [ 50, 50] fbfffffe 000fffff
^C19:39:59 [INFO] EXIT: BPF scheduler unregistered
```

With -o passed, the output is in OpenMetrics format:

```
19:40:08 [INFO] CPUs: online/possible=52/52 nr_cores=26
19:40:08 [INFO] Layered Scheduler Attached
 # HELP total Total scheduling events in the period.
 # TYPE total gauge
total 8489
 # HELP local % that got scheduled directly into an idle CPU.
 # TYPE local gauge
local 86.45305689716104
 # HELP open_idle % of open layer tasks scheduled into occupied idle CPUs.
 # TYPE open_idle gauge
open_idle 0.0
 # HELP affn_viol % which violated configured policies due to CPU affinity restrictions.
 # TYPE affn_viol gauge
affn_viol 2.332430203793144
 # HELP tctx_err Failures to free task contexts.
 # TYPE tctx_err gauge
tctx_err 0
 # HELP proc_ms CPU time this binary has consumed during the period.
 # TYPE proc_ms gauge
proc_ms 20
 # HELP busy CPU busy % (100% means all CPUs were fully occupied).
 # TYPE busy gauge
busy 0.5294061026085283
 # HELP util CPU utilization % (100% means one CPU was fully occupied).
 # TYPE util gauge
util 27.37195512782239
 # HELP load Sum of weight * duty_cycle for all tasks.
 # TYPE load gauge
load 81.55024768702126
 # HELP layer_util CPU utilization of the layer (100% means one CPU was fully occupied).
 # TYPE layer_util gauge
layer_util{layer_name="immediate"} 0.0
layer_util{layer_name="normal"} 19.340849995024997
layer_util{layer_name="batch"} 8.031105132797393
 # HELP layer_util_frac Fraction of total CPU utilization consumed by the layer.
 # TYPE layer_util_frac gauge
layer_util_frac{layer_name="batch"} 29.34063385422595
layer_util_frac{layer_name="immediate"} 0.0
layer_util_frac{layer_name="normal"} 70.65936614577405
 # HELP layer_load Sum of weight * duty_cycle for tasks in the layer.
 # TYPE layer_load gauge
layer_load{layer_name="immediate"} 0.0
layer_load{layer_name="normal"} 11.14363313258934
layer_load{layer_name="batch"} 70.40661455443191
 # HELP layer_load_frac Fraction of total load consumed by the layer.
 # TYPE layer_load_frac gauge
layer_load_frac{layer_name="normal"} 13.664744680306903
layer_load_frac{layer_name="immediate"} 0.0
layer_load_frac{layer_name="batch"} 86.33525531969309
 # HELP layer_tasks Number of tasks in the layer.
 # TYPE layer_tasks gauge
layer_tasks{layer_name="immediate"} 0
layer_tasks{layer_name="normal"} 490
layer_tasks{layer_name="batch"} 343
 # HELP layer_total Number of scheduling events in the layer.
 # TYPE layer_total gauge
layer_total{layer_name="normal"} 6711
layer_total{layer_name="batch"} 1778
layer_total{layer_name="immediate"} 0
 # HELP layer_local % of scheduling events directly into an idle CPU.
 # TYPE layer_local gauge
layer_local{layer_name="batch"} 69.79752530933632
layer_local{layer_name="immediate"} 0.0
layer_local{layer_name="normal"} 90.86574281031143
 # HELP layer_open_idle % of scheduling events into idle CPUs occupied by other layers.
 # TYPE layer_open_idle gauge
layer_open_idle{layer_name="immediate"} 0.0
layer_open_idle{layer_name="batch"} 0.0
layer_open_idle{layer_name="normal"} 0.0
 # HELP layer_preempt % of scheduling events that preempted other tasks. #
 # TYPE layer_preempt gauge
layer_preempt{layer_name="normal"} 0.0
layer_preempt{layer_name="batch"} 0.0
layer_preempt{layer_name="immediate"} 0.0
 # HELP layer_affn_viol % of scheduling events that violated configured policies due to CPU affinity restrictions.
 # TYPE layer_affn_viol gauge
layer_affn_viol{layer_name="normal"} 2.950379973178364
layer_affn_viol{layer_name="batch"} 0.0
layer_affn_viol{layer_name="immediate"} 0.0
 # HELP layer_cur_nr_cpus Current  # of CPUs assigned to the layer.
 # TYPE layer_cur_nr_cpus gauge
layer_cur_nr_cpus{layer_name="normal"} 50
layer_cur_nr_cpus{layer_name="batch"} 2
layer_cur_nr_cpus{layer_name="immediate"} 50
 # HELP layer_min_nr_cpus Minimum  # of CPUs assigned to the layer.
 # TYPE layer_min_nr_cpus gauge
layer_min_nr_cpus{layer_name="normal"} 0
layer_min_nr_cpus{layer_name="batch"} 0
layer_min_nr_cpus{layer_name="immediate"} 0
 # HELP layer_max_nr_cpus Maximum  # of CPUs assigned to the layer.
 # TYPE layer_max_nr_cpus gauge
layer_max_nr_cpus{layer_name="immediate"} 50
layer_max_nr_cpus{layer_name="normal"} 50
layer_max_nr_cpus{layer_name="batch"} 2
 # EOF
^C19:40:11 [INFO] EXIT: BPF scheduler unregistered
```

Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
2024-01-25 09:59:49 -08:00
David Vernet
9ce481255b
Merge pull request #102 from sirlucjan/services-update
systemd-services: replace ConditionPathExists with ConditionPathIsDirectory
2024-01-25 09:06:27 -06:00
Piotr Gorski
128fa63cc2
systemd-services: replace ConditionPathExists with ConditionPathIsDirectory
Signed-off-by: Piotr Gorski <lucjan.lucjanov@gmail.com>
2024-01-25 15:12:15 +01:00
David Vernet
911c3c03a2
Merge pull request #100 from sirlucjan/services-readme
Add README.md for systemd services
2024-01-24 09:37:07 -06:00
Piotr Gorski
db5d7c53d8
Update descriptions
Signed-off-by: Piotr Gorski <lucjan.lucjanov@gmail.com>
2024-01-24 16:35:47 +01:00
Piotr Gorski
25cc69b3c4
Add README.md for systemd services
Signed-off-by: Piotr Gorski <lucjan.lucjanov@gmail.com>
2024-01-24 14:56:45 +01:00
Andrea Righi
83c2b414d6
Merge pull request #99 from sched-ext/rustland-fixes
scx_rustland: fixes to improve scheduler stability
2024-01-23 13:51:28 +01:00
Andrea Righi
6d89eceb93 scx_rustland: dispatch tasks only on the global DSQ
Commit c6ada25 ("scx_rustland: use custom pcpu DSQ instead of
SCX_DSQ_LOCAL{_ON}") fixed the race issues with the cpumask, but it also
introduced performance regressions.

Until we figure out the reasons of the performance regressions, simplify
the dispatcher and go back at using only the global DSQ, relying on the
built-in idle cpu selection.

In this way we can still enforce task affinity properly
(`stress-ng --race-sched N` does not crash the scheduler) and we can
also provide a better level of system responsiveness (according to the
results of the stress tests done recently).

The idea of this change is to make the scheduler usable in certain
real-world scenarios (and as bug-free as possible), while we figure out
the performance regressions of the per-CPU DSQ approach, that will
likely be re-introduced later on in the future.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-23 13:24:12 +01:00
Andrea Righi
06b5ff3d2f scx_rustland: clarify the logic to determine interactive tasks
No functional change, simply rewrite the code a bit and update the
comment to clarify the logic to detect interactive tasks and apply the
priority boost.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-23 08:28:44 +01:00
Andrea Righi
ab1c4f66a8 scx_rustland: allow to disable the slice boost completely
Allow to specify `-b 0` to completely disable the slice boost logic and
fallback to standard vruntime-based scheduler with variable time slice.

In this way interactive tasks will not get over-prioritized over the
other tasks in the system.

Having this option can help to easily track down potential performance
regressions arising for over-prioritizing interactive tasks.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-23 00:34:06 +01:00
Andrea Righi
b4269452fc scx_userland: handle preemption events from higher sched_class
Make sure to re-schedule the user-space scheduler if it's preempted by a
task from a higher priority sched_class.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-23 00:34:06 +01:00
Andrea Righi
2426d1024f scx_rustland: increase max amount of enqueued tasks
As the scheduler is progressing towards a more stable and usable state,
it may be subject to heavy stress tests.

For this reason, bump up the limit of MAX_ENQUEUED_TASKS to 8192 in the
BPF component, to be able to sustain task-intensive stress tests,
reducing the risk of potential scheduling congestion conditions.

The downside is a negligible increase in the memory footprint of the BPF
component, that is worth the cost in order to have an improved scheduler
stability.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-21 15:47:35 +01:00
Andrea Righi
28bf96c78e scx_rustland: mitigate unevictable memory page faults
Page faults cannot happen when the user-space scheduler is running,
otherwise we may hit deadlock conditions: a kthread may need to run to
resolve the page fault, but the user-space scheduler is waiting on the
page fault to be resolved => deadlock.

We solved this problem (mostly) in commit 9708a80 ("scx_userland: use a
custom memory allocator to prevent page faults"), introducing a custom
allocator for the user-space scheduler that operates on a pre-allocated
mlocked memory buffer, but there is an exception that can still trigger
page faults: kcompactd.

When memory compaction is enabled, specifically with
vm.compact_unevictable_allowed=1 (which is often the default in many
distributions), kcompactd regularly attempts to compact all memory
zones, such that free memory is available in contiguous blocks where
feasible, including unevictable memory as well.

In the event that kcompactd remaps pages within the user-space
scheduler's address space, it can lead to page faults, resulting in a
potential deadlock.

To prevent this from happening automatically set
vm.compact_unevictable_allowed=0 when the scheduler is loaded and
restore the previous value when the scheduler in unloaded. In this way
we can prevent kcompactd from touching the unevictable memory associated
to the user-space scheduler.

Keep in mind that this is not a full bullet proof solution: something
else in the system may still set vm.compact_unevictable_allowed=1 while
the scheduler is running, re-enabling the risk of deadlock.

Ideally we would need a way to mark the user-space scheduler memory as
"really unevictable", or a proper kernel ABI to instruct kcompactd to
exclude certain tasks (or better, cgroups) from its proactive memory
compaction actions, but since then, this seems to be the best way to
mitigate this issue.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-21 15:47:35 +01:00
David Vernet
c6ada251ef scx_rustland: use custom pcpu DSQ instead of SCX_DSQ_LOCAL{_ON}
We still don't have a reliable and non-racy way to manage cpumasks from
the user-space scheduler, so it is quite hard for the scheduler to
enforce the proper CPU affinity behavior.

Despite checking the cpumask in the BPF part, tasks may still be
assigned to a CPU that they cannot use, triggering scheduler errors.

For example, it is really easy to crash the scheduler with a simple CPU
affinity stress test (`stress-ng --race-sched 8 --timeout 5`):

  14:51:28 [WARN] FAIL: SCX_DSQ_LOCAL[_ON] verdict target cpu 1 not allowed for stress-ng-race-[567048] (err=1024)

To prevent this issue from happening, create custom DSQ for each CPU
available in the system and use these per-CPU DSQs to dispatch all the
tasks processed by the user-space scheduler, including the user-space
scheduler itself.

Then consume the these DSQs from the .dispatch() callback of the
respective CPU, to transfer all the tasks to the consuming CPU's local
DSQ, preventing the cpumask race condition encountered using
SCX_DSQ_LOCAL_ON.

With this patch applied the `stress-ng --race-sched N` stress test can
be executed successfully (even with large values of N) without causing
the scheduler to crash.

Signed-off-by: David Vernet <void@manifault.com>
[ arighi: kick target cpu to improve responsiveness, update comments ]
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-21 15:47:35 +01:00
David Vernet
497229a590
Merge pull request #98 from jordalgo/cargo-toml 2024-01-20 11:18:18 -06:00
Jordan Rome
9f9a97a97f Update descriptions in cargo toml files 2024-01-19 18:19:46 -08:00
David Vernet
0ac9d40e43
Merge pull request #97 from sirlucjan/services-fixes
Set the correct value for sched-ext journald namespace
2024-01-19 14:46:24 -06:00
Piotr Gorski
9848ab4183
Increase log size to 25M
Signed-off-by: Piotr Gorski <lucjan.lucjanov@gmail.com>
2024-01-19 21:30:33 +01:00
Piotr Gorski
1a1290d54c
Simplify the location of the journal-sched-ext file
Signed-off-by: Piotr Gorski <lucjan.lucjanov@gmail.com>
2024-01-19 19:13:28 +01:00
Piotr Gorski
b6650fa4dc
Set the correct value for sched-ext journald namespace
Signed-off-by: Piotr Gorski <lucjan.lucjanov@gmail.com>
2024-01-19 18:22:47 +01:00
Andrea Righi
af11da2661
Merge pull request #95 from sched-ext/github-ci
ci: test the shedulers with the latest sched-ext kernel
2024-01-18 21:16:27 +01:00
Andrea Righi
c730e0558f ci: test the shedulers with the latest sched-ext kernel
Instead of downloading a precompiled sched-ext enabled kernel from the
Ubuntu ppa, fetch the latest kernel directly from the sched-ext git
repository and recompile it on-the-fly using virtme-ng.

This allows to get rid of the Ubuntu ppa dependency, take out from the
equation potential Ubuntu-specific patches, and ensures testing all the
schedulers with the most up-to-date sched-ext kernel (that should also
help to detect potential kernel-related issues in advance).

The downside is that the CI runs will take a bit longer now, because we
are recompiling the kernel from scratch. However, the kernel built with
virtme-ng is relatively quick to compile and includes all the sched-ext
features required for testing.

It's worth noting that this method aligns with the current sched-ext
kernel CI, where we test only the in-kernel schedulers (as intended).

This change allows to extend the test coverage, using the same kernel to
test also the schedulers included in this repository.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-18 20:51:59 +01:00
David Vernet
dd07c442fc
Merge pull request #93 from sirlucjan/services-improvements
Set log size to 10M
2024-01-17 17:43:17 -06:00
Piotr Gorski
8c61d38743
Drop unneeded default value
Signed-off-by: Piotr Gorski <lucjan.lucjanov@gmail.com>
2024-01-18 00:23:04 +01:00
Piotr Gorski
1abd319cae
Set log size to 10M
Signed-off-by: Piotr Gorski <lucjan.lucjanov@gmail.com>
2024-01-18 00:03:07 +01:00
Andrea Righi
24ef0f6c00
Merge pull request #94 from sched-ext/scx-rustland-smt-improvements
scx-rustland: SMT improvements
2024-01-17 21:01:26 +01:00