Commit Graph

89 Commits

Author SHA1 Message Date
David Vernet
e857dd90ab
layered: Use TLS map instead of hash map
In scx_layered, we're using a BPF_MAP_TYPE_HASH map (indexed by pid)
rather than a BPF_MAP_TYPE_TASK_STORAGE, to track local storage for a
task. As far as I can tell, there's no reason we need to be doing this.
We never access the map from user space, and we're even passing a
struct task_struct * to a helper subprog to look up the task context
rather than only doing it by pid.

Using a hashmap is error prone for this because we end up having to
manually track lifecycles for entries in the map rather than relying on
BPF to do it for us. For example, BPF will automatically free a task's
entry from the map when it exits. Let's just use TLS here rather than a
hashmap to avoid issues from this (e.g. we've observed the scheduler
getting evicted because we're accessing a stale map entry after a task
has been destroyed).

Reported-by: Valentin Andrei <vandrei@meta.com>
Signed-off-by: David Vernet <void@manifault.com>
2024-03-27 20:14:27 -05:00
David Vernet
602ec5ada3
layered: Make helper functions static
lookup_task_ctx(), lookup_task_ctx_may_fail(), and lookup_layer()
currently don't have the static keyword, so BPF may treat them as a
global function. We don't actually want these to be global, so let's
make them static to avoid confusing the verifier.

Signed-off-by: David Vernet <void@manifault.com>
2024-03-26 15:08:32 -05:00
David Vernet
3cda1bc690
Merge pull request #187 from sched-ext/layered-updates
scx_layered: Make config json assume default vaules for unspecified fields
2024-03-13 17:15:18 -05:00
Tejun Heo
76fb0fdd8f scx_layered: Make config json assume default vaules for unspecified fields
This makes writing configs and allows introducing new fields without
breaking existing configs.
2024-03-13 11:10:38 -10:00
Tejun Heo
6048992ca7
Merge pull request #185 from sched-ext/layered-updates
scx_layered: Implement layer properties `exclusive` and `min_exec_us`
2024-03-13 09:59:37 -10:00
Tejun Heo
60b346c1fc scx_layered: Add more comments 2024-03-13 09:56:28 -10:00
Tejun Heo
a9457a408e scx_layered: stat reporting updates 2024-03-12 10:48:21 -10:00
Tejun Heo
a642fc873b scx_layered: Fix stat reporting
GSTAT_TASK_CTX_FREE_FAILED should report total while EXCL_* should report
delta pct. Fix them.
2024-03-12 10:25:51 -10:00
Tejun Heo
58cbc5361d scx_layered: warn if omitted stats aren't zero 2024-03-12 09:29:31 -10:00
Tejun Heo
37006d1bc1 scx_layered: Use saturating sub when reading system stats, other misc changes
Sometimes io_wait time goes in the wrong direction. Use saturating sub.
2024-03-12 06:14:06 -10:00
Tejun Heo
342a4946af scx_layered: Better pct formatting when printing stats 2024-03-11 22:18:03 -10:00
Tejun Heo
be2102775b scx_layered: Implement min_exec_us option
which can be used to penalize tasks which wake up very frequently without
doing much.
2024-03-11 22:13:11 -10:00
Tejun Heo
0c62b24993 scx_layered: Implement exclusive property
A task in an exclusive grouped or open layer occupied a whole core - the
sibling CPU is kept idle.
2024-03-11 18:27:16 -10:00
Tejun Heo
76cc337d78 scx_layered: Add exclusive option to Open and Grouped layers
Actual implementation isn't done yet.
2024-03-11 12:07:03 -10:00
Jordan Rome
ffc7b7dc4a Fetch and build bpftool by default
This pairs with the new default behavior to fetch and build libbpf
and is mostly being used so we can use the latest bpftool and libbpf.
2024-03-11 10:00:01 -07:00
Jordan Rome
499924ead8 Add libbpf as a submodule
This is to potentinally reduce issues with folks
using different versions of libbpf at runtime.

This also:
- makes static linking of libbpf the default
- adds steps in `meson setup` to fetch libbpf and make it
2024-03-01 12:39:35 -08:00
Tejun Heo
4dc77f8ddf
Merge pull request #149 from davemarchevsky/davemarchevsky_nice_equals
scx_layered: Add MATCH_NICE_EQUALS match kind
2024-02-22 06:38:17 -10:00
Dave Marchevsky
9f510f18cd scx_layered: Add MATCH_NICE_EQUALS match kind
I have a usecase where specific nice values are used to bucket tasks
into groups that are handled separately by different `scx_layered`
policies, with no implications of relative priority between niceness X,
X + 1, X - 1, etc. In other words, nicevals are used as simple tags in
this scenario.

If we wanted to treat a specific niceness this way e.g. `11`, we could
do so with AND'd MATCH_NICE_{ABOVE,BELOW} like so:

```json
  "matches" : [
    [
      {
        "NiceAbove": 10
      },
      {
        "NiceBelow": 12
      },
    ],
  ],
```

But this is unnecessarily verbose and doesn't communicate the intent of
the match very well. Adding a `NiceEquals` matcher simplifies the
config and makes intent obvious:

```json
  "matches" : [
    [
      {
        "NiceEquals": 11
      },
    ],
  ],
```

This PR adds support for such a matcher.

Also, rename `layer_match.nice_above_or_below` to just
`layer_match.nice`, as the former doesn't describe the newly-added
matcher's use of the field. It's still obvious that `layer_match.nice`
is relevant to MATCH_NICE_{ABOVE, BELOW, EQUALS}.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
2024-02-22 04:08:07 -08:00
David Vernet
615b594e1c
layered: Don't refresh cpumasks before attaching
As mentioned in the previous commit, for some reason we're sometimes
(non-deterministically) not seeing the updated cpumask / layer values in
BPF if we initialize the cpumasks here before attaching. Let's undo this
for now so to avoid observing buggy behavior, until we figure it out.

Signed-off-by: David Vernet <void@manifault.com>
2024-02-21 19:19:45 -06:00
David Vernet
68d317079a
Revert "layered: Set layered cpumask in scheduler init call"
This reverts commit 56ff3437a2.

For some reason we seem to be non-deterministically failing to see the
updated layer values in BPF if we initialize before attaching. Let's
just undo this specific part so that we can unblock this being broken,
and we can figure it out async.

Signed-off-by: David Vernet <void@manifault.com>
2024-02-21 19:17:19 -06:00
David Vernet
31df8fbd09
layered: Consume from layer with cpumask in layered_dispatch
Currently, in layered_dispatch, we do the following:

1. Iterate over all preempt=true layers, and first try to consume from
   them.

2. Iterate over all confined layers, and consume from them if we find a
   layer with a cpumask that contains the consuming CPU.

3. Iterate over all grouped and open layers and consume from them in
   ordered sequence.

In (2), we're only iterating over confined layers, but we should also be
iterating over grouped layers. Otherwise, despite a consuming CPU being
allocated to a specific grouped layer, the CPU will consume from
whichever grouped or open layer has a task that's ready to run.

Signed-off-by: David Vernet <void@manifault.com>
2024-02-21 15:38:23 -06:00
David Vernet
56ff3437a2
layered: Set layered cpumask in scheduler init call
In layered_init, we're currently setting all bits in every layers'
cpumask, and then asynchronously updating the cpumasks at later time to
reflect their actual values at runtime. Now that we're updating the
layered code to initialize the cpumasks before we attach the scheduler,
we can instead have the init path actually refresh and initialize the
cpumasks directly.

Signed-off-by: David Vernet <void@manifault.com>
2024-02-21 15:38:23 -06:00
David Vernet
1f834e7f94
layered: Initialize layers before attaching scheduler
We currently have a bug in layered wherein we could fail to propagate
layer updates from user space to kernel space if a layer is never
adjusted after it's first initialized. For example, in the following
configuration:

[
	{
		"name": "workload.slice",
		"comment": "main workload slice",
		"matches": [
			[
				{
					"CgroupPrefix": "workload.slice/"
				}
			]
		],
		"kind": {
			"Grouped": {
				"cpus_range": [30, 30],
				"util_range": [
					0.0,
					1.0
				],
				"preempt": false
			}
		}
	},
	{
		"name": "normal",
		"comment": "the rest",
		"matches": [
			[]
		],
		"kind": {
			"Grouped": {
				"cpus_range": [2, 2],
				"util_range": [
					0.0,
					1.0
				],
				"preempt": false
			}
		}
	}
]

Both layers are static, and need only be resized a single time, so the
configuration would never be propagated to the kernel due to us never
calling update_bpf_layer_cpumask(). Let's instead have the
initialization propagate changes to the skeleton before we attach the
scheduler.

This has the advantage both of fixing the bug mentioned above where a
static configuration is never propagated to the kernel, and that we
don't have a short period when the scheduler is first attached where we
don't make optimal scheduling decisions due to the layer resizing not
being propagated.

Signed-off-by: David Vernet <void@manifault.com>
2024-02-21 15:38:21 -06:00
Jordan Rome
7c32acece0 Add libbpf logging to the rust schedulers
This is to get better logs when failing to load, attach, etc.
2024-02-20 15:17:10 -08:00
Tejun Heo
105dc36b8f scx_layered: Use scx_utils::user_exit_info
Instead of the bespoke implementation. This also makes scx_layered to print
out debug dump if exists.

Signed-off-by: Tejun Heo <tj@kernel.org>
2024-01-31 10:54:20 -10:00
Tejun Heo
4ee8104a6d
Merge pull request #114 from dschatzberg/local_avoid_enqueue
scx_layered: dispatch from select_cpu if possible
2024-01-31 08:33:26 -10:00
Dan Schatzberg
11e487c165 scx_layered: dispatch from select_cpu if possible
If we are doing local dispatch, we can avoid enqueue() altogether by
dispatching from select_cpu()

Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
2024-01-31 09:54:26 -08:00
Jordan Rome
1b3a9a1e72 [scx_layered] downgrade prometheus-client
This library at version 22 is not available in fedora:
https://src.fedoraproject.org/rpms/rust-prometheus-client

Rather than bothering the maintainer, let's just downgrade here.
2024-01-31 04:36:01 -08:00
Dan Schatzberg
ab5635ff6d scx_layered: Grab idle_smtmask a bit later
This is a really minor optimization, but we don't need idle_smtmask to
schedule pinned tasks, so defer it so the nr_cpus_allowed == 1 path is
marginally faster.

Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
2024-01-29 08:16:37 -08:00
Dan Schatzberg
8c9e65d880 scx_layered: Remove unnecessary idle_cpumask
idle_cpumask isn't used at all in pick_idle_cpu_from. The only need for
these cpumasks is to check if prev_cpu is a wholly idle CPU (and we only
do this when smt_enabled). idle_smtmask is sufficient for that check.

Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
2024-01-29 08:16:37 -08:00
Dan Schatzberg
142b6230b2 scx_layered: Fix AFFN_VIOL stat bump
Prior to this patch, we only bump LSTAT_AFFN_BIOL when the target cpu
was idle, but in both cases it should be counted as AFFN_VIOL.

Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
2024-01-26 13:13:16 -08:00
Tejun Heo
988b7d13c1 Bump versions
scx_exit_info change doesn't require code to be updated but breaks binary
compatbility. Bump versions and cut a new release.
2024-01-25 09:01:23 -10:00
Dan Schatzberg
7f9548eb34 scx_layered: Add support for OpenMetrics format
Currently scx_layered outputs statistics periodically as info! logs. The
format of this is largely unstructured and mostly suitable for running
scx_layered interactively (e.g. observing its behavior on the command
line or via logs after the fact).

In order to run scx_layered at larger scale, it's desireable to have
statistics output in some format that is amenable to being ingested into
monitoring databases (e.g. Prometheseus). This allows collection of
stats across many machines.

This commit adds a command line flag (-o) that outputs statistics to
stdout in OpenMetrics format instead of the normal log mechanism.
OpenMetrics has a public format
specification (https://github.com/OpenObservability/OpenMetrics) and is
in use by many projects.

The library for producing OpenMetrics metrics is lightweight but does
induce some changes. Primarily, metrics need to be pre-registered (see
OpenMetricsStats::new()).

Without -o, the output looks as before, for example:

```
19:39:54 [INFO] CPUs: online/possible=52/52 nr_cores=26
19:39:54 [INFO] Layered Scheduler Attached
19:39:56 [INFO] tot=   9912 local=76.71 open_idle= 0.00 affn_viol= 2.63 tctx_err=0 proc=21ms
19:39:56 [INFO] busy=  1.3 util=   65.2 load=    263.4 fallback_cpu=  1
19:39:56 [INFO]   batch    : util/frac=   49.7/ 76.3 load/frac=    252.0: 95.7 tasks=   458
19:39:56 [INFO]              tot=   2842 local=45.04 open_idle= 0.00 preempt= 0.00 affn_viol= 0.00
19:39:56 [INFO]              cpus=  2 [  0,  2] 04000001 00000000
19:39:56 [INFO]   immediate: util/frac=    0.0/  0.0 load/frac=      0.0:  0.0 tasks=     0
19:39:56 [INFO]              tot=      0 local= 0.00 open_idle= 0.00 preempt= 0.00 affn_viol= 0.00
19:39:56 [INFO]              cpus= 50 [  0, 50] fbfffffe 000fffff
19:39:56 [INFO]   normal   : util/frac=   15.4/ 23.7 load/frac=     11.4:  4.3 tasks=   556
19:39:56 [INFO]              tot=   7070 local=89.43 open_idle= 0.00 preempt= 0.00 affn_viol= 3.69
19:39:56 [INFO]              cpus= 50 [  0, 50] fbfffffe 000fffff
19:39:58 [INFO] tot=   7091 local=84.91 open_idle= 0.00 affn_viol= 2.64 tctx_err=0 proc=21ms
19:39:58 [INFO] busy=  0.6 util=   31.2 load=    107.1 fallback_cpu=  1
19:39:58 [INFO]   batch    : util/frac=   18.3/ 58.5 load/frac=     93.9: 87.7 tasks=   589
19:39:58 [INFO]              tot=   2011 local=60.67 open_idle= 0.00 preempt= 0.00 affn_viol= 0.00
19:39:58 [INFO]              cpus=  2 [  2,  2] 04000001 00000000
19:39:58 [INFO]   immediate: util/frac=    0.0/  0.0 load/frac=      0.0:  0.0 tasks=     0
19:39:58 [INFO]              tot=      0 local= 0.00 open_idle= 0.00 preempt= 0.00 affn_viol= 0.00
19:39:58 [INFO]              cpus= 50 [ 50, 50] fbfffffe 000fffff
19:39:58 [INFO]   normal   : util/frac=   13.0/ 41.5 load/frac=     13.2: 12.3 tasks=   650
19:39:58 [INFO]              tot=   5080 local=94.51 open_idle= 0.00 preempt= 0.00 affn_viol= 3.68
19:39:58 [INFO]              cpus= 50 [ 50, 50] fbfffffe 000fffff
^C19:39:59 [INFO] EXIT: BPF scheduler unregistered
```

With -o passed, the output is in OpenMetrics format:

```
19:40:08 [INFO] CPUs: online/possible=52/52 nr_cores=26
19:40:08 [INFO] Layered Scheduler Attached
 # HELP total Total scheduling events in the period.
 # TYPE total gauge
total 8489
 # HELP local % that got scheduled directly into an idle CPU.
 # TYPE local gauge
local 86.45305689716104
 # HELP open_idle % of open layer tasks scheduled into occupied idle CPUs.
 # TYPE open_idle gauge
open_idle 0.0
 # HELP affn_viol % which violated configured policies due to CPU affinity restrictions.
 # TYPE affn_viol gauge
affn_viol 2.332430203793144
 # HELP tctx_err Failures to free task contexts.
 # TYPE tctx_err gauge
tctx_err 0
 # HELP proc_ms CPU time this binary has consumed during the period.
 # TYPE proc_ms gauge
proc_ms 20
 # HELP busy CPU busy % (100% means all CPUs were fully occupied).
 # TYPE busy gauge
busy 0.5294061026085283
 # HELP util CPU utilization % (100% means one CPU was fully occupied).
 # TYPE util gauge
util 27.37195512782239
 # HELP load Sum of weight * duty_cycle for all tasks.
 # TYPE load gauge
load 81.55024768702126
 # HELP layer_util CPU utilization of the layer (100% means one CPU was fully occupied).
 # TYPE layer_util gauge
layer_util{layer_name="immediate"} 0.0
layer_util{layer_name="normal"} 19.340849995024997
layer_util{layer_name="batch"} 8.031105132797393
 # HELP layer_util_frac Fraction of total CPU utilization consumed by the layer.
 # TYPE layer_util_frac gauge
layer_util_frac{layer_name="batch"} 29.34063385422595
layer_util_frac{layer_name="immediate"} 0.0
layer_util_frac{layer_name="normal"} 70.65936614577405
 # HELP layer_load Sum of weight * duty_cycle for tasks in the layer.
 # TYPE layer_load gauge
layer_load{layer_name="immediate"} 0.0
layer_load{layer_name="normal"} 11.14363313258934
layer_load{layer_name="batch"} 70.40661455443191
 # HELP layer_load_frac Fraction of total load consumed by the layer.
 # TYPE layer_load_frac gauge
layer_load_frac{layer_name="normal"} 13.664744680306903
layer_load_frac{layer_name="immediate"} 0.0
layer_load_frac{layer_name="batch"} 86.33525531969309
 # HELP layer_tasks Number of tasks in the layer.
 # TYPE layer_tasks gauge
layer_tasks{layer_name="immediate"} 0
layer_tasks{layer_name="normal"} 490
layer_tasks{layer_name="batch"} 343
 # HELP layer_total Number of scheduling events in the layer.
 # TYPE layer_total gauge
layer_total{layer_name="normal"} 6711
layer_total{layer_name="batch"} 1778
layer_total{layer_name="immediate"} 0
 # HELP layer_local % of scheduling events directly into an idle CPU.
 # TYPE layer_local gauge
layer_local{layer_name="batch"} 69.79752530933632
layer_local{layer_name="immediate"} 0.0
layer_local{layer_name="normal"} 90.86574281031143
 # HELP layer_open_idle % of scheduling events into idle CPUs occupied by other layers.
 # TYPE layer_open_idle gauge
layer_open_idle{layer_name="immediate"} 0.0
layer_open_idle{layer_name="batch"} 0.0
layer_open_idle{layer_name="normal"} 0.0
 # HELP layer_preempt % of scheduling events that preempted other tasks. #
 # TYPE layer_preempt gauge
layer_preempt{layer_name="normal"} 0.0
layer_preempt{layer_name="batch"} 0.0
layer_preempt{layer_name="immediate"} 0.0
 # HELP layer_affn_viol % of scheduling events that violated configured policies due to CPU affinity restrictions.
 # TYPE layer_affn_viol gauge
layer_affn_viol{layer_name="normal"} 2.950379973178364
layer_affn_viol{layer_name="batch"} 0.0
layer_affn_viol{layer_name="immediate"} 0.0
 # HELP layer_cur_nr_cpus Current  # of CPUs assigned to the layer.
 # TYPE layer_cur_nr_cpus gauge
layer_cur_nr_cpus{layer_name="normal"} 50
layer_cur_nr_cpus{layer_name="batch"} 2
layer_cur_nr_cpus{layer_name="immediate"} 50
 # HELP layer_min_nr_cpus Minimum  # of CPUs assigned to the layer.
 # TYPE layer_min_nr_cpus gauge
layer_min_nr_cpus{layer_name="normal"} 0
layer_min_nr_cpus{layer_name="batch"} 0
layer_min_nr_cpus{layer_name="immediate"} 0
 # HELP layer_max_nr_cpus Maximum  # of CPUs assigned to the layer.
 # TYPE layer_max_nr_cpus gauge
layer_max_nr_cpus{layer_name="immediate"} 50
layer_max_nr_cpus{layer_name="normal"} 50
layer_max_nr_cpus{layer_name="batch"} 2
 # EOF
^C19:40:11 [INFO] EXIT: BPF scheduler unregistered
```

Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
2024-01-25 09:59:49 -08:00
Jordan Rome
9f9a97a97f Update descriptions in cargo toml files 2024-01-19 18:19:46 -08:00
Tejun Heo
942b0269b8 Bump versions
After updates to reflect the updated init and direct dispatch API, the
schedulers aren't compatible with older kernels. Bump versions and publish
releases.
2024-01-08 18:49:54 -10:00
Tejun Heo
552b75a9c7 scx: Build fix after kernel update
In the latest kernel, sched_ext API has changed in two areas:

- ops.prep_enable/cancel_enable/enable/disable() replaced with
  ops.init_task/enable/disable/exit_task().

- scx_bpf_dispatch() can now be called from ops.select_cpu(). Also,
  SCX_ENQ_LOCAL flag is removed. Instead, users can call
  scx_bpf_select_cpu_dfl() from ops.select_cpu() and use the @is_idle out
  param value to determine whether to dispatch directly.

This commit updates all schedules so that they build.

- Init functions renamed / merged / split.

- ops.select_cpu() is added to several schedulers and local direct
  disptching logic is moved there.

This is the minimum update which is need to make the schedulers build and
work. It needs further update to e.g. move vtime udpates to ops.enable().
2024-01-08 14:48:24 -10:00
Jordan Rome
661ea57c5c bump scx_rusty and scx_layered
These were supposed to be bumped in this commit:
fed1dae9da
2024-01-04 13:57:29 -08:00
Jordan Rome
5bacefcdbe Add README files for each rust scheduler
This because each scheduler has it's own Rust Crate
and it's better if they had a README associated with each one.

https://crates.io/crates/scx_layered
2024-01-04 07:35:44 -08:00
Jordan Rome
e9a9d32ab6 Restructure scheds folder names
- combine c and kernel-examples as it's confusing to have both
- rename 'rust-user' and 'c-user' to just 'rust' and 'c', which is simpler
- update and fix sync-to-kernel.sh
2023-12-17 13:14:31 -08:00