Prior to this patch, we only bump LSTAT_AFFN_BIOL when the target cpu
was idle, but in both cases it should be counted as AFFN_VIOL.
Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
26ae1b0356
changed scx_exit_info which requires us to rebuild with a new vmlinux.h
This patch updates vmlinux.h to the current sched_ext branch in the
github repo.
Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
Currently scx_layered outputs statistics periodically as info! logs. The
format of this is largely unstructured and mostly suitable for running
scx_layered interactively (e.g. observing its behavior on the command
line or via logs after the fact).
In order to run scx_layered at larger scale, it's desireable to have
statistics output in some format that is amenable to being ingested into
monitoring databases (e.g. Prometheseus). This allows collection of
stats across many machines.
This commit adds a command line flag (-o) that outputs statistics to
stdout in OpenMetrics format instead of the normal log mechanism.
OpenMetrics has a public format
specification (https://github.com/OpenObservability/OpenMetrics) and is
in use by many projects.
The library for producing OpenMetrics metrics is lightweight but does
induce some changes. Primarily, metrics need to be pre-registered (see
OpenMetricsStats::new()).
Without -o, the output looks as before, for example:
```
19:39:54 [INFO] CPUs: online/possible=52/52 nr_cores=26
19:39:54 [INFO] Layered Scheduler Attached
19:39:56 [INFO] tot= 9912 local=76.71 open_idle= 0.00 affn_viol= 2.63 tctx_err=0 proc=21ms
19:39:56 [INFO] busy= 1.3 util= 65.2 load= 263.4 fallback_cpu= 1
19:39:56 [INFO] batch : util/frac= 49.7/ 76.3 load/frac= 252.0: 95.7 tasks= 458
19:39:56 [INFO] tot= 2842 local=45.04 open_idle= 0.00 preempt= 0.00 affn_viol= 0.00
19:39:56 [INFO] cpus= 2 [ 0, 2] 04000001 00000000
19:39:56 [INFO] immediate: util/frac= 0.0/ 0.0 load/frac= 0.0: 0.0 tasks= 0
19:39:56 [INFO] tot= 0 local= 0.00 open_idle= 0.00 preempt= 0.00 affn_viol= 0.00
19:39:56 [INFO] cpus= 50 [ 0, 50] fbfffffe 000fffff
19:39:56 [INFO] normal : util/frac= 15.4/ 23.7 load/frac= 11.4: 4.3 tasks= 556
19:39:56 [INFO] tot= 7070 local=89.43 open_idle= 0.00 preempt= 0.00 affn_viol= 3.69
19:39:56 [INFO] cpus= 50 [ 0, 50] fbfffffe 000fffff
19:39:58 [INFO] tot= 7091 local=84.91 open_idle= 0.00 affn_viol= 2.64 tctx_err=0 proc=21ms
19:39:58 [INFO] busy= 0.6 util= 31.2 load= 107.1 fallback_cpu= 1
19:39:58 [INFO] batch : util/frac= 18.3/ 58.5 load/frac= 93.9: 87.7 tasks= 589
19:39:58 [INFO] tot= 2011 local=60.67 open_idle= 0.00 preempt= 0.00 affn_viol= 0.00
19:39:58 [INFO] cpus= 2 [ 2, 2] 04000001 00000000
19:39:58 [INFO] immediate: util/frac= 0.0/ 0.0 load/frac= 0.0: 0.0 tasks= 0
19:39:58 [INFO] tot= 0 local= 0.00 open_idle= 0.00 preempt= 0.00 affn_viol= 0.00
19:39:58 [INFO] cpus= 50 [ 50, 50] fbfffffe 000fffff
19:39:58 [INFO] normal : util/frac= 13.0/ 41.5 load/frac= 13.2: 12.3 tasks= 650
19:39:58 [INFO] tot= 5080 local=94.51 open_idle= 0.00 preempt= 0.00 affn_viol= 3.68
19:39:58 [INFO] cpus= 50 [ 50, 50] fbfffffe 000fffff
^C19:39:59 [INFO] EXIT: BPF scheduler unregistered
```
With -o passed, the output is in OpenMetrics format:
```
19:40:08 [INFO] CPUs: online/possible=52/52 nr_cores=26
19:40:08 [INFO] Layered Scheduler Attached
# HELP total Total scheduling events in the period.
# TYPE total gauge
total 8489
# HELP local % that got scheduled directly into an idle CPU.
# TYPE local gauge
local 86.45305689716104
# HELP open_idle % of open layer tasks scheduled into occupied idle CPUs.
# TYPE open_idle gauge
open_idle 0.0
# HELP affn_viol % which violated configured policies due to CPU affinity restrictions.
# TYPE affn_viol gauge
affn_viol 2.332430203793144
# HELP tctx_err Failures to free task contexts.
# TYPE tctx_err gauge
tctx_err 0
# HELP proc_ms CPU time this binary has consumed during the period.
# TYPE proc_ms gauge
proc_ms 20
# HELP busy CPU busy % (100% means all CPUs were fully occupied).
# TYPE busy gauge
busy 0.5294061026085283
# HELP util CPU utilization % (100% means one CPU was fully occupied).
# TYPE util gauge
util 27.37195512782239
# HELP load Sum of weight * duty_cycle for all tasks.
# TYPE load gauge
load 81.55024768702126
# HELP layer_util CPU utilization of the layer (100% means one CPU was fully occupied).
# TYPE layer_util gauge
layer_util{layer_name="immediate"} 0.0
layer_util{layer_name="normal"} 19.340849995024997
layer_util{layer_name="batch"} 8.031105132797393
# HELP layer_util_frac Fraction of total CPU utilization consumed by the layer.
# TYPE layer_util_frac gauge
layer_util_frac{layer_name="batch"} 29.34063385422595
layer_util_frac{layer_name="immediate"} 0.0
layer_util_frac{layer_name="normal"} 70.65936614577405
# HELP layer_load Sum of weight * duty_cycle for tasks in the layer.
# TYPE layer_load gauge
layer_load{layer_name="immediate"} 0.0
layer_load{layer_name="normal"} 11.14363313258934
layer_load{layer_name="batch"} 70.40661455443191
# HELP layer_load_frac Fraction of total load consumed by the layer.
# TYPE layer_load_frac gauge
layer_load_frac{layer_name="normal"} 13.664744680306903
layer_load_frac{layer_name="immediate"} 0.0
layer_load_frac{layer_name="batch"} 86.33525531969309
# HELP layer_tasks Number of tasks in the layer.
# TYPE layer_tasks gauge
layer_tasks{layer_name="immediate"} 0
layer_tasks{layer_name="normal"} 490
layer_tasks{layer_name="batch"} 343
# HELP layer_total Number of scheduling events in the layer.
# TYPE layer_total gauge
layer_total{layer_name="normal"} 6711
layer_total{layer_name="batch"} 1778
layer_total{layer_name="immediate"} 0
# HELP layer_local % of scheduling events directly into an idle CPU.
# TYPE layer_local gauge
layer_local{layer_name="batch"} 69.79752530933632
layer_local{layer_name="immediate"} 0.0
layer_local{layer_name="normal"} 90.86574281031143
# HELP layer_open_idle % of scheduling events into idle CPUs occupied by other layers.
# TYPE layer_open_idle gauge
layer_open_idle{layer_name="immediate"} 0.0
layer_open_idle{layer_name="batch"} 0.0
layer_open_idle{layer_name="normal"} 0.0
# HELP layer_preempt % of scheduling events that preempted other tasks. #
# TYPE layer_preempt gauge
layer_preempt{layer_name="normal"} 0.0
layer_preempt{layer_name="batch"} 0.0
layer_preempt{layer_name="immediate"} 0.0
# HELP layer_affn_viol % of scheduling events that violated configured policies due to CPU affinity restrictions.
# TYPE layer_affn_viol gauge
layer_affn_viol{layer_name="normal"} 2.950379973178364
layer_affn_viol{layer_name="batch"} 0.0
layer_affn_viol{layer_name="immediate"} 0.0
# HELP layer_cur_nr_cpus Current # of CPUs assigned to the layer.
# TYPE layer_cur_nr_cpus gauge
layer_cur_nr_cpus{layer_name="normal"} 50
layer_cur_nr_cpus{layer_name="batch"} 2
layer_cur_nr_cpus{layer_name="immediate"} 50
# HELP layer_min_nr_cpus Minimum # of CPUs assigned to the layer.
# TYPE layer_min_nr_cpus gauge
layer_min_nr_cpus{layer_name="normal"} 0
layer_min_nr_cpus{layer_name="batch"} 0
layer_min_nr_cpus{layer_name="immediate"} 0
# HELP layer_max_nr_cpus Maximum # of CPUs assigned to the layer.
# TYPE layer_max_nr_cpus gauge
layer_max_nr_cpus{layer_name="immediate"} 50
layer_max_nr_cpus{layer_name="normal"} 50
layer_max_nr_cpus{layer_name="batch"} 2
# EOF
^C19:40:11 [INFO] EXIT: BPF scheduler unregistered
```
Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
Commit c6ada25 ("scx_rustland: use custom pcpu DSQ instead of
SCX_DSQ_LOCAL{_ON}") fixed the race issues with the cpumask, but it also
introduced performance regressions.
Until we figure out the reasons of the performance regressions, simplify
the dispatcher and go back at using only the global DSQ, relying on the
built-in idle cpu selection.
In this way we can still enforce task affinity properly
(`stress-ng --race-sched N` does not crash the scheduler) and we can
also provide a better level of system responsiveness (according to the
results of the stress tests done recently).
The idea of this change is to make the scheduler usable in certain
real-world scenarios (and as bug-free as possible), while we figure out
the performance regressions of the per-CPU DSQ approach, that will
likely be re-introduced later on in the future.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
No functional change, simply rewrite the code a bit and update the
comment to clarify the logic to detect interactive tasks and apply the
priority boost.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Allow to specify `-b 0` to completely disable the slice boost logic and
fallback to standard vruntime-based scheduler with variable time slice.
In this way interactive tasks will not get over-prioritized over the
other tasks in the system.
Having this option can help to easily track down potential performance
regressions arising for over-prioritizing interactive tasks.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Make sure to re-schedule the user-space scheduler if it's preempted by a
task from a higher priority sched_class.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
As the scheduler is progressing towards a more stable and usable state,
it may be subject to heavy stress tests.
For this reason, bump up the limit of MAX_ENQUEUED_TASKS to 8192 in the
BPF component, to be able to sustain task-intensive stress tests,
reducing the risk of potential scheduling congestion conditions.
The downside is a negligible increase in the memory footprint of the BPF
component, that is worth the cost in order to have an improved scheduler
stability.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Page faults cannot happen when the user-space scheduler is running,
otherwise we may hit deadlock conditions: a kthread may need to run to
resolve the page fault, but the user-space scheduler is waiting on the
page fault to be resolved => deadlock.
We solved this problem (mostly) in commit 9708a80 ("scx_userland: use a
custom memory allocator to prevent page faults"), introducing a custom
allocator for the user-space scheduler that operates on a pre-allocated
mlocked memory buffer, but there is an exception that can still trigger
page faults: kcompactd.
When memory compaction is enabled, specifically with
vm.compact_unevictable_allowed=1 (which is often the default in many
distributions), kcompactd regularly attempts to compact all memory
zones, such that free memory is available in contiguous blocks where
feasible, including unevictable memory as well.
In the event that kcompactd remaps pages within the user-space
scheduler's address space, it can lead to page faults, resulting in a
potential deadlock.
To prevent this from happening automatically set
vm.compact_unevictable_allowed=0 when the scheduler is loaded and
restore the previous value when the scheduler in unloaded. In this way
we can prevent kcompactd from touching the unevictable memory associated
to the user-space scheduler.
Keep in mind that this is not a full bullet proof solution: something
else in the system may still set vm.compact_unevictable_allowed=1 while
the scheduler is running, re-enabling the risk of deadlock.
Ideally we would need a way to mark the user-space scheduler memory as
"really unevictable", or a proper kernel ABI to instruct kcompactd to
exclude certain tasks (or better, cgroups) from its proactive memory
compaction actions, but since then, this seems to be the best way to
mitigate this issue.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
We still don't have a reliable and non-racy way to manage cpumasks from
the user-space scheduler, so it is quite hard for the scheduler to
enforce the proper CPU affinity behavior.
Despite checking the cpumask in the BPF part, tasks may still be
assigned to a CPU that they cannot use, triggering scheduler errors.
For example, it is really easy to crash the scheduler with a simple CPU
affinity stress test (`stress-ng --race-sched 8 --timeout 5`):
14:51:28 [WARN] FAIL: SCX_DSQ_LOCAL[_ON] verdict target cpu 1 not allowed for stress-ng-race-[567048] (err=1024)
To prevent this issue from happening, create custom DSQ for each CPU
available in the system and use these per-CPU DSQs to dispatch all the
tasks processed by the user-space scheduler, including the user-space
scheduler itself.
Then consume the these DSQs from the .dispatch() callback of the
respective CPU, to transfer all the tasks to the consuming CPU's local
DSQ, preventing the cpumask race condition encountered using
SCX_DSQ_LOCAL_ON.
With this patch applied the `stress-ng --race-sched N` stress test can
be executed successfully (even with large values of N) without causing
the scheduler to crash.
Signed-off-by: David Vernet <void@manifault.com>
[ arighi: kick target cpu to improve responsiveness, update comments ]
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Instead of downloading a precompiled sched-ext enabled kernel from the
Ubuntu ppa, fetch the latest kernel directly from the sched-ext git
repository and recompile it on-the-fly using virtme-ng.
This allows to get rid of the Ubuntu ppa dependency, take out from the
equation potential Ubuntu-specific patches, and ensures testing all the
schedulers with the most up-to-date sched-ext kernel (that should also
help to detect potential kernel-related issues in advance).
The downside is that the CI runs will take a bit longer now, because we
are recompiling the kernel from scratch. However, the kernel built with
virtme-ng is relatively quick to compile and includes all the sched-ext
features required for testing.
It's worth noting that this method aligns with the current sched-ext
kernel CI, where we test only the in-kernel schedulers (as intended).
This change allows to extend the test coverage, using the same kernel to
test also the schedulers included in this repository.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
The user-space scheduler dispatches tasks in batches, with the batch
size matching the number of idle CPUs.
Commit 791bdbe ("scx_rustland: introduce SMT support") changed the order
of idle CPUs, prioritizing dispatching tasks on the least busy cores
(those with the most idle CPUs) before moving on to busier cores (those
with the least idle CPUs).
While this approach works well for a small number of tasks, it can lead
to uneven performance as the number of tasks increases and all cores are
saturated. Such uneven performance can be attributed to SMT interactions
causing potential short lags and erratic system performance. In some
cases, disabling SMT entirely results in better system responsiveness.
To address this issue, instruct the scheduler to implicitly disable SMT
and consistently dispatch tasks only on the first (or last) CPU of each
core. This approach ensures an equal distribution of tasks among the
available cores, preventing SMT disturbances and aligning with non-SMT
performance, also when a significant amount of tasks are running.
Additionally, the unused sibling CPUs within each core can be used as
"spare" CPUs for the BPF dispatcher. This is particularly beneficial for
tasks that cannot be dispatched on the target CPU selected by the
scheduler, due to cpumask restrictions or congestion conditions.
Therefore, this new approach allows to enhance system responsiveness on
SMT systems, while simultaneously improving scheduler stability.
Some preliminary results on an AMD Ryzen 7 5800X 8-Cores (SMT enabled):
running my usual benchmark of measuring the fps of a videogame
(Counter-Strike 2) during a parallel kernel build-induced system
overload, shows an improvement of approximately 2x (from 8-10fps to
15-25fps vs 1-2fps with EEVDF).
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Prior to commit 676bd88 ("bpf_rustland: do not dispatch the scheduler to
the global DSQ"), the user-space scheduler was dispatched using
SCX_DSQ_GLOBAL and we needed to explicitly kick idle CPUs from
update_idle() to ensure that at least one CPU was available to run the
user-space scheduler.
Now that we are using SCX_DSQ_LOCAL_ON|cpu to dispatch the user-space
scheduler, the target CPU is implicitly kicked. Therefore, the call to
scx_bpf_kick_cpu() within .update_idle() becomes redundant and we can
get rid of it.
Fixes: 676bd88 ("bpf_rustland: do not dispatch the scheduler to the global DSQ")
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Now that scheduling BPF timers works correctly, we don't need this extra
logic to eagerly compact if a scheduling for compaction has happened a
few times in a row. Let's remove it.
Signed-off-by: David Vernet <void@manifault.com>
In scx_nest, we use a per-cpu BPF timer to schedule compaction for a
primary core before it goes idle. If a task comes along that could use
that core, we cancel the callback with bpf_timer_cancel().
bpf_timer_cancel() drops a refcnt on the prog and nullifies the
callback, so if we want to schedule the callback again, we must use
bpf_timer_set_callback() to reset the prog. This patch does that.
Reported-by: Julia Lawall <julia.lawall@inria.fr>
Signed-off-by: David Vernet <void@manifault.com>
Update the slice boost dynamically, as a function of the amount of CPUs
in the system and the amount of tasks currently waiting to be
dispatched: as the amount of waiting tasks in the task_pool increases,
reduce the slice boost.
This adjustment ensures that the scheduler adheres more closely to a
pure vruntime-based policy as the amount of tasks contending the
available CPUs increases and it allows to sustain stress tests that are
spawning a massive amount of tasks.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Introduce a basic support of CPU topology awareness. With this change,
the scheduler will prioritize dispatching tasks to idle CPUs with fewer
busy SMT siblings, then, it will proceed to CPUs with more busy SMT
siblings, in ascending order.
To implement this, introduce a new CoreMapping abstraction, that
provides a mapping of the available core IDs in the system along with
their corresponding lists of CPU IDs. This, coupled with the
get_cpu_pid() method from the BpfScheduler abstraction, allows the
user-space scheduler to enforce the policy outlined above and improve
performance on SMT systems.
Keep in mind that this improvement is relevent only when the amount of
tasks running in the system is less than the amount of CPUs. As soon as
the amount of running tasks increases, they will be distributed across
all available CPUs and cores, thereby negating the advantages of SMT
isolation.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>