Items in the task BTreeSet are stored by pid and vruntime. Make sure
that we never store multiple items with the same PID, so that
re-enqueued tasks are not dispatched multiple times.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Only scx_simple/qmap are in the kernel tree now. Drop the rest from the sync
script. Also update the sync script so that it can handle empty
rust_scheds variable.
Signed-off-by: Tejun Heo <tj@kernel.org>
Pierre Jacquet pointed out that our docs in the scx repo are out of date
for the latest APIs. Let's update it so readers don't get confused.
Signed-off-by: David Vernet <void@manifault.com>
Allow to scale the effective time slice down to 250 us. This can help to
maintain a good quality of the audio even when the system is overloaded
by multiple CPU-intensive tasks.
Moreover, always round up the time slice scaling factor to be a little
more aggressive and prioritize at scaling the time slice, so that we can
prioritize low latency tasks even more.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Evaluate the number of voluntary context switches per second (nvcsw/sec)
for each task using an exponentially weighted moving average (EWMA) with
weight 0.5, that allows to classify interactive tasks with more
accuracy.
Using a simple average over a period of time of 10 sec can introduce
small lags every 10 sec, as the statistics for the number of voluntary
context switches are refreshed. This can result in interactive tasks
taking a brief time to catch up in order to be accurately classified as
so, causing for example short audio cracks, small drop of 5-10 fps in
games, etc.
Using a EMWA allows to smooth the average of nvcsw/sec, preventing short
lags in the interactive tasks, while also preventing to incorrectly
classify as interactive tasks that may experience an isolated short
burst of voluntary context switches.
This patch has been tested with the usual test case of playing a
videogame while running a parallel kernel build in the background.
Without this patch the short lag every 10 sec is clearly noticeable,
with this patch applied the game and audio run smoothly.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Simplify the idle selection logic by relying only on the built-in idle
selection performed in the BPF layer.
When there are idle CPUs available in the system, tasks are dispatched
directly by the BPF dispatcher without invoking the user-space
scheduler. This allows to avoid the user-space overhead and get the best
system performance when CPU resources are not overcommitted.
Once the number of tasks exceeds the available CPUs, the user-space
scheduler takes over. However, by this time, the system is already
overcommitted, so there's little advantage in attempting to pinpoint the
optimal idle CPU through the user-space scheduler. Instead, tasks can be
executed on the first available CPU, consistently dispatching them to
the shared DSQ.
This allows to achieve the optimal performance both with system
under-utilization and over-utilization.
With this change in place the user-space scheduler won't dispatch tasks
directly to specific CPUs, but we still want to keep this as a generic
feature in the BPF layer, so that it can be potentially used in the
future by this scheduler or even by other user-space schedulers (once
the BPF layer will be moved to a more generic place).
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
When the user-space scheduler dispatches a task on a specific CPU, that
CPU might not be valid, since the user-space doesn't have visibility of
the task's cpumask.
When this happens the BPF dispatcher (that has direct visibility of the
cpumask) should automatically redirect the task to a valid CPU, but
instead of bouncing the task on the shared DSQ, we should try to use the
CPU assigned by the built-in idle selection logic.
If this CPU is also not valid, then we can simply ignore the task, that
has been de-queued and re-enqueued, since a valid CPU will be naturally
re-selected at a later time.
Moreover, avoid to kick any specific CPU when the task is dispatched to
shared DSQ, since the task can be consumed on any CPU and the additional
kick would simply add more overhead.
Lastly, rename dsq_id_to_cpu() to dsq_to_cpu() and cpu_to_dsq_id() to
cpu_to_dsq() for more clarity.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
With commit c6ada25 ("scx_rustland: use custom pcpu DSQ instead of
SCX_DSQ_LOCAL{_ON}") we tried to introduce custom per-CPU DSQs, instead
of using SCX_DSQ_LOCAL and SCX_DSQ_LOCAL_ON to dispatch tasks.
This was required, because dispatching tasks using SCX_DSQ_LOCAL_ON
doesn't provide a guarantee that the cpumask, checked at dispatch time
to determine the validity of a target CPU, remains valid.
This method solved the cpumask validity issue, but unfortunately it
introduced a noticeable performance regression and a potential
starvation issue (that were probably caused by the same problem): if a
task is assigned to a CPU in select_cpu() and the scheduler decides to
dispatch it on a different CPU, the task will be added to the new CPU's
DSQ, but if no dispatch event happens there, the task may remain stuck
in the per-CPU DSQ for a long time, triggering the sched-ext watchdog
timeout that would kick out the scheduler, for example:
12:53:28 [WARN] FAIL: IPC:CSteamEngin[7217] failed to run for 6.482s (err=1026)
12:53:28 [INFO] Unregister RustLand scheduler
Therefore, we reverted this change with 6d89ece ("scx_rustland: dispatch
tasks only on the global DSQ"), dispatching all the tasks to the global
DSQ, completely delegating the kernel to distribute tasks among the
available CPUs.
This is not the ideal solution, because we still want to give the
possibility to the user-space scheduler to assign tasks to specific
CPUs.
Therefore, re-introduce distinct per-CPU DSQs, but also provide a global
shared DSQ. Tasks dispatched in the per-CPU DSQs are consumed from the
dispatch() callback of their corresponding CPU, tasks dispatched in the
global shared DSQ are consumed from any CPU.
In this way the BPF layer is able to provide an interface that gives
the flexibility to the user-space to dispatch a task on a specific CPU
or on the first CPU available, depending on the particular scheduler's
need.
If an invalid CPU (according to the cpumask) is selected the BPF
dispatcher will transparently redirect the task to a valid CPU, selected
using the built-in idle selection logic.
In the future we may want to improve this part, giving to the
user-space the visibility of the cpumask, in order to pick a valid CPU
in advance and in a proper synchronized way.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
No functional change, just some refactoring to make the code more clear.
We have is_usersched_needed() and set_usersched_needed() that are doing
different things (the former is checkig if there are pending tasks for
the scheduler, the latter is setting the usersched_needed flag to
activate the dispatch of the user-space scheduler).
Rename is_usersched_needed() to usersched_has_pending_tasks() to make
the code more clear and understandable.
Also move dispatch_user_scheduler() closer to the other dispatch-related
helper functions.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
If we are doing local dispatch, we can avoid enqueue() altogether by
dispatching from select_cpu()
Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
This is a really minor optimization, but we don't need idle_smtmask to
schedule pinned tasks, so defer it so the nr_cpus_allowed == 1 path is
marginally faster.
Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
idle_cpumask isn't used at all in pick_idle_cpu_from. The only need for
these cpumasks is to check if prev_cpu is a wholly idle CPU (and we only
do this when smt_enabled). idle_smtmask is sufficient for that check.
Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
Prior to this patch, we only bump LSTAT_AFFN_BIOL when the target cpu
was idle, but in both cases it should be counted as AFFN_VIOL.
Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
26ae1b0356
changed scx_exit_info which requires us to rebuild with a new vmlinux.h
This patch updates vmlinux.h to the current sched_ext branch in the
github repo.
Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
Currently scx_layered outputs statistics periodically as info! logs. The
format of this is largely unstructured and mostly suitable for running
scx_layered interactively (e.g. observing its behavior on the command
line or via logs after the fact).
In order to run scx_layered at larger scale, it's desireable to have
statistics output in some format that is amenable to being ingested into
monitoring databases (e.g. Prometheseus). This allows collection of
stats across many machines.
This commit adds a command line flag (-o) that outputs statistics to
stdout in OpenMetrics format instead of the normal log mechanism.
OpenMetrics has a public format
specification (https://github.com/OpenObservability/OpenMetrics) and is
in use by many projects.
The library for producing OpenMetrics metrics is lightweight but does
induce some changes. Primarily, metrics need to be pre-registered (see
OpenMetricsStats::new()).
Without -o, the output looks as before, for example:
```
19:39:54 [INFO] CPUs: online/possible=52/52 nr_cores=26
19:39:54 [INFO] Layered Scheduler Attached
19:39:56 [INFO] tot= 9912 local=76.71 open_idle= 0.00 affn_viol= 2.63 tctx_err=0 proc=21ms
19:39:56 [INFO] busy= 1.3 util= 65.2 load= 263.4 fallback_cpu= 1
19:39:56 [INFO] batch : util/frac= 49.7/ 76.3 load/frac= 252.0: 95.7 tasks= 458
19:39:56 [INFO] tot= 2842 local=45.04 open_idle= 0.00 preempt= 0.00 affn_viol= 0.00
19:39:56 [INFO] cpus= 2 [ 0, 2] 04000001 00000000
19:39:56 [INFO] immediate: util/frac= 0.0/ 0.0 load/frac= 0.0: 0.0 tasks= 0
19:39:56 [INFO] tot= 0 local= 0.00 open_idle= 0.00 preempt= 0.00 affn_viol= 0.00
19:39:56 [INFO] cpus= 50 [ 0, 50] fbfffffe 000fffff
19:39:56 [INFO] normal : util/frac= 15.4/ 23.7 load/frac= 11.4: 4.3 tasks= 556
19:39:56 [INFO] tot= 7070 local=89.43 open_idle= 0.00 preempt= 0.00 affn_viol= 3.69
19:39:56 [INFO] cpus= 50 [ 0, 50] fbfffffe 000fffff
19:39:58 [INFO] tot= 7091 local=84.91 open_idle= 0.00 affn_viol= 2.64 tctx_err=0 proc=21ms
19:39:58 [INFO] busy= 0.6 util= 31.2 load= 107.1 fallback_cpu= 1
19:39:58 [INFO] batch : util/frac= 18.3/ 58.5 load/frac= 93.9: 87.7 tasks= 589
19:39:58 [INFO] tot= 2011 local=60.67 open_idle= 0.00 preempt= 0.00 affn_viol= 0.00
19:39:58 [INFO] cpus= 2 [ 2, 2] 04000001 00000000
19:39:58 [INFO] immediate: util/frac= 0.0/ 0.0 load/frac= 0.0: 0.0 tasks= 0
19:39:58 [INFO] tot= 0 local= 0.00 open_idle= 0.00 preempt= 0.00 affn_viol= 0.00
19:39:58 [INFO] cpus= 50 [ 50, 50] fbfffffe 000fffff
19:39:58 [INFO] normal : util/frac= 13.0/ 41.5 load/frac= 13.2: 12.3 tasks= 650
19:39:58 [INFO] tot= 5080 local=94.51 open_idle= 0.00 preempt= 0.00 affn_viol= 3.68
19:39:58 [INFO] cpus= 50 [ 50, 50] fbfffffe 000fffff
^C19:39:59 [INFO] EXIT: BPF scheduler unregistered
```
With -o passed, the output is in OpenMetrics format:
```
19:40:08 [INFO] CPUs: online/possible=52/52 nr_cores=26
19:40:08 [INFO] Layered Scheduler Attached
# HELP total Total scheduling events in the period.
# TYPE total gauge
total 8489
# HELP local % that got scheduled directly into an idle CPU.
# TYPE local gauge
local 86.45305689716104
# HELP open_idle % of open layer tasks scheduled into occupied idle CPUs.
# TYPE open_idle gauge
open_idle 0.0
# HELP affn_viol % which violated configured policies due to CPU affinity restrictions.
# TYPE affn_viol gauge
affn_viol 2.332430203793144
# HELP tctx_err Failures to free task contexts.
# TYPE tctx_err gauge
tctx_err 0
# HELP proc_ms CPU time this binary has consumed during the period.
# TYPE proc_ms gauge
proc_ms 20
# HELP busy CPU busy % (100% means all CPUs were fully occupied).
# TYPE busy gauge
busy 0.5294061026085283
# HELP util CPU utilization % (100% means one CPU was fully occupied).
# TYPE util gauge
util 27.37195512782239
# HELP load Sum of weight * duty_cycle for all tasks.
# TYPE load gauge
load 81.55024768702126
# HELP layer_util CPU utilization of the layer (100% means one CPU was fully occupied).
# TYPE layer_util gauge
layer_util{layer_name="immediate"} 0.0
layer_util{layer_name="normal"} 19.340849995024997
layer_util{layer_name="batch"} 8.031105132797393
# HELP layer_util_frac Fraction of total CPU utilization consumed by the layer.
# TYPE layer_util_frac gauge
layer_util_frac{layer_name="batch"} 29.34063385422595
layer_util_frac{layer_name="immediate"} 0.0
layer_util_frac{layer_name="normal"} 70.65936614577405
# HELP layer_load Sum of weight * duty_cycle for tasks in the layer.
# TYPE layer_load gauge
layer_load{layer_name="immediate"} 0.0
layer_load{layer_name="normal"} 11.14363313258934
layer_load{layer_name="batch"} 70.40661455443191
# HELP layer_load_frac Fraction of total load consumed by the layer.
# TYPE layer_load_frac gauge
layer_load_frac{layer_name="normal"} 13.664744680306903
layer_load_frac{layer_name="immediate"} 0.0
layer_load_frac{layer_name="batch"} 86.33525531969309
# HELP layer_tasks Number of tasks in the layer.
# TYPE layer_tasks gauge
layer_tasks{layer_name="immediate"} 0
layer_tasks{layer_name="normal"} 490
layer_tasks{layer_name="batch"} 343
# HELP layer_total Number of scheduling events in the layer.
# TYPE layer_total gauge
layer_total{layer_name="normal"} 6711
layer_total{layer_name="batch"} 1778
layer_total{layer_name="immediate"} 0
# HELP layer_local % of scheduling events directly into an idle CPU.
# TYPE layer_local gauge
layer_local{layer_name="batch"} 69.79752530933632
layer_local{layer_name="immediate"} 0.0
layer_local{layer_name="normal"} 90.86574281031143
# HELP layer_open_idle % of scheduling events into idle CPUs occupied by other layers.
# TYPE layer_open_idle gauge
layer_open_idle{layer_name="immediate"} 0.0
layer_open_idle{layer_name="batch"} 0.0
layer_open_idle{layer_name="normal"} 0.0
# HELP layer_preempt % of scheduling events that preempted other tasks. #
# TYPE layer_preempt gauge
layer_preempt{layer_name="normal"} 0.0
layer_preempt{layer_name="batch"} 0.0
layer_preempt{layer_name="immediate"} 0.0
# HELP layer_affn_viol % of scheduling events that violated configured policies due to CPU affinity restrictions.
# TYPE layer_affn_viol gauge
layer_affn_viol{layer_name="normal"} 2.950379973178364
layer_affn_viol{layer_name="batch"} 0.0
layer_affn_viol{layer_name="immediate"} 0.0
# HELP layer_cur_nr_cpus Current # of CPUs assigned to the layer.
# TYPE layer_cur_nr_cpus gauge
layer_cur_nr_cpus{layer_name="normal"} 50
layer_cur_nr_cpus{layer_name="batch"} 2
layer_cur_nr_cpus{layer_name="immediate"} 50
# HELP layer_min_nr_cpus Minimum # of CPUs assigned to the layer.
# TYPE layer_min_nr_cpus gauge
layer_min_nr_cpus{layer_name="normal"} 0
layer_min_nr_cpus{layer_name="batch"} 0
layer_min_nr_cpus{layer_name="immediate"} 0
# HELP layer_max_nr_cpus Maximum # of CPUs assigned to the layer.
# TYPE layer_max_nr_cpus gauge
layer_max_nr_cpus{layer_name="immediate"} 50
layer_max_nr_cpus{layer_name="normal"} 50
layer_max_nr_cpus{layer_name="batch"} 2
# EOF
^C19:40:11 [INFO] EXIT: BPF scheduler unregistered
```
Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>