Commit Graph

409 Commits

Author SHA1 Message Date
Tejun Heo
6db362b27a scx_rustland: Use scx_utils::user_exit_info
Instead of the bespoke implementation. This also makes scx_rustland to print
out debug dump if exists.

Signed-off-by: Tejun Heo <tj@kernel.org>
2024-01-31 11:44:15 -10:00
Tejun Heo
965926f393 scx_rusty: Use scx_utils::user_exit_info
Instead of the bespoke implementation. This also makes scx_rusty to print
out debug dump if exists.

Signed-off-by: Tejun Heo <tj@kernel.org>
2024-01-31 11:08:17 -10:00
Tejun Heo
105dc36b8f scx_layered: Use scx_utils::user_exit_info
Instead of the bespoke implementation. This also makes scx_layered to print
out debug dump if exists.

Signed-off-by: Tejun Heo <tj@kernel.org>
2024-01-31 10:54:20 -10:00
Tejun Heo
e6e8c8c9cb scx: Clean up user_exit_info.h
- Separate field size constants into enums.

- Add includes so that the header is self-sufficient.

Signed-off-by: Tejun Heo <tj@kernel.org>
2024-01-31 10:47:24 -10:00
Tejun Heo
4ee8104a6d
Merge pull request #114 from dschatzberg/local_avoid_enqueue
scx_layered: dispatch from select_cpu if possible
2024-01-31 08:33:26 -10:00
Dan Schatzberg
11e487c165 scx_layered: dispatch from select_cpu if possible
If we are doing local dispatch, we can avoid enqueue() altogether by
dispatching from select_cpu()

Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
2024-01-31 09:54:26 -08:00
Jordan Rome
1b3a9a1e72 [scx_layered] downgrade prometheus-client
This library at version 22 is not available in fedora:
https://src.fedoraproject.org/rpms/rust-prometheus-client

Rather than bothering the maintainer, let's just downgrade here.
2024-01-31 04:36:01 -08:00
Dan Schatzberg
ab5635ff6d scx_layered: Grab idle_smtmask a bit later
This is a really minor optimization, but we don't need idle_smtmask to
schedule pinned tasks, so defer it so the nr_cpus_allowed == 1 path is
marginally faster.

Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
2024-01-29 08:16:37 -08:00
Dan Schatzberg
8c9e65d880 scx_layered: Remove unnecessary idle_cpumask
idle_cpumask isn't used at all in pick_idle_cpu_from. The only need for
these cpumasks is to check if prev_cpu is a wholly idle CPU (and we only
do this when smt_enabled). idle_smtmask is sufficient for that check.

Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
2024-01-29 08:16:37 -08:00
Dan Schatzberg
142b6230b2 scx_layered: Fix AFFN_VIOL stat bump
Prior to this patch, we only bump LSTAT_AFFN_BIOL when the target cpu
was idle, but in both cases it should be counted as AFFN_VIOL.

Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
2024-01-26 13:13:16 -08:00
Tejun Heo
988b7d13c1 Bump versions
scx_exit_info change doesn't require code to be updated but breaks binary
compatbility. Bump versions and cut a new release.
2024-01-25 09:01:23 -10:00
Tejun Heo
eb997a6e55
Merge pull request #101 from dschatzberg/openmetrics
scx_layered: Add support for OpenMetrics format
2024-01-25 08:59:16 -10:00
Tejun Heo
740e382f12 user_exit_info: Print out debug dump if available
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-01-25 08:49:31 -10:00
Dan Schatzberg
975c698843 Update vmlinux.h
26ae1b0356

changed scx_exit_info which requires us to rebuild with a new vmlinux.h
This patch updates vmlinux.h to the current sched_ext branch in the
github repo.

Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
2024-01-25 10:21:48 -08:00
Dan Schatzberg
7f9548eb34 scx_layered: Add support for OpenMetrics format
Currently scx_layered outputs statistics periodically as info! logs. The
format of this is largely unstructured and mostly suitable for running
scx_layered interactively (e.g. observing its behavior on the command
line or via logs after the fact).

In order to run scx_layered at larger scale, it's desireable to have
statistics output in some format that is amenable to being ingested into
monitoring databases (e.g. Prometheseus). This allows collection of
stats across many machines.

This commit adds a command line flag (-o) that outputs statistics to
stdout in OpenMetrics format instead of the normal log mechanism.
OpenMetrics has a public format
specification (https://github.com/OpenObservability/OpenMetrics) and is
in use by many projects.

The library for producing OpenMetrics metrics is lightweight but does
induce some changes. Primarily, metrics need to be pre-registered (see
OpenMetricsStats::new()).

Without -o, the output looks as before, for example:

```
19:39:54 [INFO] CPUs: online/possible=52/52 nr_cores=26
19:39:54 [INFO] Layered Scheduler Attached
19:39:56 [INFO] tot=   9912 local=76.71 open_idle= 0.00 affn_viol= 2.63 tctx_err=0 proc=21ms
19:39:56 [INFO] busy=  1.3 util=   65.2 load=    263.4 fallback_cpu=  1
19:39:56 [INFO]   batch    : util/frac=   49.7/ 76.3 load/frac=    252.0: 95.7 tasks=   458
19:39:56 [INFO]              tot=   2842 local=45.04 open_idle= 0.00 preempt= 0.00 affn_viol= 0.00
19:39:56 [INFO]              cpus=  2 [  0,  2] 04000001 00000000
19:39:56 [INFO]   immediate: util/frac=    0.0/  0.0 load/frac=      0.0:  0.0 tasks=     0
19:39:56 [INFO]              tot=      0 local= 0.00 open_idle= 0.00 preempt= 0.00 affn_viol= 0.00
19:39:56 [INFO]              cpus= 50 [  0, 50] fbfffffe 000fffff
19:39:56 [INFO]   normal   : util/frac=   15.4/ 23.7 load/frac=     11.4:  4.3 tasks=   556
19:39:56 [INFO]              tot=   7070 local=89.43 open_idle= 0.00 preempt= 0.00 affn_viol= 3.69
19:39:56 [INFO]              cpus= 50 [  0, 50] fbfffffe 000fffff
19:39:58 [INFO] tot=   7091 local=84.91 open_idle= 0.00 affn_viol= 2.64 tctx_err=0 proc=21ms
19:39:58 [INFO] busy=  0.6 util=   31.2 load=    107.1 fallback_cpu=  1
19:39:58 [INFO]   batch    : util/frac=   18.3/ 58.5 load/frac=     93.9: 87.7 tasks=   589
19:39:58 [INFO]              tot=   2011 local=60.67 open_idle= 0.00 preempt= 0.00 affn_viol= 0.00
19:39:58 [INFO]              cpus=  2 [  2,  2] 04000001 00000000
19:39:58 [INFO]   immediate: util/frac=    0.0/  0.0 load/frac=      0.0:  0.0 tasks=     0
19:39:58 [INFO]              tot=      0 local= 0.00 open_idle= 0.00 preempt= 0.00 affn_viol= 0.00
19:39:58 [INFO]              cpus= 50 [ 50, 50] fbfffffe 000fffff
19:39:58 [INFO]   normal   : util/frac=   13.0/ 41.5 load/frac=     13.2: 12.3 tasks=   650
19:39:58 [INFO]              tot=   5080 local=94.51 open_idle= 0.00 preempt= 0.00 affn_viol= 3.68
19:39:58 [INFO]              cpus= 50 [ 50, 50] fbfffffe 000fffff
^C19:39:59 [INFO] EXIT: BPF scheduler unregistered
```

With -o passed, the output is in OpenMetrics format:

```
19:40:08 [INFO] CPUs: online/possible=52/52 nr_cores=26
19:40:08 [INFO] Layered Scheduler Attached
 # HELP total Total scheduling events in the period.
 # TYPE total gauge
total 8489
 # HELP local % that got scheduled directly into an idle CPU.
 # TYPE local gauge
local 86.45305689716104
 # HELP open_idle % of open layer tasks scheduled into occupied idle CPUs.
 # TYPE open_idle gauge
open_idle 0.0
 # HELP affn_viol % which violated configured policies due to CPU affinity restrictions.
 # TYPE affn_viol gauge
affn_viol 2.332430203793144
 # HELP tctx_err Failures to free task contexts.
 # TYPE tctx_err gauge
tctx_err 0
 # HELP proc_ms CPU time this binary has consumed during the period.
 # TYPE proc_ms gauge
proc_ms 20
 # HELP busy CPU busy % (100% means all CPUs were fully occupied).
 # TYPE busy gauge
busy 0.5294061026085283
 # HELP util CPU utilization % (100% means one CPU was fully occupied).
 # TYPE util gauge
util 27.37195512782239
 # HELP load Sum of weight * duty_cycle for all tasks.
 # TYPE load gauge
load 81.55024768702126
 # HELP layer_util CPU utilization of the layer (100% means one CPU was fully occupied).
 # TYPE layer_util gauge
layer_util{layer_name="immediate"} 0.0
layer_util{layer_name="normal"} 19.340849995024997
layer_util{layer_name="batch"} 8.031105132797393
 # HELP layer_util_frac Fraction of total CPU utilization consumed by the layer.
 # TYPE layer_util_frac gauge
layer_util_frac{layer_name="batch"} 29.34063385422595
layer_util_frac{layer_name="immediate"} 0.0
layer_util_frac{layer_name="normal"} 70.65936614577405
 # HELP layer_load Sum of weight * duty_cycle for tasks in the layer.
 # TYPE layer_load gauge
layer_load{layer_name="immediate"} 0.0
layer_load{layer_name="normal"} 11.14363313258934
layer_load{layer_name="batch"} 70.40661455443191
 # HELP layer_load_frac Fraction of total load consumed by the layer.
 # TYPE layer_load_frac gauge
layer_load_frac{layer_name="normal"} 13.664744680306903
layer_load_frac{layer_name="immediate"} 0.0
layer_load_frac{layer_name="batch"} 86.33525531969309
 # HELP layer_tasks Number of tasks in the layer.
 # TYPE layer_tasks gauge
layer_tasks{layer_name="immediate"} 0
layer_tasks{layer_name="normal"} 490
layer_tasks{layer_name="batch"} 343
 # HELP layer_total Number of scheduling events in the layer.
 # TYPE layer_total gauge
layer_total{layer_name="normal"} 6711
layer_total{layer_name="batch"} 1778
layer_total{layer_name="immediate"} 0
 # HELP layer_local % of scheduling events directly into an idle CPU.
 # TYPE layer_local gauge
layer_local{layer_name="batch"} 69.79752530933632
layer_local{layer_name="immediate"} 0.0
layer_local{layer_name="normal"} 90.86574281031143
 # HELP layer_open_idle % of scheduling events into idle CPUs occupied by other layers.
 # TYPE layer_open_idle gauge
layer_open_idle{layer_name="immediate"} 0.0
layer_open_idle{layer_name="batch"} 0.0
layer_open_idle{layer_name="normal"} 0.0
 # HELP layer_preempt % of scheduling events that preempted other tasks. #
 # TYPE layer_preempt gauge
layer_preempt{layer_name="normal"} 0.0
layer_preempt{layer_name="batch"} 0.0
layer_preempt{layer_name="immediate"} 0.0
 # HELP layer_affn_viol % of scheduling events that violated configured policies due to CPU affinity restrictions.
 # TYPE layer_affn_viol gauge
layer_affn_viol{layer_name="normal"} 2.950379973178364
layer_affn_viol{layer_name="batch"} 0.0
layer_affn_viol{layer_name="immediate"} 0.0
 # HELP layer_cur_nr_cpus Current  # of CPUs assigned to the layer.
 # TYPE layer_cur_nr_cpus gauge
layer_cur_nr_cpus{layer_name="normal"} 50
layer_cur_nr_cpus{layer_name="batch"} 2
layer_cur_nr_cpus{layer_name="immediate"} 50
 # HELP layer_min_nr_cpus Minimum  # of CPUs assigned to the layer.
 # TYPE layer_min_nr_cpus gauge
layer_min_nr_cpus{layer_name="normal"} 0
layer_min_nr_cpus{layer_name="batch"} 0
layer_min_nr_cpus{layer_name="immediate"} 0
 # HELP layer_max_nr_cpus Maximum  # of CPUs assigned to the layer.
 # TYPE layer_max_nr_cpus gauge
layer_max_nr_cpus{layer_name="immediate"} 50
layer_max_nr_cpus{layer_name="normal"} 50
layer_max_nr_cpus{layer_name="batch"} 2
 # EOF
^C19:40:11 [INFO] EXIT: BPF scheduler unregistered
```

Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
2024-01-25 09:59:49 -08:00
Andrea Righi
6d89eceb93 scx_rustland: dispatch tasks only on the global DSQ
Commit c6ada25 ("scx_rustland: use custom pcpu DSQ instead of
SCX_DSQ_LOCAL{_ON}") fixed the race issues with the cpumask, but it also
introduced performance regressions.

Until we figure out the reasons of the performance regressions, simplify
the dispatcher and go back at using only the global DSQ, relying on the
built-in idle cpu selection.

In this way we can still enforce task affinity properly
(`stress-ng --race-sched N` does not crash the scheduler) and we can
also provide a better level of system responsiveness (according to the
results of the stress tests done recently).

The idea of this change is to make the scheduler usable in certain
real-world scenarios (and as bug-free as possible), while we figure out
the performance regressions of the per-CPU DSQ approach, that will
likely be re-introduced later on in the future.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-23 13:24:12 +01:00
Andrea Righi
06b5ff3d2f scx_rustland: clarify the logic to determine interactive tasks
No functional change, simply rewrite the code a bit and update the
comment to clarify the logic to detect interactive tasks and apply the
priority boost.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-23 08:28:44 +01:00
Andrea Righi
ab1c4f66a8 scx_rustland: allow to disable the slice boost completely
Allow to specify `-b 0` to completely disable the slice boost logic and
fallback to standard vruntime-based scheduler with variable time slice.

In this way interactive tasks will not get over-prioritized over the
other tasks in the system.

Having this option can help to easily track down potential performance
regressions arising for over-prioritizing interactive tasks.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-23 00:34:06 +01:00
Andrea Righi
b4269452fc scx_userland: handle preemption events from higher sched_class
Make sure to re-schedule the user-space scheduler if it's preempted by a
task from a higher priority sched_class.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-23 00:34:06 +01:00
Andrea Righi
2426d1024f scx_rustland: increase max amount of enqueued tasks
As the scheduler is progressing towards a more stable and usable state,
it may be subject to heavy stress tests.

For this reason, bump up the limit of MAX_ENQUEUED_TASKS to 8192 in the
BPF component, to be able to sustain task-intensive stress tests,
reducing the risk of potential scheduling congestion conditions.

The downside is a negligible increase in the memory footprint of the BPF
component, that is worth the cost in order to have an improved scheduler
stability.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-21 15:47:35 +01:00
Andrea Righi
28bf96c78e scx_rustland: mitigate unevictable memory page faults
Page faults cannot happen when the user-space scheduler is running,
otherwise we may hit deadlock conditions: a kthread may need to run to
resolve the page fault, but the user-space scheduler is waiting on the
page fault to be resolved => deadlock.

We solved this problem (mostly) in commit 9708a80 ("scx_userland: use a
custom memory allocator to prevent page faults"), introducing a custom
allocator for the user-space scheduler that operates on a pre-allocated
mlocked memory buffer, but there is an exception that can still trigger
page faults: kcompactd.

When memory compaction is enabled, specifically with
vm.compact_unevictable_allowed=1 (which is often the default in many
distributions), kcompactd regularly attempts to compact all memory
zones, such that free memory is available in contiguous blocks where
feasible, including unevictable memory as well.

In the event that kcompactd remaps pages within the user-space
scheduler's address space, it can lead to page faults, resulting in a
potential deadlock.

To prevent this from happening automatically set
vm.compact_unevictable_allowed=0 when the scheduler is loaded and
restore the previous value when the scheduler in unloaded. In this way
we can prevent kcompactd from touching the unevictable memory associated
to the user-space scheduler.

Keep in mind that this is not a full bullet proof solution: something
else in the system may still set vm.compact_unevictable_allowed=1 while
the scheduler is running, re-enabling the risk of deadlock.

Ideally we would need a way to mark the user-space scheduler memory as
"really unevictable", or a proper kernel ABI to instruct kcompactd to
exclude certain tasks (or better, cgroups) from its proactive memory
compaction actions, but since then, this seems to be the best way to
mitigate this issue.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-21 15:47:35 +01:00
David Vernet
c6ada251ef scx_rustland: use custom pcpu DSQ instead of SCX_DSQ_LOCAL{_ON}
We still don't have a reliable and non-racy way to manage cpumasks from
the user-space scheduler, so it is quite hard for the scheduler to
enforce the proper CPU affinity behavior.

Despite checking the cpumask in the BPF part, tasks may still be
assigned to a CPU that they cannot use, triggering scheduler errors.

For example, it is really easy to crash the scheduler with a simple CPU
affinity stress test (`stress-ng --race-sched 8 --timeout 5`):

  14:51:28 [WARN] FAIL: SCX_DSQ_LOCAL[_ON] verdict target cpu 1 not allowed for stress-ng-race-[567048] (err=1024)

To prevent this issue from happening, create custom DSQ for each CPU
available in the system and use these per-CPU DSQs to dispatch all the
tasks processed by the user-space scheduler, including the user-space
scheduler itself.

Then consume the these DSQs from the .dispatch() callback of the
respective CPU, to transfer all the tasks to the consuming CPU's local
DSQ, preventing the cpumask race condition encountered using
SCX_DSQ_LOCAL_ON.

With this patch applied the `stress-ng --race-sched N` stress test can
be executed successfully (even with large values of N) without causing
the scheduler to crash.

Signed-off-by: David Vernet <void@manifault.com>
[ arighi: kick target cpu to improve responsiveness, update comments ]
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-21 15:47:35 +01:00
Jordan Rome
9f9a97a97f Update descriptions in cargo toml files 2024-01-19 18:19:46 -08:00
Andrea Righi
24ef0f6c00
Merge pull request #94 from sched-ext/scx-rustland-smt-improvements
scx-rustland: SMT improvements
2024-01-17 21:01:26 +01:00
Andrea Righi
be1cb8774b scx_rustland: improve SMT performance
The user-space scheduler dispatches tasks in batches, with the batch
size matching the number of idle CPUs.

Commit 791bdbe ("scx_rustland: introduce SMT support") changed the order
of idle CPUs, prioritizing dispatching tasks on the least busy cores
(those with the most idle CPUs) before moving on to busier cores (those
with the least idle CPUs).

While this approach works well for a small number of tasks, it can lead
to uneven performance as the number of tasks increases and all cores are
saturated. Such uneven performance can be attributed to SMT interactions
causing potential short lags and erratic system performance. In some
cases, disabling SMT entirely results in better system responsiveness.

To address this issue, instruct the scheduler to implicitly disable SMT
and consistently dispatch tasks only on the first (or last) CPU of each
core. This approach ensures an equal distribution of tasks among the
available cores, preventing SMT disturbances and aligning with non-SMT
performance, also when a significant amount of tasks are running.

Additionally, the unused sibling CPUs within each core can be used as
"spare" CPUs for the BPF dispatcher. This is particularly beneficial for
tasks that cannot be dispatched on the target CPU selected by the
scheduler, due to cpumask restrictions or congestion conditions.

Therefore, this new approach allows to enhance system responsiveness on
SMT systems, while simultaneously improving scheduler stability.

Some preliminary results on an AMD Ryzen 7 5800X 8-Cores (SMT enabled):
running my usual benchmark of measuring the fps of a videogame
(Counter-Strike 2) during a parallel kernel build-induced system
overload, shows an improvement of approximately 2x (from 8-10fps to
15-25fps vs 1-2fps with EEVDF).

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-17 20:49:17 +01:00
Andrea Righi
f0c33320ab scx_rustland: avoid calling scx_bpf_kick_cpu() from update_idle()
Prior to commit 676bd88 ("bpf_rustland: do not dispatch the scheduler to
the global DSQ"), the user-space scheduler was dispatched using
SCX_DSQ_GLOBAL and we needed to explicitly kick idle CPUs from
update_idle() to ensure that at least one CPU was available to run the
user-space scheduler.

Now that we are using SCX_DSQ_LOCAL_ON|cpu to dispatch the user-space
scheduler, the target CPU is implicitly kicked. Therefore, the call to
scx_bpf_kick_cpu() within .update_idle() becomes redundant and we can
get rid of it.

Fixes: 676bd88 ("bpf_rustland: do not dispatch the scheduler to the global DSQ")
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-17 20:49:17 +01:00
Tejun Heo
9089cc09bb
Merge pull request #92 from sched-ext/nest_callbacks
scx_nest: Set timer callback after cancelling
2024-01-17 09:27:22 -10:00
David Vernet
7a3fe759f2
scx_nest: Remove -D option for eager compaction
Now that scheduling BPF timers works correctly, we don't need this extra
logic to eagerly compact if a scheduling for compaction has happened a
few times in a row. Let's remove it.

Signed-off-by: David Vernet <void@manifault.com>
2024-01-16 14:08:36 -06:00
David Vernet
607119d8a4
scx_nest: Set timer callback after cancelling
In scx_nest, we use a per-cpu BPF timer to schedule compaction for a
primary core before it goes idle. If a task comes along that could use
that core, we cancel the callback with bpf_timer_cancel().
bpf_timer_cancel() drops a refcnt on the prog and nullifies the
callback, so if we want to schedule the callback again, we must use
bpf_timer_set_callback() to reset the prog. This patch does that.

Reported-by: Julia Lawall <julia.lawall@inria.fr>
Signed-off-by: David Vernet <void@manifault.com>
2024-01-16 14:01:39 -06:00
Andrea Righi
0b3c399519 scx_rustland: introduce dynamic slice boost
Update the slice boost dynamically, as a function of the amount of CPUs
in the system and the amount of tasks currently waiting to be
dispatched: as the amount of waiting tasks in the task_pool increases,
reduce the slice boost.

This adjustment ensures that the scheduler adheres more closely to a
pure vruntime-based policy as the amount of tasks contending the
available CPUs increases and it allows to sustain stress tests that are
spawning a massive amount of tasks.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-16 11:51:51 +01:00
Andrea Righi
791bdbec97 scx_rustland: introduce SMT support
Introduce a basic support of CPU topology awareness. With this change,
the scheduler will prioritize dispatching tasks to idle CPUs with fewer
busy SMT siblings, then, it will proceed to CPUs with more busy SMT
siblings, in ascending order.

To implement this, introduce a new CoreMapping abstraction, that
provides a mapping of the available core IDs in the system along with
their corresponding lists of CPU IDs. This, coupled with the
get_cpu_pid() method from the BpfScheduler abstraction, allows the
user-space scheduler to enforce the policy outlined above and improve
performance on SMT systems.

Keep in mind that this improvement is relevent only when the amount of
tasks running in the system is less than the amount of CPUs. As soon as
the amount of running tasks increases, they will be distributed across
all available CPUs and cores, thereby negating the advantages of SMT
isolation.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-16 11:33:35 +01:00
Andrea Righi
63209b865d scx_rustland: support aligned allocations in RustLandAllocator
Even if the current implementation of the user-space scheduler doesn't
require to allocate aligned memory, add a simple support to aligned
allocations in RustLandAllocator, in order to make it more generic and
potentially usable by other schedulers / components.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-15 13:44:33 +01:00
Andrea Righi
c593e3605e scx_rustland: report user-space scheduler page fault counter
Periodically report a page fault counter in the scheduler output. The
user-space scheduler should never trigger page faults, otherwise we may
experience deadlocks (that would trigger the sched-ext watchdog,
unloading the scheduler).

Reporting a page fault counter periodically to stdout can be really
helpful to debug potential issues with the custom allocator.

Moreover, group together also nr_sched_congested and
nr_failed_dispatches with nr_page_faults and use the sum of all these
counters to determine the healthy status of the user-space scheduler
(reporting it to stdout as well).

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-14 22:07:37 +01:00
Andrea Righi
9708a80130 scx_userland: use a custom memory allocator to prevent page faults
To prevent potential deadlock conditions under heavy loads, any
scheduler that delegates scheduling decisions to user-space should avoid
triggering page faults.

To address this issue, replace the default Rust allocator with a custom
one (RustLandAllocator), designed to operate on a pre-allocated buffer.

This, coupled with the memory locking (via mlockall), prevents page
faults from happening during the execution of the user-space scheduler,
avoiding the deadlock condition.

This memory allocator is completely transparent to the user-space
scheduler code and it is applied automatically when the bpf module is
imported.

In the future we may decide to move this allocator to a more generic
place (scx_utils crate), so that also other user-space Rust schedulers
can use it.

This initial implementation of the RustLandAllocator is very simple: a
basic block-based allocator that uses an array to track the status of
each memory block (allocated or free).

This allocator can be improved in the future, but right now, despite its
simplicity, it shows a reasonable speed and efficiency in meeting memory
requests from the user-space scheduler, having to deal mostly with small
and uniformly sized allocations.

With this change in place scx_rustland survived more than 10hrs on a
heavily stressed system (with stress-ng and kernel builds running in a
loop):

 $ ps -o pid,rss,etime,cmd -p `pidof scx_rustland`
     PID   RSS     ELAPSED CMD
   34966 75840    10:00:44 ./build/scheds/rust/scx_rustland/debug/scx_rustland

Without this change it is possible to trigger the sched-ext watchdog
timeout in less than 5min, under the same system load conditions.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-14 22:07:37 +01:00
Andrea Righi
acc1d51560 scx_rustland: remove obsolete TODO note
Entries from TaskInfoMap associated to exiting tasks are already removed
via the BPF .exit_task() callback, so drop the obsolete TODO note and
replace it with a proper comment.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-11 20:47:36 +01:00
Andrea Righi
12d89e1d84 scx_rustland: add a troubleshooting section
Add a brief troubleshooting section to the command line help.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-11 18:14:46 +01:00
Andrea Righi
2157f638df scx_rustland: voluntary context switch boost
Improve priority boosting using voluntary context switches metric.

Overview
========

The current criteria to apply the time slice boost (option `-b`) is to
distinguish between newly created tasks and tasks that are already
running: in order to prioritize interactive applications (games,
multimedia, etc.) we apply a time slice usage penalty on newly created
tasks, indirectly boosting the priority of tasks that are already
running, which are likely to be the interactive applications that we
aim to prioritize.

Problem
=======

This approach works well when the background workload forks a bunch of
short-lived tasks (e.g., a parallel kernel build), but it fails to
properly classify CPU-intensive background tasks (i.e., video/3D
rendering, encryption, large data analysis, etc.), because these
applications, typically, do not generate many short-lived processes.

In presence of such workloads the time slice penalty is not enforced,
resulting in a lack of any boost for interactive applications.

Solution
========

A more effective critiria for distinguishing between interactive
applications and background CPU-intensive applications is to examine the
voluntary context switches: an application that periodically releases
the CPU voluntarily is very likely to be interactive.

Therefore, change the time slice boost logic to apply a bonus (scale down
the accounted used time slice) to tasks that show an increase in their
voluntary context switches counter over a time frame of 10 sec.

Based on experimental results, this simple heurstic appears to be quite
effective in classifying interactive tasks and prioritize them over
potential background CPU-intensive tasks.

Additionally, having a better criteria to identify interactive tasks
allow to prioritize also newly created tasks, thereby enhancing the
responsiveness of interactive shell sessions.

This always ensures the prompt execution of system commands, even when
the system is massively overloaded, unlike the previous time slice boost
logic, which made interactive shell sessions less responsive by
deprioritizing newly created tasks.

Results
=======

With this new logic in place it is possible to play a video game (e.g.,
Terraria) without experiencing any frame rate drop (60 fps), while a
parallel CPU stress test (`stress-ng -c 32`) is running in the
background. The same result can also be obtained with a parallel kernel
build (`make -j 32`). Thus, there is no regression compared to the
previous "ideal" test case.

Even when mixing both workloads (`make -j 16` + `stress-ng -c 16`),
Terraria can still be played without noticeable lag in the audio or
video, maintaining a consistent 60 fps.

In addition to that, shell commands are also very responsive.

Following, the results (average and standard deviation of 10 runs) of
two simple interactive shell commands, while both the `make -j 16` and
`stress-ng -c 16` workloads are running in background:

  avg time           "uname -r"       "ps axuw > /dev/null"
  =========================================================
  EEVDF                 11.1ms                     231.8ms
  scx_rustland           2.6ms                     212.0ms

  stdev              "uname -r"       "ps axuw > /dev/null"
  =========================================================
  EEVDF                   2.28                       23.41
  scx_rustland            0.70                        9.11

Tests conducted on a 8-cores laptop (11th Gen Intel i7-1195G7 @
4.800GHz) with 16GB of RAM.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-11 18:14:30 +01:00
Andrea Righi
1cf03770c7 scx_rustland: expose voluntary context switches to the scheduler
Provide the number of voluntary context switches (nvcsw) for each task
to the user-space scheduler.

This extra information can then be used by the scheduler to enhance its
decision-making process when scheduling tasks.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-11 14:10:39 +01:00
Tejun Heo
1395f14975
Update README.md
Embed the video and drop "live" from section title as it's not really live.
2024-01-10 14:47:33 -10:00
Tejun Heo
18f7fe8477 scx_flatcg: Fix fallout from direct dispatch API update
552b75a9c7 ("scx: Build fix after kernel update") updated scx_flatcg along
with other schedulers to use the new direct dispatching from
ops.select_cpu() mechanism. However, this was buggy for flatcg.

flatcg uses direct dispatch for two purposes - as an optimization when there
are idle cpus and to avoid dealing with custom CPU affinities in the
dispatch logic. While the former can be moved to ops.select_cpu(), the
latter can't as it should also apply to tasks which get enqueued without
preceding ops.select_cpu(), e.g., when the task gets requeued after an
attribute change or runs out of time slice. The API update incorrectly moved
both to ops.select_cpu() leading to futile retries of try_pick_next_cgroup()
and scheduling misbheaviors.

Fix it by separating out the two cases and only keeping the idle
optimization case in ops.select_cpu().

Signed-off-by: Tejun Heo <tj@kernel.org>
2024-01-10 10:57:50 -10:00
Tejun Heo
c1f22ea073 scx_flatcg: Report pick_next_cgroup() race and fail counts
To improve visibility into failure mode. While at it, improve output
formatting.

Signed-off-by: Tejun Heo <tj@kernel.org>
2024-01-10 10:52:24 -10:00
Tejun Heo
ae50b155ca
Merge pull request #80 from sched-ext/scx-flatcg-mitigate-stall
scx_flatcg: introduce CGROUP_MAX_RETRIES
2024-01-10 09:49:09 -10:00
Andrea Righi
0609abdca6 scx_flatcg: introduce CGROUP_MAX_RETRIES
We may end up stalling for too long in fcg_dispatch() if
try_pick_next_cgroup() doesn't find another valid cgroup to pick. This
can be quite risky, considering that we are holding the rq lock in
dispatch().

This condition can be reproduced easily in our CI, where we can trigger
stalling softirq works:

[    4.972926] NOHZ tick-stop error: local softirq work is pending, handler #200!!!

Or rcu stalls:

[   47.731900] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[   47.731900] rcu:     1-...!: (0 ticks this GP) idle=b29c/1/0x4000000000000000 softirq=2204/2204 fqs=0
[   47.731900] rcu:     3-...!: (0 ticks this GP) idle=db74/1/0x4000000000000000 softirq=2286/2286 fqs=0
[   47.731900] rcu:     (detected by 0, t=26002 jiffies, g=6029, q=54 ncpus=4)
[   47.731900] Sending NMI from CPU 0 to CPUs 1:

To mitigate this issue reduce the amount of try_pick_next_cgroup()
retries from BPF_MAX_LOOPS (8M) to CGROUP_MAX_RETRIES (1024).

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-10 17:36:17 +01:00
Andrea Righi
0198d893ce scx_rustland: introduce time slice boost parameter
Introduce a parameter to prioritize active running tasks over newly
created tasks.

This option can be used to enhance interactive applications (e.g.,
games, audio/video, GUIs, etc.) that are concurrently running with
fork-intensive background workloads (such as a large parallel build for
example).

The boost value (which functions as a penalty) is applied to the time
slice attributed to newly generated tasks, increasing their vruntime
and, in an indirect manner, "boosting" the priority of all the other
concurrent active tasks.

The time slice boost parameter was applied in the live demo video [1] to
enhance the frames per second (fps) of a video game (Terraria), running
simultaneously with a parallel kernel build (`make -j 32`) on an 8-core
laptop (the value used in the video matches the existing setting of
running `scx_rustland -b 200`).

[1] https://www.youtube.com/watch?v=oCfVbz9jvVQ

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-10 17:32:29 +01:00
Andrea Righi
732ba4900b scx_rustland: avoid using SCX_ENQ_PREEMPT
With the introduction of a the dynamic time slice that scales down based
on the number of tasks in the system, there is no obvious benefit in
utilizing SCX_ENQ_PREEMPT to dispatch the user-space scheduler.

The reduced time slice as the task count increases already enhances the
user-space scheduler's opportunities to run and efficiently manage
scheduling tasks, even when the system is massively overloaded.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-10 17:32:29 +01:00
Andrea Righi
db9a29d618 scx_rustland: improve dynamic slice scaling
Move scaling after tasks are sent to the dispatcher: tasks are
dispatched based on the amount of idle CPUs, so checking for any
remaining tasks still sitting in the scheduler after dispatch gives a
better idea how busy the system is.

Moreover, do not scale the time slice based on nr_cpus (otherwise,
systems with a large amount of CPUs would rarely get any scaling at
all).

Instead, apply a scaling factor as a function of how many tasks are
still waiting in the scheduler: nr_scheduled / 2. This method scales
better as the number of CPUs increases.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-09 22:11:07 +01:00
Andrea Righi
1da2983804 scx_rustland: get rid of force_local
Now that we can dispatch directly from select_cpu() we can make the code
more compact and readable by removing the force_local logic.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-09 22:11:07 +01:00
Andrea Righi
6ead675fb6 scx_rustland: add a link to the live demo in the README
Update the README.md adding a link to a live demo video of the
scheduler.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-09 22:11:07 +01:00
Tejun Heo
942b0269b8 Bump versions
After updates to reflect the updated init and direct dispatch API, the
schedulers aren't compatible with older kernels. Bump versions and publish
releases.
2024-01-08 18:49:54 -10:00
Tejun Heo
552b75a9c7 scx: Build fix after kernel update
In the latest kernel, sched_ext API has changed in two areas:

- ops.prep_enable/cancel_enable/enable/disable() replaced with
  ops.init_task/enable/disable/exit_task().

- scx_bpf_dispatch() can now be called from ops.select_cpu(). Also,
  SCX_ENQ_LOCAL flag is removed. Instead, users can call
  scx_bpf_select_cpu_dfl() from ops.select_cpu() and use the @is_idle out
  param value to determine whether to dispatch directly.

This commit updates all schedules so that they build.

- Init functions renamed / merged / split.

- ops.select_cpu() is added to several schedulers and local direct
  disptching logic is moved there.

This is the minimum update which is need to make the schedulers build and
work. It needs further update to e.g. move vtime udpates to ops.enable().
2024-01-08 14:48:24 -10:00
Andrea Righi
1ea5aebfb4 scx_rustland: always consider slice_ns as maximum time slice
With the introduction of a the dynamic time slice that scales down based
on the number of tasks in the system, there is no need anymore to apply
a constant scaling factor to time slice to extend the range of the
allowed time slices.

Therefore, get rid of the static scaling and use slice_ns as the upper
limit for the time slice accounted to the tasks.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-08 19:22:38 +01:00
Andrea Righi
9b482f48f1 scx_rustland: determine the amount of cores via /proc/stat
libbpf_rs::num_possible_cpus() may take into account multi-threads
multi-cores information, that are not used efficiently by the scheduler
at the moment.

For simplicity rely on /proc/stat to determine the amount of CPUs that
can be used by the scheduler and provide a proper abstraction to access
this information from the bpf Rust module.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-08 19:11:25 +01:00
Andrea Righi
0d107d6220 scx_rustland: return the proper cpu value from get_task_cpu()
Fix the ternary operator expression to return the CPU id, instead of the
boolean result of the condition.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-08 19:10:59 +01:00
Andrea Righi
fa6915cc0a scx_rustland: simplify update_enqueued()
With the introduction of a variable time slice that scales down in
function of the amount of waiting tasks, the scheduler is able to handle
a steady stream of newly spawned tasks, without having to de-prioritize
them to guarantee a good level of system responsiveness.

Hence, the logic for de-prioritizing new tasks can be removed, as it
currently doesn't provide any measurable benefits. In fact, it even
proves counterproductive as it can implicitly slow down the interactive
performance of shell sessions when the system is overloaded with a
significant amount of CPU hogs (e.g, `stress-ng -c 128`).

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-08 07:38:52 +01:00
Andrea Righi
bf98154ee1 scx_rustland: use dynamic time slice in the user-space scheduler
Implement a simple logic in the user-space scheduler to automatically
adjust the tasks' time slice: reduce the time slice by a scaling factor
of (nr_waiting / nr_cpus + 1), where nr_waiting is the amount of tasks
waiting in the scheduler and nr_cpus is the amount of CPUs in the
system.

Using a fine-grained time slice as the number of tasks in the system
grows, improves responsiveness of low-latency activities (e.g., audio,
video games), also in presence of other CPU-intensive tasks that are
concurrently running in the system.

On the other hand, extending the time slice when only a limited number
of tasks are active in the system contributes to an enhancement in the
overall system throughput and a reduced amount of context switches.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-08 07:38:52 +01:00
Andrea Righi
303c4ea548 scx_rustland: dynamic time slice support
Add to BpfScheduler() the new methods set_effective_slice_us() and
get_effective_slice_us().

These methods can be used by the user-space scheduler to dynamically
adjust (and retrieve) the effective time slice used to dispatch tasks
within the BPF dispatcher.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-08 07:35:31 +01:00
Andrea Righi
2a32d81859 scx_rustland: store default slice_ns in the scheduler class
Cache slice_ns into the main scheduler class to avoid accessing it via
self.bpf.skel.rodata().slice_ns every single time.

This also makes the scheduler code more clear and more abstracted from
the BPF details.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-07 16:14:51 +01:00
Andrea Righi
8ccbbdadee scx_userland: improve BPF logging
Always report task comm, nr_queued and nr_scheduled in the log messages.
Moreover, report also task name (comm) and cpu when possible.

All these extra information can be really helpful to trace and debug
scheduling issues.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-07 16:14:51 +01:00
Andrea Righi
295873ac41 scx_rustland: always dispatch per-CPU kthreads from enqueue
We allow tasks to bypass the user-space scheduler and be dispatched
directly using a shortcut in the enqueue path, if their running CPU is
immediately available or if the task is per-CPU kthread.

However, the shortcut is disabled if the user-space scheduler has some
pending activities to do (to avoid disrupting too much its decision).

In this case the shortcut is disabled also for per-CPU kthreads and that
may cause priority-inversion problems in the system, triggering some
stall of some per-CPU kthreads (such as rcuog/N) and short system
lockups, if the system is overloaded.

Prevent this by always enabing the dispatch shortcut for per-CPU
kthreads.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-06 11:06:53 +01:00
Andrea Righi
0c3bdb16fe scx_rustland: prevent using SCX_DSQ_LOCAL_ON from enqueue()
When we fail to push a task to the queued BPF map we fallback to direct
dispatch, but we can't use SCX_DSQ_LOCAL_ON. So, make sure to use
SCX_DSQ_GLOBAL in this case to prevent scheduler crashes.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-06 11:06:53 +01:00
Andrea Righi
05d997c539 scx_rustland: more robust CPU selection logic in the dispatcher
Instead of just trying the target CPU and the previously used CPU, we
could cycle among all the available CPUs (if both those CPUs cannot be
used), before using the global DSQ.

This allows to not de-prioritize too much tasks that can't be scheduled
on the CPU selected by the scheduler (or their previously used CPU), and
we can still dispatch them using SCX_DSQ_LOCAL_ON, like any other task.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-06 11:06:53 +01:00
Andrea Righi
18a990ae82 scx_rustland: assign min_vruntime before time slice evaluation
Assign min_vruntime to the task before the weighted time slice is
evaluated, then add the time slice.

In this way we still ensure that the task's vruntime is in the range
(min_vruntime + 1, min_vruntime + max_slice_ns], but we don't nullify
the effect of the evaluated time slice if the starting vruntime of the
task is too small.

Also change update_enqueued() to return the evaluated weighted time
slice (that can be used in the future).

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-06 11:06:53 +01:00
Andrea Righi
92109c95a9 scx_rustland: small TaskTree.push() refactoring
Change TaskTree.push() to accept directly a Task object, rather than
each individual attribute. Moreover, Task attributes don't need to be
public, since both TaskTree and Task are only used locally.

This makes the code more elegant and more readable.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-06 11:06:53 +01:00
Jordan Rome
661ea57c5c bump scx_rusty and scx_layered
These were supposed to be bumped in this commit:
fed1dae9da
2024-01-04 13:57:29 -08:00
Andrea Righi
96f3eb42be
Merge pull request #68 from sched-ext/scx-rustland-refactoring
scx_rustland: refactoring
2024-01-04 20:42:30 +01:00
Andrea Righi
7813992896 scx_rustland: introduce nr_failed_dispatches
Introduce a new counter to report the amount of failed dispatches: if
the scheduler designates a target CPU for a task, and both the chosen
CPU and the previously utilized one are unavailable when the task is
dispatched, the task will be sent to the global DSQ, and the counter
will be incremented.

Also mark all the methods to access these statistics counters as
optional. In the future we may also provide a "verbose" option and show
these statistics only when the scheduler runs in verbose mode.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-04 17:36:06 +01:00
Andrea Righi
796a7ebc0e scx_rustland: provide an abstraction layer for the BPF component
Move the code responsible for interfacing with the BPF component into
its own module and provide high-level abstractions for the user-space
scheduler, hiding all the internal BPF implementation details.

This makes the user-space scheduler code much more readable and it
allows potential developers/contributors that want to focus at the pure
scheduling details to modify the scheduler in a generic way, without
having to worry about the internal BPF details.

In the future we may even decide to provide the BPF abstraction as a
separate crate, that could be used as a baseline to implement user-space
schedulers in Rust.

API overview
============

The main BPF interface is provided by BpfScheduler(). When this object
is initialized it will take care of registering and initializing the BPF
component.

Then the scheduler can use the BpfScheduler() instance to receive tasks
(in the form of QueuedTask object) and dispatch tasks (in the form of
DispatchedTask objects), using respectively the methods dequeue_task()
and dispatch_task().

The CPU ownership map can be accessed using the method get_cpu_pid(),
this also allows to keep track of the idle and busy CPUs, with the
corrsponding PIDs associated to them.

BPF counters and statistics can be accessed using the methods
nr_*_mut(), in particular nr_queued_mut() and nr_scheduled_mut() can be
updated to notify the BPF component if the user-space scheduler has some
pending work to do or not.

Finally the methods read_bpf_exit_kind() and report_bpf_exit_kind() can
be used respectively to read the exit code and exit message from the BPF
component, when the scheduler is unregistered.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-04 16:49:09 +01:00
Jordan Rome
5bacefcdbe Add README files for each rust scheduler
This because each scheduler has it's own Rust Crate
and it's better if they had a README associated with each one.

https://crates.io/crates/scx_layered
2024-01-04 07:35:44 -08:00
Andrea Righi
7c11837a61 scx_rustland: make dispatcher more robust
We always try to use the current CPU (from the .dispatch() callback) to
run the user-space scheduler itself and if the current CPU is not usable
(according to the cpumask) we just re-use the previouly used CPU.

However, if the previously used CPU is also not usable, we may trigger
the following error:

 sched_ext: runtime error (SCX_DSQ_LOCAL[_ON] verdict target cpu 4 not allowed for scx_rustland[256201])

Potentially this can also happen with any task, so improve the dispatch
logic as following:

 - dispatch on the target CPU, if usable
 - otherwise dispatch on the previously used CPU, if usable
 - otherwise dispatch on the global DSQ

Moreover, rename dispatch_on_cpu() -> dispatch_task() for better
clarity.

This should be enough to handle all the possible decisions made by the
user-space scheduler, making the dispatcher more robust.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-04 10:21:40 +01:00
Andrea Righi
69c1dfc03c scx_rustland: remove unnecessary scx_bpf_dispatch_nr_slots() check
In the dispatch callback we can dispatch tasks to any CPU, according to
the scheduler decisions, so there's no reason to check for the available
dispatch slots in the current CPU only, to determine if we need to stop
dispatching tasks.

Since the scheduler is aware of the idle state of the CPUs (via the CPU
ownership map) it has all the information to automatically regulate the
flow of dispatched tasks and not overflow the dispatch slots, therefore
it is safe to remove this check.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-04 09:41:54 +01:00
Andrea Righi
6b1e7d927d scx_rustland: update comments and documentation in the BPF part
No functional change, only a little polishing, including updates to
comments and documentation to align with the latest changes in the code.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-04 09:40:49 +01:00
Andrea Righi
bb1c32d395 scx_rustland: avoid bypassing the scheduler with pending activities
While bypassing the user-space scheduler can provide some benefits at
reducing the scheduling overhead, doing so underneath the scheduler
while it is actively taking decisions may disrupt its work and have a
negative effect on the overall system performance.

For this reason, activate the logic to bypass the user-space scheduler
only when there is no pending work it.

This change makes the scheduler much more reliable, for example on a
8-cores system it is really easy to trigger short lockups or even
trigger the sched-ext watchdog that kicks out the scheduler, running the
following stress test:

  $ stress-ng -c 128

With this change applied the system remains reasonably responsive and
the scheduler is never disabled by the sched-ext watchdog.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-03 22:54:14 +01:00
Andrea Righi
5d15d34777 scx_rustland: charge additional time slice to new tasks
Instead of accounting (max_slice_ns / 2) to the vruntime of all the new
tasks, add that to thier regular weighted time delta, as an additional
penalty.

This allows to distinguish new CPU intensive tasks vs new less CPU
intensive tasks, and prioritize the latter over the former.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-03 22:54:10 +01:00
Andrea Righi
8820af8d36 scx_rustland: enable user-space scheduler to preempt other tasks
Use SCX_ENQ_PREEMPT to dispatch the user-space scheduler. This can help
to mitigate starvation in presence of many cpu hogs (way more than the
amount of available CPUs) running in the system, by giving the scheduler
more chances to drain the amount of tasks that may be starving in a
waiting state.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-03 22:54:00 +01:00
Andrea Righi
5d9182d9c3 scx_rustland: prioritize interactive workloads
The current implementation of the user-space scheduler is strongly
prioritizing newly created tasks by setting their initial vruntime to
(min_vruntime + 1); this prioritization places them ahead of other tasks
waiting to run.

While this approach is efficient for processing short-lived tasks, it
makes the scheduler vulnerable to fork-bomb attacks and significantly
penalizes interactive workloads (e.g., "foreground" applications), in
particular in the presence of background applications that are spawning
multiple tasks, such as parallel builds.

Instead of prioritizing newly created tasks, do the opposite and account
(max_slice_ns / 2) to their initial vruntime, to make sure they are not
scheduled before the other tasks that are already waiting for the CPU in
the current scheduler run.

This allows to mitigate potential fork-bomb attacks and it strongly
improves the responsiveness of interactive applications (such as UI,
audio/video streams, gaming, etc.).

With this change applied, under certain conditions, scx_rustland can
even outperform the default Linux scheduler.

For example, with a parallel kernel build (make -j32) running in the
background, I can play Terraria with a constant rate of ~30-40 fps,
while the default Linux scheduler can handle only ~20-30 fps under the
same conditions.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-03 18:28:54 +01:00
Andrea Righi
50b5f6e8c6 scx_rustland: do not update exiting tasks statistics
Avoid updating task information for tasks that are exiting, as they
won't be used by the user-space scheduler.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-03 09:10:20 +01:00
Andrea Righi
b7a9d3775a scx_rustland: schedule non-cpu intensive kthreads normally
With commit a7677fd ("scx_rustland: bypass user-space scheduler for
short-lived kthreads") we were try to mitigate a problem that was
actually introduced by using the wrong formula to evaluate weighted
vruntime, see commit 2900b20 ("scx_rustland: evaluate the proper
vruntime delta").

Reverting that (pseudo-)optimization doesn't seem to introduce any
performance/latency regression and it makes the code more elegant,
therefore drop it.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-03 07:46:01 +01:00
David Vernet
e8978ebe23
scx_userland: Introduce ops.update_idle() callback
We can sometimes hit scenarios in the scx_userland scheduler where there
is work to be done in user space, but we incorrectly fail to run the
user space scheduler. In order to avoid this, we can use global
variables that are set from both BPF and user space. The BPF-side
variable reflects when one or more tasks have been enqueued, and the
user space-side variable reflects when user space has received tasks but
has not yet dispatched them.

In the ops.update_idle() callback, we can check these variables and send
a resched IPI to a core to ensure that the user-space scheduler is
always scheduled when there's work to be done.

Signed-off-by: David Vernet <void@manifault.com>
2024-01-02 16:29:19 -06:00
Andrea Righi
bcbce040b6 scheds: c: improve build portability
Improve build portability by including asm-generic/errno.h, instead of
linux/errno.h.

The difference between these two headers can be summarized as following:

  - asm-generic/errno.h contains generic error code definitions that are
    intended to be common across different architectures,

  - linux/errno.h includes architecture-specific error codes and
    provides additional (or overrides) error code definitions based on
    the specific architecture where the code is compiled.

Considering the architecture-independent nature of scx, the advantages
of being able to use architecture-specific error codes are marginal or
negligible (and we should probably discourage using them).

Moving towards asm-generic/errno.h, however, allows the removal of
cross-compilation dependencies (such as the gcc-multilib package in
Debian/Ubuntu) and improves the code portability across various
architectures and distributions.

This also allows to remove a symlink hack from the github workflow.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-02 17:39:46 +01:00
David Vernet
d05c7cf6c3
Merge pull request #51 from arighi/virtme-ng-github-workflow
test the schedulers in the github workflow using virtme-ng
2024-01-02 08:43:54 -06:00
Andrea Righi
a09482f0ef scx_rustland: notify user-space scheduler about exiting tasks
Instead of implementing a garbage collector to periodically free up
exiting tasks' resources, implement a proper synchronous mechanism to
notify the user-space scheduler about the exiting tasks from the BPF
component, using the .disable() callback.

When the user-space scheduler receives a queued task with a negative CPU
number, it can then release all the resources associated with that task
(which currently includes only the entry in the TaskInfoMap for now).

This allows to get rid of the TaskInfoMap periodic garbage collector
routine, save a lot of syscalls in procfs (used to check if the pids
were still alive), and improve the overall scheduler performance.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-02 12:57:27 +01:00
Andrea Righi
280796c4bd scx_rustland: small code refactoring
No functional change, make the user-space scheduler code a bit more
readable and more Rust idiomatic.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-01 19:47:30 +01:00
Andrea Righi
2900b208fe scx_rustland: evaluate the proper vruntime delta
The forumla used to evaluate the weighted time delta is not correct,
it's not considering the weight as a percentage. Fix this by using the
proper formula.

Moreover, take into account also the task weight when evaluating the
maximum time delta to account in vruntime and make sure that we never
charge a task more than slice_ns.

This helps to prevent starvation of low priority tasks.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-01 19:47:30 +01:00
Andrea Righi
90e92ace2d scx_rustland: prevent starvation handling short-lived tasks properly
Prevent newly created short-lived tasks from starving the other tasks
sitting in the user-space scheduler.

This can be done setting an initial vruntime of (min_vruntime + 1) to
newly scheduled tasks, instead of min_vruntime: this ensures a
progressing global vruntime durig each scheduler run, providing a
priority boost to newer tasks (that is still beneficial for potential
short-lived tasks) while also preventing excessive starvation of the
other tasks sitting in the user-space scheduler, waiting to be
dispatched.

Without this change it is really easy to create a stall condition simply
by forking a bunch of short-lived tasks in a busy loop, with this change
applied the scheduler can handle properly the consistent flow of newly
created short-lived tasks, without introducing any stall.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-01 16:58:28 +01:00
Andrea Righi
676bd88ada bpf_rustland: do not dispatch the scheduler to the global DSQ
Never dispatch the user-space scheduler to the global DSQ, while all
the other tasks are dispatched to the local per-CPU DSQ.

Since tasks are consumed from the local DSQ first and then from the
global DSQ, we may end up starving the scheduler if we dispatch only
this one on the global DSQ.

In fact it is really easy to trigger a stall with a workload that
triggers many context switches in the system, for example (on a 8 cores
system):

 $ stress-ng --cpu 32 --iomix 4 --vm 2 --vm-bytes 128M --fork 4 --timeout 30s

 ...
 09:28:11 [WARN] EXIT: scx_rustland[1455943] failed to run for 5.275s
 09:28:11 [INFO] Unregister RustLand scheduler

To prevent this from happening also dispatch the user-space scheduler on
the local DSQ, using the current CPU where .dispatch() is called, if
possible, or the previously used CPU otherwise.

Apply the same logic when the scheduler is congested: dispatch on the
previously used CPU using the local DSQ.

In this way all tasks will always get the same "dispatch priority" and
we can prevent the scheduler starvation issue.

Note that with this change in place dispatch_global() is never used and
we can get rid of it.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-01 14:17:23 +01:00
Andrea Righi
0fc46b2be2 scx_rustland: remove SCX_ENQ_LAST check in is_task_cpu_available()
With commit 49f2e7c ("scx_rustland: enable SCX_OPS_ENQ_LAST") we have
enabled SCX_OPS_ENQ_LAST that seems to save some unnecessary user-space
scheduler activations when the system is mostly idle.

We are also checking for the SCX_ENQ_LAST in the enqueue flags, that
apparently it is not needed and we can achieve the same behavior
dropping this check.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-01 14:17:23 +01:00
Andrea Righi
840260141d scx_rustland: never account more than slice_ns to vruntime
In any case make sure that we never account more than the maximum
slice_ns to a task's vruntime.

This helps to prevent starving a task for too long in the user-space
scheduler.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-01 14:17:23 +01:00
Andrea Righi
61c77b7d87 scx_rustland: clean up old entries in the task map
The user-space scheduler maintains an internal hash map of tasks
information (indexed by their pid). Tasks are only added to this hash
map and never removed. After running the scheduler for a while we may
experience a performance degration, because the hash map keeps growing.

Therefore implement a mechanism of garbage collector to remove the old
entries from the task map (periodically removing pids that don't exist
anymore).

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-01 14:17:23 +01:00
Andrea Righi
27739065bc scx_rustland: rename variable id -> pos for better clarity
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-01 14:17:23 +01:00
Andrea Righi
1cdcb8af60 scx_rustland: show the CPU where the scheduler is running
In the scheduler statistics reported periodically to stdout, instead of
showing "pid=0" for the CPU where the scheduler is running (like an idle
CPU), show "[self]".

This helps to identify exactly where the user-space scheduler is running
(when and where it migrates, etc.).

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2023-12-31 17:03:30 +01:00
Andrea Righi
a7677fdf28 scx_rustland: bypass user-space scheduler for short-lived kthreads
Bypass the user-space scheduler for kthreads that still have more than
half of their runtime budget.

As they are likely to release the CPU soon, granting them a substantial
priority boost can enhance the overall system performance.

In the event that one of these kthreads turns into a CPU hog, it will
deplete its runtime budget and therefore it will be scheduled like
any other normal task through the user-space scheduler.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2023-12-31 16:40:05 +01:00
Andrea Righi
405a11308e scx_rustland: always use dispatch_on_cpu() when possible
Use dispatch_on_cpu() when possible, so that all tasks dispatched by the
user-space scheduler gets the same priority, instead of having some of
them dispatched to the global DSQ and others dispatched to the per-CPU
DSQ.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2023-12-31 16:08:31 +01:00
Andrea Righi
49f2e7ce06 scx_rustland: enable SCX_OPS_ENQ_LAST
Make sure the scheduler is not activated if we are deadling with the
last task running.

This allows to consistency reduce scx_rustland CPU usage in systems that
are mostly idle (and avoid unnecessary power consumption).

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2023-12-31 16:06:45 +01:00
Andrea Righi
0522219bea scx_rustland: prevent dispatching multiple tasks on the same idle cpu
When a task is dispatched we always try to pick the previously used CPU
(if idle) to minimize the migration overhead. Alternatively, if such CPU
is not available, we pick any other idle CPU in the system.

However, we don't update the list of idle CPUs as we dispatch tasks,
therefore we may end up sending multiple tasks to the same idle CPU (if
their previously used CPU is the same) and we may even skip some idle
CPUs completely.

Change this logic to make sure that we never dispatch multiple tasks to
the same idle CPU, by updating the list of idle CPUs as we send tasks to
the BPF dispatcher.

This also avoids dispatching tasks with a closely matched vruntime to
the same CPU, thereby negating the advantages of the vruntime ordering.
With this change in place, we ensure that tasks with a similar vruntime
are dispatched to different CPUs, leading to significant improvements in
latency performance.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2023-12-31 09:37:39 +01:00
Andrea Righi
38145f8dc9 scx_rustland: check CPU selection validity
When the scheduler decides to assign a different CPU to the task always
make sure the assignment is valid according to the task cpumask. If it's
not valid simply dispatch the task to the global DSQ.

This prevents the scheduler from exiting with errors like this:

  09:11:02 [WARN] EXIT: SCX_DSQ_LOCAL[_ON] verdict target cpu 7 not allowed for gcc[440718]

In the future we may want move this check directly into the user-space
scheduler, but for now let's keep this check in the BPF dispatcher as a
quick fix.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2023-12-30 10:40:46 +01:00
Andrea Righi
1a2c9f5fd4 scx_rustland: improve scheduler's idle CPU selection
The current CPU selection logic in the scheduler presents some
inefficiencies.

When a task is drained from the BPF queue, the scheduler immediately
checks whether the CPU previously assigned to the task is still idle,
assigning it if it is. Otherwise, it iterates through available CPUs,
always starting from CPU #0, and selects the first idle one without
updating its state. This approach is consistently applied to the entire
batch of tasks drained from the BPF queue, resulting in all of them
being assigned to the same idle CPU (also with a higher likelihood of
allocation to lower CPU ids rather than higher ones).

While dispatching a batch of tasks to the same idle CPU is not
necessarily problematic, a fairer distribution among the list of idle
CPUs would be preferable.

Therefore change the CPU selection logic to distribute tasks equally
among the idle CPUs, still maintaining the preference for the previously
used one. Additionally, apply the CPU selection logic just before tasks
are dispatched, rather than assigning a CPU when tasks are drained from
the BPF queue. This adjustment is important, because tasks may linger in
the scheduler's internal structures for a bit and the idle state of the
CPUs in the system may change during that period.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2023-12-30 10:34:08 +01:00
Andrea Righi
e90bc923f9 scx_rustland: introduce nr_waiting concept
We want to activate the user-space scheduler only when there are pending
tasks that require scheduling actions.

To do so we keep track of the queued tasks via nr_queued, that is
incremented in .enqueue() when a task is sent to the user-space
scheduler and decremented in .dispatch() when a task is dispatched.

However, we may trigger an unbalance if the same pid is sent to the
scheduler multiple times (because the scheduler store all the tasks by
their unique pid).

When this happens nr_queued is never decremented back to 0, leading the
user-space scheduler to constantly spin, even if there's no activity to
do.

To prevent this from happening split nr_queued into nr_queued and
nr_scheduled. The former will be updated by the BPF component every time
that a task is sent to the scheduler and it's up to the user-space
scheduler to reset the counter when the queue is fully dreained. The
latter is maintained by the user-space scheduler and represents the
amount of tasks that are still processed by the scheduler and are
waiting to be dispatched.

The sum of nr_queued + nr_scheduled will be called nr_waiting and we can
rely on this metric to determine if the user-space scheduler has some
pending work to do or not.

This change makes rust_rustland more reliable and it strongly reduces
the CPU usage of the user-space scheduler by eliminating a lot of
unnecessary activations.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2023-12-29 21:15:04 +01:00
Andrea Righi
d67dfe50f9 scx_rustland: treat the CPU running the user-space scheduler as idle
Considering the CPU where the user-space scheduler is running as busy
doesn't really provide any benefit, since the user-space scheduler is
constantly dispatching an amount of tasks equal to the amount of idle
CPUs and then yields (therefore its own CPU should be considered idle).

Considering the CPU where the user-space scheduler is running as busy
doesn't provide any benefit, as the scheduler consistently dispatches
tasks equal to the number of idle CPUs and then yields (therefore its
own CPU should be considered idle).

This also allows to reduce the overall user-space scheduler CPU
utilization, especially when the system is mostly idle, without
introducing any measurable performance regression.

Measuring the average CPU utilization of a (mostly) idle system over a
time period of 60 sec:

 - wihout this patch: 5.41% avg cpu util
 - with this patch:   2.26% avg cpu util

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2023-12-29 21:14:58 +01:00
Andrea Righi
dbc8e23980 scx_userland: flush stdout when printing stats
Periodically flush stdout to help following the scheduler progress
during testing.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2023-12-29 15:53:12 +01:00
Andrea Righi
614a1ff901 scx_flatcg: flush stdout when printing stats
Periodically flush stdout to help following the scheduler progress
during testing.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2023-12-29 15:53:12 +01:00
Andrea Righi
cc17780c24 scx_rustland: add documentation to scheds/rust/README.md
Add documentation for scx_rustland to the README.md files of the Rust
schedulers.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2023-12-29 09:13:54 +01:00
Andrea Righi
6df4d7e0c6 scx_rustland: introduce an update_idle() callback
Move the logic to activate the userspace scheduler to an update_idle()
callback, which is called when the CPU is about to go idle.

This disables the built-in idle tracking mechanism, so it allows to rely
completely on the internal CPU ownership logic (via get_cpu_owner() and
set_cpu_owner()) and it also allows to share the idle state with the
user-space scheduler via the BPF_MAP_TYPE_ARRAY cpu_map.

Moreover, when the user-space scheduler is activated, kick the idle cpu
to trigger immediate dispatch and avoid bubbles in the scheduling
pipeline.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2023-12-28 14:41:08 +01:00
Andrea Righi
1baae38e7f Revert "scx_rustland: always dispatch kthreads on the local CPU"
This reverts commit 9237e1d ("scx_rustland: always dispatch kthreads on
the local CPU").

Do not always prioritize all kthreads, we may have unbound workqueue
workers that can consume a lot of CPU cycles (e.g., encryption workers),
so we definitely want to apply the scheduling for those.

Therefore, restore the old behavior to prioritize only per-CPU kthreads.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2023-12-28 14:40:03 +01:00
Andrea Righi
9237e1d835 scx_rustland: always dispatch kthreads on the local CPU
Adding extra overhead to any kthread can potentially slow down the
entire system, so make sure this never happens by dispatching all
kthreads directly on the same local CPU (not just the per-CPU kthreads),
bypassing the user-space scheduler.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2023-12-27 14:15:46 +01:00
Andrea Righi
f0ece7af6b scx_rustland: wake-up user-space scheduler when a CPU is released
Trigger the user-space scheduler only upon a task's CPU release event
(avoiding its activation during each enqueue event) and only if there
are tasks waiting to be processed by the user-space scheduler.

This should save unnecessary calls to the user-space scheduler, reducing
the overall overhead of the scheduler.

Moreover, rename nr_enqueues to nr_queued and store the amount of tasks
currently queued to the user-space scheduler (that are waiting to be
dispatched).

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2023-12-27 14:15:46 +01:00
Andrea Righi
7d01be9568 scx_rustland: provide get/set_cpu_owner()
Provide the following primitives to get and set CPU ownership in the BPF
part. This improves code readability and these primitives can be used by
the BPF part as a baseline to implement a better CPU idle tracking in
the future.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2023-12-27 14:15:39 +01:00
Andrea Righi
cd7e1c6248 scx_rustland: clarify BPF / user-space interlocking
BPF doesn't have full memory model yet, and while strict atomicity might
not be necessary in this context, it is advisable to enhance clarity in
the interlocking model.

To achieve this, provide the following primitives to operate on
usersched_needed:

  static void set_usersched_needed(void)

  static bool test_and_clear_usersched_needed(void)

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2023-12-26 14:28:24 +01:00
Andrea Righi
e038a530ae scx_rustland: dispatch tasks in batch
Dispatch tasks in a batch equal to the amount of idle CPUs in the
system.

This allows to reduce the pressure on the dispatcher queues, improving
the effectiveness of the scheduler (by having more tasks sitting in the
scheduler task pool) and mitigating potential priority inversion issues.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2023-12-23 10:44:03 +01:00
Andrea Righi
4d98862674 scx_rustland: expose CPU information to the user-space scheduler
Provide an interface for the BPF dispatcher and user-space scheduler to
share CPU information. This information can empower the user-space
scheduler to make more informed decisions and enable the implementation
of a broader range of scheduling policies.

With this change the BPF dispatcher provides a CPU map (one entry per
CPU) that stores the pid that is running on each CPU (0 if the CPU is
idle). The CPU map is updated by the BPF dispatcher in the .running()
and .stopping() callbacks.

The dispatcher then sends to the user-space scheduler a suggestion of
the candidate CPU for each task that needs to run (that is always the
previously used CPU), along with all the task's information.

The user-space scheduler can decide to confirm the selected CPU or to
choose a different one, using all the shared CPU information.

Lastly, the selected CPU is communicated back to the dispatcher along
with all the task's information and the BPF dispatcher takes care of
executing the task on the selected CPU, eventually triggering a
migration.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2023-12-23 10:38:56 +01:00
Andrea Righi
968ac80a3f scx_rustland: handle graceful vs non-graceful exit
Do not report an exit error message if it's empty. Moreover, distinguish
between a graceful exit vs a non-graceful exit.

In general, try to follow the behavior of user_exit_info.h for the C
schedulers.

NOTE: in the future the whole exit handling probably can be moved to a
more generic place (scx_utils) to prevent code duplication across
schedulers and also to prevent small inconsistencies like this one.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2023-12-22 19:44:14 +01:00
Andrea Righi
f7f0e3236c scx_rustland: rename from scx_rustlite
Rename scx_rustlite to scx_rustland to better represent the mirroring of
scx_userland (in C), but implemented in Rust.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2023-12-22 00:20:14 +01:00
Andrea Righi
086c6dffc8 scx_rustlite: simple user-space scheduler written in Rust
This scheduler is made of a BPF component (dispatcher) that implements
the low level sched-ext functionalities and a user-space counterpart
(scheduler), written in Rust, that implements the actual scheduling
policy.

The main goal of this scheduler is to be easy to read and well
documented, so that newcomers (i.e., students, researchers, junior devs,
etc.) can use this as a template to quickly experiment scheduling
theory.

For this reason the design of this scheduler is mostly focused on
simplicity and code readability.

Moreover, the BPF dispatcher is completely agnostic of the particular
scheduling policy implemented by the user-space scheduler. For this
reason developers that are willing to use this scheduler to experiment
scheduling policies should be able to simply modify the Rust component,
without having to deal with any internal kernel / BPF details.

Future improvements:

 - Transfer the responsibility of determining the CPU for executing a
   particular task to the user-space scheduler.

   Right now this logic is still fully implemented in the BPF part and
   the user-space scheduler can only decide the order of execution of
   the tasks, that significantly restricts the scheduling policies that
   can be implemented in the user-space scheduler.

 - Experiment the possibility to send tasks from the user-space
   scheduler to the BPF dispatcher using a batch size, instead of
   draining the task queue completely and sending all the tasks at once
   every single time.

   A batch size should help to reduce the overhead and it should also
   help to reduce the wakeups of the user-space scheduler.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2023-12-21 18:53:30 +01:00
David Vernet
eb7b3c99f0
Merge pull request #40 from sched-ext/ci
scx: Add CI action that builds schedulers for PRs
2023-12-18 21:17:47 -06:00
David Vernet
4523b10e45
scx: Add CI action that builds schedulers for PRs
When Ubuntu ships with sched_ext, we can also maybe test loading the
schedulers (not sure if the runners can run as root though). For now, we
should at least have a CI job that lets us verify that the schedulers
can _build_. To that end, this patch adds a basic CI action that builds
the schedulers.

As is, this is a bit brittle in that we're having to manually download
and install a few dependencies. I don't see a better way for now without
hosting our own runners with our own containers, but that's a bigger
investment. For now, hopefully this will get us _some_ coverage.

Signed-off-by: David Vernet <void@manifault.com>
2023-12-18 21:12:50 -06:00
David Vernet
318c06fa9c
nest: Skip out of idle cpu selection on exec() path
The core sched code calls select_task_rq() in a few places: the task
wakeup path (typical path), the fork() path, and the exec() path. For
nest scheduling, we don't want to select a core from the nest on the
exec() path. If we were previously able to find an idle core, we would
have found it on the fork() path, so we don't gain much by checking on
the exec() path. In fact, it's actually harmful, because we could
incorrectly blow up the primary nest unnecessarily by bumping the same
task between multiple cores for no reason. Let's just opt-out of
select_task_rq calls on the exec() path.

Suggested-by: Julia Lawall <julia.lawall@inria.fr>
Signed-off-by: David Vernet <void@manifault.com>
2023-12-18 13:51:15 -06:00
David Vernet
ab0e36f9ce
scx_nest: Apply r_impatient if no task is found in primary nest
Julia pointed out that our current implementation of r_impatient is
incorrect. r_impatient is meant to be a mechanism for more aggressively
growing the primary nest if a task repeatedly isn't able to find a core.
Right now, we trigger r_impatient if we're not able to find an attached
or previous core in the primary nest, but we _should_ be triggering it
only if we're unable to find _any_ core in the primary nest. Fixing the
implementation to do this drastically decreases how aggressively we grow
the primary nest when r_impatient is in effect.

Reported-by: Julia Lawall <julia.lawall@inria.fr>
Signed-off-by: David Vernet <void@manifault.com>
2023-12-18 11:05:36 -06:00
Jordan Rome
e9a9d32ab6 Restructure scheds folder names
- combine c and kernel-examples as it's confusing to have both
- rename 'rust-user' and 'c-user' to just 'rust' and 'c', which is simpler
- update and fix sync-to-kernel.sh
2023-12-17 13:14:31 -08:00
Daniel Müller
fed1dae9da rust: Update libbpf-rs & libbpf-cargo to 0.22
This is a follow on to #32, which got reverted. I wrongly assumed that
scx_rusty resides in the sched_ext tree and consumes published version
of scx_utils.
With this change we update the other in-tree dependencies. I built
scx_layered & scx_rusty. I bumped scx-utils to 0.4, because the
libbpf-cargo seems to be part of the public API surface and libbpf-cargo
0.21 and 0.22 are not compatible with each other.

Signed-off-by: Daniel Müller <deso@posteo.net>
2023-12-14 14:33:58 -08:00
David Vernet
b8f70fa09a
Merge pull request #31 from jordalgo/minor-rusty-refactor
minor refactor of scx_rusty
2023-12-14 09:35:18 -06:00
Jordan Rome
ba35b97bb7 minor refactor of scx_rusty 2023-12-14 07:33:53 -08:00
Andrea Righi
dc81311d79 scx_userland: align MAX_ENQUEUED_TASKS to dispatch batch
With commit 48bba8e ("scx_userland: survive to dispatch failures")
scx_useland can better tolerate dispatch failures, so we can reduce a
bit MAX_ENQUEUED_TASKS and align it with the size used in bpf_repeat(),
when tasks are actually dispatched in the bpf counterpart.

This allows reducing the memory footprint of the scheduler and makes it
more consistent between enqueue and dispatch events.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2023-12-13 22:25:04 +01:00
Andrea Righi
48bba8e4f6 scx_userland: survive to dispatch failures
If the scheduler fails to dispatch a task we immediately give up,
exiting with an error like the following:

 Failed to dispatch task 251 in 1
 EXIT: BPF scheduler unregistered

This scenario can be simulated decreasing dramatically the value of
MAX_ENQUEUED_TASKS.

We can make the scheduler a little more robust simply by re-adding the
task that cannot be dispatched to vruntime_head and stop dispatching
additional tasks in the same batch.

This can give enough room, under such "dispatch overload" condition, to
catch up and resume the normal execution without crashing.

Moreover, introduce nr_vruntime_failed to report failed dispatch events
in the scheduler's statistics.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2023-12-13 22:19:36 +01:00
Andrea Righi
1e9e6778bc scx_userland: allocate tasks array based on kernel.pid_max
Currently the array of enqueued tasks is statically allocated to a fixed
size of USERLAND_MAX_TASKS to avoid potential deadlocks that could be
introduced by performing dynamic allocations in the enqueue path.

However, this also adds a limit on the maximum pid that the scheduler
can handle, since the pid is used as the index to access the array.

In fact, it is quite easy to trigger the following failure on an average
desktop system (making this scheduler pretty much unusable in such
scenario):

 $ sudo scx_userland
 ...
 Failed to enqueue task 33258: No such file or directory
 EXIT: BPF scheduler unregistered

Prevent this by using sysctl's kernel.pid_max as the size of the tasks
array (and still allocate it all at once during initialization).

The downside of this change is that scx_userland may require additional
memory to start and in small systems it could even trigger OOMs. For
this reason add an explicit message to the command help, suggesting to
reduce kernel.pid_max in case of OOM conditions.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2023-12-13 17:33:10 +01:00
Tejun Heo
8a07bcc31b Bump versions and add LICENSE symlinks for scx_layered and scx_rusty 2023-12-12 11:21:08 -10:00
Kumar Kartikeya Dwivedi
c4c994c9ce
scx_central: Break dispatch_to_cpu loop when running out of buffer slots
For the case where many tasks being popped from the central queue cannot
be dispatched to the local DSQ of the target CPU, we will keep bouncing
them to the fallback DSQ and continue the dispatch_to_cpu loop until we
find one which can be dispatch to the local DSQ of the target CPU.

In a contrived case, it might be so that all tasks pin themselves to
CPUs != target CPU, and due to their affinity cannot be dispatched to
that CPU's local DSQ. If all of them are filling up the central queue,
then we will keep looping in the dispatch_to_cpu loop and eventually run
out of slots for the dispatch buffer. The nr_mismatched counter will
quickly rise and sched-ext will notice the error and unload the BPF
scheduler.

To remedy this, ensure that we break the dispatch_to_cpu loop when we
can no longer perform a dispatch operation. The outer loop in
central_dispatch for the central CPU should ensure the loop breaks when
we run out of these slots and schedule a self-IPI to the central core,
and allow sched-ext to consume the dispatch buffer before restarting the
dispatch loop again.

A basic way to reproduce this scenario is to do:
taskset -c 0 perf bench sched messaging

The error in the kernel will be:
sched_ext: BPF scheduler "central" errored, disabling
sched_ext: runtime error (dispatch buffer overflow)
bpf_prog_6a473147db3cec67_dispatch_to_cpu+0xc2/0x19a
bpf_prog_c9e51ba75372a829_central_dispatch+0x103/0x1a5

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
2023-12-12 07:50:46 +00:00
Tejun Heo
c9d0cc640a
Merge pull request #22 from arighi/enable-rust-build-option
build: introduce enable_rust build option
2023-12-09 15:59:19 -10:00
Tejun Heo
abbb6a0276
Merge pull request #20 from arighi/scx-rusty-fix
scx_rusty: fix "subtract with overflow" error
2023-12-09 15:58:11 -10:00
Andrea Righi
6343bcf360 build: introduce enable_rust build option
Introduce an option to enable/disable the build of all the Rust
sub-projects.

This can be useful to build scx on those systems where Rust is not
fully supported (e.g., armhf).

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2023-12-09 15:05:23 +01:00
Andrea Righi
0637b6a0b5 scx_nest: use proper format string for u64 types
This prevents some warnings when building scx_nest on 32-bit
architectures.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2023-12-09 14:49:50 +01:00
Andrea Righi
adc01140aa scx_qmap: use proper format string for u64 types
This prevents some warnings when building scx_qmap on 32-bit
architectures.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2023-12-09 14:49:44 +01:00
Andrea Righi
4df979ccb7 scx_pair: use proper format string for u64 types
This prevents some warnings when building scx_pair on 32-bit
architectures.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2023-12-09 14:49:38 +01:00
Andrea Righi
14e70fd134 scx_flatcg: use proper data size for hweight_gen
We should explicitly use u64 for hweight_gen to prevent the following
build failures on 32-bit architectures:

scheds/kernel-examples/scx_flatcg.p/scx_flatcg.bpf.skel.h: In function ‘scx_flatcg__assert’:
scheds/kernel-examples/scx_flatcg.p/scx_flatcg.bpf.skel.h:3523:9: error: static assertion failed: "unexpected size of \'hweight_gen\'"
 3523 |         _Static_assert(sizeof(s->data->hweight_gen) == 8, "unexpected size of 'hweight_gen'");

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2023-12-09 14:49:30 +01:00
Andrea Righi
00c5d2dfb7 scx_qmap: use proper data size for scheduler stats
We should explicitly use u64 for scheduler statistics to prevent the
following build failures on 32-bit architectures:

scheds/kernel-examples/scx_qmap.p/scx_qmap.bpf.skel.h: In function ‘scx_qmap__assert’:
scheds/kernel-examples/scx_qmap.p/scx_qmap.bpf.skel.h:2560:9: error: static assertion failed: "unexpected size of \'nr_enqueued\'"
 2560 |         _Static_assert(sizeof(s->bss->nr_enqueued) == 8, "unexpected size of 'nr_enqueued'");
      |         ^~~~~~~~~~~~~~
scheds/kernel-examples/scx_qmap.p/scx_qmap.bpf.skel.h:2561:9: error: static assertion failed: "unexpected size of \'nr_dispatched\'"
 2561 |         _Static_assert(sizeof(s->bss->nr_dispatched) == 8, "unexpected size of 'nr_dispatched'");
      |         ^~~~~~~~~~~~~~
scheds/kernel-examples/scx_qmap.p/scx_qmap.bpf.skel.h:2562:9: error: static assertion failed: "unexpected size of \'nr_reenqueued\'"
 2562 |         _Static_assert(sizeof(s->bss->nr_reenqueued) == 8, "unexpected size of 'nr_reenqueued'");
      |         ^~~~~~~~~~~~~~
scheds/kernel-examples/scx_qmap.p/scx_qmap.bpf.skel.h:2563:9: error: static assertion failed: "unexpected size of \'nr_dequeued\'"
 2563 |         _Static_assert(sizeof(s->bss->nr_dequeued) == 8, "unexpected size of 'nr_dequeued'");
      |         ^~~~~~~~~~~~~~
scheds/kernel-examples/scx_qmap.p/scx_qmap.bpf.skel.h:2564:9: error: static assertion failed: "unexpected size of \'nr_core_sched_execed\'"
 2564 |         _Static_assert(sizeof(s->bss->nr_core_sched_execed) == 8, "unexpected size of 'nr_core_sched_execed'");
      |         ^~~~~~~~~~~~~~

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2023-12-09 14:49:25 +01:00
Andrea Righi
4c65e71c48 scx_central: use proper format string for u64
When printing scheduler statistics we use %lu to print u64 values, that
works well on 64-bit architectures, but on 32-bit architectures we get
errors like the following:

  106 |                 printf("total   :%10lu    local:%10lu   queued:%10lu  lost:%10lu\n",
      |                                  ~~~~^
      |                                      |
      |                                      long unsigned int
      |                                  %10llu
  107 |                        skel->bss->nr_total,
      |                        ~~~~~~~~~~~~~~~~~~~
      |                                 |
      |                                 u64 {aka long long unsigned int}

Fix this by using the proper format %llu.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2023-12-09 14:49:20 +01:00
Andrea Righi
e396f1e467 scx_userland: get rid of strings.h include
Use compiler's built-in stack initialization instead of memset().

In this way we can get rid of the string.h include and make
cross-compilation easier in certain small environments (i.e., arm).

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2023-12-09 14:49:14 +01:00
Andrea Righi
c5d1bc3577 scx_rusty: fix "subtract with overflow" error
It seems that under certain conditions, the difference between the
current and the previous procfs::CpuStat values may become negative,
triggering the following crash/trace:

thread 'main' panicked at /build/rustc-VvCkKl/rustc-1.73.0+dfsg0ubuntu1/library/core/src/ops/arith.rs:217:1:
attempt to subtract with overflow
stack backtrace:
...
  19:     0x590d8481909e - scx_rusty::calc_util::h46f2af9c512c2ecd
                               at /home/arighi/src/scx/scheds/rust-user/scx_rusty/src/main.rs:217:31
  20:     0x590d8481c794 - scx_rusty::Tuner::step::h2e51076f043a8593
                               at /home/arighi/src/scx/scheds/rust-user/scx_rusty/src/main.rs:444:38
  21:     0x590d84828270 - scx_rusty::Scheduler::run::hb5483f1e585f52fe
                               at /home/arighi/src/scx/scheds/rust-user/scx_rusty/src/main.rs:1198:17
  22:     0x590d848289e9 - scx_rusty::main::h9ba8c62ad33aeee1
...

Prevent this by introducing a sub_or_zero() helper function that returns
zero if the difference is negative.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2023-12-09 14:47:35 +01:00
David Vernet
c953ee47a6
scx_nest: Reset schedulings when a task is dispatched
In scx_nest, we currently count the number of times that a core is
scheduled for compaction before we eventually just eagerly compact the
core. The idea is that the core could thrash between being scheduled and
then "de-scheduled" for compaction if there are a couple of tasks that
are bouncing between cores in the primary nest often enough to kick them
out of being compacted.

We're currently resetting schedulings when a core is eagerly compacted,
but to be precise we should probably also reset the count when a core
consumes a task from the fallback DSQ, at this indicates that the system
is overcommitted and that we likely won't benefit from compacting the
primary nest.

Signed-off-by: David Vernet <void@manifault.com>
2023-12-08 13:16:40 -06:00
Tejun Heo
11c6a809b2
Merge pull request #12 from sched-ext/scx_nest
scx_nest: Add scx_nest scheduler
2023-12-07 12:19:40 -10:00
David Vernet
ca21842908
scx_nest: Add scx_nest scheduler
The scx_nest scheduler seems to be behaving well. Let's merge it to the
scx repo so that CachyOS can package and use it more easily.

Signed-off-by: David Vernet <void@manifault.com>
2023-12-07 13:28:09 -06:00
David Vernet
b53e8251a1
rusty: Fix calc_util() in rusty
We were assigning curr to prev stats, and vice versa, in calc_util().
This was causing the following crash on debug builds:

[void@maniforge scheds]$ sudo RUST_BACKTRACE=1 scx_rusty
00:00:56 [INFO] CPUs: online/possible = 32/32
00:00:56 [INFO] DOM[00] cpumask 0000000000FF00FF (16 cpus)
00:00:56 [INFO] DOM[01] cpumask 00000000FF00FF00 (16 cpus)
00:00:56 [INFO] Rusty Scheduler Attached
thread 'main' panicked at /rustc/475c71da0710fd1d40c046f9cee04b733b5b2b51/library/core/src/ops/arith.rs:217:1:
attempt to subtract with overflow
stack backtrace:
   0: rust_begin_unwind
             at /rustc/475c71da0710fd1d40c046f9cee04b733b5b2b51/library/std/src/panicking.rs:597:5
   1: core::panicking::panic_fmt
             at /rustc/475c71da0710fd1d40c046f9cee04b733b5b2b51/library/core/src/panicking.rs:72:14
   2: core::panicking::panic
             at /rustc/475c71da0710fd1d40c046f9cee04b733b5b2b51/library/core/src/panicking.rs:127:5
   3: <u64 as core::ops::arith::Sub>::sub
             at /rustc/475c71da0710fd1d40c046f9cee04b733b5b2b51/library/core/src/ops/arith.rs:217:1
   4: <&u64 as core::ops::arith::Sub<&u64>>::sub
             at /rustc/475c71da0710fd1d40c046f9cee04b733b5b2b51/library/core/src/internal_macros.rs:55:17
   5: scx_rusty::calc_util
             at ./rust-user/scx_rusty/src/main.rs:216:29
   6: scx_rusty::Tuner::step
             at ./rust-user/scx_rusty/src/main.rs:444:38
   7: scx_rusty::Scheduler::run
             at ./rust-user/scx_rusty/src/main.rs:1198:17
   8: scx_rusty::main
             at ./rust-user/scx_rusty/src/main.rs:1261:5
   9: core::ops::function::FnOnce::call_once
             at /rustc/475c71da0710fd1d40c046f9cee04b733b5b2b51/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

Flip them to avoid the crash. Rusty now runs fine.

Signed-off-by: David Vernet <void@manifault.com>
2023-12-06 18:25:27 -06:00
David Vernet
eba9155a7f
README: Add scheds/ README's
There's a fairly comprehensive README in the kernel's tools/sched_ext
directory which describes each of the example schedulers. Let's pull it
into this repository, and split it into the various subdirectories
containing the kernele-examples/ schedulers, and the rust-user/
schedulers.

Signed-off-by: David Vernet <void@manifault.com>
2023-12-06 16:55:02 -06:00
Tejun Heo
8ee1bc706f scx_central: Implement fallback for missing BPF_F_TIMER_CPU_PIN support 2023-12-05 14:50:45 -10:00
David Vernet
13586dc2ab scx_simple: Don't vtime dispatch to SCX_DSQ_GLOBAL
SCX_DSQ_GLOBAL now does not support vtime dispatching. scx_simple uses
it to do vtime scheduling, so let's update it to create and use a
separate DSQ that it can both FIFO and PRIQ dispatch to.

Signed-off-by: David Vernet <void@manifault.com>
2023-12-04 18:06:47 -06:00
Tejun Heo
c42ba105ca scx_layered: Use p->thread_node to iterate threads in tp_cgroup_attach_task()
tp_cgroup_attach_task() walks p->thread_group to visit all member threads
and set tctx->refresh_layer. However, the upstream kernel has removed
p->thread_group recently in 8e1f385104ac ("kill task_struct->thread_group")
as it was mostly a duplicate of p->signal->thread_head list which goes
through p->thread_node.

Switch to iterate via p->thread_node instead, add a comment explaining why
it's using the cgroup TP instead of scx_ops.cgroup_move(), and make
iteration failure non-fatal as the iteration is racy.
2023-12-04 10:58:06 -10:00
Tejun Heo
d6742a9b1b scx_rusty: Add comment explaining spurious delete_elem failures 2023-12-04 10:51:05 -10:00
Tejun Heo
66150490a1 scx_rusty: Work around spurious task_ctx update failures
As in scx_layered, bpf_map_delete_elem() can fail due to recursion
protection triggering spuriously which can then lead to task_ctx creation
failure after PIDs wrap. Work around by dropping BPF_NOEXIST.
2023-12-03 15:46:29 -10:00
Tejun Heo
bf186f95f3 sync-to-kernel.sh: Drop unused shopt globstar 2023-12-03 15:35:15 -10:00
Tejun Heo
255f614615 sync-to-kernel.sh: Updated to sync to kernel and renamed accordingly
The scx repo is going to serve as the source of truth for sched_ext
schedulers. Reverse the sync direction and include syncing rust-user
schedulers too.
2023-12-03 15:34:26 -10:00
Tejun Heo
1a4734bb4c scx_flatcg: Drop unnecessary include 2023-12-03 14:22:59 -10:00
Tejun Heo
fcab460386 scx_utils: Bump version and publish
include directory structure has changed (a breaking change) and the doc had
a misleading error. Let's cut a new release.
2023-12-03 12:51:16 -10:00
Tejun Heo
d0ed7913b4 scheds: Rearrange include files to match kernel/tools/sched_ext/include
Build scripts are updated accordingly.
2023-12-03 12:47:23 -10:00
Tejun Heo
41a4f6407e build: Add dummy gnu/stubs.h to fix build on systems without 32bit glibc-dev
See comment in sched/include/bpf-compat/gnu/stubs.h for details.
2023-12-02 13:25:02 -10:00
Tejun Heo
0f1ed894bd build: rust projects now link against libbpf.a if provided 2023-12-02 06:41:26 -10:00
Tejun Heo
30f6a38573 build: Pipe down build options to scx_utils::BuildHelpers using env vars 2023-12-01 14:49:32 -10:00
Tejun Heo
6b9c392bf0 build: "meson install" works now 2023-12-01 13:37:28 -10:00
Tejun Heo
a55fc6893b build: Trigger cargo build on rust sub-projects from meson.build 2023-12-01 11:58:56 -10:00
Tejun Heo
01d8351616 rustu: Import scx_rusty and scx_layered from kernel tree 2023-11-30 13:13:41 -10:00
Tejun Heo
302ea57798 scheds: Remove now unnecessary ravg_read.rs.h and relocate sync script 2023-11-28 08:55:41 -10:00
Tejun Heo
68b6d37800 scx: Initial repo setup and import of example schedulers from kernel tree 2023-11-27 14:47:04 -10:00