Commit Graph

144 Commits

Author SHA1 Message Date
Andrea Righi
6d89eceb93 scx_rustland: dispatch tasks only on the global DSQ
Commit c6ada25 ("scx_rustland: use custom pcpu DSQ instead of
SCX_DSQ_LOCAL{_ON}") fixed the race issues with the cpumask, but it also
introduced performance regressions.

Until we figure out the reasons of the performance regressions, simplify
the dispatcher and go back at using only the global DSQ, relying on the
built-in idle cpu selection.

In this way we can still enforce task affinity properly
(`stress-ng --race-sched N` does not crash the scheduler) and we can
also provide a better level of system responsiveness (according to the
results of the stress tests done recently).

The idea of this change is to make the scheduler usable in certain
real-world scenarios (and as bug-free as possible), while we figure out
the performance regressions of the per-CPU DSQ approach, that will
likely be re-introduced later on in the future.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-23 13:24:12 +01:00
Andrea Righi
06b5ff3d2f scx_rustland: clarify the logic to determine interactive tasks
No functional change, simply rewrite the code a bit and update the
comment to clarify the logic to detect interactive tasks and apply the
priority boost.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-23 08:28:44 +01:00
Andrea Righi
ab1c4f66a8 scx_rustland: allow to disable the slice boost completely
Allow to specify `-b 0` to completely disable the slice boost logic and
fallback to standard vruntime-based scheduler with variable time slice.

In this way interactive tasks will not get over-prioritized over the
other tasks in the system.

Having this option can help to easily track down potential performance
regressions arising for over-prioritizing interactive tasks.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-23 00:34:06 +01:00
Andrea Righi
b4269452fc scx_userland: handle preemption events from higher sched_class
Make sure to re-schedule the user-space scheduler if it's preempted by a
task from a higher priority sched_class.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-23 00:34:06 +01:00
Andrea Righi
2426d1024f scx_rustland: increase max amount of enqueued tasks
As the scheduler is progressing towards a more stable and usable state,
it may be subject to heavy stress tests.

For this reason, bump up the limit of MAX_ENQUEUED_TASKS to 8192 in the
BPF component, to be able to sustain task-intensive stress tests,
reducing the risk of potential scheduling congestion conditions.

The downside is a negligible increase in the memory footprint of the BPF
component, that is worth the cost in order to have an improved scheduler
stability.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-21 15:47:35 +01:00
Andrea Righi
28bf96c78e scx_rustland: mitigate unevictable memory page faults
Page faults cannot happen when the user-space scheduler is running,
otherwise we may hit deadlock conditions: a kthread may need to run to
resolve the page fault, but the user-space scheduler is waiting on the
page fault to be resolved => deadlock.

We solved this problem (mostly) in commit 9708a80 ("scx_userland: use a
custom memory allocator to prevent page faults"), introducing a custom
allocator for the user-space scheduler that operates on a pre-allocated
mlocked memory buffer, but there is an exception that can still trigger
page faults: kcompactd.

When memory compaction is enabled, specifically with
vm.compact_unevictable_allowed=1 (which is often the default in many
distributions), kcompactd regularly attempts to compact all memory
zones, such that free memory is available in contiguous blocks where
feasible, including unevictable memory as well.

In the event that kcompactd remaps pages within the user-space
scheduler's address space, it can lead to page faults, resulting in a
potential deadlock.

To prevent this from happening automatically set
vm.compact_unevictable_allowed=0 when the scheduler is loaded and
restore the previous value when the scheduler in unloaded. In this way
we can prevent kcompactd from touching the unevictable memory associated
to the user-space scheduler.

Keep in mind that this is not a full bullet proof solution: something
else in the system may still set vm.compact_unevictable_allowed=1 while
the scheduler is running, re-enabling the risk of deadlock.

Ideally we would need a way to mark the user-space scheduler memory as
"really unevictable", or a proper kernel ABI to instruct kcompactd to
exclude certain tasks (or better, cgroups) from its proactive memory
compaction actions, but since then, this seems to be the best way to
mitigate this issue.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-21 15:47:35 +01:00
David Vernet
c6ada251ef scx_rustland: use custom pcpu DSQ instead of SCX_DSQ_LOCAL{_ON}
We still don't have a reliable and non-racy way to manage cpumasks from
the user-space scheduler, so it is quite hard for the scheduler to
enforce the proper CPU affinity behavior.

Despite checking the cpumask in the BPF part, tasks may still be
assigned to a CPU that they cannot use, triggering scheduler errors.

For example, it is really easy to crash the scheduler with a simple CPU
affinity stress test (`stress-ng --race-sched 8 --timeout 5`):

  14:51:28 [WARN] FAIL: SCX_DSQ_LOCAL[_ON] verdict target cpu 1 not allowed for stress-ng-race-[567048] (err=1024)

To prevent this issue from happening, create custom DSQ for each CPU
available in the system and use these per-CPU DSQs to dispatch all the
tasks processed by the user-space scheduler, including the user-space
scheduler itself.

Then consume the these DSQs from the .dispatch() callback of the
respective CPU, to transfer all the tasks to the consuming CPU's local
DSQ, preventing the cpumask race condition encountered using
SCX_DSQ_LOCAL_ON.

With this patch applied the `stress-ng --race-sched N` stress test can
be executed successfully (even with large values of N) without causing
the scheduler to crash.

Signed-off-by: David Vernet <void@manifault.com>
[ arighi: kick target cpu to improve responsiveness, update comments ]
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-21 15:47:35 +01:00
Jordan Rome
9f9a97a97f Update descriptions in cargo toml files 2024-01-19 18:19:46 -08:00
Andrea Righi
24ef0f6c00
Merge pull request #94 from sched-ext/scx-rustland-smt-improvements
scx-rustland: SMT improvements
2024-01-17 21:01:26 +01:00
Andrea Righi
be1cb8774b scx_rustland: improve SMT performance
The user-space scheduler dispatches tasks in batches, with the batch
size matching the number of idle CPUs.

Commit 791bdbe ("scx_rustland: introduce SMT support") changed the order
of idle CPUs, prioritizing dispatching tasks on the least busy cores
(those with the most idle CPUs) before moving on to busier cores (those
with the least idle CPUs).

While this approach works well for a small number of tasks, it can lead
to uneven performance as the number of tasks increases and all cores are
saturated. Such uneven performance can be attributed to SMT interactions
causing potential short lags and erratic system performance. In some
cases, disabling SMT entirely results in better system responsiveness.

To address this issue, instruct the scheduler to implicitly disable SMT
and consistently dispatch tasks only on the first (or last) CPU of each
core. This approach ensures an equal distribution of tasks among the
available cores, preventing SMT disturbances and aligning with non-SMT
performance, also when a significant amount of tasks are running.

Additionally, the unused sibling CPUs within each core can be used as
"spare" CPUs for the BPF dispatcher. This is particularly beneficial for
tasks that cannot be dispatched on the target CPU selected by the
scheduler, due to cpumask restrictions or congestion conditions.

Therefore, this new approach allows to enhance system responsiveness on
SMT systems, while simultaneously improving scheduler stability.

Some preliminary results on an AMD Ryzen 7 5800X 8-Cores (SMT enabled):
running my usual benchmark of measuring the fps of a videogame
(Counter-Strike 2) during a parallel kernel build-induced system
overload, shows an improvement of approximately 2x (from 8-10fps to
15-25fps vs 1-2fps with EEVDF).

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-17 20:49:17 +01:00
Andrea Righi
f0c33320ab scx_rustland: avoid calling scx_bpf_kick_cpu() from update_idle()
Prior to commit 676bd88 ("bpf_rustland: do not dispatch the scheduler to
the global DSQ"), the user-space scheduler was dispatched using
SCX_DSQ_GLOBAL and we needed to explicitly kick idle CPUs from
update_idle() to ensure that at least one CPU was available to run the
user-space scheduler.

Now that we are using SCX_DSQ_LOCAL_ON|cpu to dispatch the user-space
scheduler, the target CPU is implicitly kicked. Therefore, the call to
scx_bpf_kick_cpu() within .update_idle() becomes redundant and we can
get rid of it.

Fixes: 676bd88 ("bpf_rustland: do not dispatch the scheduler to the global DSQ")
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-17 20:49:17 +01:00
Tejun Heo
9089cc09bb
Merge pull request #92 from sched-ext/nest_callbacks
scx_nest: Set timer callback after cancelling
2024-01-17 09:27:22 -10:00
David Vernet
7a3fe759f2
scx_nest: Remove -D option for eager compaction
Now that scheduling BPF timers works correctly, we don't need this extra
logic to eagerly compact if a scheduling for compaction has happened a
few times in a row. Let's remove it.

Signed-off-by: David Vernet <void@manifault.com>
2024-01-16 14:08:36 -06:00
David Vernet
607119d8a4
scx_nest: Set timer callback after cancelling
In scx_nest, we use a per-cpu BPF timer to schedule compaction for a
primary core before it goes idle. If a task comes along that could use
that core, we cancel the callback with bpf_timer_cancel().
bpf_timer_cancel() drops a refcnt on the prog and nullifies the
callback, so if we want to schedule the callback again, we must use
bpf_timer_set_callback() to reset the prog. This patch does that.

Reported-by: Julia Lawall <julia.lawall@inria.fr>
Signed-off-by: David Vernet <void@manifault.com>
2024-01-16 14:01:39 -06:00
Andrea Righi
0b3c399519 scx_rustland: introduce dynamic slice boost
Update the slice boost dynamically, as a function of the amount of CPUs
in the system and the amount of tasks currently waiting to be
dispatched: as the amount of waiting tasks in the task_pool increases,
reduce the slice boost.

This adjustment ensures that the scheduler adheres more closely to a
pure vruntime-based policy as the amount of tasks contending the
available CPUs increases and it allows to sustain stress tests that are
spawning a massive amount of tasks.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-16 11:51:51 +01:00
Andrea Righi
791bdbec97 scx_rustland: introduce SMT support
Introduce a basic support of CPU topology awareness. With this change,
the scheduler will prioritize dispatching tasks to idle CPUs with fewer
busy SMT siblings, then, it will proceed to CPUs with more busy SMT
siblings, in ascending order.

To implement this, introduce a new CoreMapping abstraction, that
provides a mapping of the available core IDs in the system along with
their corresponding lists of CPU IDs. This, coupled with the
get_cpu_pid() method from the BpfScheduler abstraction, allows the
user-space scheduler to enforce the policy outlined above and improve
performance on SMT systems.

Keep in mind that this improvement is relevent only when the amount of
tasks running in the system is less than the amount of CPUs. As soon as
the amount of running tasks increases, they will be distributed across
all available CPUs and cores, thereby negating the advantages of SMT
isolation.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-16 11:33:35 +01:00
Andrea Righi
63209b865d scx_rustland: support aligned allocations in RustLandAllocator
Even if the current implementation of the user-space scheduler doesn't
require to allocate aligned memory, add a simple support to aligned
allocations in RustLandAllocator, in order to make it more generic and
potentially usable by other schedulers / components.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-15 13:44:33 +01:00
Andrea Righi
c593e3605e scx_rustland: report user-space scheduler page fault counter
Periodically report a page fault counter in the scheduler output. The
user-space scheduler should never trigger page faults, otherwise we may
experience deadlocks (that would trigger the sched-ext watchdog,
unloading the scheduler).

Reporting a page fault counter periodically to stdout can be really
helpful to debug potential issues with the custom allocator.

Moreover, group together also nr_sched_congested and
nr_failed_dispatches with nr_page_faults and use the sum of all these
counters to determine the healthy status of the user-space scheduler
(reporting it to stdout as well).

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-14 22:07:37 +01:00
Andrea Righi
9708a80130 scx_userland: use a custom memory allocator to prevent page faults
To prevent potential deadlock conditions under heavy loads, any
scheduler that delegates scheduling decisions to user-space should avoid
triggering page faults.

To address this issue, replace the default Rust allocator with a custom
one (RustLandAllocator), designed to operate on a pre-allocated buffer.

This, coupled with the memory locking (via mlockall), prevents page
faults from happening during the execution of the user-space scheduler,
avoiding the deadlock condition.

This memory allocator is completely transparent to the user-space
scheduler code and it is applied automatically when the bpf module is
imported.

In the future we may decide to move this allocator to a more generic
place (scx_utils crate), so that also other user-space Rust schedulers
can use it.

This initial implementation of the RustLandAllocator is very simple: a
basic block-based allocator that uses an array to track the status of
each memory block (allocated or free).

This allocator can be improved in the future, but right now, despite its
simplicity, it shows a reasonable speed and efficiency in meeting memory
requests from the user-space scheduler, having to deal mostly with small
and uniformly sized allocations.

With this change in place scx_rustland survived more than 10hrs on a
heavily stressed system (with stress-ng and kernel builds running in a
loop):

 $ ps -o pid,rss,etime,cmd -p `pidof scx_rustland`
     PID   RSS     ELAPSED CMD
   34966 75840    10:00:44 ./build/scheds/rust/scx_rustland/debug/scx_rustland

Without this change it is possible to trigger the sched-ext watchdog
timeout in less than 5min, under the same system load conditions.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-14 22:07:37 +01:00
Andrea Righi
acc1d51560 scx_rustland: remove obsolete TODO note
Entries from TaskInfoMap associated to exiting tasks are already removed
via the BPF .exit_task() callback, so drop the obsolete TODO note and
replace it with a proper comment.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-11 20:47:36 +01:00
Andrea Righi
12d89e1d84 scx_rustland: add a troubleshooting section
Add a brief troubleshooting section to the command line help.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-11 18:14:46 +01:00
Andrea Righi
2157f638df scx_rustland: voluntary context switch boost
Improve priority boosting using voluntary context switches metric.

Overview
========

The current criteria to apply the time slice boost (option `-b`) is to
distinguish between newly created tasks and tasks that are already
running: in order to prioritize interactive applications (games,
multimedia, etc.) we apply a time slice usage penalty on newly created
tasks, indirectly boosting the priority of tasks that are already
running, which are likely to be the interactive applications that we
aim to prioritize.

Problem
=======

This approach works well when the background workload forks a bunch of
short-lived tasks (e.g., a parallel kernel build), but it fails to
properly classify CPU-intensive background tasks (i.e., video/3D
rendering, encryption, large data analysis, etc.), because these
applications, typically, do not generate many short-lived processes.

In presence of such workloads the time slice penalty is not enforced,
resulting in a lack of any boost for interactive applications.

Solution
========

A more effective critiria for distinguishing between interactive
applications and background CPU-intensive applications is to examine the
voluntary context switches: an application that periodically releases
the CPU voluntarily is very likely to be interactive.

Therefore, change the time slice boost logic to apply a bonus (scale down
the accounted used time slice) to tasks that show an increase in their
voluntary context switches counter over a time frame of 10 sec.

Based on experimental results, this simple heurstic appears to be quite
effective in classifying interactive tasks and prioritize them over
potential background CPU-intensive tasks.

Additionally, having a better criteria to identify interactive tasks
allow to prioritize also newly created tasks, thereby enhancing the
responsiveness of interactive shell sessions.

This always ensures the prompt execution of system commands, even when
the system is massively overloaded, unlike the previous time slice boost
logic, which made interactive shell sessions less responsive by
deprioritizing newly created tasks.

Results
=======

With this new logic in place it is possible to play a video game (e.g.,
Terraria) without experiencing any frame rate drop (60 fps), while a
parallel CPU stress test (`stress-ng -c 32`) is running in the
background. The same result can also be obtained with a parallel kernel
build (`make -j 32`). Thus, there is no regression compared to the
previous "ideal" test case.

Even when mixing both workloads (`make -j 16` + `stress-ng -c 16`),
Terraria can still be played without noticeable lag in the audio or
video, maintaining a consistent 60 fps.

In addition to that, shell commands are also very responsive.

Following, the results (average and standard deviation of 10 runs) of
two simple interactive shell commands, while both the `make -j 16` and
`stress-ng -c 16` workloads are running in background:

  avg time           "uname -r"       "ps axuw > /dev/null"
  =========================================================
  EEVDF                 11.1ms                     231.8ms
  scx_rustland           2.6ms                     212.0ms

  stdev              "uname -r"       "ps axuw > /dev/null"
  =========================================================
  EEVDF                   2.28                       23.41
  scx_rustland            0.70                        9.11

Tests conducted on a 8-cores laptop (11th Gen Intel i7-1195G7 @
4.800GHz) with 16GB of RAM.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-11 18:14:30 +01:00
Andrea Righi
1cf03770c7 scx_rustland: expose voluntary context switches to the scheduler
Provide the number of voluntary context switches (nvcsw) for each task
to the user-space scheduler.

This extra information can then be used by the scheduler to enhance its
decision-making process when scheduling tasks.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-11 14:10:39 +01:00
Tejun Heo
1395f14975
Update README.md
Embed the video and drop "live" from section title as it's not really live.
2024-01-10 14:47:33 -10:00
Tejun Heo
18f7fe8477 scx_flatcg: Fix fallout from direct dispatch API update
552b75a9c7 ("scx: Build fix after kernel update") updated scx_flatcg along
with other schedulers to use the new direct dispatching from
ops.select_cpu() mechanism. However, this was buggy for flatcg.

flatcg uses direct dispatch for two purposes - as an optimization when there
are idle cpus and to avoid dealing with custom CPU affinities in the
dispatch logic. While the former can be moved to ops.select_cpu(), the
latter can't as it should also apply to tasks which get enqueued without
preceding ops.select_cpu(), e.g., when the task gets requeued after an
attribute change or runs out of time slice. The API update incorrectly moved
both to ops.select_cpu() leading to futile retries of try_pick_next_cgroup()
and scheduling misbheaviors.

Fix it by separating out the two cases and only keeping the idle
optimization case in ops.select_cpu().

Signed-off-by: Tejun Heo <tj@kernel.org>
2024-01-10 10:57:50 -10:00
Tejun Heo
c1f22ea073 scx_flatcg: Report pick_next_cgroup() race and fail counts
To improve visibility into failure mode. While at it, improve output
formatting.

Signed-off-by: Tejun Heo <tj@kernel.org>
2024-01-10 10:52:24 -10:00
Tejun Heo
ae50b155ca
Merge pull request #80 from sched-ext/scx-flatcg-mitigate-stall
scx_flatcg: introduce CGROUP_MAX_RETRIES
2024-01-10 09:49:09 -10:00
Andrea Righi
0609abdca6 scx_flatcg: introduce CGROUP_MAX_RETRIES
We may end up stalling for too long in fcg_dispatch() if
try_pick_next_cgroup() doesn't find another valid cgroup to pick. This
can be quite risky, considering that we are holding the rq lock in
dispatch().

This condition can be reproduced easily in our CI, where we can trigger
stalling softirq works:

[    4.972926] NOHZ tick-stop error: local softirq work is pending, handler #200!!!

Or rcu stalls:

[   47.731900] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[   47.731900] rcu:     1-...!: (0 ticks this GP) idle=b29c/1/0x4000000000000000 softirq=2204/2204 fqs=0
[   47.731900] rcu:     3-...!: (0 ticks this GP) idle=db74/1/0x4000000000000000 softirq=2286/2286 fqs=0
[   47.731900] rcu:     (detected by 0, t=26002 jiffies, g=6029, q=54 ncpus=4)
[   47.731900] Sending NMI from CPU 0 to CPUs 1:

To mitigate this issue reduce the amount of try_pick_next_cgroup()
retries from BPF_MAX_LOOPS (8M) to CGROUP_MAX_RETRIES (1024).

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-10 17:36:17 +01:00
Andrea Righi
0198d893ce scx_rustland: introduce time slice boost parameter
Introduce a parameter to prioritize active running tasks over newly
created tasks.

This option can be used to enhance interactive applications (e.g.,
games, audio/video, GUIs, etc.) that are concurrently running with
fork-intensive background workloads (such as a large parallel build for
example).

The boost value (which functions as a penalty) is applied to the time
slice attributed to newly generated tasks, increasing their vruntime
and, in an indirect manner, "boosting" the priority of all the other
concurrent active tasks.

The time slice boost parameter was applied in the live demo video [1] to
enhance the frames per second (fps) of a video game (Terraria), running
simultaneously with a parallel kernel build (`make -j 32`) on an 8-core
laptop (the value used in the video matches the existing setting of
running `scx_rustland -b 200`).

[1] https://www.youtube.com/watch?v=oCfVbz9jvVQ

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-10 17:32:29 +01:00
Andrea Righi
732ba4900b scx_rustland: avoid using SCX_ENQ_PREEMPT
With the introduction of a the dynamic time slice that scales down based
on the number of tasks in the system, there is no obvious benefit in
utilizing SCX_ENQ_PREEMPT to dispatch the user-space scheduler.

The reduced time slice as the task count increases already enhances the
user-space scheduler's opportunities to run and efficiently manage
scheduling tasks, even when the system is massively overloaded.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-10 17:32:29 +01:00
Andrea Righi
db9a29d618 scx_rustland: improve dynamic slice scaling
Move scaling after tasks are sent to the dispatcher: tasks are
dispatched based on the amount of idle CPUs, so checking for any
remaining tasks still sitting in the scheduler after dispatch gives a
better idea how busy the system is.

Moreover, do not scale the time slice based on nr_cpus (otherwise,
systems with a large amount of CPUs would rarely get any scaling at
all).

Instead, apply a scaling factor as a function of how many tasks are
still waiting in the scheduler: nr_scheduled / 2. This method scales
better as the number of CPUs increases.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-09 22:11:07 +01:00
Andrea Righi
1da2983804 scx_rustland: get rid of force_local
Now that we can dispatch directly from select_cpu() we can make the code
more compact and readable by removing the force_local logic.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-09 22:11:07 +01:00
Andrea Righi
6ead675fb6 scx_rustland: add a link to the live demo in the README
Update the README.md adding a link to a live demo video of the
scheduler.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-09 22:11:07 +01:00
Tejun Heo
942b0269b8 Bump versions
After updates to reflect the updated init and direct dispatch API, the
schedulers aren't compatible with older kernels. Bump versions and publish
releases.
2024-01-08 18:49:54 -10:00
Tejun Heo
552b75a9c7 scx: Build fix after kernel update
In the latest kernel, sched_ext API has changed in two areas:

- ops.prep_enable/cancel_enable/enable/disable() replaced with
  ops.init_task/enable/disable/exit_task().

- scx_bpf_dispatch() can now be called from ops.select_cpu(). Also,
  SCX_ENQ_LOCAL flag is removed. Instead, users can call
  scx_bpf_select_cpu_dfl() from ops.select_cpu() and use the @is_idle out
  param value to determine whether to dispatch directly.

This commit updates all schedules so that they build.

- Init functions renamed / merged / split.

- ops.select_cpu() is added to several schedulers and local direct
  disptching logic is moved there.

This is the minimum update which is need to make the schedulers build and
work. It needs further update to e.g. move vtime udpates to ops.enable().
2024-01-08 14:48:24 -10:00
Andrea Righi
1ea5aebfb4 scx_rustland: always consider slice_ns as maximum time slice
With the introduction of a the dynamic time slice that scales down based
on the number of tasks in the system, there is no need anymore to apply
a constant scaling factor to time slice to extend the range of the
allowed time slices.

Therefore, get rid of the static scaling and use slice_ns as the upper
limit for the time slice accounted to the tasks.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-08 19:22:38 +01:00
Andrea Righi
9b482f48f1 scx_rustland: determine the amount of cores via /proc/stat
libbpf_rs::num_possible_cpus() may take into account multi-threads
multi-cores information, that are not used efficiently by the scheduler
at the moment.

For simplicity rely on /proc/stat to determine the amount of CPUs that
can be used by the scheduler and provide a proper abstraction to access
this information from the bpf Rust module.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-08 19:11:25 +01:00
Andrea Righi
0d107d6220 scx_rustland: return the proper cpu value from get_task_cpu()
Fix the ternary operator expression to return the CPU id, instead of the
boolean result of the condition.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-08 19:10:59 +01:00
Andrea Righi
fa6915cc0a scx_rustland: simplify update_enqueued()
With the introduction of a variable time slice that scales down in
function of the amount of waiting tasks, the scheduler is able to handle
a steady stream of newly spawned tasks, without having to de-prioritize
them to guarantee a good level of system responsiveness.

Hence, the logic for de-prioritizing new tasks can be removed, as it
currently doesn't provide any measurable benefits. In fact, it even
proves counterproductive as it can implicitly slow down the interactive
performance of shell sessions when the system is overloaded with a
significant amount of CPU hogs (e.g, `stress-ng -c 128`).

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-08 07:38:52 +01:00
Andrea Righi
bf98154ee1 scx_rustland: use dynamic time slice in the user-space scheduler
Implement a simple logic in the user-space scheduler to automatically
adjust the tasks' time slice: reduce the time slice by a scaling factor
of (nr_waiting / nr_cpus + 1), where nr_waiting is the amount of tasks
waiting in the scheduler and nr_cpus is the amount of CPUs in the
system.

Using a fine-grained time slice as the number of tasks in the system
grows, improves responsiveness of low-latency activities (e.g., audio,
video games), also in presence of other CPU-intensive tasks that are
concurrently running in the system.

On the other hand, extending the time slice when only a limited number
of tasks are active in the system contributes to an enhancement in the
overall system throughput and a reduced amount of context switches.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-08 07:38:52 +01:00
Andrea Righi
303c4ea548 scx_rustland: dynamic time slice support
Add to BpfScheduler() the new methods set_effective_slice_us() and
get_effective_slice_us().

These methods can be used by the user-space scheduler to dynamically
adjust (and retrieve) the effective time slice used to dispatch tasks
within the BPF dispatcher.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-08 07:35:31 +01:00
Andrea Righi
2a32d81859 scx_rustland: store default slice_ns in the scheduler class
Cache slice_ns into the main scheduler class to avoid accessing it via
self.bpf.skel.rodata().slice_ns every single time.

This also makes the scheduler code more clear and more abstracted from
the BPF details.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-07 16:14:51 +01:00
Andrea Righi
8ccbbdadee scx_userland: improve BPF logging
Always report task comm, nr_queued and nr_scheduled in the log messages.
Moreover, report also task name (comm) and cpu when possible.

All these extra information can be really helpful to trace and debug
scheduling issues.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-07 16:14:51 +01:00
Andrea Righi
295873ac41 scx_rustland: always dispatch per-CPU kthreads from enqueue
We allow tasks to bypass the user-space scheduler and be dispatched
directly using a shortcut in the enqueue path, if their running CPU is
immediately available or if the task is per-CPU kthread.

However, the shortcut is disabled if the user-space scheduler has some
pending activities to do (to avoid disrupting too much its decision).

In this case the shortcut is disabled also for per-CPU kthreads and that
may cause priority-inversion problems in the system, triggering some
stall of some per-CPU kthreads (such as rcuog/N) and short system
lockups, if the system is overloaded.

Prevent this by always enabing the dispatch shortcut for per-CPU
kthreads.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-06 11:06:53 +01:00
Andrea Righi
0c3bdb16fe scx_rustland: prevent using SCX_DSQ_LOCAL_ON from enqueue()
When we fail to push a task to the queued BPF map we fallback to direct
dispatch, but we can't use SCX_DSQ_LOCAL_ON. So, make sure to use
SCX_DSQ_GLOBAL in this case to prevent scheduler crashes.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-06 11:06:53 +01:00
Andrea Righi
05d997c539 scx_rustland: more robust CPU selection logic in the dispatcher
Instead of just trying the target CPU and the previously used CPU, we
could cycle among all the available CPUs (if both those CPUs cannot be
used), before using the global DSQ.

This allows to not de-prioritize too much tasks that can't be scheduled
on the CPU selected by the scheduler (or their previously used CPU), and
we can still dispatch them using SCX_DSQ_LOCAL_ON, like any other task.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-06 11:06:53 +01:00
Andrea Righi
18a990ae82 scx_rustland: assign min_vruntime before time slice evaluation
Assign min_vruntime to the task before the weighted time slice is
evaluated, then add the time slice.

In this way we still ensure that the task's vruntime is in the range
(min_vruntime + 1, min_vruntime + max_slice_ns], but we don't nullify
the effect of the evaluated time slice if the starting vruntime of the
task is too small.

Also change update_enqueued() to return the evaluated weighted time
slice (that can be used in the future).

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-06 11:06:53 +01:00
Andrea Righi
92109c95a9 scx_rustland: small TaskTree.push() refactoring
Change TaskTree.push() to accept directly a Task object, rather than
each individual attribute. Moreover, Task attributes don't need to be
public, since both TaskTree and Task are only used locally.

This makes the code more elegant and more readable.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-01-06 11:06:53 +01:00
Jordan Rome
661ea57c5c bump scx_rusty and scx_layered
These were supposed to be bumped in this commit:
fed1dae9da
2024-01-04 13:57:29 -08:00
Andrea Righi
96f3eb42be
Merge pull request #68 from sched-ext/scx-rustland-refactoring
scx_rustland: refactoring
2024-01-04 20:42:30 +01:00