scx_rustland: introduce virtual time slice

Overview
========

Currently, a task's time slice is determined based on the total number
of tasks waiting to be scheduled: the more overloaded the system, the
shorter the time slice.

This approach can help to reduce the average wait time of all tasks,
allowing them to progress more slowly, but uniformly, thus providing a
smoother overall system performance.

However, under heavy system load, this approach can lead to very short
time slices distributed among all tasks, causing excessive context
switches that can badly affect soft real-time workloads.

Moreover, the scheduler tends to operate in a bursty manner (tasks are
queued and dispatched in bursts). This can also result in fluctuations
of longer and shorter time slices, depending on the number of tasks
still waiting in the scheduler's queue.

Such behavior can also negatively impact on soft real-time workloads,
such as real-time audio processing.

Virtual time slice
==================

To mitigate this problem, introduce the concept of virtual time slice:
the idea is to evaluate the optimal time slice of a task, considering
the vruntime as a deadline for the task to complete its work before
releasing the CPU.

This is accomplished by calculating the difference between the task's
vruntime and the global current vruntime and use this value as the task
time slice:

  task_slice = task_vruntime - min_vruntime

In this way, tasks that "promise" to release the CPU quickly (based on
their previous work pattern) get a much higher priority (due to
vruntime-based scheduling and the additional priority boost for being
classified as interactive), but they are also given a shorter time slice
to complete their work and fulfill their promise of rapidity.

At the same time tasks that are more CPU-intensive get de-prioritized,
but they will tend to have a longer time slice available, reducing in
this way the amount of context switches that can negatively affect their
performance.

In conclusion, latency-sensitive tasks get a high priority and a short
time slice (and they can preempt other tasks), CPU-intensive tasks get
low priority and a long time slice.

Example
=======

Let's consider the following theoretical scenario:

 task | time
 -----+-----
   A  | 1
   B  | 3
   C  | 6
   D  | 6

In this case task A represents a short interactive task, task C and D
are CPU-intensive tasks and task B is mainly interactive, but it also
requires some CPU time.

With a uniform time slice, scaled based on the amount of tasks, the
scheduling looks like this (assuming the time slice is 2):

 A B B C C D D A B C C D D C C D D
  |   |   |   | | |   |   |   |
  `---`---`---`-`-`---`---`---`----> 9 context switches

With the virtual time slice the scheduling changes to this:

 A B B C C C D A B C C C D D D D D
  |   |     | | | |     |
  `---`-----`-`-`-`-----`----------> 7 context switches

In the latter scenario, tasks do not receive the same time slice scaled
by the total number of tasks waiting to be scheduled. Instead, their
time slice is adjusted based on their previous CPU usage. Tasks that
used more CPU time are given longer slices and their processing time
tends to be packed together, reducing the amount of context switches.

Meanwhile, latency-sensitive tasks can still be processed as soon as
they need to, because they get a higher priority and they can preempt
other tasks. However, they will get a short time slice, so tasks that
were incorrectly classified as interactive will still be forced to
release the CPU quickly.

Experimental results
====================

This patch has been tested on a on a 8-cores AMD Ryzen 7 5800X 8-Core
Processor (16 threads with SMT), 16GB RAM, NVIDIA GeForce RTX 3070.

The test case involves the usual benchmark of playing a video game while
simultaneously overloading the system with a parallel kernel build
(`make -j32`).

The average frames per second (fps) reported by Steam is used as a
metric for measuring system responsiveness (the higher the better):

 Game                       |  before |  after  | delta  |
 ---------------------------+---------+---------+--------+
 Baldur's Gate 3            |  40 fps |  48 fps | +20.0% |
 Counter-Strike 2           |   8 fps |  15 fps | +87.5% |
 Cyberpunk 2077             |  41 fps |  46 fps | +12.2% |
 Terraria                   |  98 fps | 108 fps | +10.2% |
 Team Fortress 2            |  81 fps |  92 fps | +13.6% |
 WebGL demo (firefox) [1]   |  32 fps |  42 fps | +31.2% |
 ---------------------------+---------+---------+--------+

Apart from the massive boost with Counter-Strike 2 (that should be taken
with a grain of salt, considering the overall poor performance in both
cases), the virtual time slice seems to systematically provide a boost
in responsiveness of around +10-20% fps.

It also seems to significantly prevent potential audio cracking issues
when the system is massively overloaded: no audio cracking was detected
during the entire run of these tests with the virtual deadline change
applied.

[1] https://webglsamples.org/aquarium/aquarium.html

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
This commit is contained in:
Andrea Righi 2024-06-02 08:20:54 +02:00
parent 40e67897a9
commit 6f4cd853f9

View File

@ -497,22 +497,8 @@ impl<'a> Scheduler<'a> {
// Return the target time slice, proportionally adjusted based on the total amount of tasks
// waiting to be scheduled (more tasks waiting => shorter time slice).
fn effective_slice_ns(&mut self, nr_scheduled: u64) -> u64 {
// Scale time slice as a function of nr_scheduled, but never scale below 250 us.
//
// The goal here is to adjust the time slice allocated to tasks based on the number of
// tasks currently awaiting scheduling. When the system is heavily loaded, shorter time
// slices are assigned to provide more opportunities for all tasks to receive CPU time.
let scaling = ((nr_scheduled + 1) / 2).max(1);
let slice_ns = (self.slice_ns / scaling).max(NSEC_PER_MSEC / 4);
slice_ns
}
// Dispatch tasks from the task pool in order (sending them to the BPF dispatcher).
fn dispatch_tasks(&mut self) {
let nr_scheduled = self.task_pool.tasks.len() as u64;
// Dispatch only a batch of tasks equal to the amount of idle CPUs in the system.
//
// This allows to have more tasks sitting in the task pool, reducing the pressure on the
@ -521,29 +507,59 @@ impl<'a> Scheduler<'a> {
for _ in 0..self.nr_idle_cpus().max(1) {
match self.task_pool.pop() {
Some(task) => {
// Determine the task's virtual time slice.
//
// The goal is to evaluate the optimal time slice, considering the vruntime as
// a deadline for the task to complete its work before releasing the CPU.
//
// This is accomplished by calculating the difference between the task's
// vruntime and the global current vruntime and use this value as the task time
// slice.
//
// In this way, tasks that "promise" to release the CPU quickly (based on
// their previous work pattern) get a much higher priority (due to
// vruntime-based scheduling and the additional priority boost for being
// classified as interactive), but they are also given a shorter time slice
// to complete their work and fulfill their promise of rapidity.
//
// At the same time tasks that are more CPU-intensive get de-prioritized, but
// they will also tend to have a longer time slice available, reducing in this
// way the amount of context switches that can negatively affect their
// performance.
//
// In conclusion, latency-sensitive tasks get a high priority and a short time
// slice (and they can preempt other tasks), CPU-intensive tasks get low
// priority and a long time slice.
//
// Moreover, ensure that the time slice is never less than 0.25 ms to prevent
// excessive penalty from assigning time slices that are too short and reduce
// context switch overhead.
let slice_ns = (task.vruntime - self.min_vruntime).max(NSEC_PER_MSEC / 4);
// Update global minimum vruntime.
self.min_vruntime = task.vruntime;
// Create a new task to dispatch.
let mut dispatched_task = DispatchedTask::new(&task.qtask);
// Interactive tasks will be dispatched on the first CPU available and they are
// allowed to preempt other tasks.
dispatched_task.set_slice_ns(slice_ns);
if task.is_interactive {
// Dispatch interactive tasks on the first CPU available.
dispatched_task.set_flag(RL_CPU_ANY);
// Interactive tasks can preempt other tasks.
if !self.no_preemption {
dispatched_task.set_flag(RL_PREEMPT_CPU);
}
}
// In full-user mode we skip the built-in idle selection logic, so simply
// dispatch all the tasks on the first CPU available.
if self.full_user {
// In full-user mode we skip the built-in idle selection logic, so simply
// dispatch all the tasks on the first CPU available.
dispatched_task.set_flag(RL_CPU_ANY);
}
// Assign a timeslice as a function of the amount of tasks that are waiting to
// be scheduled.
dispatched_task.set_slice_ns(self.effective_slice_ns(nr_scheduled));
// Send task to the BPF dispatcher.
match self.bpf.dispatch_task(&dispatched_task) {
Ok(_) => {}