scx_rustland: introduce virtual time slice

Overview ======== Currently, a task's time slice is determined based on the total number of tasks waiting to be scheduled: the more overloaded the system, the shorter the time slice. This approach can help to reduce the average wait time of all tasks, allowing them to progress more slowly, but uniformly, thus providing a smoother overall system performance. However, under heavy system load, this approach can lead to very short time slices distributed among all tasks, causing excessive context switches that can badly affect soft real-time workloads. Moreover, the scheduler tends to operate in a bursty manner (tasks are queued and dispatched in bursts). This can also result in fluctuations of longer and shorter time slices, depending on the number of tasks still waiting in the scheduler's queue. Such behavior can also negatively impact on soft real-time workloads, such as real-time audio processing. Virtual time slice ================== To mitigate this problem, introduce the concept of virtual time slice: the idea is to evaluate the optimal time slice of a task, considering the vruntime as a deadline for the task to complete its work before releasing the CPU. This is accomplished by calculating the difference between the task's vruntime and the global current vruntime and use this value as the task time slice: task_slice = task_vruntime - min_vruntime In this way, tasks that "promise" to release the CPU quickly (based on their previous work pattern) get a much higher priority (due to vruntime-based scheduling and the additional priority boost for being classified as interactive), but they are also given a shorter time slice to complete their work and fulfill their promise of rapidity. At the same time tasks that are more CPU-intensive get de-prioritized, but they will tend to have a longer time slice available, reducing in this way the amount of context switches that can negatively affect their performance. In conclusion, latency-sensitive tasks get a high priority and a short time slice (and they can preempt other tasks), CPU-intensive tasks get low priority and a long time slice. Example ======= Let's consider the following theoretical scenario: task | time -----+----- A | 1 B | 3 C | 6 D | 6 In this case task A represents a short interactive task, task C and D are CPU-intensive tasks and task B is mainly interactive, but it also requires some CPU time. With a uniform time slice, scaled based on the amount of tasks, the scheduling looks like this (assuming the time slice is 2): A B B C C D D A B C C D D C C D D | | | | | | | | | `---`---`---`-`-`---`---`---`----> 9 context switches With the virtual time slice the scheduling changes to this: A B B C C C D A B C C C D D D D D | | | | | | | `---`-----`-`-`-`-----`----------> 7 context switches In the latter scenario, tasks do not receive the same time slice scaled by the total number of tasks waiting to be scheduled. Instead, their time slice is adjusted based on their previous CPU usage. Tasks that used more CPU time are given longer slices and their processing time tends to be packed together, reducing the amount of context switches. Meanwhile, latency-sensitive tasks can still be processed as soon as they need to, because they get a higher priority and they can preempt other tasks. However, they will get a short time slice, so tasks that were incorrectly classified as interactive will still be forced to release the CPU quickly. Experimental results ==================== This patch has been tested on a on a 8-cores AMD Ryzen 7 5800X 8-Core Processor (16 threads with SMT), 16GB RAM, NVIDIA GeForce RTX 3070. The test case involves the usual benchmark of playing a video game while simultaneously overloading the system with a parallel kernel build (`make -j32`). The average frames per second (fps) reported by Steam is used as a metric for measuring system responsiveness (the higher the better): Game | before | after | delta | ---------------------------+---------+---------+--------+ Baldur's Gate 3 | 40 fps | 48 fps | +20.0% | Counter-Strike 2 | 8 fps | 15 fps | +87.5% | Cyberpunk 2077 | 41 fps | 46 fps | +12.2% | Terraria | 98 fps | 108 fps | +10.2% | Team Fortress 2 | 81 fps | 92 fps | +13.6% | WebGL demo (firefox) [1] | 32 fps | 42 fps | +31.2% | ---------------------------+---------+---------+--------+ Apart from the massive boost with Counter-Strike 2 (that should be taken with a grain of salt, considering the overall poor performance in both cases), the virtual time slice seems to systematically provide a boost in responsiveness of around +10-20% fps. It also seems to significantly prevent potential audio cracking issues when the system is massively overloaded: no audio cracking was detected during the entire run of these tests with the virtual deadline change applied. [1] https://webglsamples.org/aquarium/aquarium.html Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-11-29 12:40:24 +00:00 · 2024-06-02 08:20:54 +02:00 · 2024-06-02 08:20:54 +02:00 · 6f4cd853f9
commit 6f4cd853f9
parent 40e67897a9
1 changed files with 38 additions and 22 deletions
--- a/scheds/rust/scx_rustland/src/main.rs
+++ b/scheds/rust/scx_rustland/src/main.rs
@ -497,22 +497,8 @@ impl<'a> Scheduler<'a> {

    // Return the target time slice, proportionally adjusted based on the total amount of tasks
    // waiting to be scheduled (more tasks waiting => shorter time slice).
-    fn effective_slice_ns(&mut self, nr_scheduled: u64) -> u64 {
-        // Scale time slice as a function of nr_scheduled, but never scale below 250 us.
-        //
-        // The goal here is to adjust the time slice allocated to tasks based on the number of
-        // tasks currently awaiting scheduling. When the system is heavily loaded, shorter time
-        // slices are assigned to provide more opportunities for all tasks to receive CPU time.
-        let scaling = ((nr_scheduled + 1) / 2).max(1);
-        let slice_ns = (self.slice_ns / scaling).max(NSEC_PER_MSEC / 4);
-
-        slice_ns
-    }
-
    // Dispatch tasks from the task pool in order (sending them to the BPF dispatcher).
    fn dispatch_tasks(&mut self) {
-        let nr_scheduled = self.task_pool.tasks.len() as u64;
-
        // Dispatch only a batch of tasks equal to the amount of idle CPUs in the system.
        //
        // This allows to have more tasks sitting in the task pool, reducing the pressure on the
@ -521,29 +507,59 @@ impl<'a> Scheduler<'a> {
        for _ in 0..self.nr_idle_cpus().max(1) {
            match self.task_pool.pop() {
                Some(task) => {
+                    // Determine the task's virtual time slice.
+                    //
+                    // The goal is to evaluate the optimal time slice, considering the vruntime as
+                    // a deadline for the task to complete its work before releasing the CPU.
+                    //
+                    // This is accomplished by calculating the difference between the task's
+                    // vruntime and the global current vruntime and use this value as the task time
+                    // slice.
+                    //
+                    // In this way, tasks that "promise" to release the CPU quickly (based on
+                    // their previous work pattern) get a much higher priority (due to
+                    // vruntime-based scheduling and the additional priority boost for being
+                    // classified as interactive), but they are also given a shorter time slice
+                    // to complete their work and fulfill their promise of rapidity.
+                    //
+                    // At the same time tasks that are more CPU-intensive get de-prioritized, but
+                    // they will also tend to have a longer time slice available, reducing in this
+                    // way the amount of context switches that can negatively affect their
+                    // performance.
+                    //
+                    // In conclusion, latency-sensitive tasks get a high priority and a short time
+                    // slice (and they can preempt other tasks), CPU-intensive tasks get low
+                    // priority and a long time slice.
+                    //
+                    // Moreover, ensure that the time slice is never less than 0.25 ms to prevent
+                    // excessive penalty from assigning time slices that are too short and reduce
+                    // context switch overhead.
+                    let slice_ns = (task.vruntime - self.min_vruntime).max(NSEC_PER_MSEC / 4);
+
                    // Update global minimum vruntime.
                    self.min_vruntime = task.vruntime;

+                    // Create a new task to dispatch.
                    let mut dispatched_task = DispatchedTask::new(&task.qtask);

-                    // Interactive tasks will be dispatched on the first CPU available and they are
-                    // allowed to preempt other tasks.
+                    dispatched_task.set_slice_ns(slice_ns);
+
                    if task.is_interactive {
+                        // Dispatch interactive tasks on the first CPU available.
                        dispatched_task.set_flag(RL_CPU_ANY);
+
+                        // Interactive tasks can preempt other tasks.
                        if !self.no_preemption {
                            dispatched_task.set_flag(RL_PREEMPT_CPU);
                        }
                    }
+
+                    // In full-user mode we skip the built-in idle selection logic, so simply
+                    // dispatch all the tasks on the first CPU available.
                    if self.full_user {
-                        // In full-user mode we skip the built-in idle selection logic, so simply
-                        // dispatch all the tasks on the first CPU available.
                        dispatched_task.set_flag(RL_CPU_ANY);
                    }

-                    // Assign a timeslice as a function of the amount of tasks that are waiting to
-                    // be scheduled.
-                    dispatched_task.set_slice_ns(self.effective_slice_ns(nr_scheduled));
-
                    // Send task to the BPF dispatcher.
                    match self.bpf.dispatch_task(&dispatched_task) {
                        Ok(_) => {}