scx_rustland: voluntary context switch boost

Improve priority boosting using voluntary context switches metric. Overview ======== The current criteria to apply the time slice boost (option `-b`) is to distinguish between newly created tasks and tasks that are already running: in order to prioritize interactive applications (games, multimedia, etc.) we apply a time slice usage penalty on newly created tasks, indirectly boosting the priority of tasks that are already running, which are likely to be the interactive applications that we aim to prioritize. Problem ======= This approach works well when the background workload forks a bunch of short-lived tasks (e.g., a parallel kernel build), but it fails to properly classify CPU-intensive background tasks (i.e., video/3D rendering, encryption, large data analysis, etc.), because these applications, typically, do not generate many short-lived processes. In presence of such workloads the time slice penalty is not enforced, resulting in a lack of any boost for interactive applications. Solution ======== A more effective critiria for distinguishing between interactive applications and background CPU-intensive applications is to examine the voluntary context switches: an application that periodically releases the CPU voluntarily is very likely to be interactive. Therefore, change the time slice boost logic to apply a bonus (scale down the accounted used time slice) to tasks that show an increase in their voluntary context switches counter over a time frame of 10 sec. Based on experimental results, this simple heurstic appears to be quite effective in classifying interactive tasks and prioritize them over potential background CPU-intensive tasks. Additionally, having a better criteria to identify interactive tasks allow to prioritize also newly created tasks, thereby enhancing the responsiveness of interactive shell sessions. This always ensures the prompt execution of system commands, even when the system is massively overloaded, unlike the previous time slice boost logic, which made interactive shell sessions less responsive by deprioritizing newly created tasks. Results ======= With this new logic in place it is possible to play a video game (e.g., Terraria) without experiencing any frame rate drop (60 fps), while a parallel CPU stress test (`stress-ng -c 32`) is running in the background. The same result can also be obtained with a parallel kernel build (`make -j 32`). Thus, there is no regression compared to the previous "ideal" test case. Even when mixing both workloads (`make -j 16` + `stress-ng -c 16`), Terraria can still be played without noticeable lag in the audio or video, maintaining a consistent 60 fps. In addition to that, shell commands are also very responsive. Following, the results (average and standard deviation of 10 runs) of two simple interactive shell commands, while both the `make -j 16` and `stress-ng -c 16` workloads are running in background: avg time "uname -r" "ps axuw > /dev/null" ========================================================= EEVDF 11.1ms 231.8ms scx_rustland 2.6ms 212.0ms stdev "uname -r" "ps axuw > /dev/null" ========================================================= EEVDF 2.28 23.41 scx_rustland 0.70 9.11 Tests conducted on a 8-cores laptop (11th Gen Intel i7-1195G7 @ 4.800GHz) with 16GB of RAM. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-11-24 20:00:22 +00:00 · 2024-01-11 14:22:36 +01:00 · 2024-01-11 14:22:36 +01:00 · 2157f638df
commit 2157f638df
parent 1cf03770c7
1 changed files with 71 additions and 43 deletions
--- a/scheds/rust/scx_rustland/src/main.rs
+++ b/scheds/rust/scx_rustland/src/main.rs
@ -17,7 +17,7 @@ use std::collections::HashMap;
 use std::sync::atomic::AtomicBool;
 use std::sync::atomic::Ordering;
 use std::sync::Arc;
-use std::time::{Duration, SystemTime};
+use std::time::SystemTime;

 use std::fs::File;
 use std::io::{self, Read};
@ -69,14 +69,16 @@ struct Opts {

    /// Time slice boost: increasing this value enhances performance of interactive applications
    /// (gaming, multimedia, GUIs, etc.), but may lead to decreased responsiveness of other tasks
-    /// in the system; set to 0 for maximum responsiveness in newly created tasks (e.g., max
-    /// responsiveness of shell sessions).
+    /// in the system.
    ///
    /// WARNING: setting a large value can make the scheduler quite unpredictable and you may
    /// experience temporary system stalls (before hitting the sched-ext watchdog timeout).
    ///
-    /// The default setting (no boost at all) prioritizes fairness and system stability.
-    #[clap(short = 'b', long, default_value = "1")]
+    /// Default time slice boost is 100, which means interactive tasks will get a 100x priority
+    /// boost to run respect to non-interactive tasks.
+    ///
+    /// Use "1" to disable time slice boost and fallback to the standard vruntime-based scheduling.
+    #[clap(short = 'b', long, default_value = "100")]
    slice_boost: u64,

    /// If specified, only tasks which have their scheduling policy set to
@ -91,11 +93,17 @@ struct Opts {
    debug: bool,
 }

+// Time constants.
+const MSEC_PER_SEC: u64 = 1_000;
+const NSEC_PER_SEC: u64 = 1_000_000_000;
+
 // Basic item stored in the task information map.
 #[derive(Debug)]
 struct TaskInfo {
    sum_exec_runtime: u64, // total cpu time used by the task
    vruntime: u64,         // total vruntime of the task
+    nvcsw: u64,            // total amount of voluntary context switches
+    nvcsw_ts: u64,         // timestamp of the previous nvcsw update
 }

 // Task information map: store total execution time and vruntime of each task in the system.
@ -187,10 +195,10 @@ impl<'a> Scheduler<'a> {
        info!("{} scheduler attached", SCHEDULER_NAME);

        // Save the default time slice (in ns) in the scheduler class.
-        let slice_ns = opts.slice_us * 1000;
+        let slice_ns = opts.slice_us * MSEC_PER_SEC;

-        // Slice booster.
-        let slice_boost = opts.slice_boost;
+        // Slice booster (must be >= 1).
+        let slice_boost = opts.slice_boost.max(1);

        // Scheduler task pool to sort tasks by vruntime.
        let task_pool = TaskTree::new();
@ -226,6 +234,14 @@ impl<'a> Scheduler<'a> {
        idle_cpus
    }

+    // Return current timestamp in ns.
+    fn now() -> u64 {
+        let ts = SystemTime::now()
+            .duration_since(SystemTime::UNIX_EPOCH)
+            .unwrap();
+        ts.as_nanos() as u64
+    }
+
    // Update task's vruntime based on the information collected from the kernel and return the
    // evaluated weighted time slice to the caller.
    //
@ -233,17 +249,15 @@ impl<'a> Scheduler<'a> {
    fn update_enqueued(
        task_info: &mut TaskInfo,
        sum_exec_runtime: u64,
+        nvcsw: u64,
        weight: u64,
        min_vruntime: u64,
        slice_ns: u64,
        slice_boost: u64,
+        now: u64,
    ) {
        // Determine if a task is new or old, based on their current runtime and previous runtime
        // counters.
-        fn is_new_task(runtime_curr: u64, runtime_prev: u64) -> bool {
-            runtime_curr < runtime_prev || runtime_prev == 0
-        }
-        // Evaluate last time slot used by the task.
        //
        // NOTE: make sure to handle the case where the current sum_exec_runtime is less then the
        // previous sum_exec_runtime. This can happen, for example, when a new task is created via
@ -253,22 +267,37 @@ impl<'a> Scheduler<'a> {
        // Consequently, the existing task_info slot is reused, containing the total run-time of
        // the previous task (likely exceeding the current sum_exec_runtime). In such cases, simply
        // use sum_exec_runtime as the time slice of the new task.
-        let slice = if is_new_task(sum_exec_runtime, task_info.sum_exec_runtime) {
-            // New task: check if we need to apply the time slice penalty.
-            if slice_boost != 1 {
-                // Assign to the new task an initial time slice of (slice_ns * slice_boost / 2)
-                // over a maximum allowed time slice of (slice_ns * slice_boost), that can be
-                // potentially reached applying the task's weight (see below).
-                slice_ns * slice_boost / 2
-            } else {
-                sum_exec_runtime
-            }
+        fn is_new_task(curr_runtime: u64, prev_runtime: u64) -> bool {
+            curr_runtime < prev_runtime || prev_runtime == 0
+        }
+
+        // Determine if a task is interactive, based on the amount of voluntary context switches
+        // per seconds.
+        //
+        // NOTE: we should probably make the (delta_nvcsw / delta_t) threshold a tunable, but for
+        // now let's assume that 1 voluntary context switch over the last 10 seconds is enough to
+        // assume that the task is interactive.
+        fn is_interactive_task(delta_nvcsw: u64, delta_ns: u64) -> bool {
+            delta_nvcsw / (delta_ns / (10 * NSEC_PER_SEC)).max(1) > 0
+        }
+
+        // Evaluate last time slot used by the task.
+        let mut slice = if is_new_task(sum_exec_runtime, task_info.sum_exec_runtime) {
+            // Give to newer tasks a priority boost, so we can prioritize responsiveness of
+            // interactive shell sessions.
+            sum_exec_runtime / slice_boost
        } else {
            // Old task: charge time slice normally.
            sum_exec_runtime - task_info.sum_exec_runtime
-            // Scale the time slice by the task's priority (weight).
-        } * 100
-            / weight;
+        };
+
+        // Apply the slice boost to interactive tasks.
+        if is_interactive_task(nvcsw - task_info.nvcsw, now - task_info.nvcsw_ts) {
+            slice /= slice_boost;
+        }
+
+        // Scale the time slice by the task's priority (weight).
+        slice = slice * 100 / weight;

        // Make sure that the updated vruntime is in the range:
        //
@ -280,22 +309,22 @@ impl<'a> Scheduler<'a> {
        //
        // Moreover, limiting the accounted time slice to slice_ns, allows to prevent starving the
        // current task for too long in the scheduler task pool.
-        let max_slice_ns =
-            if slice_boost > 1 && is_new_task(sum_exec_runtime, task_info.sum_exec_runtime) {
-                slice_ns * slice_boost
-            } else {
-                slice_ns
-            };
-        task_info.vruntime = min_vruntime + slice.clamp(1, max_slice_ns);
+        task_info.vruntime = min_vruntime + slice.clamp(1, slice_ns * 100 / weight);

        // Update total task cputime.
        task_info.sum_exec_runtime = sum_exec_runtime;
+
+        // Update voluntay context switches counter and timestamp.
+        if now - task_info.nvcsw_ts > 10 * NSEC_PER_SEC {
+            task_info.nvcsw = nvcsw;
+            task_info.nvcsw_ts = now;
+        }
    }

    // Drain all the tasks from the queued list, update their vruntime (Self::update_enqueued()),
    // then push them all to the task pool (doing so will sort them by their vruntime).
    fn drain_queued_tasks(&mut self) {
-        let slice_ns = self.bpf.get_effective_slice_us() * 1000;
+        let slice_ns = self.bpf.get_effective_slice_us() * MSEC_PER_SEC;
        loop {
            match self.bpf.dequeue_task() {
                Ok(Some(task)) => {
@ -315,16 +344,20 @@ impl<'a> Scheduler<'a> {
                            .or_insert_with_key(|&_pid| TaskInfo {
                                sum_exec_runtime: 0,
                                vruntime: self.min_vruntime,
+                                nvcsw: 0,
+                                nvcsw_ts: Self::now(),
                            });

                    // Update task information.
                    Self::update_enqueued(
                        task_info,
                        task.sum_exec_runtime,
+                        task.nvcsw,
                        task.weight,
                        self.min_vruntime,
                        slice_ns,
                        self.slice_boost,
+                        Self::now(),
                    );

                    // Insert task in the task pool (ordered by vruntime).
@ -353,11 +386,11 @@ impl<'a> Scheduler<'a> {
    // Dynamically adjust the time slice based on the amount of waiting tasks.
    fn scale_slice_ns(&mut self) {
        let nr_scheduled = self.task_pool.tasks.len() as u64;
-        let slice_us_max = self.slice_ns / 1000;
+        let slice_us_max = self.slice_ns / MSEC_PER_SEC;

        // Scale time slice as a function of nr_scheduled, but never scale below 1 ms.
        let scaling = (nr_scheduled / 2).max(1);
-        let slice_us = (slice_us_max / scaling).max(1000);
+        let slice_us = (slice_us_max / scaling).max(MSEC_PER_SEC);

        // Apply new scaling.
        self.bpf.set_effective_slice_us(slice_us);
@ -514,20 +547,15 @@ impl<'a> Scheduler<'a> {
    }

    fn run(&mut self, shutdown: Arc<AtomicBool>) -> Result<()> {
-        let mut prev_ts = SystemTime::now();
+        let mut prev_ts = Self::now();

        while !shutdown.load(Ordering::Relaxed) && self.bpf.read_bpf_exit_kind() == 0 {
-            let curr_ts = SystemTime::now();
-            let elapsed = curr_ts
-                .duration_since(prev_ts)
-                .unwrap_or_else(|_| Duration::from_secs(0));
-
            // Call the main scheduler body.
            self.schedule();

            // Print scheduler statistics every second.
-            if elapsed > Duration::from_secs(1) {
-                // Print scheduler statistics.
+            let curr_ts = Self::now();
+            if curr_ts - prev_ts > NSEC_PER_SEC {
                self.print_stats();

                prev_ts = curr_ts;