scx-upstream

mirror of https://github.com/sched-ext/scx.git synced 2024-11-25 04:00:24 +00:00

Author	SHA1	Message	Date
Andrea Righi	f8a2445869	scx_bpfland: introduce performance/powersave primary domain The primary scheduling domain represents a group of CPUs in the system where the scheduler will initially attempt to assign tasks. Tasks will only be dispatched to CPUs within this primary domain until they are fully utilized, after which tasks may overflow to other available CPUs. The primary scheduling domain can defined using the option `--primary-domain CPUMASK` (by default all the CPUs in the system are used as primary domain). This change introduces two new special values for the CPUMASK argument: - `performance`: automatically detect the fastest CPUs in the system and use them as primary scheduling domain, - `powersave`: automatically detect the slowest CPUs in the system and use them as primary scheduling domain. The current logic only supports creating two groups: fast and slow CPUs. The fast CPU group is created by excluding CPUs with the lowest frequency from the overall set, which means that within the fast CPU group, CPUs may have different maximum frequencies. When using the `performance` mode the fast CPUs will be used as primary domain, whereas in `powersave` mode, the slow CPUs will be used instead. This option is particularly useful in hybrid architectures (with P-cores and E-cores), as it allows the use of bpfland to prioritize task scheduling on either P-cores or E-cores, depending on the desired performance profile. Example: - Dell Precision 5480 - CPU: 13th Gen Intel(R) Core(TM) i7-13800H - P-cores: 0-11 / max freq: 5.2GHz - E-cores: 12-19 / max freq: 4.0GHz $ scx_bpfland --primary-domain performance 0[\|\|\|\|\|\|\|\|\| 24.5%] 10[\|\|\|\|\|\|\|\| 22.8%] 1[\|\|\|\|\|\| 14.9%] 11[\|\|\|\|\|\|\|\|\|\|\|\|\| 36.9%] 2[\|\|\|\|\|\| 16.2%] 12[ 0.0%] 3[\|\|\|\|\|\|\|\|\| 25.3%] 13[ 0.0%] 4[\|\|\|\|\|\|\|\|\|\|\| 33.3%] 14[ 0.0%] 5[\|\|\|\| 9.9%] 15[ 0.0%] 6[\|\|\|\|\|\|\|\|\|\|\| 31.5%] 16[ 0.0%] 7[\|\|\|\|\|\|\| 17.4%] 17[ 0.0%] 8[\|\|\|\|\|\|\|\| 23.4%] 18[ 0.0%] 9[\|\|\|\|\|\|\|\|\| 26.1%] 19[ 0.0%] Avg power consumption: 3.29W $ scx_bpfland --primary-domain powersave 0[\| 2.5%] 10[ 0.0%] 1[ 0.0%] 11[ 0.0%] 2[ 0.0%] 12[\|\|\|\| 8.0%] 3[ 0.0%] 13[\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\| 64.2%] 4[ 0.0%] 14[\|\|\|\|\|\|\|\|\|\| 29.6%] 5[ 0.0%] 15[\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\| 52.5%] 6[ 0.0%] 16[\|\|\|\|\|\|\|\|\| 24.7%] 7[ 0.0%] 17[\|\|\|\|\|\|\|\|\|\| 30.4%] 8[ 0.0%] 18[\|\|\|\|\|\|\| 22.4%] 9[ 0.0%] 19[\|\|\|\|\| 12.4%] Avg power consumption: 2.17W (Info collected from htop and turbostat) Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-19 20:19:21 +02:00
Andrea Righi	174993f9d2	scx_bpfland: introduce cache awareness While the system is not saturated the scheduler will use the following strategy to select the next CPU for a task: - pick the same CPU if it's a full-idle SMT core - pick any full-idle SMT core in the primary scheduling group that shares the same L2 cache - pick any full-idle SMT core in the primary scheduling grouop that shares the same L3 cache - pick the same CPU (ignoring SMT) - pick any idle CPU in the primary scheduling group that shares the same L2 cache - pick any idle CPU in the primary scheduling group that shares the same L3 cache - pick any idle CPU in the system While the system is completely saturated (no idle CPUs available), tasks will be dispatched on the first CPU that becomes available. Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-19 20:19:21 +02:00
Tejun Heo	c16b48d7b2	scheds/rust: Include Cargo.lock in the repo Binary packages are expected to include Cargo.lock in the repo so that the produced binaries match across different builds.	2024-08-15 23:08:35 -10:00
Andrea Righi	f9a994412d	scx_bpfland: introduce primary scheduling domain Allow to specify a primary scheduling domain via the new command line option `--primary-domain CPUMASK`, where CPUMASK can be a hex number of arbitrary length, representing the CPUs assigned to the domain. If this option is not specified the scheduler will use all the available CPUs in the system as primary domain (no behavior change). Otherwise, if a primary scheduling domain is defined, the scheduler will try to dispatch tasks only to the CPUs assigned to the primary domain, until these CPUs are saturated, at which point tasks may overflow to other available CPUs. This feature can be used to prioritize certain cores over others and it can be really effective in systems with heterogeneous cores (e.g., hybrid systems with P-cores and E-cores). == Example (hybrid architecture) == Hardware: - Dell Precision 5480 with 13th Gen Intel(R) Core(TM) i7-13800H - 6 P-cores 0..5 with 2 CPUs each (CPU from 0..11) - 8 E-cores 6..13 with 1 CPU each (CPU from 12..19) == Test == WebGL application (https://webglsamples.org/aquarium/aquarium.html): this allows to generate a steady workload in the system without over-saturating the CPUs. Use different scheduler configurations: - EEVDF (default) - scx_bpfland using P-cores only (--primary-domain 0x00fff) - scx_bpfland using E-cores only (--primary-domain 0xff000) Measure performance (fps) and power consumption (W). == Result == +-----+-----+------+-----+----------+ \| min \| max \| avg \| \| \| \| fps \| fps \| fps \| stdev \| power \| +-----------------+-----+-----+------+-------+--------+ \| EEVDF \| 28 \| 34 \| 31.0 \| 1.73 \| 3.5W \| \| bpfland-p-cores \| 33 \| 34 \| 33.5 \| 0.29 \| 3.5W \| \| bpfland-e-cores \| 25 \| 26 \| 25.5 \| 0.29 \| 2.2W \| +-----------------+-----+-----+------+-------+--------+ Using a primary scheduling domain of only P-cores with scx_bpfland allows to achieve a more stable and predictable level of performance, with an average of 33.5 fps and an error of ±0.5 fps. In contrast, using EEVDF results in an average frame rate of 31.0 fps with an error of ±3.0 fps, indicating slightly less consistency, due to the fact that tasks are evenly distributed across all the cores in the system (both slow and fast cores). On the other hand, using a scheduling domain solely of E-cores with scx_bpfland results in a lower average frame rate (25.5 fps), though it maintains a stable performance (error of ±0.5 fps), but the power consumption is also reduced, averaging 2.2W, compared to 3.5W with either of the other configurations. == Conclusion == In summary, with this change users have the flexibility to prioritize scheduling on performance cores for better performance and consistency, or prioritize energy efficient cores for reduced power consumption, on hybrid architectures. Moreover, this feature can also be used to minimize the number of cores used by the scheduler, until they reach full capacity. This capability can be useful for reducing power consumption even in homogeneous systems or for conducting scheduling experiments with smaller sets of cores, provided the system is not overcommitted. Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-14 16:17:54 +02:00
Andrea Righi	a6e977c70b	scx_bpfland: make output more compact Abbreviate the statistics reported to stdout and remove the slice_ms metric: this metric can be easily derived from slice_ns, slice_ns_min and nr_wait, which is already reported to stdout. Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-14 16:17:54 +02:00
Andrea Righi	8656effa50	scx_bpfland: update copyright info Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-14 16:17:54 +02:00
Tejun Heo	63c4a0191f	Merge branch 'main' into topic/inlined-skeleton-members	2024-08-08 14:23:37 -10:00
Tejun Heo	cd6a4d72c7	Bump versions for 1.0.2 release	2024-08-08 14:10:16 -10:00
Tejun Heo	7c3ffe96e1	Unify crate dependency versions Different sub-projects are using different versions for the same crates. Synchronize them to the latest.	2024-08-08 13:26:47 -10:00
Andrea Righi	b87541a26e	scx_rustland_core: refactor idle CPU selection logic Use the same idle selection logic used in scx_bpfland also in scx_rustland_core. Also drop fifo_mode and always use the BPF idle selection logic by default as long as the system is not saturated, unless full_user is specified. This approach allows user-space schedulers aiming for maximum performance to leverage the BPF idle selection logic (bypassing user-space), while those seeking full control can enable full_user to bypass the BPF CPU idle selection logic and choose the target CPU for each task from user-space. Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-07 08:10:53 +02:00
Andrea Righi	bee0d699ef	scx_bpfland: always re-align task's vruntime to the global vruntime Immediately re-align p->scx.dsq_vtime to the global vruntime (+/- slice lag) as soon as we are evaluating the task's vruntime. This allows rapidly chase the minimum global vruntime, ensuring to not over prioritize tasks tasks with a predominantly sleeping behavior pattern. Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-08-02 13:11:25 +02:00
Andrea Righi	19854f1535	scx_bpfland: allow to specify negative values with --slice-us-lag Using negative values with --slice-us-lag can be useful to make performance more consistent and prioritize newly created tasks over the running tasks. Therefore, allow to specify negative values from the command line and also update the documentation of this option. Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-07-26 09:10:18 +02:00
Andrea Righi	46ddca6bd5	scx_bpfland: report task time slice to stdout Periodically report to stdout samples of the effective time slice applied to tasks. While one could determine this metric by examining the max slice_ns and nr_waiting metrics, directly reporting it to stdout allows users to quickly identify what is happening and it provides a clearer overview of the scheduling behavior. Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-07-22 22:01:49 +02:00
Andrea Righi	c1d93d2a00	scx_bpfland: drop kthread dispatches metric Dispatching per-CPU kthreads directly is disabled by default, reporting this metric can generate some confusion (since it is always 0), and even if local kthread dispatches are enabled, they should be still considered as regular direct dispatches (there is no difference in practice). Therefore, merge direct kthread dispatches into direct dispatches and drop the separate nr_kthread_dispatches metric. Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-07-22 22:01:49 +02:00
Andrea Righi	a5f1d6b595	scx_bpfland: show average amount of tasks waiting to be dispatched Periodically report the average amount of tasks sitting in the priority and shared DSQs. Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-07-22 22:01:45 +02:00
Andrea Righi	5908a985bc	scx_bpfland: adjust task time slice based on the amount of waiting tasks Scale the task's time slice based on the average amount of tasks that are currently waiting to be dispatched. Use a moving average for the amount of waiting tasks to smooth out potential spikes caused by temporary bursts of tasks piling in the wait queues. This was initially modeled in scx_rustland and it seems to work pretty well also in scx_bpfland now. Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-07-22 21:53:25 +02:00
Andrea Righi	c4eb3ce7b4	scx_bpfland: introduce dynamic nvcsw threshold Instead of using a static value to classify tasks based on their average amount of voluntary context switches, try to periodically evaluate an optimal threshold, based on a global average of voluntary context switches among of all the running tasks. Tasks with an average amount of voluntary context switches greater than the global average will be classified as interactive. The global average is evaluated as an exponentially weighted moving average (EWMA), as: avg(t) = avg(t - 1) * 0.75 - task_avg(t) * 0.25 This approach is more efficient than iterating through all tasks and it helps to prevent rapid fluctuations that may be caused by bursts of voluntary context switch events. The dynamic nvcsw threshold enables a more precise adjustment of the classification criteria to swiftly respond to global system changes: tasks can be quickly classified as interactive, but if the system experiences too many interactive events, the criteria for maintaining interactive status become stricter. This creates a natural selection process where only the most deserving tasks remain interactive. Additionally, introduce the new option `--nvcsw-max-thresh N`, which allows to extend or restrict the fluctuation range of the global average threshold for voluntary context switches. Tested-by: Piotr Gorski <piotrgorski@cachyos.org> Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-07-18 19:03:25 +02:00
Daniel Müller	565aec3662	rust: Update libbpf-rs & libbpf-cargo to 0.24 Update libbpf-rs & libbpf-cargo to 0.24. Among other things, generated skeletons now contain directly accessible map and program objects, no longer necessitating the use of accessor methods. As a result, the risk for mutability conflicts is reduced greatly. Signed-off-by: Daniel Müller <deso@posteo.net>	2024-07-16 11:48:52 -07:00
Tejun Heo	51334b5c4d	Bump versions for 1.0.1 release	2024-07-15 13:21:52 -10:00
Andrea Righi	8e7a526356	scx_bpfland: use nr_cpu_ids for consistency We always use nr_cpu_ids to represent the maximum CPU id returned by scx_bpf_nr_cpu_ids(). Replace cpu_max with nr_cpu_ids to be more consistent with the rest of the code. Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-07-15 08:44:35 +02:00
Andrea Righi	33d06f653b	scx_bpfland: get rid of the MAX_CPUS hard-coded limit We can rely on scx_bpf_nr_cpu_ids() to create all the possible per-CPU DSQs, eliminating the need for the hard-coded limit MAX_CPUS. In this way scx_bpfland can support the same amount of CPUs that the kernel can handle. Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-07-15 00:17:30 +02:00
Andrea Righi	b80ef7d8eb	scx_bpfland: optimize offline CPU handling Instead of constantly checking the need to drain tasks from the DSQs of the offline CPUs, provide an atomic flag to notify when there are tasks to be drained from the offline CPUs. Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-07-15 00:17:23 +02:00
Andrea Righi	0530706710	scx_bpfland: properly initialize the nvcsw metrics Initialize the number of voluntary context switches metrics in the local task storage. Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-07-15 00:16:10 +02:00
Andrea Righi	bf4ad23599	scx_bpfland: refine interactive tasks flood safeguard Refine the safeguard mechanism to avoid generating too many interactive tasks in the system, which could nullify the effect of the interactive/regular task classification. The safeguard mechanism operates by pausing the promotion of new tasks to interactive status during the task wake-up process, whenever the number of interactive tasks in the priority queue exceeds a specific limit (set to 4x the number of online CPUs). Halting the promotion of additional interactive tasks allows to prioritize those already classified as interactive, thereby preventing potential "bursts" of excessive interactive tasks in the system. This refines the mitigation already provided by commit `640bd562` ("scx_bpfland: prevent tasks from abusing interactive priority boost"). Fixes: `640bd562` ("scx_bpfland: prevent tasks from abusing interactive priority boost") Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-07-15 00:11:34 +02:00
Andrea Righi	eb1cf0e670	scx_bpfland: improve task time slice evaluation Always assign the maximum time slice if there are idle CPUs in the system. Otherwise, double the task's unused time slice to reward tasks that use less CPU time and at the same time refill the time slice of the tasks every time they're dispatched. Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-07-14 23:24:24 +02:00
Tejun Heo	761ec142ce	Bump most versions to 1.0.0 sched_ext is about to be merged upstream. There are some compatibility breaking changes and we're making the current sched_ext/for-6.11 1edab907b57d ("sched_ext/scx_qmap: Pick idle CPU for direct dispatch on !wakeup enqueues") the baseline. Tag everything except scx_mitosis as 1.0.0. As scx_mitosis is still in early development and is currently temporarily disabled, only the patchlevel is bumped.	2024-07-12 11:34:14 -10:00
Andrea Righi	640bd562ff	scx_bpfland: prevent tasks from abusing interactive priority boost The priority boost for interactive tasks can be exploited to render the system nearly unresponsive by creating numerous tasks that constantly switch between wait/wakeup states. For example, stress tests like `hackbench -l 10000` can significantly degrade system responsiveness. To mitigate this, limit the number of interactive tasks added to the priority queue to 4x the number of online CPUs. This simple approach appears to be a quite effective at identifying potential spam of "fake" interactive tasks, while still prioritizing legitimate interactive tasks. Additionally, periodically refresh the interactive status of the tasks based on their most recent average of voluntary context switches, preventing the interactive status from being too "sticky". Tested-by: Piotr Gorski <lucjan.lucjanov@gmail.com> Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-07-11 16:13:55 +02:00
Andrea Righi	1babb2b92d	scx_bpfland: prevent per-CPU kthreads starving other tasks Avoid dispatching per-CPU kthreads directly, since this may cause interactivity problems or unfairness, for example if there are too many softirqs being scheduled (e.g., in presence of high RX network traffic or when running certain stress tests, like hackbench). Moreover, in order to help with testing and benchmarks, introduce the option --local-kthread, that allows to restore the old behavior if enabled. Tested-by: Piotr Gorski <lucjan.lucjanov@gmail.com> Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-07-11 16:13:48 +02:00
Andrea Righi	c3ebdd338f	scx_bpfland: prevent slice delta overflow When updating the task vruntime, ensure the time slice delta is always a positive value. Failing to do so may cause the global vruntime to increase excessively due to overflows. Tested-by: Piotr Gorski <lucjan.lucjanov@gmail.com> Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-07-11 15:58:01 +02:00
Andrea Righi	f59aa52fe7	scx_bpfland: expose the amount of online CPUs Periodically report the amount of online CPUs to stdout. The online CPUs are initially evaluated looking at the online cpumask, then the value is updated in the .cpu_offline() / .cpu_online() callbacks. Tested-by: Piotr Gorski <lucjan.lucjanov@gmail.com> Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-07-11 15:58:01 +02:00
Andrea Righi	3a47b484af	scx_bpfland: report interactive tasks to stdout Keep track of the CPUs that are running interactive tasks and report their amount to stdout. Tested-by: Piotr Gorski <lucjan.lucjanov@gmail.com> Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-07-11 15:58:01 +02:00
Andrea Righi	1a1a16b9e9	scx_bpfland: fix typo in slice_ns definition The correct default value of slice_ns 5ms, not 5s. This change doesn't really make any difference in practice, since these values are changed by the Rust part when the scheduler is started, but it's good to keep this aligned to the proper values for consistency. Tested-by: Piotr Gorski <lucjan.lucjanov@gmail.com> Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-07-11 15:58:01 +02:00
Andrea Righi	995577762a	scx_bpfland: refill task time slice Every time we need to dispatch a task re-evalate its time slice as: (unused_time_slice + min_time_slice) / 2 This allows to refill the time slice for tasks that haven't used much of their previously assigned time, improving fairness. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-07-06 14:07:24 +02:00
Andrea Righi	6a64182ef2	scx_bpfland: always classify interactive tasks Make sure to always classify interactive tasks, even when the system is not fully utilized. This ensures that if the system suddenly becomes overloaded, we already know which tasks need to be dispatched to the priority DSQ. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-07-06 14:07:24 +02:00
Andrea Righi	8dd528abfd	scx_bpfland: pass enqueue flags when dispatching kthreads Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-07-06 14:07:10 +02:00
Andrea Righi	2bc8f800e7	scx_bpfland: report build id version Use the version string provided by scx_utils:build_id. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-07-05 09:29:29 +02:00
Andrea Righi	bdb31e98e2	scx_bpfland: show statistics in a more human-readable format Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-07-05 09:29:29 +02:00
Andrea Righi	f9d7844b2e	scx_bpfland: split direct dispatches and kthread dispatches Show separate statistics for direct dispatches and kthread direct dispatches. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-07-05 09:27:59 +02:00
Andrea Righi	cfe2ed063d	scx_bpfland: time-based starvation prevention Tasks are consumed from various DSQs in the following order: per-CPU DSQs => priority DSQ => shared DSQ Tasks in the shared DSQ may be starved by those in the priority DSQ, which in turn may be starved by tasks dispatched to any per-CPU DSQ. To mitigate this, record the timestamp of the last task scheduling event both from the priority DSQ and the shared DSQ. If the starvation threshold is exceeded without consuming a task, the scheduler will be forced to consume a task from the corresponding DSQ. The starvation threshold can be adjusted using the --starvation-thresh command line parameter (default is 5ms). Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-07-04 10:52:39 +02:00
Andrea Righi	9e0db4ae17	scx_bpfland: remove unnecessary RCU read protection There is no need to RCU protect the cpumask for the offline CPUs: it is created once when the scheduler is initialized and it's never deallocated. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-07-04 10:24:43 +02:00
Andrea Righi	cef6ca93cf	scx_bpfland: adjust default time slice to 5ms Reduce the default time slice down to 5ms for a faster reaction and system responsiveness when the system is overcomissioned. This also helps to provide a more predictable level of performance. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-07-04 10:24:43 +02:00
Andrea Righi	7d15e3171c	scx_bpfland: ensure task time slice never exceeds the slice_ns limit Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-07-04 10:24:43 +02:00
Andrea Righi	e8a4d350ad	scx_bpfland: unify dispatching kthreads with direct CPU dispatches Always use direct CPU dispatch for kthreads, there is no need to treat kthreads in a special way, simply reuse direct CPU dispatch to prioritize them. Moreover, change direct CPU dispatches to use scx_bpf_dispatch_vtime(), since we may dispatch multiple tasks to the same per-CPU DSQ now. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-07-03 09:38:43 +02:00
Andrea Righi	d2231b0aed	scx_bpfland: drop built-in idle CPU selection logic Small refactoring of the idle CPU selection logic: - optimize idle CPU selection for tasks that can run on a single CPU - drop the built-in idle selection policy and completely rely on the custom one Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-07-03 08:54:17 +02:00
Andrea Righi	7c355f50b2	scx_bpfland: use the right cpumask to find any idle CPU We are incorrectly using the SMT idle cpumask to find any idle CPU, fix by using the generic idle cpumask. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-07-01 20:36:24 +02:00
Andrea Righi	ff7a518d28	scx_bpfland: support CPU hotplugging Implement CPU hotplugging in scx_bpfland without restarting the scheduler. The idle selection logic has been updated to consider online CPUs. Additionally, a cpumask for offline CPUs has been introduced. Tasks that have been dispatched to the DSQs associated with offline CPUs are consumed by the other CPUs that are still online. Moreover, the dependency on the Topology crate is temporarily dropped and instead, /sys/devices/system/cpu/smt/active is used to determine if SMT should be taken into account during idle selection. The Topology crate will be re-introduced later when scx_bpfland will gain more topology-aware capabilities. This fixes #406. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-06-30 23:04:13 +02:00
Andrea Righi	74175f5a49	scx_bpfland: properly integrate with meson build Properly honor the meson build `serialize` option. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-06-28 21:37:00 +02:00
Andrea Righi	7606b95150	scx_bpfland: introduce maximum time slice lag Introduce a tunable to set a limit of the minimum vruntime that is used when a task is dispatched, as: vtime_min = vtime_now - slice_lag_ns Increasing the time slice lag can make interactive tasks even more responsive at the cost of starving regular and newly created tasks. Default time slice lag is 0. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-06-27 17:28:42 +02:00
Andrea Righi	5a44329d45	scheds: introduce scx_bpfland Overview ======== This scheduler is derived from scx_rustland, but it is fully implemented in BFP with minimal user-space Rust part to process command line options, collect metrics and logs out scheduling statistics. Unlike scx_rustland, all scheduling decisions are made by the BPF component. Motivation ========== The primary goal of this scheduler is to act as a performance baseline for comparison with scx_rustland, allowing for a better assessment of the overhead caused by kernel/user-space interactions. It can also be used to deploy prototypes initially tested in the scx_rustland scheduler. In fact, this scheduler is expected to outperform scx_rustland, due to the elimitation of the kernel/user-space overhead. Scheduling policy ================= scx_bpfland is a vruntime-based sched_ext scheduler that prioritizes interactive workloads. Its scheduling policy closely mirrors scx_rustland, but it has been re-implemented in BPF with some small adjustments. Tasks are categorized as either interactive or regular based on their average rate of voluntary context switches per second: tasks that exceed a specific voluntary context switch threshold are classified as interactive. Interactive tasks are prioritized in a higher-priority DSQ, while regular tasks are placed in a lower-priority DSQ. Within each queue, tasks are sorted based on their weighted runtime, using the built-in scx vtime ordering capabilities (scx_bpf_dispatch_vtime()). Moreover, each task gets a time slice budget. When a task is dispatched, it receives a time slice equivalent to the remaining unused portion of its previously allocated time slice (with a minimum threshold applied). This gives latency-sensitive workloads more chances to exceed their time slice when needed to perform short bursts of CPU activity without being interrupted (i.e., real-time audio encoding / decoding workloads). Results ======= According to the initial test results, using the same benchmark "playing a videogame while recompiling the kernel", this scheduler seems to provide a +5% improvement in the frames-per-second (fps) compared to scx_rustland, with video games such as Cyberpunk 2077, Counter-Strike 2 and Baldur's Gate 3. Initial test results indicate that this scheduler offers around a +5% improvement in frames-per-second (fps) compared to scx_rustland when using the benchmark "playing a video game while recompiling the kernel". This improvement was observed in games such as Cyberpunk 2077, Counter-Strike 2, and Baldur's Gate 3. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-06-27 17:28:42 +02:00

49 Commits