scx-upstream

mirror of https://github.com/sched-ext/scx.git synced 2024-11-25 04:00:24 +00:00

Author	SHA1	Message	Date
Andrea Righi	a155d5185d	scx_bpfland: rely on Topology to classify core types Rely on scx_utils::Topology to classify Big, Little and Turbo CPUs. Moreover, support the special keyword "all" with --primary-domain to include all the CPUs in the system (default). Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-28 00:23:55 +02:00
Andrea Righi	872e653cd2	scx_utils: introduce Turbo core type to Topology Integrate the logic used by scx_bpfland to detect turbo-boosted cores in Topology. Also change the logic to detect Big/Little cores in function of base_frequency, instead of scaling_max_freq, otherwise turbo-boosted cores in homogeneous systems may be incorrectly classified as Big. Moreover, introduce the following new methods to Cpu to check for the core type: - is_turbo(): return true if the CPU is Turbo, false otherwise - is_big(): return true if the CPU is either Turbo or Big - is_little(): return true if the CPU is Little Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-28 00:09:08 +02:00
Andrea Righi	e0f49a338a	scx_bpfland: fix turbo boost domain nullifying primary domain limits When creating the turbo boost scheduling domain, we might use a full CPU mask (selecting all possible CPUs) to indicate "do not prioritize turbo boost CPUs" or when all CPUs have the same maximum frequency. This approach works when the primary domain also contains all the CPUs, as the complete overlap allows the CPU selection logic to ignore the turbo boost domain and start picking CPUs directly from the primary domain. However, if the primary domain doesn't include all CPUs, the two domains won't fully overlap, which can lead to the turbo boost domain incorrectly including all CPUs, thereby negating the restrictions set by the primary scheduling domain. To resolve this, an empty CPU mask should be used for the turbo boost domain when turbo boost CPUs aren't prioritized. If the turbo boost domain is empty, it should be entirely bypassed, and the selection should proceed directly to the primary domain. Reported-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-27 13:36:50 +02:00
Andrea Righi	a469f0f1ce	Merge pull request #561 from sched-ext/bpfland-fix-energy-profile-refresh scx_bpfland: prevent reading energy profile if not available	2024-08-25 18:31:34 +02:00
Andrea Righi	f8acd069f0	scx_bpfland: prevent reading energy profile if not available Avoid to periodically read the current performance profile from /sys/devices/system/cpu/cpufreq/policy0/energy_performance_preference if it's not available (i.e., with older CPUs or kernels without cpufreq). This fixes issue #560. Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-25 16:53:35 +02:00
Tejun Heo	43950c65bd	build: Use workspace to group rust sub-projects meson build script was building each rust sub-project under rust/ and scheds/rust/ separately. This means that each rust project is built independently which leads to a couple problems - 1. There are a lot of shared dependencies but they have to be built over and over again for each proejct. 2. Concurrency management becomes sad - we either have to unleash multiple cargo builds at the same time possibly thrashing the system or build one by one. We've been trying to solve this from meson side in vain. Thankfully, in issue #546, @vimproved suggested using cargo workspace which makes the sub-projects share the same target directory and built together by the same cargo instance while still allowing each project to behave independently for development and publishing purposes. Make the following changes: - Create two cargo workspaces - one under rust/, the other under scheds/rust/. Each contains all rust projects underneath it. - Don't let meson descend into rust/. These are libraries used by the rust schedulers. No need to build them from meson. Cargo will build them as needed. - Change the rust_scheds build target to invoke `cargo build` in scheds/rust/ and let cargo do its thing. - Remove per-scheduler meson.build files and instead generate custom_targets in scheds/rust/meson.build which invokes `cargo build -p $SCHED`. - This changes rust binary directory. Update README and meson-scripts/install_rust_user_scheds accordingly. - Remove per-scheduler Cargo.lock as scheds/rust/Cargo.lock is shared by all schedulers now. - Unify .gitignore handling. The followings are build times on Ryzen 3975W: Before: ________________________________________________________ Executed in 165.93 secs fish external usr time 40.55 mins 2.71 millis 40.55 mins sys time 3.34 mins 36.40 millis 3.34 mins After: ________________________________________________________ Executed in 36.04 secs fish external usr time 336.42 secs 0.00 millis 336.42 secs sys time 36.65 secs 43.95 millis 36.61 secs Wallclock time is reduced 5x and CPU time 7x.	2024-08-25 00:47:58 -10:00
Tejun Heo	152a8471cc	scx_bpfland: When reporting stats, use interval deltas Three of the reported stats are cumulative. While they obviously can be processed into delta values, that holds for the other direction too and the cumulative values are difficult to make intutive sense of. Report interval delta values instead. Note that a stats client can reliably build back cumulative values even under heavy system contention - the delta values reported between two consecutive reads are guaranteed to be correct regardless of the duration of the interval.	2024-08-24 23:14:57 -10:00
Tejun Heo	bd68e230b9	scx_bpfland: Convert to scx_stats Use scx_stats instead of prometheus for stats reporting. This has a few advantages: - Stats metadata can be defined more succinctly. - Natural support for nesting statistics which will be useful in making scheduler components composable. - Support for multiple programmable readers where each reader can use their own reading interval. - Built-in stats help message generation. - Openmetrics integration is still available through scx_stats/scripts/scxstats_to_openmetrics.py.	2024-08-24 23:14:55 -10:00
Tejun Heo	1bba713a29	Merge pull request #542 from sched-ext/htejun/scx_stats scx_stats, scx_rusty, scx_layered: Implement `--help-stats`	2024-08-24 15:38:36 -10:00
Andrea Righi	5a08855a86	scx_bpfland: always honor average nvcsw in lowlatency mode Keep evaluating the average number of voluntary context switches for each task when lowlatency mode is enabled, even when interactive tasks classification is disabled (via `-c 0`). The average nvcsw is also used in lowlatency mode to evaluate the proportional bonus to the tasks' deadline and it shouldn't be ignored when interactive tasks classification is disabled. Moreover, make sure that such bonus never exceeds the starvation threshold. Keep in mind that it is still possible to disable the periodic average nvcsw evaluation with `-c 0`, without specifying `--lowlatency`. Fixes: `6a22853` ("scx_bpfland: introduce --lowlatency option") Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-24 10:42:22 +02:00
Avraham Hollander	d6e27b59e7	Clean up scx_bpfland help info a bit	2024-08-23 18:55:04 -04:00
Tejun Heo	9e3b4e6db0	scx_stats: A bit of cleanups and renames	2024-08-23 09:09:02 -10:00
Andrea Righi	50684e4569	scx_bpfland: introduce Intel Turbo Boost awareness Make `--primar-domain auto` aware of turbo boosted CPUs and prioritize them over the primary scheduling domain when the energy model `balance_power` is used (typically when running on battery power with the "balanced" profile). With this change the scheduling hierarchy becomes the following: 1) CPUs in the turbo scheduling domain 2) CPUs in the primary scheduling domain 3) full-idle SMT CPUs 4) CPUs in the same L2 cache 5) CPUs in the same L3 cache 6) CPUs in the task's allowed domain And the idle selection logic is modified as following: - In the turbo scheduling domain: - pick same full-idle SMT CPU - pick any other full-idle SMT CPU sharing the same L2 cache - pick any other full-idle SMT CPU sharing the same L3 cache - pick any other full-idle SMT CPU - pick same idle CPU - pick any other idle CPU sharing the same L2 cache - pick any other idle CPU sharing the same L3 cache - pick any other idle SMT CPU - In the primary scheduling domain: - pick same full-idle SMT CPU - pick any other full-idle SMT CPU sharing the same L2 cache - pick any other full-idle SMT CPU sharing the same L3 cache - pick any other full-idle SMT CPU - pick same idle CPU - pick any other idle CPU sharing the same L2 cache - pick any other idle CPU sharing the same L3 cache - pick any other idle SMT CPU - In the entire task domain: - pick any other idle CPU Keep in mind that the turbo domain will be evaluated only when the scheduler is started with `--primary-domain auto` and only when the `balance_power` energy profile is used. The turbo domain is always made using the subset of CPUs in the system with the highest max frequency. If such subset can't be determined (for example if all the CPUs in the primary domain have all the same frequency), the turbo domain will be ignored. Prioritizing turbo boosted CPUs can help to improve performance by forcing the governor to scale up their frequency, without increasing too much power consumption, due to the fact that tasks will be preferably confined into a reduced amount of cores. This change seems to improve performance, without increasing much power consuption, on Intel laptops while using the `balanced_power` energy profile. Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-23 19:49:08 +02:00
Andrea Righi	d958dd4482	scx_bpfland: introduce dynamic energy profile Introduce the new option `--primary-domain auto`. With this option the scheduler will dynamically adjusts the primary scheduling domain at run-time, in function of the current energy profile reported in /sys/devices/system/cpu/cpufreq/policy0/energy_performance_preference. When the `power` energy profile is selected, the primary scheduling domain will prioritize E-cores. Alternatively, when the `performance` profile is selected, it will prioritize P-cores. For all the other energy profiles, all the CPUs in the system will be used. Note that this option is only relevant on hybrid architectures with P-cores and E-cores. Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-23 19:49:01 +02:00
Andrea Righi	6a2285398d	scx_bpfland: introduce --lowlatency option Introduce the new `--lowlatency` option, which enables switching between the default pure vruntime-based scheduling (more optimized for server workloads) and a deadline-based scheduling (better suited for low-latency workloads). When the low-latency mode is activated, a task's deadline is calculated as its vruntime, adjusted by a bonus proportional to the task's average number of voluntary context switches (the more voluntary context switches, the shorter the deadline). This feature enhances the prioritization of interactive tasks even more, proportionally to their average voluntary context switches, also within the two main global queues (priority / shared) and it helps to maintain interactive workloads always responsive, even in presence of heavy non-interactive background work. Low-latency mode allows to prevent audio cracking even in presence of a large amount of short-lived tasks with pseudo-interactive behavior (i.e, hackbench) and it enables achieving approximately a +33% average frames-per-second (FPS) in the typical "gaming while building the kernel" benchmark. However, it can also amplify the de-prioritization of CPU-intensive tasks, making this option more suitable for specific low-latency scenarios. Therefore the low-latency mode is disabled by default and it can only be enabled via the `--lowlatency` option. Tested-by: Piotr Gorski (piotrgorski@cachyos.org) Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-22 13:26:19 +02:00
Andrea Righi	b0a8e4a91e	scx_bpfland: better time slice control Explicitly replenish the task's time slice from ops.dispatch() if the task still wants to run and no other task is selected. In this way the sched_ext core won't automatically re-schedule the task on the same CPU, implicitly assigning a time slice of SCX_SLICE_DFL. Moreover, instead of determining the task time slice in ops.enqueue(), refresh the time slice immediately before the task is started on its assigned CPU in ops.running(). This allows to use a more precise time slice, adjusted based on the actual amount of tasks that are currently waiting to be scheduled. Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-22 09:23:37 +02:00
Tejun Heo	f726f0b73b	Version: Cargo.lock	2024-08-21 06:45:19 -10:00
Tejun Heo	4d1f0639d8	Version: v1.0.3	2024-08-21 06:42:11 -10:00
Andrea Righi	fedfee0bd6	scx_bpfland: drop unused variable With the global scx_utils::NR_CPU_IDS we don't need Topology anymore in init_primary_domain(), so drop the variable to fix the following build warning: warning: unused variable: `topo` --> src/main.rs:385:9 \| 385 \| topo: &Topology, \| ^^^^ help: if this is intentional, prefix it with an underscore: `_topo` \| = note: `#[warn(unused_variables)]` on by default Fixes: `1da249f` ("scx_utils::topology: Always use NR_CPU_IDS and NR_CPUS_POSSIBLE") Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-21 17:46:12 +02:00
Andrea Righi	9f7a11bba6	Merge pull request #528 from sched-ext/bpfland-turbo-boost scx_bpfland: properly classify Intel Turbo Boost CPUs	2024-08-21 17:40:25 +02:00
Tejun Heo	9c62019c81	Merge pull request #527 from sched-ext/htejun/scx_utils scx_utils::cpumask,topology: Misc updates	2024-08-20 22:25:25 -10:00
Andrea Righi	695e3b25b0	scx_bpfland: classify CPUs depending of their the base frequency Use the base frequency, instead of maximum frequency, to classify fast and slow CPUs. This ensures accurate distinction between Intel Turbo Boost CPUs and genuinely faster CPUs when auto-detecting the primary scheduling domain. Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-21 10:16:41 +02:00
Tejun Heo	1da249f063	scx_utils::topology: Always use NR_CPU_IDS and NR_CPUS_POSSIBLE Always use the LazyLock versions and drop the counterparts from Topology.	2024-08-20 21:57:56 -10:00
Andrea Righi	c85315d527	scx_bpfland: allow to completely disable interactive classification Tasks enqueued with SCX_ENQ_WAKEUP are immediately classified as interactive. However, if interactive tasks classification is disabled (via `-c 0`), we should avoid promoting them as interactive. This is particularly important because, with the nvcsw logic disabled, tasks can remain classified as interactive indefinitely and they will never be demoted to regular tasks. Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-21 08:45:13 +02:00
Andrea Righi	a9f5aaa536	scx_bpfland: replace custom CpuMask with scx_utils::Cpumask Rely on scx_utils::Cpumask instead of re-implementing a custom struct to parse and manage CPU masks. Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-21 07:21:52 +02:00
Andrea Righi	467d4b5ea4	scx_bpfland: get topology information from scx_utils::Topology Rely on scx_utils::Topology to get CPU and cache information, instead of re-implementing custom methods. Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-20 10:16:02 +02:00
Andrea Righi	f8a2445869	scx_bpfland: introduce performance/powersave primary domain The primary scheduling domain represents a group of CPUs in the system where the scheduler will initially attempt to assign tasks. Tasks will only be dispatched to CPUs within this primary domain until they are fully utilized, after which tasks may overflow to other available CPUs. The primary scheduling domain can defined using the option `--primary-domain CPUMASK` (by default all the CPUs in the system are used as primary domain). This change introduces two new special values for the CPUMASK argument: - `performance`: automatically detect the fastest CPUs in the system and use them as primary scheduling domain, - `powersave`: automatically detect the slowest CPUs in the system and use them as primary scheduling domain. The current logic only supports creating two groups: fast and slow CPUs. The fast CPU group is created by excluding CPUs with the lowest frequency from the overall set, which means that within the fast CPU group, CPUs may have different maximum frequencies. When using the `performance` mode the fast CPUs will be used as primary domain, whereas in `powersave` mode, the slow CPUs will be used instead. This option is particularly useful in hybrid architectures (with P-cores and E-cores), as it allows the use of bpfland to prioritize task scheduling on either P-cores or E-cores, depending on the desired performance profile. Example: - Dell Precision 5480 - CPU: 13th Gen Intel(R) Core(TM) i7-13800H - P-cores: 0-11 / max freq: 5.2GHz - E-cores: 12-19 / max freq: 4.0GHz $ scx_bpfland --primary-domain performance 0[\|\|\|\|\|\|\|\|\| 24.5%] 10[\|\|\|\|\|\|\|\| 22.8%] 1[\|\|\|\|\|\| 14.9%] 11[\|\|\|\|\|\|\|\|\|\|\|\|\| 36.9%] 2[\|\|\|\|\|\| 16.2%] 12[ 0.0%] 3[\|\|\|\|\|\|\|\|\| 25.3%] 13[ 0.0%] 4[\|\|\|\|\|\|\|\|\|\|\| 33.3%] 14[ 0.0%] 5[\|\|\|\| 9.9%] 15[ 0.0%] 6[\|\|\|\|\|\|\|\|\|\|\| 31.5%] 16[ 0.0%] 7[\|\|\|\|\|\|\| 17.4%] 17[ 0.0%] 8[\|\|\|\|\|\|\|\| 23.4%] 18[ 0.0%] 9[\|\|\|\|\|\|\|\|\| 26.1%] 19[ 0.0%] Avg power consumption: 3.29W $ scx_bpfland --primary-domain powersave 0[\| 2.5%] 10[ 0.0%] 1[ 0.0%] 11[ 0.0%] 2[ 0.0%] 12[\|\|\|\| 8.0%] 3[ 0.0%] 13[\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\| 64.2%] 4[ 0.0%] 14[\|\|\|\|\|\|\|\|\|\| 29.6%] 5[ 0.0%] 15[\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\| 52.5%] 6[ 0.0%] 16[\|\|\|\|\|\|\|\|\| 24.7%] 7[ 0.0%] 17[\|\|\|\|\|\|\|\|\|\| 30.4%] 8[ 0.0%] 18[\|\|\|\|\|\|\| 22.4%] 9[ 0.0%] 19[\|\|\|\|\| 12.4%] Avg power consumption: 2.17W (Info collected from htop and turbostat) Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-19 20:19:21 +02:00
Andrea Righi	174993f9d2	scx_bpfland: introduce cache awareness While the system is not saturated the scheduler will use the following strategy to select the next CPU for a task: - pick the same CPU if it's a full-idle SMT core - pick any full-idle SMT core in the primary scheduling group that shares the same L2 cache - pick any full-idle SMT core in the primary scheduling grouop that shares the same L3 cache - pick the same CPU (ignoring SMT) - pick any idle CPU in the primary scheduling group that shares the same L2 cache - pick any idle CPU in the primary scheduling group that shares the same L3 cache - pick any idle CPU in the system While the system is completely saturated (no idle CPUs available), tasks will be dispatched on the first CPU that becomes available. Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-19 20:19:21 +02:00
Tejun Heo	c16b48d7b2	scheds/rust: Include Cargo.lock in the repo Binary packages are expected to include Cargo.lock in the repo so that the produced binaries match across different builds.	2024-08-15 23:08:35 -10:00
Andrea Righi	f9a994412d	scx_bpfland: introduce primary scheduling domain Allow to specify a primary scheduling domain via the new command line option `--primary-domain CPUMASK`, where CPUMASK can be a hex number of arbitrary length, representing the CPUs assigned to the domain. If this option is not specified the scheduler will use all the available CPUs in the system as primary domain (no behavior change). Otherwise, if a primary scheduling domain is defined, the scheduler will try to dispatch tasks only to the CPUs assigned to the primary domain, until these CPUs are saturated, at which point tasks may overflow to other available CPUs. This feature can be used to prioritize certain cores over others and it can be really effective in systems with heterogeneous cores (e.g., hybrid systems with P-cores and E-cores). == Example (hybrid architecture) == Hardware: - Dell Precision 5480 with 13th Gen Intel(R) Core(TM) i7-13800H - 6 P-cores 0..5 with 2 CPUs each (CPU from 0..11) - 8 E-cores 6..13 with 1 CPU each (CPU from 12..19) == Test == WebGL application (https://webglsamples.org/aquarium/aquarium.html): this allows to generate a steady workload in the system without over-saturating the CPUs. Use different scheduler configurations: - EEVDF (default) - scx_bpfland using P-cores only (--primary-domain 0x00fff) - scx_bpfland using E-cores only (--primary-domain 0xff000) Measure performance (fps) and power consumption (W). == Result == +-----+-----+------+-----+----------+ \| min \| max \| avg \| \| \| \| fps \| fps \| fps \| stdev \| power \| +-----------------+-----+-----+------+-------+--------+ \| EEVDF \| 28 \| 34 \| 31.0 \| 1.73 \| 3.5W \| \| bpfland-p-cores \| 33 \| 34 \| 33.5 \| 0.29 \| 3.5W \| \| bpfland-e-cores \| 25 \| 26 \| 25.5 \| 0.29 \| 2.2W \| +-----------------+-----+-----+------+-------+--------+ Using a primary scheduling domain of only P-cores with scx_bpfland allows to achieve a more stable and predictable level of performance, with an average of 33.5 fps and an error of ±0.5 fps. In contrast, using EEVDF results in an average frame rate of 31.0 fps with an error of ±3.0 fps, indicating slightly less consistency, due to the fact that tasks are evenly distributed across all the cores in the system (both slow and fast cores). On the other hand, using a scheduling domain solely of E-cores with scx_bpfland results in a lower average frame rate (25.5 fps), though it maintains a stable performance (error of ±0.5 fps), but the power consumption is also reduced, averaging 2.2W, compared to 3.5W with either of the other configurations. == Conclusion == In summary, with this change users have the flexibility to prioritize scheduling on performance cores for better performance and consistency, or prioritize energy efficient cores for reduced power consumption, on hybrid architectures. Moreover, this feature can also be used to minimize the number of cores used by the scheduler, until they reach full capacity. This capability can be useful for reducing power consumption even in homogeneous systems or for conducting scheduling experiments with smaller sets of cores, provided the system is not overcommitted. Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-14 16:17:54 +02:00
Andrea Righi	a6e977c70b	scx_bpfland: make output more compact Abbreviate the statistics reported to stdout and remove the slice_ms metric: this metric can be easily derived from slice_ns, slice_ns_min and nr_wait, which is already reported to stdout. Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-14 16:17:54 +02:00
Andrea Righi	8656effa50	scx_bpfland: update copyright info Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-14 16:17:54 +02:00
Tejun Heo	63c4a0191f	Merge branch 'main' into topic/inlined-skeleton-members	2024-08-08 14:23:37 -10:00
Tejun Heo	cd6a4d72c7	Bump versions for 1.0.2 release	2024-08-08 14:10:16 -10:00
Tejun Heo	7c3ffe96e1	Unify crate dependency versions Different sub-projects are using different versions for the same crates. Synchronize them to the latest.	2024-08-08 13:26:47 -10:00
Andrea Righi	b87541a26e	scx_rustland_core: refactor idle CPU selection logic Use the same idle selection logic used in scx_bpfland also in scx_rustland_core. Also drop fifo_mode and always use the BPF idle selection logic by default as long as the system is not saturated, unless full_user is specified. This approach allows user-space schedulers aiming for maximum performance to leverage the BPF idle selection logic (bypassing user-space), while those seeking full control can enable full_user to bypass the BPF CPU idle selection logic and choose the target CPU for each task from user-space. Signed-off-by: Andrea Righi <andrea.righi@linux.dev>	2024-08-07 08:10:53 +02:00
Andrea Righi	bee0d699ef	scx_bpfland: always re-align task's vruntime to the global vruntime Immediately re-align p->scx.dsq_vtime to the global vruntime (+/- slice lag) as soon as we are evaluating the task's vruntime. This allows rapidly chase the minimum global vruntime, ensuring to not over prioritize tasks tasks with a predominantly sleeping behavior pattern. Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-08-02 13:11:25 +02:00
Andrea Righi	19854f1535	scx_bpfland: allow to specify negative values with --slice-us-lag Using negative values with --slice-us-lag can be useful to make performance more consistent and prioritize newly created tasks over the running tasks. Therefore, allow to specify negative values from the command line and also update the documentation of this option. Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-07-26 09:10:18 +02:00
Andrea Righi	46ddca6bd5	scx_bpfland: report task time slice to stdout Periodically report to stdout samples of the effective time slice applied to tasks. While one could determine this metric by examining the max slice_ns and nr_waiting metrics, directly reporting it to stdout allows users to quickly identify what is happening and it provides a clearer overview of the scheduling behavior. Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-07-22 22:01:49 +02:00
Andrea Righi	c1d93d2a00	scx_bpfland: drop kthread dispatches metric Dispatching per-CPU kthreads directly is disabled by default, reporting this metric can generate some confusion (since it is always 0), and even if local kthread dispatches are enabled, they should be still considered as regular direct dispatches (there is no difference in practice). Therefore, merge direct kthread dispatches into direct dispatches and drop the separate nr_kthread_dispatches metric. Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-07-22 22:01:49 +02:00
Andrea Righi	a5f1d6b595	scx_bpfland: show average amount of tasks waiting to be dispatched Periodically report the average amount of tasks sitting in the priority and shared DSQs. Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-07-22 22:01:45 +02:00
Andrea Righi	5908a985bc	scx_bpfland: adjust task time slice based on the amount of waiting tasks Scale the task's time slice based on the average amount of tasks that are currently waiting to be dispatched. Use a moving average for the amount of waiting tasks to smooth out potential spikes caused by temporary bursts of tasks piling in the wait queues. This was initially modeled in scx_rustland and it seems to work pretty well also in scx_bpfland now. Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-07-22 21:53:25 +02:00
Andrea Righi	c4eb3ce7b4	scx_bpfland: introduce dynamic nvcsw threshold Instead of using a static value to classify tasks based on their average amount of voluntary context switches, try to periodically evaluate an optimal threshold, based on a global average of voluntary context switches among of all the running tasks. Tasks with an average amount of voluntary context switches greater than the global average will be classified as interactive. The global average is evaluated as an exponentially weighted moving average (EWMA), as: avg(t) = avg(t - 1) * 0.75 - task_avg(t) * 0.25 This approach is more efficient than iterating through all tasks and it helps to prevent rapid fluctuations that may be caused by bursts of voluntary context switch events. The dynamic nvcsw threshold enables a more precise adjustment of the classification criteria to swiftly respond to global system changes: tasks can be quickly classified as interactive, but if the system experiences too many interactive events, the criteria for maintaining interactive status become stricter. This creates a natural selection process where only the most deserving tasks remain interactive. Additionally, introduce the new option `--nvcsw-max-thresh N`, which allows to extend or restrict the fluctuation range of the global average threshold for voluntary context switches. Tested-by: Piotr Gorski <piotrgorski@cachyos.org> Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-07-18 19:03:25 +02:00
Daniel Müller	565aec3662	rust: Update libbpf-rs & libbpf-cargo to 0.24 Update libbpf-rs & libbpf-cargo to 0.24. Among other things, generated skeletons now contain directly accessible map and program objects, no longer necessitating the use of accessor methods. As a result, the risk for mutability conflicts is reduced greatly. Signed-off-by: Daniel Müller <deso@posteo.net>	2024-07-16 11:48:52 -07:00
Tejun Heo	51334b5c4d	Bump versions for 1.0.1 release	2024-07-15 13:21:52 -10:00
Andrea Righi	8e7a526356	scx_bpfland: use nr_cpu_ids for consistency We always use nr_cpu_ids to represent the maximum CPU id returned by scx_bpf_nr_cpu_ids(). Replace cpu_max with nr_cpu_ids to be more consistent with the rest of the code. Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-07-15 08:44:35 +02:00
Andrea Righi	33d06f653b	scx_bpfland: get rid of the MAX_CPUS hard-coded limit We can rely on scx_bpf_nr_cpu_ids() to create all the possible per-CPU DSQs, eliminating the need for the hard-coded limit MAX_CPUS. In this way scx_bpfland can support the same amount of CPUs that the kernel can handle. Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-07-15 00:17:30 +02:00
Andrea Righi	b80ef7d8eb	scx_bpfland: optimize offline CPU handling Instead of constantly checking the need to drain tasks from the DSQs of the offline CPUs, provide an atomic flag to notify when there are tasks to be drained from the offline CPUs. Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-07-15 00:17:23 +02:00
Andrea Righi	0530706710	scx_bpfland: properly initialize the nvcsw metrics Initialize the number of voluntary context switches metrics in the local task storage. Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-07-15 00:16:10 +02:00
Andrea Righi	bf4ad23599	scx_bpfland: refine interactive tasks flood safeguard Refine the safeguard mechanism to avoid generating too many interactive tasks in the system, which could nullify the effect of the interactive/regular task classification. The safeguard mechanism operates by pausing the promotion of new tasks to interactive status during the task wake-up process, whenever the number of interactive tasks in the priority queue exceeds a specific limit (set to 4x the number of online CPUs). Halting the promotion of additional interactive tasks allows to prioritize those already classified as interactive, thereby preventing potential "bursts" of excessive interactive tasks in the system. This refines the mitigation already provided by commit `640bd562` ("scx_bpfland: prevent tasks from abusing interactive priority boost"). Fixes: `640bd562` ("scx_bpfland: prevent tasks from abusing interactive priority boost") Signed-off-by: Andrea Righi <righi.andrea@gmail.com>	2024-07-15 00:11:34 +02:00

1 2

75 Commits