JakeHillion/scx

mirror of https://github.com/JakeHillion/scx.git synced 2024-11-26 11:30:22 +00:00

Author	SHA1	Message	Date
Tejun Heo	970c04b43a	compat: Drop support for missing sched_ext_ops.exit_dump_len In preparation of upstreaming, let's set the min version requirement at the released v6.9 kernels. Drop support for missing sched_ext_ops.exit_dump_len. The open helper macros now check the existence of the field and abort if missing.	2024-06-16 06:37:34 -10:00
Tejun Heo	046bdfd5e0	compat: Drop support for missing sched_ext_ops.hotplug_seq In preparation of upstreaming, let's set the min version requirement at the released v6.9 kernels. Drop support for missing sched_ext_ops.hotplug_seq. The open helper macros now check the existence of the field and abort if missing.	2024-06-16 06:34:59 -10:00
Tejun Heo	dde2942125	compat: Drop __COMPAT_scx_bpf_cpuperf_() In preparation of upstreaming, let's set the min version requirement at the released v6.9 kernels. Drop __COMPAT_scx_bpf_cpuperf_(). The open helper macros now check the existence of scx_bpf_cpuperf_cap() and abort if not.	2024-06-16 06:16:53 -10:00
Tejun Heo	13e8388e1e	compat: Drop __COMPAT_HAS_CPUMASKS In preparation of upstreaming, let's set the min version requirement at the released v6.9 kernels. Drop __COMPAT_HAS_CPUMASKS(). The open helper macros now check the existence of scx_bpf_nr_cpu_ids() and abort if not.	2024-06-16 06:12:06 -10:00
Tejun Heo	66901e2b44	compat: Drop __COMPAT_scx_bpf_dump() In preparation of upstreaming, let's set the min version requirement at the released v6.9 kernels. Drop __COMPAT_scx_bpf_dump(). The open helper macros now check the existence of scx_bpf_dump_bstr() and abort if not. While at it, reorder the min requirement checks so that newly added ones are up top to make testing easier.	2024-06-16 06:02:47 -10:00
Tejun Heo	0d8adf2260	compat: Drop __COMPAT_scx_bpf_exit() In preparation of upstreaming, let's set the min version requirement at the released v6.9 kernels. Drop __COMPAT_scx_bpf_exit(). The open helper macros now check the existence of scx_bpf_exit_bstr() and abort if not.	2024-06-15 20:36:17 -10:00
Tejun Heo	5b5e5be906	compat: Drop __COMPAT_SCX_KICK_IDLE In preparation of upstreaming, let's set the min version requirement at the released v6.9 kernels. Drop __COMPAT_SCX_KICK_IDLE. The open helper macros now check the existence of SCX_KICK_IDLE and abort if not.	2024-06-15 20:24:15 -10:00
Tejun Heo	b730f35e68	scx/common.h: Improve SCX_BUG() macro There's no guarantee that errno is set or contains relevant information when SCX_BUG() is invoked. This sometimes leads to "task failed successfully" messages: # ./scx_simple ../scheds/c/scx_simple.c:72 [scx panic]: Success SCX_OPS_SWITCH_PARTIAL missing, kernel too old? While not critical, it's not great. Let's update it so that errno is printed in parentheses when non-zero and match the tag to the macro name so that what's printed is the following: # ./scx_simple [SCX_BUG] ../scheds/c/scx_simple.c:72 SCX_OPS_SWITCH_PARTIAL missing, kernel too old?	2024-06-15 20:17:32 -10:00
Tejun Heo	7c9aedaefe	compat: Drop __COMPAT_scx_bpf_switch_all() In preparation of upstreaming, let's set the min version requirement at the released v6.9 kernels. Drop __COMPAT_scx_bpf_switch_call(). The open helper macros now check the existence of SCX_OPS_SWITCH_PARTIAL and abort if not.	2024-06-15 20:03:37 -10:00
Tejun Heo	dd6255a601	Merge pull request #359 from sched-ext/htejun/cosmetic common.bpf.h: Cosmetic changes	2024-06-15 06:42:00 -10:00
Andrea Righi	cb20a6f136	scx_rlfifo: dispatch all tasks on the first CPU available With commit `786ec0c0` ("scx_rlfifo: schedule all tasks in user-space") all the scheduling decisions are now happening in user-space. This also bypasses the built-in idle selection logic, delegating the CPU selection for each task to the user-space scheduler. The easiest way to distribute tasks across the available CPUs is to simply allow to dispatch them on the first CPU available. In this way the scheduler becomes usable in practical scenarios and at the same time it also maintains its simplicity. This allows to spread all tasks across all the available CPUs Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-06-15 16:13:53 +02:00
Andrea Righi	786ec0c04a	scx_rlfifo: schedule all tasks in user-space Disable all the BPF optimization shortcuts by default and force all tasks to be processed by the user-space scheduler. Given that the primary goal of this scheduler is to offer a straightforward and intuitive example for experimental purposes, this change simplifies the process for individuals looking to experiment, allowing them to apply changes to user-space code and quickly observe the effects, without dealing with any in-kernel optimizations. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-06-15 16:07:39 +02:00
Andrea Righi	59f47d6659	scx_rlfifo: improve code readability No functional change, just add some comments to better describe the parameters used when initializing the main BpfScheduler object. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-06-15 16:05:28 +02:00
Tejun Heo	d3b34d1df7	scx_qmap: Rename central_timer to monitor_timer The name was copied from scx_central.bpf.c and doesn't match what the timer is used for in scx_qmap.bpf.c.	2024-06-14 16:07:20 -10:00
Tejun Heo	13abb6fd26	scx/common.bpf.h: Reorganize Currently, the BPF declarations and generic helpers are in the same section. Let's move the generic helpers down to its own section.	2024-06-14 15:36:00 -10:00
Tejun Heo	d7677e3e5c	scx/common.bpf.h: Rename bpf_log2[l]() to u32/64_log2() The bpf_ prefix is used for BPF API. Rename bpf_log2() to u32_log2() and bpf_log2l() to u64_log2(). While at it, relocate them below compiler directive helpers.	2024-06-14 15:22:39 -10:00
Tejun Heo	5a2412c211	scx/common.bpf.h: Minor comment updates	2024-06-14 15:22:29 -10:00
Andrea Righi	8c6fe540eb	scx_rustland: prevent excessive starvation when system is congested Keep track of the maximum vruntime among all tasks and flush them if the difference between the maximum and minimum vruntime exceeds slice_ns. This helps to prevent excessive starvation, as every task is guaranteed to be dispatched within the slice_ns time limit. Tested-by: Tested-by: SoulHarsh007 <harsh.peshwani@outlook.com> Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-06-14 20:09:19 +02:00
Changwoo Min	94a39f419f	scx_lavd: add the design of core compaction The core compaction seems to work great in various hardware. Now it is time to document its design. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-06-14 11:53:52 +09:00
Changwoo Min	5068d75bf3	Merge pull request #351 from multics69/lavd-power-v2 scx_lavd: improve CPU frequency scaling	2024-06-14 09:29:10 +09:00
Tejun Heo	a3342810c7	Merge pull request #352 from dschatzberg/mitosis common: Add css iter forward declares	2024-06-13 06:50:06 -10:00
Dan Schatzberg	114e4b644b	common: Add css iter forward declares These are used in mitosis, but they belong in common code so other schedulers can do css iteration. Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>	2024-06-12 15:02:48 -07:00
Changwoo Min	747bf2a7d7	scx_lavd: add the design of CPU frequency scaling Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-06-13 01:42:19 +09:00
Changwoo Min	2e74b86b4a	scx_lavd: logging cpu performance target Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-06-13 00:44:04 +09:00
Changwoo Min	e6348a11e9	scx_lavd: improve frequency scaling logic The old logic for CPU frequency scaling is that the task's CPU performance target (i.e., target CPU frequency) is checked every tick interval and updated immediately. Indeed, it samples and updates a performance target every tick interval. Ultimately, it fluctuates CPU frequency every tick interval, resulting in less steady performance. Now, we take a different strategy. The key idea is to increase the frequency as soon as possible when a task starts running for quick adoption to load spikes. However, if necessary, it decreases gradually every tick interval to avoid frequency fluctuations. In my testing, it shows more stable performance in many workloads (games, compilation). Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-06-12 23:40:40 +09:00
Changwoo Min	753f333c09	scx_lavd: refactoring do_update_sys_stat() Originally, do_update_sys_stat() simply calculated the system-wide CPU utilization. Over time, it has evolved to collect all kinds of system-wide, periodic statistics for decision-making, so it has become bulky. Now, it is time to refactor it for readability. This commit does not contain functional changes other than refactoring. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-06-12 21:15:25 +09:00
Changwoo Min	9d129f0afa	scx_lavd: rename LAVD_CPU_UTIL_INTERVAL_NS to LAVD_SYS_STAT_INTERVAL_NS The periodic CPU utilization routine does a lot of other work now. So we rename LAVD_CPU_UTIL_INTERVAL_NS to LAVD_SYS_STAT_INTERVAL_NS. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-06-12 20:06:17 +09:00
Changwoo Min	7046b47b9c	scx_lavd: properly calculate task's runtime after suspend/resume When a device is suspended and resumed, the suspended duration is added up to a task's runtime if the task was running on the CPU. After the resume, the task's runtime is incorrectly long and the scheduler starts to recognize the system is under heavy load. To avoid such problem, the suspended duration is measured and substracted from the task's runtime. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-06-12 15:58:41 +09:00
Dan Schatzberg	b95cfb0772	mitosis: Fix build The target wasn't dependent on the previous sched so building all schedulers ended up not building scx_mitosis which broke the install script.	2024-06-11 14:33:32 -07:00
Dan Schatzberg	9528d4603e	Merge pull request #339 from dschatzberg/mitosis scheds: Add scx_mitosis scheduler	2024-06-11 16:50:25 -04:00
Dan Schatzberg	3b6e2dee20	scheds: Add scx_mitosis scheduler scx_mitosis is a dynamic affinity scheduler which assigns cgroups to Cells and Cells to discrete sets of CPUs. The number of cells is dynamic as is the CPU assignment. BPF mostly just does vtime scheduling for each cell, tracks load, and responds to reconfiguration from userspace. Userspace makes decisions about how to assign cgroups to cells and cells to cpus. This is not yet a complete scheduler, much of the userspace logic is a placeholder as I experiment with better logic. I also want to add richer scheduling semantics to userspace, e.g. so that cells can do more "soft-affinity" rather than the strict partitioning implemented currently. Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>	2024-06-11 10:34:53 -07:00
David Vernet	1dbf874709	Merge pull request #341 from vax-r/rusty_data_races scx_rusty: Elimate data races possibility for domain min_vruntime	2024-06-11 12:04:40 -05:00
David Vernet	b50ba626cc	uei: Pass skel to RESIZE_ARRAY() The RESIZE_ARRAY() macro assumes the presence of an in-scope "skel" variable. This is bad practice and can cause issues in other macros that use it. Let's update it to explicitly take a skel argument. Signed-off-by: David Vernet <void@manifault.com>	2024-06-11 10:15:26 -05:00
I Hsin Cheng	4e30bb9ccf	scx_rusty: Elimate data races possibility for domain min_vruntime READ_ONCE()/WRITE_ONCE() macros are added in commit 0932fde, we should be able to utilize the macros to get around the possibility of data races for domc->min_vruntime. Signed-off-by: I Hsin Cheng <richard120310@gmail.com>	2024-06-11 10:57:03 +08:00
Tejun Heo	30f27d99d9	Merge pull request #340 from sched-ext/htejun/layered-updates scx_layered: Improve yield, preemption and other behaviors	2024-06-10 11:27:44 -10:00
Tejun Heo	9ec3594b4f	scx_layered: Several fixes to address David's review - pick_idle_cpu() was putting idle_smtmask that it didn't acquire. - layered_enqueue() was unnecessarily entering preemption path after finding an idle CPU. - No need to test whether scx_bpf_get_idle_cpu/smtmask() return NULL. They never do. - Relocate cctx->yielding test into keep_runinng() from its caller.	2024-06-10 11:23:37 -10:00
Tejun Heo	92317aa2f9	Use __always_inline uniformly Instead of using __attribute__((always_inline)) use the __always_inline macro provided by BPF.	2024-06-10 11:23:26 -10:00
Changwoo Min	472ab945b8	scx_lavd: core compaction for low power consumption (#338 ) scx_lavd: core compaction for low power consumption When system-wide CPU utilization is low, it is very likely all the CPUs are running with very low utilization. That means all CPUs run with low clock frequency thanks to dynamic frequency scaling and very frequently go in and out from/to C-state. That results in low performance (i.e., low clock frequency) and high power consumption (i.e., frequent P-/C-state transition). The idea of core compaction is using less number of CPUs when system-wide CPU utilization is low. The chosen cores (called "active cores") will run in higher utilization and higher clock frequency, and the rest of the cores (called "idle cores") will be in a C-state for a much longer duration. Thus, the core compaction can achieve higher performance with lower power consumption. One potential problem of core compaction is latency spikes when all the active cores are overloaded. A few techniques are incorporated to solve this problem. 1) Limit the active CPU core's utilization below a certain limit (say 50%). 2) Do not use the core compaction when the system-wide utilization is moderate (say 50%). 3) Do not enforce the core compaction for kernel and pinned user-space tasks since they are manually optimized for performance. In my experiments, under a wide range of system-wide CPU utilization (5%—80%), the core compaction reduces 7-30% power consumption without sacrificing average and 99p tail latency. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-06-08 09:25:27 +09:00
Tejun Heo	a165970ab9	scx_layered: Add migration statistic Keep track of how frequent migrations are.	2024-06-07 11:49:39 -10:00
Tejun Heo	5b31d96c3d	scx_layered: Implement "preempt_first" layer property If set, tasks in the layer will try to preempt tasks in their previous CPUs before trying to find idle CPUs.	2024-06-07 11:49:39 -10:00
Tejun Heo	ece3638664	scx_layered: Allow confined layers to preempt There's no reason to restrict confined layers from preempting on the CPUs that they are entitled to. Allow preemption for confined layers.	2024-06-07 11:49:39 -10:00
Tejun Heo	7c48814ed0	scx_layered: Prefer preempting the CPU the task was previously on Currently, when preempting, searching for the candidate CPU always starts from the RR preemption cursor. Let's first try the previous CPU the preempting task was on as that may have some locality benefits.	2024-06-07 11:49:38 -10:00
Tejun Heo	3db3257911	scx_layered: Find and kick an idle CPU from enqueue path When a task is being enqueued outside wakeup path, ops.select_cpu() isn't called, so we can end up in a situation where a newly enqueued task keeps waiting in one of the DSQs while there are idle CPUs. Factor out idle CPU selection path into pick_idle_cpu() and call it from the enqueue path in such cases. This problem is shared across schedulers and likely needs a more generic solution in the future.	2024-06-07 11:49:38 -10:00
Tejun Heo	0f2d1ad2fa	scx_layered: Implement a new layer parameter "yield_ignore" yield(2) currently gives up the entire slice. Add "yield_ignore" layer parameter which can modulate the magnitude of yiedling. When 1.0, yields are completely ignored. 0.5, only half worth of the full slice is given up and so on.	2024-06-07 11:49:38 -10:00
Tejun Heo	4aa8124b9c	scx_layered: Add explicit yield() support Currently, a task which yields is treated the same as a task which has run out its slice. As the budget charged to a task is calculated from wall clock time, a repeatedly yielding task can stay at the top of the queue for quite a while hogging the CPU and spiking the number of scheduling events. Let's add explicit yield support. An yielding task is now always charged the full slice and not allowed to keep running on the same CPU.	2024-06-07 11:49:38 -10:00
Tejun Heo	436cd7ba9e	scx_layered: Make enqueue path comprehensive and handle CPU preemptions The keep_running path relies on the implicit last task enqueue which makes the statistics a bit difficult to track. Let's make the enqueue path comprehensive: - Set SCX_OPS_ENQ_LAST and handle the last runnable task enqueue explicitly. - Implement layered_cpu_release() to re-enqueue tasks from a CPU preempted by a higher pri sched class and handle the re-enqueued tasks explicitly in layered_enqueue(). - Add more statistics to track all enqueue operations.	2024-06-07 11:49:38 -10:00
Tejun Heo	4a0993ceab	scx_layered: Allow long-running tasks to keep running on the same CPU When a task exhausts its slice, layered currently doesn't make any effort to keep it on the same CPU. It dispatches the next task to run and then enqueues the running one. This leads to suboptimal behaviors. e.g. When this happens to a task in a preempting layer, the task will most likely find an idle CPU or a task to preempt and then migrate there causing a completely unnecessary migration. This patch layered_dispatch() test whether the current task should keep running on the CPU and then skip dispatching to keep the task running. This behavior depends on the implicit local DSQ enqueue mechanism which triggers when there are no other tasks to run.	2024-06-07 11:49:38 -10:00
Tejun Heo	200af60f2a	scx_layered: Fix load failure due to scheduler_tick() -> sched_tick() rename - scx_utils: Replace kfunc_exists() with ksym_exists() which doesn't care about the type of the symbol. - scx_layered: Fix load failure on kernels >= v6.10-rc due to scheduler_tick() -> sched_tick rename. Attach the tick fentry function to either scheduler_tick() or sched_tick().	2024-06-06 12:54:59 -10:00
Andrea Righi	8a3ee7b801	scx_rustland: never use a time slice that exceeds the default value Make sure to never assign a time slice longer than the default time slice, that can be used as an upper limit. This seems to prevent potential stall conditions (reported by the CachyOS community) when running CPU-intensive workloads, such as: [ 68.062813] sched_ext: BPF scheduler "rustland" errored, disabling [ 68.062831] sched_ext: runnable task stall (ollama_llama_se[3312] failed to run for 5.180s) [ 68.062832] scx_watchdog_workfn+0x154/0x1e0 [ 68.062837] process_one_work+0x18e/0x350 [ 68.062839] worker_thread+0x2fa/0x490 [ 68.062841] kthread+0xd2/0x100 [ 68.062842] ret_from_fork+0x34/0x50 [ 68.062844] ret_from_fork_asm+0x1a/0x30 Fixes: `6f4cd853` ("scx_rustland: introduce virtual time slice") Tested-by: SoulHarsh007 <harsh.peshwani@outlook.com> Tested-by: Piotr Gorski <piotrgorski@cachyos.org> Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-06-06 17:56:23 +02:00
Andrea Righi	6f4cd853f9	scx_rustland: introduce virtual time slice Overview ======== Currently, a task's time slice is determined based on the total number of tasks waiting to be scheduled: the more overloaded the system, the shorter the time slice. This approach can help to reduce the average wait time of all tasks, allowing them to progress more slowly, but uniformly, thus providing a smoother overall system performance. However, under heavy system load, this approach can lead to very short time slices distributed among all tasks, causing excessive context switches that can badly affect soft real-time workloads. Moreover, the scheduler tends to operate in a bursty manner (tasks are queued and dispatched in bursts). This can also result in fluctuations of longer and shorter time slices, depending on the number of tasks still waiting in the scheduler's queue. Such behavior can also negatively impact on soft real-time workloads, such as real-time audio processing. Virtual time slice ================== To mitigate this problem, introduce the concept of virtual time slice: the idea is to evaluate the optimal time slice of a task, considering the vruntime as a deadline for the task to complete its work before releasing the CPU. This is accomplished by calculating the difference between the task's vruntime and the global current vruntime and use this value as the task time slice: task_slice = task_vruntime - min_vruntime In this way, tasks that "promise" to release the CPU quickly (based on their previous work pattern) get a much higher priority (due to vruntime-based scheduling and the additional priority boost for being classified as interactive), but they are also given a shorter time slice to complete their work and fulfill their promise of rapidity. At the same time tasks that are more CPU-intensive get de-prioritized, but they will tend to have a longer time slice available, reducing in this way the amount of context switches that can negatively affect their performance. In conclusion, latency-sensitive tasks get a high priority and a short time slice (and they can preempt other tasks), CPU-intensive tasks get low priority and a long time slice. Example ======= Let's consider the following theoretical scenario: task \| time -----+----- A \| 1 B \| 3 C \| 6 D \| 6 In this case task A represents a short interactive task, task C and D are CPU-intensive tasks and task B is mainly interactive, but it also requires some CPU time. With a uniform time slice, scaled based on the amount of tasks, the scheduling looks like this (assuming the time slice is 2): A B B C C D D A B C C D D C C D D \| \| \| \| \| \| \| \| \| `---`---`---`-`-`---`---`---`----> 9 context switches With the virtual time slice the scheduling changes to this: A B B C C C D A B C C C D D D D D \| \| \| \| \| \| \| `---`-----`-`-`-`-----`----------> 7 context switches In the latter scenario, tasks do not receive the same time slice scaled by the total number of tasks waiting to be scheduled. Instead, their time slice is adjusted based on their previous CPU usage. Tasks that used more CPU time are given longer slices and their processing time tends to be packed together, reducing the amount of context switches. Meanwhile, latency-sensitive tasks can still be processed as soon as they need to, because they get a higher priority and they can preempt other tasks. However, they will get a short time slice, so tasks that were incorrectly classified as interactive will still be forced to release the CPU quickly. Experimental results ==================== This patch has been tested on a on a 8-cores AMD Ryzen 7 5800X 8-Core Processor (16 threads with SMT), 16GB RAM, NVIDIA GeForce RTX 3070. The test case involves the usual benchmark of playing a video game while simultaneously overloading the system with a parallel kernel build (`make -j32`). The average frames per second (fps) reported by Steam is used as a metric for measuring system responsiveness (the higher the better): Game \| before \| after \| delta \| ---------------------------+---------+---------+--------+ Baldur's Gate 3 \| 40 fps \| 48 fps \| +20.0% \| Counter-Strike 2 \| 8 fps \| 15 fps \| +87.5% \| Cyberpunk 2077 \| 41 fps \| 46 fps \| +12.2% \| Terraria \| 98 fps \| 108 fps \| +10.2% \| Team Fortress 2 \| 81 fps \| 92 fps \| +13.6% \| WebGL demo (firefox) [1] \| 32 fps \| 42 fps \| +31.2% \| ---------------------------+---------+---------+--------+ Apart from the massive boost with Counter-Strike 2 (that should be taken with a grain of salt, considering the overall poor performance in both cases), the virtual time slice seems to systematically provide a boost in responsiveness of around +10-20% fps. It also seems to significantly prevent potential audio cracking issues when the system is massively overloaded: no audio cracking was detected during the entire run of these tests with the virtual deadline change applied. [1] https://webglsamples.org/aquarium/aquarium.html Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-06-04 23:01:13 +02:00

1 2 3 4 5 ...

488 Commits