scx-upstream

mirror of https://github.com/sched-ext/scx.git synced 2024-12-01 23:07:11 +00:00

Author	SHA1	Message	Date
Andrea Righi	c730e0558f	ci: test the shedulers with the latest sched-ext kernel Instead of downloading a precompiled sched-ext enabled kernel from the Ubuntu ppa, fetch the latest kernel directly from the sched-ext git repository and recompile it on-the-fly using virtme-ng. This allows to get rid of the Ubuntu ppa dependency, take out from the equation potential Ubuntu-specific patches, and ensures testing all the schedulers with the most up-to-date sched-ext kernel (that should also help to detect potential kernel-related issues in advance). The downside is that the CI runs will take a bit longer now, because we are recompiling the kernel from scratch. However, the kernel built with virtme-ng is relatively quick to compile and includes all the sched-ext features required for testing. It's worth noting that this method aligns with the current sched-ext kernel CI, where we test only the in-kernel schedulers (as intended). This change allows to extend the test coverage, using the same kernel to test also the schedulers included in this repository. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-18 20:51:59 +01:00
David Vernet	dd07c442fc	Merge pull request #93 from sirlucjan/services-improvements Set log size to 10M	2024-01-17 17:43:17 -06:00
Piotr Gorski	8c61d38743	Drop unneeded default value Signed-off-by: Piotr Gorski <lucjan.lucjanov@gmail.com>	2024-01-18 00:23:04 +01:00
Piotr Gorski	1abd319cae	Set log size to 10M Signed-off-by: Piotr Gorski <lucjan.lucjanov@gmail.com>	2024-01-18 00:03:07 +01:00
Andrea Righi	24ef0f6c00	Merge pull request #94 from sched-ext/scx-rustland-smt-improvements scx-rustland: SMT improvements	2024-01-17 21:01:26 +01:00
Andrea Righi	be1cb8774b	scx_rustland: improve SMT performance The user-space scheduler dispatches tasks in batches, with the batch size matching the number of idle CPUs. Commit `791bdbe` ("scx_rustland: introduce SMT support") changed the order of idle CPUs, prioritizing dispatching tasks on the least busy cores (those with the most idle CPUs) before moving on to busier cores (those with the least idle CPUs). While this approach works well for a small number of tasks, it can lead to uneven performance as the number of tasks increases and all cores are saturated. Such uneven performance can be attributed to SMT interactions causing potential short lags and erratic system performance. In some cases, disabling SMT entirely results in better system responsiveness. To address this issue, instruct the scheduler to implicitly disable SMT and consistently dispatch tasks only on the first (or last) CPU of each core. This approach ensures an equal distribution of tasks among the available cores, preventing SMT disturbances and aligning with non-SMT performance, also when a significant amount of tasks are running. Additionally, the unused sibling CPUs within each core can be used as "spare" CPUs for the BPF dispatcher. This is particularly beneficial for tasks that cannot be dispatched on the target CPU selected by the scheduler, due to cpumask restrictions or congestion conditions. Therefore, this new approach allows to enhance system responsiveness on SMT systems, while simultaneously improving scheduler stability. Some preliminary results on an AMD Ryzen 7 5800X 8-Cores (SMT enabled): running my usual benchmark of measuring the fps of a videogame (Counter-Strike 2) during a parallel kernel build-induced system overload, shows an improvement of approximately 2x (from 8-10fps to 15-25fps vs 1-2fps with EEVDF). Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-17 20:49:17 +01:00
Andrea Righi	f0c33320ab	scx_rustland: avoid calling scx_bpf_kick_cpu() from update_idle() Prior to commit `676bd88` ("bpf_rustland: do not dispatch the scheduler to the global DSQ"), the user-space scheduler was dispatched using SCX_DSQ_GLOBAL and we needed to explicitly kick idle CPUs from update_idle() to ensure that at least one CPU was available to run the user-space scheduler. Now that we are using SCX_DSQ_LOCAL_ON\|cpu to dispatch the user-space scheduler, the target CPU is implicitly kicked. Therefore, the call to scx_bpf_kick_cpu() within .update_idle() becomes redundant and we can get rid of it. Fixes: `676bd88` ("bpf_rustland: do not dispatch the scheduler to the global DSQ") Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-17 20:49:17 +01:00
Tejun Heo	9089cc09bb	Merge pull request #92 from sched-ext/nest_callbacks scx_nest: Set timer callback after cancelling	2024-01-17 09:27:22 -10:00
Andrea Righi	a900d76ceb	Merge pull request #91 from sched-ext/scx-rustland-dynamic-slice-boost scx_rustland: introduce dynamic slice boost	2024-01-16 21:51:39 +01:00
David Vernet	7a3fe759f2	scx_nest: Remove -D option for eager compaction Now that scheduling BPF timers works correctly, we don't need this extra logic to eagerly compact if a scheduling for compaction has happened a few times in a row. Let's remove it. Signed-off-by: David Vernet <void@manifault.com>	2024-01-16 14:08:36 -06:00
David Vernet	607119d8a4	scx_nest: Set timer callback after cancelling In scx_nest, we use a per-cpu BPF timer to schedule compaction for a primary core before it goes idle. If a task comes along that could use that core, we cancel the callback with bpf_timer_cancel(). bpf_timer_cancel() drops a refcnt on the prog and nullifies the callback, so if we want to schedule the callback again, we must use bpf_timer_set_callback() to reset the prog. This patch does that. Reported-by: Julia Lawall <julia.lawall@inria.fr> Signed-off-by: David Vernet <void@manifault.com>	2024-01-16 14:01:39 -06:00
Tejun Heo	f28e5fb259	Merge pull request #88 from sirlucjan/systemd Add systemd services for scx schedulers	2024-01-16 07:29:44 -10:00
David Vernet	b8687a051e	Merge pull request #90 from sched-ext/scx-rustland-smt scx_rustland: introduce SMT support	2024-01-16 10:30:41 -06:00
Piotr Gorski	af1f344447	Allow to run from both /usr/sbin and /usr/bin Signed-off-by: Piotr Gorski <lucjan.lucjanov@gmail.com>	2024-01-16 16:04:30 +01:00
Andrea Righi	0b3c399519	scx_rustland: introduce dynamic slice boost Update the slice boost dynamically, as a function of the amount of CPUs in the system and the amount of tasks currently waiting to be dispatched: as the amount of waiting tasks in the task_pool increases, reduce the slice boost. This adjustment ensures that the scheduler adheres more closely to a pure vruntime-based policy as the amount of tasks contending the available CPUs increases and it allows to sustain stress tests that are spawning a massive amount of tasks. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-16 11:51:51 +01:00
Andrea Righi	791bdbec97	scx_rustland: introduce SMT support Introduce a basic support of CPU topology awareness. With this change, the scheduler will prioritize dispatching tasks to idle CPUs with fewer busy SMT siblings, then, it will proceed to CPUs with more busy SMT siblings, in ascending order. To implement this, introduce a new CoreMapping abstraction, that provides a mapping of the available core IDs in the system along with their corresponding lists of CPU IDs. This, coupled with the get_cpu_pid() method from the BpfScheduler abstraction, allows the user-space scheduler to enforce the policy outlined above and improve performance on SMT systems. Keep in mind that this improvement is relevent only when the amount of tasks running in the system is less than the amount of CPUs. As soon as the amount of running tasks increases, they will be distributed across all available CPUs and cores, thereby negating the advantages of SMT isolation. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-16 11:33:35 +01:00
Piotr Gorski	c7678eb0e9	Adapting service names to scheduler names Signed-off-by: Piotr Gorski <lucjan.lucjanov@gmail.com>	2024-01-16 10:26:25 +01:00
Piotr Gorski	d618a06d92	Add systemd services for scx schedulers Signed-off-by: Piotr Gorski <lucjan.lucjanov@gmail.com>	2024-01-15 23:41:59 +01:00
Andrea Righi	09e7905ee0	Merge pull request #87 from sched-ext/scx-rustland-allocator scx_userland: use a custom memory allocator to prevent page faults	2024-01-15 16:21:17 +01:00
Andrea Righi	63209b865d	scx_rustland: support aligned allocations in RustLandAllocator Even if the current implementation of the user-space scheduler doesn't require to allocate aligned memory, add a simple support to aligned allocations in RustLandAllocator, in order to make it more generic and potentially usable by other schedulers / components. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-15 13:44:33 +01:00
Andrea Righi	c593e3605e	scx_rustland: report user-space scheduler page fault counter Periodically report a page fault counter in the scheduler output. The user-space scheduler should never trigger page faults, otherwise we may experience deadlocks (that would trigger the sched-ext watchdog, unloading the scheduler). Reporting a page fault counter periodically to stdout can be really helpful to debug potential issues with the custom allocator. Moreover, group together also nr_sched_congested and nr_failed_dispatches with nr_page_faults and use the sum of all these counters to determine the healthy status of the user-space scheduler (reporting it to stdout as well). Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-14 22:07:37 +01:00
Andrea Righi	9708a80130	scx_userland: use a custom memory allocator to prevent page faults To prevent potential deadlock conditions under heavy loads, any scheduler that delegates scheduling decisions to user-space should avoid triggering page faults. To address this issue, replace the default Rust allocator with a custom one (RustLandAllocator), designed to operate on a pre-allocated buffer. This, coupled with the memory locking (via mlockall), prevents page faults from happening during the execution of the user-space scheduler, avoiding the deadlock condition. This memory allocator is completely transparent to the user-space scheduler code and it is applied automatically when the bpf module is imported. In the future we may decide to move this allocator to a more generic place (scx_utils crate), so that also other user-space Rust schedulers can use it. This initial implementation of the RustLandAllocator is very simple: a basic block-based allocator that uses an array to track the status of each memory block (allocated or free). This allocator can be improved in the future, but right now, despite its simplicity, it shows a reasonable speed and efficiency in meeting memory requests from the user-space scheduler, having to deal mostly with small and uniformly sized allocations. With this change in place scx_rustland survived more than 10hrs on a heavily stressed system (with stress-ng and kernel builds running in a loop): $ ps -o pid,rss,etime,cmd -p `pidof scx_rustland` PID RSS ELAPSED CMD 34966 75840 10:00:44 ./build/scheds/rust/scx_rustland/debug/scx_rustland Without this change it is possible to trigger the sched-ext watchdog timeout in less than 5min, under the same system load conditions. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-14 22:07:37 +01:00
Tejun Heo	930f92cb77	Merge pull request #86 from sched-ext/scx-rustland-remove-old-todo scx_rustland: remove obsolete TODO note	2024-01-11 09:49:23 -10:00
Andrea Righi	acc1d51560	scx_rustland: remove obsolete TODO note Entries from TaskInfoMap associated to exiting tasks are already removed via the BPF .exit_task() callback, so drop the obsolete TODO note and replace it with a proper comment. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-11 20:47:36 +01:00
Andrea Righi	e0bf2325c4	Merge pull request #85 from sched-ext/scx-rustland-voluntary-context-switch-boost scx_rustland: voluntary context switch boost	2024-01-11 19:32:52 +01:00
Andrea Righi	12d89e1d84	scx_rustland: add a troubleshooting section Add a brief troubleshooting section to the command line help. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-11 18:14:46 +01:00
Andrea Righi	2157f638df	scx_rustland: voluntary context switch boost Improve priority boosting using voluntary context switches metric. Overview ======== The current criteria to apply the time slice boost (option `-b`) is to distinguish between newly created tasks and tasks that are already running: in order to prioritize interactive applications (games, multimedia, etc.) we apply a time slice usage penalty on newly created tasks, indirectly boosting the priority of tasks that are already running, which are likely to be the interactive applications that we aim to prioritize. Problem ======= This approach works well when the background workload forks a bunch of short-lived tasks (e.g., a parallel kernel build), but it fails to properly classify CPU-intensive background tasks (i.e., video/3D rendering, encryption, large data analysis, etc.), because these applications, typically, do not generate many short-lived processes. In presence of such workloads the time slice penalty is not enforced, resulting in a lack of any boost for interactive applications. Solution ======== A more effective critiria for distinguishing between interactive applications and background CPU-intensive applications is to examine the voluntary context switches: an application that periodically releases the CPU voluntarily is very likely to be interactive. Therefore, change the time slice boost logic to apply a bonus (scale down the accounted used time slice) to tasks that show an increase in their voluntary context switches counter over a time frame of 10 sec. Based on experimental results, this simple heurstic appears to be quite effective in classifying interactive tasks and prioritize them over potential background CPU-intensive tasks. Additionally, having a better criteria to identify interactive tasks allow to prioritize also newly created tasks, thereby enhancing the responsiveness of interactive shell sessions. This always ensures the prompt execution of system commands, even when the system is massively overloaded, unlike the previous time slice boost logic, which made interactive shell sessions less responsive by deprioritizing newly created tasks. Results ======= With this new logic in place it is possible to play a video game (e.g., Terraria) without experiencing any frame rate drop (60 fps), while a parallel CPU stress test (`stress-ng -c 32`) is running in the background. The same result can also be obtained with a parallel kernel build (`make -j 32`). Thus, there is no regression compared to the previous "ideal" test case. Even when mixing both workloads (`make -j 16` + `stress-ng -c 16`), Terraria can still be played without noticeable lag in the audio or video, maintaining a consistent 60 fps. In addition to that, shell commands are also very responsive. Following, the results (average and standard deviation of 10 runs) of two simple interactive shell commands, while both the `make -j 16` and `stress-ng -c 16` workloads are running in background: avg time "uname -r" "ps axuw > /dev/null" ========================================================= EEVDF 11.1ms 231.8ms scx_rustland 2.6ms 212.0ms stdev "uname -r" "ps axuw > /dev/null" ========================================================= EEVDF 2.28 23.41 scx_rustland 0.70 9.11 Tests conducted on a 8-cores laptop (11th Gen Intel i7-1195G7 @ 4.800GHz) with 16GB of RAM. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-11 18:14:30 +01:00
Andrea Righi	1cf03770c7	scx_rustland: expose voluntary context switches to the scheduler Provide the number of voluntary context switches (nvcsw) for each task to the user-space scheduler. This extra information can then be used by the scheduler to enhance its decision-making process when scheduling tasks. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-11 14:10:39 +01:00
Tejun Heo	30c25ff30e	Merge pull request #84 from sched-ext/htejun-README-update Update README.md to include terraria video	2024-01-10 15:18:41 -10:00
Tejun Heo	331f28b775	Update README.md to include terraria video	2024-01-10 15:17:35 -10:00
David Vernet	90874df9ef	Merge pull request #83 from sched-ext/htejun-README-updates scx_rustland: Update README.md	2024-01-10 18:57:02 -06:00
Tejun Heo	1395f14975	Update README.md Embed the video and drop "live" from section title as it's not really live.	2024-01-10 14:47:33 -10:00
Tejun Heo	b32d73ae4e	Merge pull request #82 from sched-ext/htejun scx_flatcg: Fix fallout from direct dispatch API update	2024-01-10 11:25:36 -10:00
Tejun Heo	18f7fe8477	scx_flatcg: Fix fallout from direct dispatch API update `552b75a9c7` ("scx: Build fix after kernel update") updated scx_flatcg along with other schedulers to use the new direct dispatching from ops.select_cpu() mechanism. However, this was buggy for flatcg. flatcg uses direct dispatch for two purposes - as an optimization when there are idle cpus and to avoid dealing with custom CPU affinities in the dispatch logic. While the former can be moved to ops.select_cpu(), the latter can't as it should also apply to tasks which get enqueued without preceding ops.select_cpu(), e.g., when the task gets requeued after an attribute change or runs out of time slice. The API update incorrectly moved both to ops.select_cpu() leading to futile retries of try_pick_next_cgroup() and scheduling misbheaviors. Fix it by separating out the two cases and only keeping the idle optimization case in ops.select_cpu(). Signed-off-by: Tejun Heo <tj@kernel.org>	2024-01-10 10:57:50 -10:00
Tejun Heo	c1f22ea073	scx_flatcg: Report pick_next_cgroup() race and fail counts To improve visibility into failure mode. While at it, improve output formatting. Signed-off-by: Tejun Heo <tj@kernel.org>	2024-01-10 10:52:24 -10:00
Tejun Heo	ae50b155ca	Merge pull request #80 from sched-ext/scx-flatcg-mitigate-stall scx_flatcg: introduce CGROUP_MAX_RETRIES	2024-01-10 09:49:09 -10:00
Tejun Heo	af06d3dd4b	Merge pull request #81 from sched-ext/scx-rustland-time-slice-boost scx_rustland: time slice boost	2024-01-10 08:15:47 -10:00
Andrea Righi	0609abdca6	scx_flatcg: introduce CGROUP_MAX_RETRIES We may end up stalling for too long in fcg_dispatch() if try_pick_next_cgroup() doesn't find another valid cgroup to pick. This can be quite risky, considering that we are holding the rq lock in dispatch(). This condition can be reproduced easily in our CI, where we can trigger stalling softirq works: [ 4.972926] NOHZ tick-stop error: local softirq work is pending, handler #200!!! Or rcu stalls: [ 47.731900] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 47.731900] rcu: 1-...!: (0 ticks this GP) idle=b29c/1/0x4000000000000000 softirq=2204/2204 fqs=0 [ 47.731900] rcu: 3-...!: (0 ticks this GP) idle=db74/1/0x4000000000000000 softirq=2286/2286 fqs=0 [ 47.731900] rcu: (detected by 0, t=26002 jiffies, g=6029, q=54 ncpus=4) [ 47.731900] Sending NMI from CPU 0 to CPUs 1: To mitigate this issue reduce the amount of try_pick_next_cgroup() retries from BPF_MAX_LOOPS (8M) to CGROUP_MAX_RETRIES (1024). Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-10 17:36:17 +01:00
Andrea Righi	0198d893ce	scx_rustland: introduce time slice boost parameter Introduce a parameter to prioritize active running tasks over newly created tasks. This option can be used to enhance interactive applications (e.g., games, audio/video, GUIs, etc.) that are concurrently running with fork-intensive background workloads (such as a large parallel build for example). The boost value (which functions as a penalty) is applied to the time slice attributed to newly generated tasks, increasing their vruntime and, in an indirect manner, "boosting" the priority of all the other concurrent active tasks. The time slice boost parameter was applied in the live demo video [1] to enhance the frames per second (fps) of a video game (Terraria), running simultaneously with a parallel kernel build (`make -j 32`) on an 8-core laptop (the value used in the video matches the existing setting of running `scx_rustland -b 200`). [1] https://www.youtube.com/watch?v=oCfVbz9jvVQ Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-10 17:32:29 +01:00
Andrea Righi	732ba4900b	scx_rustland: avoid using SCX_ENQ_PREEMPT With the introduction of a the dynamic time slice that scales down based on the number of tasks in the system, there is no obvious benefit in utilizing SCX_ENQ_PREEMPT to dispatch the user-space scheduler. The reduced time slice as the task count increases already enhances the user-space scheduler's opportunities to run and efficiently manage scheduling tasks, even when the system is massively overloaded. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-10 17:32:29 +01:00
Tejun Heo	be1b184b51	Merge pull request #78 from sched-ext/ci-unstable-ppa ci: temporarily switch to ppa:arighi/sched-ext-unstable	2024-01-09 11:49:44 -10:00
Andrea Righi	1c92458c4b	ci: temporarily switch to ppa:arighi/sched-ext-unstable Temporarily switch to the unstable sched-ext ppa, so that we can resume testing with the new kernel API. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-09 22:40:52 +01:00
Andrea Righi	9e782b9cd6	Merge pull request #77 from sched-ext/scx-rustland-update scx_rustland: small updates	2024-01-09 22:37:14 +01:00
Andrea Righi	db9a29d618	scx_rustland: improve dynamic slice scaling Move scaling after tasks are sent to the dispatcher: tasks are dispatched based on the amount of idle CPUs, so checking for any remaining tasks still sitting in the scheduler after dispatch gives a better idea how busy the system is. Moreover, do not scale the time slice based on nr_cpus (otherwise, systems with a large amount of CPUs would rarely get any scaling at all). Instead, apply a scaling factor as a function of how many tasks are still waiting in the scheduler: nr_scheduled / 2. This method scales better as the number of CPUs increases. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-09 22:11:07 +01:00
Andrea Righi	1da2983804	scx_rustland: get rid of force_local Now that we can dispatch directly from select_cpu() we can make the code more compact and readable by removing the force_local logic. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-09 22:11:07 +01:00
Andrea Righi	6ead675fb6	scx_rustland: add a link to the live demo in the README Update the README.md adding a link to a live demo video of the scheduler. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-09 22:11:07 +01:00
Tejun Heo	74923c6cdb	Merge pull request #76 from sched-ext/htejun Bump versions	2024-01-08 18:51:47 -10:00
Tejun Heo	942b0269b8	Bump versions After updates to reflect the updated init and direct dispatch API, the schedulers aren't compatible with older kernels. Bump versions and publish releases.	2024-01-08 18:49:54 -10:00
David Vernet	4ff504a65c	Merge pull request #75 from sched-ext/htejun scx: Build fix after kernel update	2024-01-08 21:22:20 -06:00
Tejun Heo	552b75a9c7	scx: Build fix after kernel update In the latest kernel, sched_ext API has changed in two areas: - ops.prep_enable/cancel_enable/enable/disable() replaced with ops.init_task/enable/disable/exit_task(). - scx_bpf_dispatch() can now be called from ops.select_cpu(). Also, SCX_ENQ_LOCAL flag is removed. Instead, users can call scx_bpf_select_cpu_dfl() from ops.select_cpu() and use the @is_idle out param value to determine whether to dispatch directly. This commit updates all schedules so that they build. - Init functions renamed / merged / split. - ops.select_cpu() is added to several schedulers and local direct disptching logic is moved there. This is the minimum update which is need to make the schedulers build and work. It needs further update to e.g. move vtime udpates to ops.enable().	2024-01-08 14:48:24 -10:00

... 3 4 5 6 7 ...

465 Commits