scx-upstream

mirror of https://github.com/sched-ext/scx.git synced 2024-11-25 20:20:23 +00:00

Author	SHA1	Message	Date
David Vernet	9ce481255b	Merge pull request #102 from sirlucjan/services-update systemd-services: replace ConditionPathExists with ConditionPathIsDirectory	2024-01-25 09:06:27 -06:00
Piotr Gorski	128fa63cc2	systemd-services: replace ConditionPathExists with ConditionPathIsDirectory Signed-off-by: Piotr Gorski <lucjan.lucjanov@gmail.com>	2024-01-25 15:12:15 +01:00
David Vernet	911c3c03a2	Merge pull request #100 from sirlucjan/services-readme Add README.md for systemd services	2024-01-24 09:37:07 -06:00
Piotr Gorski	db5d7c53d8	Update descriptions Signed-off-by: Piotr Gorski <lucjan.lucjanov@gmail.com>	2024-01-24 16:35:47 +01:00
Piotr Gorski	25cc69b3c4	Add README.md for systemd services Signed-off-by: Piotr Gorski <lucjan.lucjanov@gmail.com>	2024-01-24 14:56:45 +01:00
Andrea Righi	83c2b414d6	Merge pull request #99 from sched-ext/rustland-fixes scx_rustland: fixes to improve scheduler stability	2024-01-23 13:51:28 +01:00
Andrea Righi	6d89eceb93	scx_rustland: dispatch tasks only on the global DSQ Commit `c6ada25` ("scx_rustland: use custom pcpu DSQ instead of SCX_DSQ_LOCAL{_ON}") fixed the race issues with the cpumask, but it also introduced performance regressions. Until we figure out the reasons of the performance regressions, simplify the dispatcher and go back at using only the global DSQ, relying on the built-in idle cpu selection. In this way we can still enforce task affinity properly (`stress-ng --race-sched N` does not crash the scheduler) and we can also provide a better level of system responsiveness (according to the results of the stress tests done recently). The idea of this change is to make the scheduler usable in certain real-world scenarios (and as bug-free as possible), while we figure out the performance regressions of the per-CPU DSQ approach, that will likely be re-introduced later on in the future. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-23 13:24:12 +01:00
Andrea Righi	06b5ff3d2f	scx_rustland: clarify the logic to determine interactive tasks No functional change, simply rewrite the code a bit and update the comment to clarify the logic to detect interactive tasks and apply the priority boost. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-23 08:28:44 +01:00
Andrea Righi	ab1c4f66a8	scx_rustland: allow to disable the slice boost completely Allow to specify `-b 0` to completely disable the slice boost logic and fallback to standard vruntime-based scheduler with variable time slice. In this way interactive tasks will not get over-prioritized over the other tasks in the system. Having this option can help to easily track down potential performance regressions arising for over-prioritizing interactive tasks. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-23 00:34:06 +01:00
Andrea Righi	b4269452fc	scx_userland: handle preemption events from higher sched_class Make sure to re-schedule the user-space scheduler if it's preempted by a task from a higher priority sched_class. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-23 00:34:06 +01:00
Andrea Righi	2426d1024f	scx_rustland: increase max amount of enqueued tasks As the scheduler is progressing towards a more stable and usable state, it may be subject to heavy stress tests. For this reason, bump up the limit of MAX_ENQUEUED_TASKS to 8192 in the BPF component, to be able to sustain task-intensive stress tests, reducing the risk of potential scheduling congestion conditions. The downside is a negligible increase in the memory footprint of the BPF component, that is worth the cost in order to have an improved scheduler stability. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-21 15:47:35 +01:00
Andrea Righi	28bf96c78e	scx_rustland: mitigate unevictable memory page faults Page faults cannot happen when the user-space scheduler is running, otherwise we may hit deadlock conditions: a kthread may need to run to resolve the page fault, but the user-space scheduler is waiting on the page fault to be resolved => deadlock. We solved this problem (mostly) in commit `9708a80` ("scx_userland: use a custom memory allocator to prevent page faults"), introducing a custom allocator for the user-space scheduler that operates on a pre-allocated mlocked memory buffer, but there is an exception that can still trigger page faults: kcompactd. When memory compaction is enabled, specifically with vm.compact_unevictable_allowed=1 (which is often the default in many distributions), kcompactd regularly attempts to compact all memory zones, such that free memory is available in contiguous blocks where feasible, including unevictable memory as well. In the event that kcompactd remaps pages within the user-space scheduler's address space, it can lead to page faults, resulting in a potential deadlock. To prevent this from happening automatically set vm.compact_unevictable_allowed=0 when the scheduler is loaded and restore the previous value when the scheduler in unloaded. In this way we can prevent kcompactd from touching the unevictable memory associated to the user-space scheduler. Keep in mind that this is not a full bullet proof solution: something else in the system may still set vm.compact_unevictable_allowed=1 while the scheduler is running, re-enabling the risk of deadlock. Ideally we would need a way to mark the user-space scheduler memory as "really unevictable", or a proper kernel ABI to instruct kcompactd to exclude certain tasks (or better, cgroups) from its proactive memory compaction actions, but since then, this seems to be the best way to mitigate this issue. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-21 15:47:35 +01:00
David Vernet	c6ada251ef	scx_rustland: use custom pcpu DSQ instead of SCX_DSQ_LOCAL{_ON} We still don't have a reliable and non-racy way to manage cpumasks from the user-space scheduler, so it is quite hard for the scheduler to enforce the proper CPU affinity behavior. Despite checking the cpumask in the BPF part, tasks may still be assigned to a CPU that they cannot use, triggering scheduler errors. For example, it is really easy to crash the scheduler with a simple CPU affinity stress test (`stress-ng --race-sched 8 --timeout 5`): 14:51:28 [WARN] FAIL: SCX_DSQ_LOCAL[_ON] verdict target cpu 1 not allowed for stress-ng-race-[567048] (err=1024) To prevent this issue from happening, create custom DSQ for each CPU available in the system and use these per-CPU DSQs to dispatch all the tasks processed by the user-space scheduler, including the user-space scheduler itself. Then consume the these DSQs from the .dispatch() callback of the respective CPU, to transfer all the tasks to the consuming CPU's local DSQ, preventing the cpumask race condition encountered using SCX_DSQ_LOCAL_ON. With this patch applied the `stress-ng --race-sched N` stress test can be executed successfully (even with large values of N) without causing the scheduler to crash. Signed-off-by: David Vernet <void@manifault.com> [ arighi: kick target cpu to improve responsiveness, update comments ] Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-21 15:47:35 +01:00
David Vernet	497229a590	Merge pull request #98 from jordalgo/cargo-toml	2024-01-20 11:18:18 -06:00
Jordan Rome	9f9a97a97f	Update descriptions in cargo toml files	2024-01-19 18:19:46 -08:00
David Vernet	0ac9d40e43	Merge pull request #97 from sirlucjan/services-fixes Set the correct value for sched-ext journald namespace	2024-01-19 14:46:24 -06:00
Piotr Gorski	9848ab4183	Increase log size to 25M Signed-off-by: Piotr Gorski <lucjan.lucjanov@gmail.com>	2024-01-19 21:30:33 +01:00
Piotr Gorski	1a1290d54c	Simplify the location of the journal-sched-ext file Signed-off-by: Piotr Gorski <lucjan.lucjanov@gmail.com>	2024-01-19 19:13:28 +01:00
Piotr Gorski	b6650fa4dc	Set the correct value for sched-ext journald namespace Signed-off-by: Piotr Gorski <lucjan.lucjanov@gmail.com>	2024-01-19 18:22:47 +01:00
Andrea Righi	af11da2661	Merge pull request #95 from sched-ext/github-ci ci: test the shedulers with the latest sched-ext kernel	2024-01-18 21:16:27 +01:00
Andrea Righi	c730e0558f	ci: test the shedulers with the latest sched-ext kernel Instead of downloading a precompiled sched-ext enabled kernel from the Ubuntu ppa, fetch the latest kernel directly from the sched-ext git repository and recompile it on-the-fly using virtme-ng. This allows to get rid of the Ubuntu ppa dependency, take out from the equation potential Ubuntu-specific patches, and ensures testing all the schedulers with the most up-to-date sched-ext kernel (that should also help to detect potential kernel-related issues in advance). The downside is that the CI runs will take a bit longer now, because we are recompiling the kernel from scratch. However, the kernel built with virtme-ng is relatively quick to compile and includes all the sched-ext features required for testing. It's worth noting that this method aligns with the current sched-ext kernel CI, where we test only the in-kernel schedulers (as intended). This change allows to extend the test coverage, using the same kernel to test also the schedulers included in this repository. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-18 20:51:59 +01:00
David Vernet	dd07c442fc	Merge pull request #93 from sirlucjan/services-improvements Set log size to 10M	2024-01-17 17:43:17 -06:00
Piotr Gorski	8c61d38743	Drop unneeded default value Signed-off-by: Piotr Gorski <lucjan.lucjanov@gmail.com>	2024-01-18 00:23:04 +01:00
Piotr Gorski	1abd319cae	Set log size to 10M Signed-off-by: Piotr Gorski <lucjan.lucjanov@gmail.com>	2024-01-18 00:03:07 +01:00
Andrea Righi	24ef0f6c00	Merge pull request #94 from sched-ext/scx-rustland-smt-improvements scx-rustland: SMT improvements	2024-01-17 21:01:26 +01:00
Andrea Righi	be1cb8774b	scx_rustland: improve SMT performance The user-space scheduler dispatches tasks in batches, with the batch size matching the number of idle CPUs. Commit `791bdbe` ("scx_rustland: introduce SMT support") changed the order of idle CPUs, prioritizing dispatching tasks on the least busy cores (those with the most idle CPUs) before moving on to busier cores (those with the least idle CPUs). While this approach works well for a small number of tasks, it can lead to uneven performance as the number of tasks increases and all cores are saturated. Such uneven performance can be attributed to SMT interactions causing potential short lags and erratic system performance. In some cases, disabling SMT entirely results in better system responsiveness. To address this issue, instruct the scheduler to implicitly disable SMT and consistently dispatch tasks only on the first (or last) CPU of each core. This approach ensures an equal distribution of tasks among the available cores, preventing SMT disturbances and aligning with non-SMT performance, also when a significant amount of tasks are running. Additionally, the unused sibling CPUs within each core can be used as "spare" CPUs for the BPF dispatcher. This is particularly beneficial for tasks that cannot be dispatched on the target CPU selected by the scheduler, due to cpumask restrictions or congestion conditions. Therefore, this new approach allows to enhance system responsiveness on SMT systems, while simultaneously improving scheduler stability. Some preliminary results on an AMD Ryzen 7 5800X 8-Cores (SMT enabled): running my usual benchmark of measuring the fps of a videogame (Counter-Strike 2) during a parallel kernel build-induced system overload, shows an improvement of approximately 2x (from 8-10fps to 15-25fps vs 1-2fps with EEVDF). Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-17 20:49:17 +01:00
Andrea Righi	f0c33320ab	scx_rustland: avoid calling scx_bpf_kick_cpu() from update_idle() Prior to commit `676bd88` ("bpf_rustland: do not dispatch the scheduler to the global DSQ"), the user-space scheduler was dispatched using SCX_DSQ_GLOBAL and we needed to explicitly kick idle CPUs from update_idle() to ensure that at least one CPU was available to run the user-space scheduler. Now that we are using SCX_DSQ_LOCAL_ON\|cpu to dispatch the user-space scheduler, the target CPU is implicitly kicked. Therefore, the call to scx_bpf_kick_cpu() within .update_idle() becomes redundant and we can get rid of it. Fixes: `676bd88` ("bpf_rustland: do not dispatch the scheduler to the global DSQ") Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-17 20:49:17 +01:00
Tejun Heo	9089cc09bb	Merge pull request #92 from sched-ext/nest_callbacks scx_nest: Set timer callback after cancelling	2024-01-17 09:27:22 -10:00
Andrea Righi	a900d76ceb	Merge pull request #91 from sched-ext/scx-rustland-dynamic-slice-boost scx_rustland: introduce dynamic slice boost	2024-01-16 21:51:39 +01:00
David Vernet	7a3fe759f2	scx_nest: Remove -D option for eager compaction Now that scheduling BPF timers works correctly, we don't need this extra logic to eagerly compact if a scheduling for compaction has happened a few times in a row. Let's remove it. Signed-off-by: David Vernet <void@manifault.com>	2024-01-16 14:08:36 -06:00
David Vernet	607119d8a4	scx_nest: Set timer callback after cancelling In scx_nest, we use a per-cpu BPF timer to schedule compaction for a primary core before it goes idle. If a task comes along that could use that core, we cancel the callback with bpf_timer_cancel(). bpf_timer_cancel() drops a refcnt on the prog and nullifies the callback, so if we want to schedule the callback again, we must use bpf_timer_set_callback() to reset the prog. This patch does that. Reported-by: Julia Lawall <julia.lawall@inria.fr> Signed-off-by: David Vernet <void@manifault.com>	2024-01-16 14:01:39 -06:00
Tejun Heo	f28e5fb259	Merge pull request #88 from sirlucjan/systemd Add systemd services for scx schedulers	2024-01-16 07:29:44 -10:00
David Vernet	b8687a051e	Merge pull request #90 from sched-ext/scx-rustland-smt scx_rustland: introduce SMT support	2024-01-16 10:30:41 -06:00
Piotr Gorski	af1f344447	Allow to run from both /usr/sbin and /usr/bin Signed-off-by: Piotr Gorski <lucjan.lucjanov@gmail.com>	2024-01-16 16:04:30 +01:00
Andrea Righi	0b3c399519	scx_rustland: introduce dynamic slice boost Update the slice boost dynamically, as a function of the amount of CPUs in the system and the amount of tasks currently waiting to be dispatched: as the amount of waiting tasks in the task_pool increases, reduce the slice boost. This adjustment ensures that the scheduler adheres more closely to a pure vruntime-based policy as the amount of tasks contending the available CPUs increases and it allows to sustain stress tests that are spawning a massive amount of tasks. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-16 11:51:51 +01:00
Andrea Righi	791bdbec97	scx_rustland: introduce SMT support Introduce a basic support of CPU topology awareness. With this change, the scheduler will prioritize dispatching tasks to idle CPUs with fewer busy SMT siblings, then, it will proceed to CPUs with more busy SMT siblings, in ascending order. To implement this, introduce a new CoreMapping abstraction, that provides a mapping of the available core IDs in the system along with their corresponding lists of CPU IDs. This, coupled with the get_cpu_pid() method from the BpfScheduler abstraction, allows the user-space scheduler to enforce the policy outlined above and improve performance on SMT systems. Keep in mind that this improvement is relevent only when the amount of tasks running in the system is less than the amount of CPUs. As soon as the amount of running tasks increases, they will be distributed across all available CPUs and cores, thereby negating the advantages of SMT isolation. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-16 11:33:35 +01:00
Piotr Gorski	c7678eb0e9	Adapting service names to scheduler names Signed-off-by: Piotr Gorski <lucjan.lucjanov@gmail.com>	2024-01-16 10:26:25 +01:00
Piotr Gorski	d618a06d92	Add systemd services for scx schedulers Signed-off-by: Piotr Gorski <lucjan.lucjanov@gmail.com>	2024-01-15 23:41:59 +01:00
Andrea Righi	09e7905ee0	Merge pull request #87 from sched-ext/scx-rustland-allocator scx_userland: use a custom memory allocator to prevent page faults	2024-01-15 16:21:17 +01:00
Andrea Righi	63209b865d	scx_rustland: support aligned allocations in RustLandAllocator Even if the current implementation of the user-space scheduler doesn't require to allocate aligned memory, add a simple support to aligned allocations in RustLandAllocator, in order to make it more generic and potentially usable by other schedulers / components. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-15 13:44:33 +01:00
Andrea Righi	c593e3605e	scx_rustland: report user-space scheduler page fault counter Periodically report a page fault counter in the scheduler output. The user-space scheduler should never trigger page faults, otherwise we may experience deadlocks (that would trigger the sched-ext watchdog, unloading the scheduler). Reporting a page fault counter periodically to stdout can be really helpful to debug potential issues with the custom allocator. Moreover, group together also nr_sched_congested and nr_failed_dispatches with nr_page_faults and use the sum of all these counters to determine the healthy status of the user-space scheduler (reporting it to stdout as well). Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-14 22:07:37 +01:00
Andrea Righi	9708a80130	scx_userland: use a custom memory allocator to prevent page faults To prevent potential deadlock conditions under heavy loads, any scheduler that delegates scheduling decisions to user-space should avoid triggering page faults. To address this issue, replace the default Rust allocator with a custom one (RustLandAllocator), designed to operate on a pre-allocated buffer. This, coupled with the memory locking (via mlockall), prevents page faults from happening during the execution of the user-space scheduler, avoiding the deadlock condition. This memory allocator is completely transparent to the user-space scheduler code and it is applied automatically when the bpf module is imported. In the future we may decide to move this allocator to a more generic place (scx_utils crate), so that also other user-space Rust schedulers can use it. This initial implementation of the RustLandAllocator is very simple: a basic block-based allocator that uses an array to track the status of each memory block (allocated or free). This allocator can be improved in the future, but right now, despite its simplicity, it shows a reasonable speed and efficiency in meeting memory requests from the user-space scheduler, having to deal mostly with small and uniformly sized allocations. With this change in place scx_rustland survived more than 10hrs on a heavily stressed system (with stress-ng and kernel builds running in a loop): $ ps -o pid,rss,etime,cmd -p `pidof scx_rustland` PID RSS ELAPSED CMD 34966 75840 10:00:44 ./build/scheds/rust/scx_rustland/debug/scx_rustland Without this change it is possible to trigger the sched-ext watchdog timeout in less than 5min, under the same system load conditions. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-14 22:07:37 +01:00
Tejun Heo	930f92cb77	Merge pull request #86 from sched-ext/scx-rustland-remove-old-todo scx_rustland: remove obsolete TODO note	2024-01-11 09:49:23 -10:00
Andrea Righi	acc1d51560	scx_rustland: remove obsolete TODO note Entries from TaskInfoMap associated to exiting tasks are already removed via the BPF .exit_task() callback, so drop the obsolete TODO note and replace it with a proper comment. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-11 20:47:36 +01:00
Andrea Righi	e0bf2325c4	Merge pull request #85 from sched-ext/scx-rustland-voluntary-context-switch-boost scx_rustland: voluntary context switch boost	2024-01-11 19:32:52 +01:00
Andrea Righi	12d89e1d84	scx_rustland: add a troubleshooting section Add a brief troubleshooting section to the command line help. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-11 18:14:46 +01:00
Andrea Righi	2157f638df	scx_rustland: voluntary context switch boost Improve priority boosting using voluntary context switches metric. Overview ======== The current criteria to apply the time slice boost (option `-b`) is to distinguish between newly created tasks and tasks that are already running: in order to prioritize interactive applications (games, multimedia, etc.) we apply a time slice usage penalty on newly created tasks, indirectly boosting the priority of tasks that are already running, which are likely to be the interactive applications that we aim to prioritize. Problem ======= This approach works well when the background workload forks a bunch of short-lived tasks (e.g., a parallel kernel build), but it fails to properly classify CPU-intensive background tasks (i.e., video/3D rendering, encryption, large data analysis, etc.), because these applications, typically, do not generate many short-lived processes. In presence of such workloads the time slice penalty is not enforced, resulting in a lack of any boost for interactive applications. Solution ======== A more effective critiria for distinguishing between interactive applications and background CPU-intensive applications is to examine the voluntary context switches: an application that periodically releases the CPU voluntarily is very likely to be interactive. Therefore, change the time slice boost logic to apply a bonus (scale down the accounted used time slice) to tasks that show an increase in their voluntary context switches counter over a time frame of 10 sec. Based on experimental results, this simple heurstic appears to be quite effective in classifying interactive tasks and prioritize them over potential background CPU-intensive tasks. Additionally, having a better criteria to identify interactive tasks allow to prioritize also newly created tasks, thereby enhancing the responsiveness of interactive shell sessions. This always ensures the prompt execution of system commands, even when the system is massively overloaded, unlike the previous time slice boost logic, which made interactive shell sessions less responsive by deprioritizing newly created tasks. Results ======= With this new logic in place it is possible to play a video game (e.g., Terraria) without experiencing any frame rate drop (60 fps), while a parallel CPU stress test (`stress-ng -c 32`) is running in the background. The same result can also be obtained with a parallel kernel build (`make -j 32`). Thus, there is no regression compared to the previous "ideal" test case. Even when mixing both workloads (`make -j 16` + `stress-ng -c 16`), Terraria can still be played without noticeable lag in the audio or video, maintaining a consistent 60 fps. In addition to that, shell commands are also very responsive. Following, the results (average and standard deviation of 10 runs) of two simple interactive shell commands, while both the `make -j 16` and `stress-ng -c 16` workloads are running in background: avg time "uname -r" "ps axuw > /dev/null" ========================================================= EEVDF 11.1ms 231.8ms scx_rustland 2.6ms 212.0ms stdev "uname -r" "ps axuw > /dev/null" ========================================================= EEVDF 2.28 23.41 scx_rustland 0.70 9.11 Tests conducted on a 8-cores laptop (11th Gen Intel i7-1195G7 @ 4.800GHz) with 16GB of RAM. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-11 18:14:30 +01:00
Andrea Righi	1cf03770c7	scx_rustland: expose voluntary context switches to the scheduler Provide the number of voluntary context switches (nvcsw) for each task to the user-space scheduler. This extra information can then be used by the scheduler to enhance its decision-making process when scheduling tasks. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-11 14:10:39 +01:00
Tejun Heo	30c25ff30e	Merge pull request #84 from sched-ext/htejun-README-update Update README.md to include terraria video	2024-01-10 15:18:41 -10:00
Tejun Heo	331f28b775	Update README.md to include terraria video	2024-01-10 15:17:35 -10:00

1 2 3 4 5 ...

335 Commits