scx-upstream

mirror of https://github.com/sched-ext/scx.git synced 2024-12-04 08:17:11 +00:00

Author	SHA1	Message	Date
Tejun Heo	d790bdb14c	Merge pull request #138 from sirlucjan/0.1.7 Bump to 0.1.7	2024-02-12 22:08:10 -10:00
Piotr Gorski	cb11aecbd0	Bump to 0.1.7 Signed-off-by: Piotr Gorski <lucjan.lucjanov@gmail.com>	2024-02-13 08:20:38 +01:00
David Vernet	37dde8ac74	Merge pull request #137 from sched-ext/scx-utils-fix-build scx_utils: use c_char to prevent build failures	2024-02-11 15:42:33 -06:00
Andrea Righi	f5a21198ad	scx_utils: use c_char to prevent build failures Use c_char to convert C strings, that is more portable across different architectures. This prevents a build failure on arm64 and ppc64el. Fixes: `d57a23f` ("rust/scx_utils: Add user_exit_info support") Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-11 21:42:52 +01:00
Andrea Righi	b5f3f7f8fe	Merge pull request #136 from sched-ext/rustland-perf-improvement scx_rustland: performance improvements	2024-02-11 17:06:53 +01:00
Andrea Righi	fc889c6995	scx_rustland: replace custom allocator with buddy-alloc Currently, the primary bottleneck in scx_rustland lies within its custom memory allocator, which is used to prevent page faults in the user-space scheduler. This is pretty evident looking at perf top: 39.95% scx_rustland [.] <scx_rustland::bpf::alloc::RustLandAllocator as core::alloc::global::GlobalAlloc>::alloc 3.41% [kernel] [k] _copy_from_user 3.20% [kernel] [k] __kmem_cache_alloc_node 2.59% [kernel] [k] __sys_bpf 2.30% [kernel] [k] __kmem_cache_free 1.48% libc.so.6 [.] syscall 1.45% [kernel] [k] __virt_addr_valid 1.42% scx_rustland [.] <scx_rustland::bpf::alloc::RustLandAllocator as core::alloc::global::GlobalAlloc>::dealloc 1.31% [kernel] [k] _copy_to_user 1.23% [kernel] [k] entry_SYSRETQ_unsafe_stack However, there's no need to reinvent the wheel here, rather than relying on an overly simplistic and inefficient allocator, we can rely on buddy-alloc [1], which is also capable of operating on a preallocated memory buffer. After switching to buddy-alloc, the performance profile under the same workload conditions looks like the following: 6.01% [kernel] [k] _copy_from_user 5.21% [kernel] [k] __kmem_cache_alloc_node 4.45% [kernel] [k] __sys_bpf 3.80% [kernel] [k] __kmem_cache_free 2.79% libc.so.6 [.] syscall 2.34% [kernel] [k] __virt_addr_valid 2.26% [kernel] [k] _copy_to_user 2.14% [kernel] [k] __check_heap_object 2.10% [kernel] [k] __check_object_size.part.0 2.02% [kernel] [k] entry_SYSRETQ_unsafe_stack With this change in place, the primary overhead is now moved to the bpf() syscall and the copies between kernel and user-space (this could potentially be optimized in the future using BPF ring buffers, instead of BPF FIFO queues). A better focus at the allocator overhead before vs after this change: [before] 39.95% scx_rustland [.] core::alloc::global::GlobalAlloc>::alloc 1.42% scx_rustland [.] core::alloc::global::GlobalAlloc>::dealloc [after] 1.50% scx_rustland [.] core::alloc::global::GlobalAlloc>::alloc 0.76% scx_rustland [.] core::alloc::global::GlobalAlloc>::dealloc [1] https://crates.io/crates/buddy-alloc Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-11 14:33:39 +01:00
Andrea Righi	ccf5946425	scx_rustland: speed up search by PID in tasks BTreeSet In order to prevent duplicate PIDs in the TaskTree (BTreeSet), we perform an O(N) search each time we add an item, to verify whether the PID already exists or not. Under heavy stress test conditions the O(N) complexity can have a potential impact on the overall performance. To mitigate this, introduce a HashMap that can be used to retrieve tasks by PID typically with a O(1) complexity. This could potentially degrade to O(N) in presence of hash collisions, but even in this case, accessing the hash map is still more efficient than scanning all the entries in the BTreeSet to search for the target PID. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-11 14:11:38 +01:00
Andrea Righi	7ce0d038e4	Merge pull request #133 from sched-ext/rustland-cpumask-gen-cnt scx_rustland: per-task cpumask generation counter	2024-02-10 19:07:02 +01:00
Andrea Righi	61d1ed338a	scx_rustland: per-task cpumask generation counter Introduce a per-task generation counter to check the validity of the cpumask at dispatch time. The logic is the following: - the cpumask generation number is incremented every time a task calls .set_cpumask() - when a task is enqueued the current generation number is stored in the queued_task_ctx and relayed to the user-space scheduler - the user-space scheduler can decide to dispatch the task on the CPU determined by the BPF layer in .select_cpu(), redirect the task to any other specific CPU, or redirect to the first CPU available (using NO_CPU) - task is then dispatched back to the BPF code along with its cpumask generation counter - at dispatch time the BPF code checks if the generation number is the same and it discards the dispatch attempt if the cpumask is not valid anymore (the task will be automatically re-enqueued by the sched-ext core code, potentially selecting another CPU / cpumask) - if the cpumask is valid, but the CPU selected by the user-space scheduler is invalid (according to the cpumask), the task will be transparently bounced by the BPF code to the shared DSQ (in this way the user-space code can be completely abstracted and dispatches that target invalid CPUs can be automatically fixed by the BPF layer) This solution can prevent stalls due to dispatches targeting invalid CPUs and it can also avoid redundant dispatch events, making the code more efficient and the cpumask interlocking more reliable. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-10 18:02:42 +01:00
David Vernet	1c00de9402	Merge pull request #129 from sched-ext/infeasible_weights Implement solution to infeasible weights problem	2024-02-09 16:23:56 -06:00
David Vernet	e627176d90	scx: Implement solution to infeasible weights problem As described in [0], there is an open problem in load balancing called the "infeasible weights" problem. Essentially, the problem boils down to the fact that a task with disproportionately high load can be granted more CPU time than they can actually consume per their duty cycle. This patch implements a solution to that problem, wherein we apply the algorithm described in this paper to adjust all infeasible weights in the system down to a feasible wight that gives them their full duty cycle, while allowing the remaining feasible tasks on the system to share the remaining compute capacity on the machine. [0]: https://drive.google.com/file/d/1fAoWUlmW-HTp6akuATVpMxpUpvWcGSAv/view?usp=drive_link Signed-off-by: David Vernet <void@manifault.com>	2024-02-09 16:23:12 -06:00
Andrea Righi	a4ff395d68	Merge pull request #132 from sched-ext/rustland-fix-cpumask-stall scx_rustland: fix cpumask stall and prevent stuttery behavior	2024-02-09 00:20:02 +01:00
Andrea Righi	8e47602f00	scx_rustland: keep default CPU selection when idle Dispatch to the shared DSQ (NO_CPU) only when the assigned CPU is not idle anymore, otherwise maintain the same CPU that has been assigned by the BPF layer. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-08 22:48:07 +01:00
Andrea Righi	7085d57709	scx_rustland: kick user-space scheduler when a CPU is released When the system is not being fully utilized there may be delays in promptly awakening the user-space scheduler. This can happen for example, when some CPU-intensive tasks are constantly dispatched bypassing the user-space scheduler (e.g., using SCX_DSQ_LOCAL) and other CPUs are completely idle. Under this condition the update_idle() can fail to activate the user-space scheduler, because there are no pending events, and only the periodic timer will wake up the scheduler, potentially introducing lags of up to 1 sec. This can be reproduced, for example, running a video game that doesn't use all the CPUs available in the system (i.e., Team Fortress 2). With this game it is pretty easy to notice sporadic lags that are resumed after ~1sec, due to the periodic timer kicking scheduler. To prevent this from happening wake up the user-space scheduler immediately as soon as a CPU is released, speculating on the fact that most of the time there will be always another task ready to run. This can introduce a little more overhead in the scheduler (due to potential unnecessary wake up events), but it also prevents stuttery behaviors and it makes the system much more smooth and responsive, especially with video games. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-08 22:48:07 +01:00
Andrea Righi	cb82d91e0f	scx_rustland: use scx_bpf_dispatch_cancel() Use scx_bpf_dispatch_cancel() to invalidate dispatches on wrong per-CPU DSQ, due to cpumask race conditions, and redirect them to the shared DSQ. This prevents dispatching tasks to CPU that cannot be used according to the task's cpumask. With this applied the scheduler passed all the `stress-ng --race-sched` stress tests. Moreover, introduce a counter that is periodically reported to stdout as an additional statistic, that can be helpful for debugging. Link: https://github.com/sched-ext/sched_ext/pull/135 Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-08 22:48:07 +01:00
Andrea Righi	13e23e8cc9	scx_rustland: dump scheduler statistics before exiting Print all the scheduler statistics before exiting. Reporting the very last state of the scheduler can help to debug events that could trigger error conditions (such as page faults, scheduler congestions, etc.). While at it, fix also some minor coding style issues (tabs vs spaces). Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-08 15:37:44 +01:00
David Vernet	c574598dc7	scx_rusty: Fix typos Signed-off-by: David Vernet <void@manifault.com>	2024-02-07 23:38:26 -06:00
Tejun Heo	73c68c6f4a	Merge pull request #131 from sched-ext/htejun scx: Update vmlinux to use SCX_KICK_IDLE	2024-02-07 07:04:00 -10:00
Tejun Heo	2062d1ad1f	scx: Add compat support for SCX_KICK_IDLE and use it for idle CPU wakeups SCX_KICK_IDLE is a new feature which isn't defined in older kernels. Add compat wrapper and use it for idle CPU wakeups. Signed-off-by: Tejun Heo <tj@kernel.org>	2024-02-06 15:28:40 -10:00
Tejun Heo	17014e91fb	scx: Update vmlinux.h to receive SCX_KICK_IDLE Signed-off-by: Tejun Heo <tj@kernel.org>	2024-02-06 15:02:01 -10:00
Andrea Righi	55c9c92b81	Merge pull request #128 from sched-ext/ci-stderr ci: detect errors only from stderr	2024-02-05 18:22:34 +01:00
Tejun Heo	9eda031474	Merge pull request #126 from sirlucjan/cachyos-repo Add linux-sched-ext to CachyOS repo	2024-02-05 07:04:20 -10:00
Andrea Righi	67a53ba621	ci: detect errors only from stderr Search for potential errors only in the kernel logs and the scheduler stderr. In this way we can use "error keywords" in the scheduler's output without triggering false positives in the CI (see for example #127). NOTE: this works, because virtme-ng, when executed in verbose mode, sends the kernel messages to stderr (together with the command's stderr) and it channels the command's stdout to the stdout of the host. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-05 17:59:11 +01:00
David Vernet	95420fc7fc	Merge pull request #125 from sched-ext/reduce_ci_jobs ci: Only do CI runs for pull requests	2024-02-04 18:21:36 -06:00
Tejun Heo	86a4e7bd46	Merge pull request #124 from sched-ext/print_warning scx_userland: Print warning about poor performance	2024-02-04 13:40:20 -10:00
David Vernet	4108ece204	scx_userland: Increase scx_userland timeout This is meant to be an example scheduler that won't necessarily run well in production. Let's remove the 3 second timeout and use the system default of 30. Signed-off-by: David Vernet <void@manifault.com>	2024-02-04 16:23:18 -06:00
David Vernet	fc671aca49	ci: Only do CI runs for pull requests We're basically always runnin two CI jobs: one for a remote push, and another for when a PR is opened. These are essentially measuring the same thing, so let's save CI bandwidth and just do a PR run. This will hopefully make things a bit less noisy as well. Signed-off-by: David Vernet <void@manifault.com>	2024-02-04 16:12:46 -06:00
David Vernet	28a0b82be6	scx_userland: Print warning about poor performance Let's make it clear that this scheduler isn't expected to perform well, and instead point people to scx_rustland. Signed-off-by: David Vernet <void@manifault.com>	2024-02-04 16:02:57 -06:00
Piotr Gorski	52ccf1de57	Add linux-sched-ext to CachyOS repo Signed-off-by: Piotr Gorski <lucjan.lucjanov@gmail.com>	2024-02-04 19:11:46 +01:00
Tejun Heo	a3b5941c3a	Merge pull request #122 from sched-ext/htejun common.bpf.h: Add kfunc prototype for scx_bpf_dispatch_cancel()	2024-02-04 07:23:42 -10:00
Tejun Heo	1ca3b8dca8	common.bpf.h: Add kfunc prototype for scx_bpf_dispatch_cancel() And relocate scx_bpf_dispatch_nr_slots() while at it. Signed-off-by: Tejun Heo <tj@kernel.org>	2024-02-03 09:46:13 -10:00
Andrea Righi	b6eee3a5c4	Merge pull request #121 from sched-ext/rustland-duplicate-pids scx_rustland: prevent duplicate PIDs in the task BTreeSet	2024-02-03 17:55:43 +01:00
Andrea Righi	acb174aa51	scx_rustland: prevent duplicate PIDs in the task BTreeSet Items in the task BTreeSet are stored by pid and vruntime. Make sure that we never store multiple items with the same PID, so that re-enqueued tasks are not dispatched multiple times. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-03 14:46:39 +01:00
Tejun Heo	4631392031	Merge pull request #120 from jordalgo/c-headers Include libbpf_h path in c sched compilation	2024-02-02 17:22:15 -10:00
Jordan Rome	dd11d97cdb	Include libbpf_h path in c sched compilation If a user wants to use an external libbpf source (via libbpf_h opt) we need to also pass this to c sched compilation.	2024-02-02 19:46:24 -05:00
David Vernet	7cbcc16be9	Merge pull request #119 from sched-ext/htejun scheds/sync-to-kernel.sh: Drop most schedulers from sync	2024-02-02 14:20:02 -06:00
Tejun Heo	7ffdcc1984	Merge pull request #110 from sched-ext/scx-rustland-pcpu-dsq scx_rustland: per-CPU DSQs + global shared DSQ	2024-02-02 09:23:40 -10:00
Tejun Heo	c7ad3a71f9	scheds/sync-to-kernel.sh: Drop most schedulers from sync Only scx_simple/qmap are in the kernel tree now. Drop the rest from the sync script. Also update the sync script so that it can handle empty rust_scheds variable. Signed-off-by: Tejun Heo <tj@kernel.org>	2024-02-02 09:08:30 -10:00
Tejun Heo	1326b8a539	Merge pull request #118 from sched-ext/docs_fix docs: Update OVERVIEW to match latest APIs	2024-02-02 06:43:08 -10:00
David Vernet	15487a95af	docs: Update OVERVIEW to match latest APIs Pierre Jacquet pointed out that our docs in the scx repo are out of date for the latest APIs. Let's update it so readers don't get confused. Signed-off-by: David Vernet <void@manifault.com>	2024-02-02 10:38:19 -06:00
Andrea Righi	681b3fd807	scx_rustland: more aggressive time slice scaling Allow to scale the effective time slice down to 250 us. This can help to maintain a good quality of the audio even when the system is overloaded by multiple CPU-intensive tasks. Moreover, always round up the time slice scaling factor to be a little more aggressive and prioritize at scaling the time slice, so that we can prioritize low latency tasks even more. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-01 16:40:59 +01:00
Andrea Righi	26d6d530f0	scx_rustland: enhance interactive task classification Evaluate the number of voluntary context switches per second (nvcsw/sec) for each task using an exponentially weighted moving average (EWMA) with weight 0.5, that allows to classify interactive tasks with more accuracy. Using a simple average over a period of time of 10 sec can introduce small lags every 10 sec, as the statistics for the number of voluntary context switches are refreshed. This can result in interactive tasks taking a brief time to catch up in order to be accurately classified as so, causing for example short audio cracks, small drop of 5-10 fps in games, etc. Using a EMWA allows to smooth the average of nvcsw/sec, preventing short lags in the interactive tasks, while also preventing to incorrectly classify as interactive tasks that may experience an isolated short burst of voluntary context switches. This patch has been tested with the usual test case of playing a videogame while running a parallel kernel build in the background. Without this patch the short lag every 10 sec is clearly noticeable, with this patch applied the game and audio run smoothly. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-01 16:40:59 +01:00
Andrea Righi	baeea306fc	scx_rustland: rely on the built-in idle selection logic Simplify the idle selection logic by relying only on the built-in idle selection performed in the BPF layer. When there are idle CPUs available in the system, tasks are dispatched directly by the BPF dispatcher without invoking the user-space scheduler. This allows to avoid the user-space overhead and get the best system performance when CPU resources are not overcommitted. Once the number of tasks exceeds the available CPUs, the user-space scheduler takes over. However, by this time, the system is already overcommitted, so there's little advantage in attempting to pinpoint the optimal idle CPU through the user-space scheduler. Instead, tasks can be executed on the first available CPU, consistently dispatching them to the shared DSQ. This allows to achieve the optimal performance both with system under-utilization and over-utilization. With this change in place the user-space scheduler won't dispatch tasks directly to specific CPUs, but we still want to keep this as a generic feature in the BPF layer, so that it can be potentially used in the future by this scheduler or even by other user-space schedulers (once the BPF layer will be moved to a more generic place). Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-01 16:40:59 +01:00
Andrea Righi	b9e60f71ed	scx_rustland: usersched: code refactoring No functional change, just move code around to make it more readable. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-01 16:40:59 +01:00
Andrea Righi	d13ed5c025	scx_rustland: BPF: refine CPU dispatch logic When the user-space scheduler dispatches a task on a specific CPU, that CPU might not be valid, since the user-space doesn't have visibility of the task's cpumask. When this happens the BPF dispatcher (that has direct visibility of the cpumask) should automatically redirect the task to a valid CPU, but instead of bouncing the task on the shared DSQ, we should try to use the CPU assigned by the built-in idle selection logic. If this CPU is also not valid, then we can simply ignore the task, that has been de-queued and re-enqueued, since a valid CPU will be naturally re-selected at a later time. Moreover, avoid to kick any specific CPU when the task is dispatched to shared DSQ, since the task can be consumed on any CPU and the additional kick would simply add more overhead. Lastly, rename dsq_id_to_cpu() to dsq_to_cpu() and cpu_to_dsq_id() to cpu_to_dsq() for more clarity. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-01 16:38:17 +01:00
Andrea Righi	45d8b54eb9	scx_rustland: re-introduce per-CPU DSQ + a global shared DSQ With commit `c6ada25` ("scx_rustland: use custom pcpu DSQ instead of SCX_DSQ_LOCAL{_ON}") we tried to introduce custom per-CPU DSQs, instead of using SCX_DSQ_LOCAL and SCX_DSQ_LOCAL_ON to dispatch tasks. This was required, because dispatching tasks using SCX_DSQ_LOCAL_ON doesn't provide a guarantee that the cpumask, checked at dispatch time to determine the validity of a target CPU, remains valid. This method solved the cpumask validity issue, but unfortunately it introduced a noticeable performance regression and a potential starvation issue (that were probably caused by the same problem): if a task is assigned to a CPU in select_cpu() and the scheduler decides to dispatch it on a different CPU, the task will be added to the new CPU's DSQ, but if no dispatch event happens there, the task may remain stuck in the per-CPU DSQ for a long time, triggering the sched-ext watchdog timeout that would kick out the scheduler, for example: 12:53:28 [WARN] FAIL: IPC:CSteamEngin[7217] failed to run for 6.482s (err=1026) 12:53:28 [INFO] Unregister RustLand scheduler Therefore, we reverted this change with `6d89ece` ("scx_rustland: dispatch tasks only on the global DSQ"), dispatching all the tasks to the global DSQ, completely delegating the kernel to distribute tasks among the available CPUs. This is not the ideal solution, because we still want to give the possibility to the user-space scheduler to assign tasks to specific CPUs. Therefore, re-introduce distinct per-CPU DSQs, but also provide a global shared DSQ. Tasks dispatched in the per-CPU DSQs are consumed from the dispatch() callback of their corresponding CPU, tasks dispatched in the global shared DSQ are consumed from any CPU. In this way the BPF layer is able to provide an interface that gives the flexibility to the user-space to dispatch a task on a specific CPU or on the first CPU available, depending on the particular scheduler's need. If an invalid CPU (according to the cpumask) is selected the BPF dispatcher will transparently redirect the task to a valid CPU, selected using the built-in idle selection logic. In the future we may want to improve this part, giving to the user-space the visibility of the cpumask, in order to pick a valid CPU in advance and in a proper synchronized way. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-01 00:33:35 +01:00
Andrea Righi	b5e846c538	scx_rustland: BPF: small refactoring No functional change, just some refactoring to make the code more clear. We have is_usersched_needed() and set_usersched_needed() that are doing different things (the former is checkig if there are pending tasks for the scheduler, the latter is setting the usersched_needed flag to activate the dispatch of the user-space scheduler). Rename is_usersched_needed() to usersched_has_pending_tasks() to make the code more clear and understandable. Also move dispatch_user_scheduler() closer to the other dispatch-related helper functions. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-01 00:33:35 +01:00
Tejun Heo	439c3cdba7	Merge pull request #116 from sched-ext/htejun Add user_exit_info support to scx_utils and convert the rust scheds accordingly	2024-01-31 12:17:52 -10:00
Tejun Heo	6db362b27a	scx_rustland: Use scx_utils::user_exit_info Instead of the bespoke implementation. This also makes scx_rustland to print out debug dump if exists. Signed-off-by: Tejun Heo <tj@kernel.org>	2024-01-31 11:44:15 -10:00
Tejun Heo	965926f393	scx_rusty: Use scx_utils::user_exit_info Instead of the bespoke implementation. This also makes scx_rusty to print out debug dump if exists. Signed-off-by: Tejun Heo <tj@kernel.org>	2024-01-31 11:08:17 -10:00

1 2 3 4 5 ...

366 Commits