JakeHillion/scx

mirror of https://github.com/JakeHillion/scx.git synced 2024-11-29 20:50:22 +00:00

Author	SHA1	Message	Date
Andrea Righi	2cd1d4b684	scx_rustland: introduce --full-user Introduce an option to send all scheduling events and actions to user-space, disabling any form of in-kernel optimization. Enabling this option will likely make the system less responsive (but more predictable in terms of performance) and it can be useful for debugging purposes. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-21 00:25:14 +01:00
David Vernet	ef8aa9ea31	add documentation Signed-off-by: David Vernet <void@manifault.com>	2024-02-20 14:57:09 -06:00
David Vernet	8aba090d4f	rust: Add topology module to utils crate scx_rusty has logic in the scheduler to inspect the host to automatically build scheduling domains across every L3 cache. This would be generically useful for many different types of schedulers, so let's add it to the scx_utils crate so it can be used by others. Signed-off-by: David Vernet <void@manifault.com>	2024-02-20 14:57:09 -06:00
Andrea Righi	7ff06a6ff0	scx_rustland: prevent misaligned pointer dereference The buffer used to store struct queued_task_ctx items fetched from the BPF ring buffer needs to be aligned to the architecture register size, otherwise we may hit misaligned pointer dereference issues, such as: thread 'main' panicked at src/bpf.rs:162:43: misaligned pointer dereference: address must be a multiple of 0x8 but is 0x56516a51e004 note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace Prevent this by making sure the buffer is always aligned to 64-bits. Fixes: `93dc615` ("scx_rustland: use a ring buffer for queued tasks") Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-20 19:08:38 +01:00
Andrea Righi	93dc615653	scx_rustland: use a ring buffer for queued tasks Switch from a BPF_MAP_TYPE_QUEUE to a BPF_MAP_TYPE_RINGBUF to store the tasks that need to be processed by the user-space scheduler. A ring buffer allows to save a lot of memory copies and syscalls, since the memory is directly shared between the BPF and the user-space components. Performance profile before this change: 2.44% [kernel] [k] __memset 2.19% [kernel] [k] __sys_bpf 1.59% [kernel] [k] __kmem_cache_alloc_node 1.00% [kernel] [k] _copy_from_user After this change: 1.42% [kernel] [k] __memset 0.14% [kernel] [k] __sys_bpf 0.10% [kernel] [k] __kmem_cache_alloc_node 0.07% [kernel] [k] _copy_from_user Both the overhead of sys_bpf() and copy_from_user() are reduced by a factor of ~15x now (only the dispatch path is using sys_bpf() now). NOTE: despite being very effective, the current implementation is a bit of a hack. This is because the present ring buffer API exclusively permits consumption in a greedy manner, where multiple items can be consumed simultaneously. However, libbpf-rs does not provide precise information regarding the exact number of items consumed. By utilizing a more refined libbpf-rs API [1] we may be able to improve this code a bit. Moreover, libbpf-rs doesn't provide an API for the user_ring_buffer, so at the moment there's not a trivial way to apply the same change to the dispatched tasks. However, just with this change applied, the overhead of sys_bpf() and copy_from_user() is already minimal, so we won't get much benefits by changing the dispatch path to use a BPF ring buffer. [1] https://github.com/libbpf/libbpf-rs/pull/680 Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-20 12:30:22 +01:00
Andrea Righi	04685e633f	scx_rustland: avoid memory copies while accessing cpu_map Instead of using a BPF_MAP_TYPE_ARRAY to store which tasks are running on which CPU we can simply use a global array, mapped in the user-space address space. In this way we can avoid a lot of memory copies and call to sys_bpf(), significantly reducing the scheduler's overhead. Keep in mind that we don't need to be 100% correct while accessing this information, so we can accept some fuzziness in order to significantly reduce the scheduler's overhead. Performance profile before this change: 5.52% [kernel] [k] __sys_bpf 4.84% [kernel] [k] __kmem_cache_alloc_node 4.71% [kernel] [k] map_lookup_elem 4.10% [kernel] [k] _copy_from_user 3.51% [kernel] [k] bpf_map_copy_value 3.12% [kernel] [k] check_heap_object After this change: 2.20% [kernel] [k] __sys_bpf 1.91% [kernel] [k] map_lookup_and_delete_elem 1.60% [kernel] [k] __kmem_cache_alloc_node 1.10% [kernel] [k] _copy_from_user 0.12% [kernel] [k] check_heap_object n/a bpf_map_copy_value n/a map_lookup_elem With this change we can reduce the overhead of sys_bpf() by ~2x and the overhead of copy_from_user() by ~4x. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-20 12:30:16 +01:00
Andrea Righi	fc889c6995	scx_rustland: replace custom allocator with buddy-alloc Currently, the primary bottleneck in scx_rustland lies within its custom memory allocator, which is used to prevent page faults in the user-space scheduler. This is pretty evident looking at perf top: 39.95% scx_rustland [.] <scx_rustland::bpf::alloc::RustLandAllocator as core::alloc::global::GlobalAlloc>::alloc 3.41% [kernel] [k] _copy_from_user 3.20% [kernel] [k] __kmem_cache_alloc_node 2.59% [kernel] [k] __sys_bpf 2.30% [kernel] [k] __kmem_cache_free 1.48% libc.so.6 [.] syscall 1.45% [kernel] [k] __virt_addr_valid 1.42% scx_rustland [.] <scx_rustland::bpf::alloc::RustLandAllocator as core::alloc::global::GlobalAlloc>::dealloc 1.31% [kernel] [k] _copy_to_user 1.23% [kernel] [k] entry_SYSRETQ_unsafe_stack However, there's no need to reinvent the wheel here, rather than relying on an overly simplistic and inefficient allocator, we can rely on buddy-alloc [1], which is also capable of operating on a preallocated memory buffer. After switching to buddy-alloc, the performance profile under the same workload conditions looks like the following: 6.01% [kernel] [k] _copy_from_user 5.21% [kernel] [k] __kmem_cache_alloc_node 4.45% [kernel] [k] __sys_bpf 3.80% [kernel] [k] __kmem_cache_free 2.79% libc.so.6 [.] syscall 2.34% [kernel] [k] __virt_addr_valid 2.26% [kernel] [k] _copy_to_user 2.14% [kernel] [k] __check_heap_object 2.10% [kernel] [k] __check_object_size.part.0 2.02% [kernel] [k] entry_SYSRETQ_unsafe_stack With this change in place, the primary overhead is now moved to the bpf() syscall and the copies between kernel and user-space (this could potentially be optimized in the future using BPF ring buffers, instead of BPF FIFO queues). A better focus at the allocator overhead before vs after this change: [before] 39.95% scx_rustland [.] core::alloc::global::GlobalAlloc>::alloc 1.42% scx_rustland [.] core::alloc::global::GlobalAlloc>::dealloc [after] 1.50% scx_rustland [.] core::alloc::global::GlobalAlloc>::alloc 0.76% scx_rustland [.] core::alloc::global::GlobalAlloc>::dealloc [1] https://crates.io/crates/buddy-alloc Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-11 14:33:39 +01:00
Andrea Righi	ccf5946425	scx_rustland: speed up search by PID in tasks BTreeSet In order to prevent duplicate PIDs in the TaskTree (BTreeSet), we perform an O(N) search each time we add an item, to verify whether the PID already exists or not. Under heavy stress test conditions the O(N) complexity can have a potential impact on the overall performance. To mitigate this, introduce a HashMap that can be used to retrieve tasks by PID typically with a O(1) complexity. This could potentially degrade to O(N) in presence of hash collisions, but even in this case, accessing the hash map is still more efficient than scanning all the entries in the BTreeSet to search for the target PID. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-11 14:11:38 +01:00
Andrea Righi	7ce0d038e4	Merge pull request #133 from sched-ext/rustland-cpumask-gen-cnt scx_rustland: per-task cpumask generation counter	2024-02-10 19:07:02 +01:00
Andrea Righi	61d1ed338a	scx_rustland: per-task cpumask generation counter Introduce a per-task generation counter to check the validity of the cpumask at dispatch time. The logic is the following: - the cpumask generation number is incremented every time a task calls .set_cpumask() - when a task is enqueued the current generation number is stored in the queued_task_ctx and relayed to the user-space scheduler - the user-space scheduler can decide to dispatch the task on the CPU determined by the BPF layer in .select_cpu(), redirect the task to any other specific CPU, or redirect to the first CPU available (using NO_CPU) - task is then dispatched back to the BPF code along with its cpumask generation counter - at dispatch time the BPF code checks if the generation number is the same and it discards the dispatch attempt if the cpumask is not valid anymore (the task will be automatically re-enqueued by the sched-ext core code, potentially selecting another CPU / cpumask) - if the cpumask is valid, but the CPU selected by the user-space scheduler is invalid (according to the cpumask), the task will be transparently bounced by the BPF code to the shared DSQ (in this way the user-space code can be completely abstracted and dispatches that target invalid CPUs can be automatically fixed by the BPF layer) This solution can prevent stalls due to dispatches targeting invalid CPUs and it can also avoid redundant dispatch events, making the code more efficient and the cpumask interlocking more reliable. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-10 18:02:42 +01:00
David Vernet	1c00de9402	Merge pull request #129 from sched-ext/infeasible_weights Implement solution to infeasible weights problem	2024-02-09 16:23:56 -06:00
David Vernet	e627176d90	scx: Implement solution to infeasible weights problem As described in [0], there is an open problem in load balancing called the "infeasible weights" problem. Essentially, the problem boils down to the fact that a task with disproportionately high load can be granted more CPU time than they can actually consume per their duty cycle. This patch implements a solution to that problem, wherein we apply the algorithm described in this paper to adjust all infeasible weights in the system down to a feasible wight that gives them their full duty cycle, while allowing the remaining feasible tasks on the system to share the remaining compute capacity on the machine. [0]: https://drive.google.com/file/d/1fAoWUlmW-HTp6akuATVpMxpUpvWcGSAv/view?usp=drive_link Signed-off-by: David Vernet <void@manifault.com>	2024-02-09 16:23:12 -06:00
Andrea Righi	8e47602f00	scx_rustland: keep default CPU selection when idle Dispatch to the shared DSQ (NO_CPU) only when the assigned CPU is not idle anymore, otherwise maintain the same CPU that has been assigned by the BPF layer. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-08 22:48:07 +01:00
Andrea Righi	7085d57709	scx_rustland: kick user-space scheduler when a CPU is released When the system is not being fully utilized there may be delays in promptly awakening the user-space scheduler. This can happen for example, when some CPU-intensive tasks are constantly dispatched bypassing the user-space scheduler (e.g., using SCX_DSQ_LOCAL) and other CPUs are completely idle. Under this condition the update_idle() can fail to activate the user-space scheduler, because there are no pending events, and only the periodic timer will wake up the scheduler, potentially introducing lags of up to 1 sec. This can be reproduced, for example, running a video game that doesn't use all the CPUs available in the system (i.e., Team Fortress 2). With this game it is pretty easy to notice sporadic lags that are resumed after ~1sec, due to the periodic timer kicking scheduler. To prevent this from happening wake up the user-space scheduler immediately as soon as a CPU is released, speculating on the fact that most of the time there will be always another task ready to run. This can introduce a little more overhead in the scheduler (due to potential unnecessary wake up events), but it also prevents stuttery behaviors and it makes the system much more smooth and responsive, especially with video games. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-08 22:48:07 +01:00
Andrea Righi	cb82d91e0f	scx_rustland: use scx_bpf_dispatch_cancel() Use scx_bpf_dispatch_cancel() to invalidate dispatches on wrong per-CPU DSQ, due to cpumask race conditions, and redirect them to the shared DSQ. This prevents dispatching tasks to CPU that cannot be used according to the task's cpumask. With this applied the scheduler passed all the `stress-ng --race-sched` stress tests. Moreover, introduce a counter that is periodically reported to stdout as an additional statistic, that can be helpful for debugging. Link: https://github.com/sched-ext/sched_ext/pull/135 Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-08 22:48:07 +01:00
Andrea Righi	13e23e8cc9	scx_rustland: dump scheduler statistics before exiting Print all the scheduler statistics before exiting. Reporting the very last state of the scheduler can help to debug events that could trigger error conditions (such as page faults, scheduler congestions, etc.). While at it, fix also some minor coding style issues (tabs vs spaces). Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-08 15:37:44 +01:00
David Vernet	c574598dc7	scx_rusty: Fix typos Signed-off-by: David Vernet <void@manifault.com>	2024-02-07 23:38:26 -06:00
Tejun Heo	2062d1ad1f	scx: Add compat support for SCX_KICK_IDLE and use it for idle CPU wakeups SCX_KICK_IDLE is a new feature which isn't defined in older kernels. Add compat wrapper and use it for idle CPU wakeups. Signed-off-by: Tejun Heo <tj@kernel.org>	2024-02-06 15:28:40 -10:00
Andrea Righi	acb174aa51	scx_rustland: prevent duplicate PIDs in the task BTreeSet Items in the task BTreeSet are stored by pid and vruntime. Make sure that we never store multiple items with the same PID, so that re-enqueued tasks are not dispatched multiple times. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-03 14:46:39 +01:00
Andrea Righi	681b3fd807	scx_rustland: more aggressive time slice scaling Allow to scale the effective time slice down to 250 us. This can help to maintain a good quality of the audio even when the system is overloaded by multiple CPU-intensive tasks. Moreover, always round up the time slice scaling factor to be a little more aggressive and prioritize at scaling the time slice, so that we can prioritize low latency tasks even more. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-01 16:40:59 +01:00
Andrea Righi	26d6d530f0	scx_rustland: enhance interactive task classification Evaluate the number of voluntary context switches per second (nvcsw/sec) for each task using an exponentially weighted moving average (EWMA) with weight 0.5, that allows to classify interactive tasks with more accuracy. Using a simple average over a period of time of 10 sec can introduce small lags every 10 sec, as the statistics for the number of voluntary context switches are refreshed. This can result in interactive tasks taking a brief time to catch up in order to be accurately classified as so, causing for example short audio cracks, small drop of 5-10 fps in games, etc. Using a EMWA allows to smooth the average of nvcsw/sec, preventing short lags in the interactive tasks, while also preventing to incorrectly classify as interactive tasks that may experience an isolated short burst of voluntary context switches. This patch has been tested with the usual test case of playing a videogame while running a parallel kernel build in the background. Without this patch the short lag every 10 sec is clearly noticeable, with this patch applied the game and audio run smoothly. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-01 16:40:59 +01:00
Andrea Righi	baeea306fc	scx_rustland: rely on the built-in idle selection logic Simplify the idle selection logic by relying only on the built-in idle selection performed in the BPF layer. When there are idle CPUs available in the system, tasks are dispatched directly by the BPF dispatcher without invoking the user-space scheduler. This allows to avoid the user-space overhead and get the best system performance when CPU resources are not overcommitted. Once the number of tasks exceeds the available CPUs, the user-space scheduler takes over. However, by this time, the system is already overcommitted, so there's little advantage in attempting to pinpoint the optimal idle CPU through the user-space scheduler. Instead, tasks can be executed on the first available CPU, consistently dispatching them to the shared DSQ. This allows to achieve the optimal performance both with system under-utilization and over-utilization. With this change in place the user-space scheduler won't dispatch tasks directly to specific CPUs, but we still want to keep this as a generic feature in the BPF layer, so that it can be potentially used in the future by this scheduler or even by other user-space schedulers (once the BPF layer will be moved to a more generic place). Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-01 16:40:59 +01:00
Andrea Righi	b9e60f71ed	scx_rustland: usersched: code refactoring No functional change, just move code around to make it more readable. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-01 16:40:59 +01:00
Andrea Righi	d13ed5c025	scx_rustland: BPF: refine CPU dispatch logic When the user-space scheduler dispatches a task on a specific CPU, that CPU might not be valid, since the user-space doesn't have visibility of the task's cpumask. When this happens the BPF dispatcher (that has direct visibility of the cpumask) should automatically redirect the task to a valid CPU, but instead of bouncing the task on the shared DSQ, we should try to use the CPU assigned by the built-in idle selection logic. If this CPU is also not valid, then we can simply ignore the task, that has been de-queued and re-enqueued, since a valid CPU will be naturally re-selected at a later time. Moreover, avoid to kick any specific CPU when the task is dispatched to shared DSQ, since the task can be consumed on any CPU and the additional kick would simply add more overhead. Lastly, rename dsq_id_to_cpu() to dsq_to_cpu() and cpu_to_dsq_id() to cpu_to_dsq() for more clarity. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-01 16:38:17 +01:00
Andrea Righi	45d8b54eb9	scx_rustland: re-introduce per-CPU DSQ + a global shared DSQ With commit `c6ada25` ("scx_rustland: use custom pcpu DSQ instead of SCX_DSQ_LOCAL{_ON}") we tried to introduce custom per-CPU DSQs, instead of using SCX_DSQ_LOCAL and SCX_DSQ_LOCAL_ON to dispatch tasks. This was required, because dispatching tasks using SCX_DSQ_LOCAL_ON doesn't provide a guarantee that the cpumask, checked at dispatch time to determine the validity of a target CPU, remains valid. This method solved the cpumask validity issue, but unfortunately it introduced a noticeable performance regression and a potential starvation issue (that were probably caused by the same problem): if a task is assigned to a CPU in select_cpu() and the scheduler decides to dispatch it on a different CPU, the task will be added to the new CPU's DSQ, but if no dispatch event happens there, the task may remain stuck in the per-CPU DSQ for a long time, triggering the sched-ext watchdog timeout that would kick out the scheduler, for example: 12:53:28 [WARN] FAIL: IPC:CSteamEngin[7217] failed to run for 6.482s (err=1026) 12:53:28 [INFO] Unregister RustLand scheduler Therefore, we reverted this change with `6d89ece` ("scx_rustland: dispatch tasks only on the global DSQ"), dispatching all the tasks to the global DSQ, completely delegating the kernel to distribute tasks among the available CPUs. This is not the ideal solution, because we still want to give the possibility to the user-space scheduler to assign tasks to specific CPUs. Therefore, re-introduce distinct per-CPU DSQs, but also provide a global shared DSQ. Tasks dispatched in the per-CPU DSQs are consumed from the dispatch() callback of their corresponding CPU, tasks dispatched in the global shared DSQ are consumed from any CPU. In this way the BPF layer is able to provide an interface that gives the flexibility to the user-space to dispatch a task on a specific CPU or on the first CPU available, depending on the particular scheduler's need. If an invalid CPU (according to the cpumask) is selected the BPF dispatcher will transparently redirect the task to a valid CPU, selected using the built-in idle selection logic. In the future we may want to improve this part, giving to the user-space the visibility of the cpumask, in order to pick a valid CPU in advance and in a proper synchronized way. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-01 00:33:35 +01:00
Andrea Righi	b5e846c538	scx_rustland: BPF: small refactoring No functional change, just some refactoring to make the code more clear. We have is_usersched_needed() and set_usersched_needed() that are doing different things (the former is checkig if there are pending tasks for the scheduler, the latter is setting the usersched_needed flag to activate the dispatch of the user-space scheduler). Rename is_usersched_needed() to usersched_has_pending_tasks() to make the code more clear and understandable. Also move dispatch_user_scheduler() closer to the other dispatch-related helper functions. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-01 00:33:35 +01:00
Tejun Heo	6db362b27a	scx_rustland: Use scx_utils::user_exit_info Instead of the bespoke implementation. This also makes scx_rustland to print out debug dump if exists. Signed-off-by: Tejun Heo <tj@kernel.org>	2024-01-31 11:44:15 -10:00
Tejun Heo	965926f393	scx_rusty: Use scx_utils::user_exit_info Instead of the bespoke implementation. This also makes scx_rusty to print out debug dump if exists. Signed-off-by: Tejun Heo <tj@kernel.org>	2024-01-31 11:08:17 -10:00
Tejun Heo	105dc36b8f	scx_layered: Use scx_utils::user_exit_info Instead of the bespoke implementation. This also makes scx_layered to print out debug dump if exists. Signed-off-by: Tejun Heo <tj@kernel.org>	2024-01-31 10:54:20 -10:00
Tejun Heo	4ee8104a6d	Merge pull request #114 from dschatzberg/local_avoid_enqueue scx_layered: dispatch from select_cpu if possible	2024-01-31 08:33:26 -10:00
Dan Schatzberg	11e487c165	scx_layered: dispatch from select_cpu if possible If we are doing local dispatch, we can avoid enqueue() altogether by dispatching from select_cpu() Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>	2024-01-31 09:54:26 -08:00
Jordan Rome	1b3a9a1e72	[scx_layered] downgrade prometheus-client This library at version 22 is not available in fedora: https://src.fedoraproject.org/rpms/rust-prometheus-client Rather than bothering the maintainer, let's just downgrade here.	2024-01-31 04:36:01 -08:00
Dan Schatzberg	ab5635ff6d	scx_layered: Grab idle_smtmask a bit later This is a really minor optimization, but we don't need idle_smtmask to schedule pinned tasks, so defer it so the nr_cpus_allowed == 1 path is marginally faster. Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>	2024-01-29 08:16:37 -08:00
Dan Schatzberg	8c9e65d880	scx_layered: Remove unnecessary idle_cpumask idle_cpumask isn't used at all in pick_idle_cpu_from. The only need for these cpumasks is to check if prev_cpu is a wholly idle CPU (and we only do this when smt_enabled). idle_smtmask is sufficient for that check. Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>	2024-01-29 08:16:37 -08:00
Dan Schatzberg	142b6230b2	scx_layered: Fix AFFN_VIOL stat bump Prior to this patch, we only bump LSTAT_AFFN_BIOL when the target cpu was idle, but in both cases it should be counted as AFFN_VIOL. Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>	2024-01-26 13:13:16 -08:00
Tejun Heo	988b7d13c1	Bump versions scx_exit_info change doesn't require code to be updated but breaks binary compatbility. Bump versions and cut a new release.	2024-01-25 09:01:23 -10:00
Tejun Heo	eb997a6e55	Merge pull request #101 from dschatzberg/openmetrics scx_layered: Add support for OpenMetrics format	2024-01-25 08:59:16 -10:00
Dan Schatzberg	7f9548eb34	scx_layered: Add support for OpenMetrics format Currently scx_layered outputs statistics periodically as info! logs. The format of this is largely unstructured and mostly suitable for running scx_layered interactively (e.g. observing its behavior on the command line or via logs after the fact). In order to run scx_layered at larger scale, it's desireable to have statistics output in some format that is amenable to being ingested into monitoring databases (e.g. Prometheseus). This allows collection of stats across many machines. This commit adds a command line flag (-o) that outputs statistics to stdout in OpenMetrics format instead of the normal log mechanism. OpenMetrics has a public format specification (https://github.com/OpenObservability/OpenMetrics) and is in use by many projects. The library for producing OpenMetrics metrics is lightweight but does induce some changes. Primarily, metrics need to be pre-registered (see OpenMetricsStats::new()). Without -o, the output looks as before, for example: ``` 19:39:54 [INFO] CPUs: online/possible=52/52 nr_cores=26 19:39:54 [INFO] Layered Scheduler Attached 19:39:56 [INFO] tot= 9912 local=76.71 open_idle= 0.00 affn_viol= 2.63 tctx_err=0 proc=21ms 19:39:56 [INFO] busy= 1.3 util= 65.2 load= 263.4 fallback_cpu= 1 19:39:56 [INFO] batch : util/frac= 49.7/ 76.3 load/frac= 252.0: 95.7 tasks= 458 19:39:56 [INFO] tot= 2842 local=45.04 open_idle= 0.00 preempt= 0.00 affn_viol= 0.00 19:39:56 [INFO] cpus= 2 [ 0, 2] 04000001 00000000 19:39:56 [INFO] immediate: util/frac= 0.0/ 0.0 load/frac= 0.0: 0.0 tasks= 0 19:39:56 [INFO] tot= 0 local= 0.00 open_idle= 0.00 preempt= 0.00 affn_viol= 0.00 19:39:56 [INFO] cpus= 50 [ 0, 50] fbfffffe 000fffff 19:39:56 [INFO] normal : util/frac= 15.4/ 23.7 load/frac= 11.4: 4.3 tasks= 556 19:39:56 [INFO] tot= 7070 local=89.43 open_idle= 0.00 preempt= 0.00 affn_viol= 3.69 19:39:56 [INFO] cpus= 50 [ 0, 50] fbfffffe 000fffff 19:39:58 [INFO] tot= 7091 local=84.91 open_idle= 0.00 affn_viol= 2.64 tctx_err=0 proc=21ms 19:39:58 [INFO] busy= 0.6 util= 31.2 load= 107.1 fallback_cpu= 1 19:39:58 [INFO] batch : util/frac= 18.3/ 58.5 load/frac= 93.9: 87.7 tasks= 589 19:39:58 [INFO] tot= 2011 local=60.67 open_idle= 0.00 preempt= 0.00 affn_viol= 0.00 19:39:58 [INFO] cpus= 2 [ 2, 2] 04000001 00000000 19:39:58 [INFO] immediate: util/frac= 0.0/ 0.0 load/frac= 0.0: 0.0 tasks= 0 19:39:58 [INFO] tot= 0 local= 0.00 open_idle= 0.00 preempt= 0.00 affn_viol= 0.00 19:39:58 [INFO] cpus= 50 [ 50, 50] fbfffffe 000fffff 19:39:58 [INFO] normal : util/frac= 13.0/ 41.5 load/frac= 13.2: 12.3 tasks= 650 19:39:58 [INFO] tot= 5080 local=94.51 open_idle= 0.00 preempt= 0.00 affn_viol= 3.68 19:39:58 [INFO] cpus= 50 [ 50, 50] fbfffffe 000fffff ^C19:39:59 [INFO] EXIT: BPF scheduler unregistered ``` With -o passed, the output is in OpenMetrics format: ``` 19:40:08 [INFO] CPUs: online/possible=52/52 nr_cores=26 19:40:08 [INFO] Layered Scheduler Attached # HELP total Total scheduling events in the period. # TYPE total gauge total 8489 # HELP local % that got scheduled directly into an idle CPU. # TYPE local gauge local 86.45305689716104 # HELP open_idle % of open layer tasks scheduled into occupied idle CPUs. # TYPE open_idle gauge open_idle 0.0 # HELP affn_viol % which violated configured policies due to CPU affinity restrictions. # TYPE affn_viol gauge affn_viol 2.332430203793144 # HELP tctx_err Failures to free task contexts. # TYPE tctx_err gauge tctx_err 0 # HELP proc_ms CPU time this binary has consumed during the period. # TYPE proc_ms gauge proc_ms 20 # HELP busy CPU busy % (100% means all CPUs were fully occupied). # TYPE busy gauge busy 0.5294061026085283 # HELP util CPU utilization % (100% means one CPU was fully occupied). # TYPE util gauge util 27.37195512782239 # HELP load Sum of weight * duty_cycle for all tasks. # TYPE load gauge load 81.55024768702126 # HELP layer_util CPU utilization of the layer (100% means one CPU was fully occupied). # TYPE layer_util gauge layer_util{layer_name="immediate"} 0.0 layer_util{layer_name="normal"} 19.340849995024997 layer_util{layer_name="batch"} 8.031105132797393 # HELP layer_util_frac Fraction of total CPU utilization consumed by the layer. # TYPE layer_util_frac gauge layer_util_frac{layer_name="batch"} 29.34063385422595 layer_util_frac{layer_name="immediate"} 0.0 layer_util_frac{layer_name="normal"} 70.65936614577405 # HELP layer_load Sum of weight * duty_cycle for tasks in the layer. # TYPE layer_load gauge layer_load{layer_name="immediate"} 0.0 layer_load{layer_name="normal"} 11.14363313258934 layer_load{layer_name="batch"} 70.40661455443191 # HELP layer_load_frac Fraction of total load consumed by the layer. # TYPE layer_load_frac gauge layer_load_frac{layer_name="normal"} 13.664744680306903 layer_load_frac{layer_name="immediate"} 0.0 layer_load_frac{layer_name="batch"} 86.33525531969309 # HELP layer_tasks Number of tasks in the layer. # TYPE layer_tasks gauge layer_tasks{layer_name="immediate"} 0 layer_tasks{layer_name="normal"} 490 layer_tasks{layer_name="batch"} 343 # HELP layer_total Number of scheduling events in the layer. # TYPE layer_total gauge layer_total{layer_name="normal"} 6711 layer_total{layer_name="batch"} 1778 layer_total{layer_name="immediate"} 0 # HELP layer_local % of scheduling events directly into an idle CPU. # TYPE layer_local gauge layer_local{layer_name="batch"} 69.79752530933632 layer_local{layer_name="immediate"} 0.0 layer_local{layer_name="normal"} 90.86574281031143 # HELP layer_open_idle % of scheduling events into idle CPUs occupied by other layers. # TYPE layer_open_idle gauge layer_open_idle{layer_name="immediate"} 0.0 layer_open_idle{layer_name="batch"} 0.0 layer_open_idle{layer_name="normal"} 0.0 # HELP layer_preempt % of scheduling events that preempted other tasks. # # TYPE layer_preempt gauge layer_preempt{layer_name="normal"} 0.0 layer_preempt{layer_name="batch"} 0.0 layer_preempt{layer_name="immediate"} 0.0 # HELP layer_affn_viol % of scheduling events that violated configured policies due to CPU affinity restrictions. # TYPE layer_affn_viol gauge layer_affn_viol{layer_name="normal"} 2.950379973178364 layer_affn_viol{layer_name="batch"} 0.0 layer_affn_viol{layer_name="immediate"} 0.0 # HELP layer_cur_nr_cpus Current # of CPUs assigned to the layer. # TYPE layer_cur_nr_cpus gauge layer_cur_nr_cpus{layer_name="normal"} 50 layer_cur_nr_cpus{layer_name="batch"} 2 layer_cur_nr_cpus{layer_name="immediate"} 50 # HELP layer_min_nr_cpus Minimum # of CPUs assigned to the layer. # TYPE layer_min_nr_cpus gauge layer_min_nr_cpus{layer_name="normal"} 0 layer_min_nr_cpus{layer_name="batch"} 0 layer_min_nr_cpus{layer_name="immediate"} 0 # HELP layer_max_nr_cpus Maximum # of CPUs assigned to the layer. # TYPE layer_max_nr_cpus gauge layer_max_nr_cpus{layer_name="immediate"} 50 layer_max_nr_cpus{layer_name="normal"} 50 layer_max_nr_cpus{layer_name="batch"} 2 # EOF ^C19:40:11 [INFO] EXIT: BPF scheduler unregistered ``` Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>	2024-01-25 09:59:49 -08:00
Andrea Righi	6d89eceb93	scx_rustland: dispatch tasks only on the global DSQ Commit `c6ada25` ("scx_rustland: use custom pcpu DSQ instead of SCX_DSQ_LOCAL{_ON}") fixed the race issues with the cpumask, but it also introduced performance regressions. Until we figure out the reasons of the performance regressions, simplify the dispatcher and go back at using only the global DSQ, relying on the built-in idle cpu selection. In this way we can still enforce task affinity properly (`stress-ng --race-sched N` does not crash the scheduler) and we can also provide a better level of system responsiveness (according to the results of the stress tests done recently). The idea of this change is to make the scheduler usable in certain real-world scenarios (and as bug-free as possible), while we figure out the performance regressions of the per-CPU DSQ approach, that will likely be re-introduced later on in the future. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-23 13:24:12 +01:00
Andrea Righi	06b5ff3d2f	scx_rustland: clarify the logic to determine interactive tasks No functional change, simply rewrite the code a bit and update the comment to clarify the logic to detect interactive tasks and apply the priority boost. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-23 08:28:44 +01:00
Andrea Righi	ab1c4f66a8	scx_rustland: allow to disable the slice boost completely Allow to specify `-b 0` to completely disable the slice boost logic and fallback to standard vruntime-based scheduler with variable time slice. In this way interactive tasks will not get over-prioritized over the other tasks in the system. Having this option can help to easily track down potential performance regressions arising for over-prioritizing interactive tasks. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-23 00:34:06 +01:00
Andrea Righi	b4269452fc	scx_userland: handle preemption events from higher sched_class Make sure to re-schedule the user-space scheduler if it's preempted by a task from a higher priority sched_class. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-23 00:34:06 +01:00
Andrea Righi	2426d1024f	scx_rustland: increase max amount of enqueued tasks As the scheduler is progressing towards a more stable and usable state, it may be subject to heavy stress tests. For this reason, bump up the limit of MAX_ENQUEUED_TASKS to 8192 in the BPF component, to be able to sustain task-intensive stress tests, reducing the risk of potential scheduling congestion conditions. The downside is a negligible increase in the memory footprint of the BPF component, that is worth the cost in order to have an improved scheduler stability. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-21 15:47:35 +01:00
Andrea Righi	28bf96c78e	scx_rustland: mitigate unevictable memory page faults Page faults cannot happen when the user-space scheduler is running, otherwise we may hit deadlock conditions: a kthread may need to run to resolve the page fault, but the user-space scheduler is waiting on the page fault to be resolved => deadlock. We solved this problem (mostly) in commit `9708a80` ("scx_userland: use a custom memory allocator to prevent page faults"), introducing a custom allocator for the user-space scheduler that operates on a pre-allocated mlocked memory buffer, but there is an exception that can still trigger page faults: kcompactd. When memory compaction is enabled, specifically with vm.compact_unevictable_allowed=1 (which is often the default in many distributions), kcompactd regularly attempts to compact all memory zones, such that free memory is available in contiguous blocks where feasible, including unevictable memory as well. In the event that kcompactd remaps pages within the user-space scheduler's address space, it can lead to page faults, resulting in a potential deadlock. To prevent this from happening automatically set vm.compact_unevictable_allowed=0 when the scheduler is loaded and restore the previous value when the scheduler in unloaded. In this way we can prevent kcompactd from touching the unevictable memory associated to the user-space scheduler. Keep in mind that this is not a full bullet proof solution: something else in the system may still set vm.compact_unevictable_allowed=1 while the scheduler is running, re-enabling the risk of deadlock. Ideally we would need a way to mark the user-space scheduler memory as "really unevictable", or a proper kernel ABI to instruct kcompactd to exclude certain tasks (or better, cgroups) from its proactive memory compaction actions, but since then, this seems to be the best way to mitigate this issue. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-21 15:47:35 +01:00
David Vernet	c6ada251ef	scx_rustland: use custom pcpu DSQ instead of SCX_DSQ_LOCAL{_ON} We still don't have a reliable and non-racy way to manage cpumasks from the user-space scheduler, so it is quite hard for the scheduler to enforce the proper CPU affinity behavior. Despite checking the cpumask in the BPF part, tasks may still be assigned to a CPU that they cannot use, triggering scheduler errors. For example, it is really easy to crash the scheduler with a simple CPU affinity stress test (`stress-ng --race-sched 8 --timeout 5`): 14:51:28 [WARN] FAIL: SCX_DSQ_LOCAL[_ON] verdict target cpu 1 not allowed for stress-ng-race-[567048] (err=1024) To prevent this issue from happening, create custom DSQ for each CPU available in the system and use these per-CPU DSQs to dispatch all the tasks processed by the user-space scheduler, including the user-space scheduler itself. Then consume the these DSQs from the .dispatch() callback of the respective CPU, to transfer all the tasks to the consuming CPU's local DSQ, preventing the cpumask race condition encountered using SCX_DSQ_LOCAL_ON. With this patch applied the `stress-ng --race-sched N` stress test can be executed successfully (even with large values of N) without causing the scheduler to crash. Signed-off-by: David Vernet <void@manifault.com> [ arighi: kick target cpu to improve responsiveness, update comments ] Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-21 15:47:35 +01:00
Jordan Rome	9f9a97a97f	Update descriptions in cargo toml files	2024-01-19 18:19:46 -08:00
Andrea Righi	be1cb8774b	scx_rustland: improve SMT performance The user-space scheduler dispatches tasks in batches, with the batch size matching the number of idle CPUs. Commit `791bdbe` ("scx_rustland: introduce SMT support") changed the order of idle CPUs, prioritizing dispatching tasks on the least busy cores (those with the most idle CPUs) before moving on to busier cores (those with the least idle CPUs). While this approach works well for a small number of tasks, it can lead to uneven performance as the number of tasks increases and all cores are saturated. Such uneven performance can be attributed to SMT interactions causing potential short lags and erratic system performance. In some cases, disabling SMT entirely results in better system responsiveness. To address this issue, instruct the scheduler to implicitly disable SMT and consistently dispatch tasks only on the first (or last) CPU of each core. This approach ensures an equal distribution of tasks among the available cores, preventing SMT disturbances and aligning with non-SMT performance, also when a significant amount of tasks are running. Additionally, the unused sibling CPUs within each core can be used as "spare" CPUs for the BPF dispatcher. This is particularly beneficial for tasks that cannot be dispatched on the target CPU selected by the scheduler, due to cpumask restrictions or congestion conditions. Therefore, this new approach allows to enhance system responsiveness on SMT systems, while simultaneously improving scheduler stability. Some preliminary results on an AMD Ryzen 7 5800X 8-Cores (SMT enabled): running my usual benchmark of measuring the fps of a videogame (Counter-Strike 2) during a parallel kernel build-induced system overload, shows an improvement of approximately 2x (from 8-10fps to 15-25fps vs 1-2fps with EEVDF). Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-17 20:49:17 +01:00
Andrea Righi	f0c33320ab	scx_rustland: avoid calling scx_bpf_kick_cpu() from update_idle() Prior to commit `676bd88` ("bpf_rustland: do not dispatch the scheduler to the global DSQ"), the user-space scheduler was dispatched using SCX_DSQ_GLOBAL and we needed to explicitly kick idle CPUs from update_idle() to ensure that at least one CPU was available to run the user-space scheduler. Now that we are using SCX_DSQ_LOCAL_ON\|cpu to dispatch the user-space scheduler, the target CPU is implicitly kicked. Therefore, the call to scx_bpf_kick_cpu() within .update_idle() becomes redundant and we can get rid of it. Fixes: `676bd88` ("bpf_rustland: do not dispatch the scheduler to the global DSQ") Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-17 20:49:17 +01:00
Andrea Righi	0b3c399519	scx_rustland: introduce dynamic slice boost Update the slice boost dynamically, as a function of the amount of CPUs in the system and the amount of tasks currently waiting to be dispatched: as the amount of waiting tasks in the task_pool increases, reduce the slice boost. This adjustment ensures that the scheduler adheres more closely to a pure vruntime-based policy as the amount of tasks contending the available CPUs increases and it allows to sustain stress tests that are spawning a massive amount of tasks. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-16 11:51:51 +01:00
Andrea Righi	791bdbec97	scx_rustland: introduce SMT support Introduce a basic support of CPU topology awareness. With this change, the scheduler will prioritize dispatching tasks to idle CPUs with fewer busy SMT siblings, then, it will proceed to CPUs with more busy SMT siblings, in ascending order. To implement this, introduce a new CoreMapping abstraction, that provides a mapping of the available core IDs in the system along with their corresponding lists of CPU IDs. This, coupled with the get_cpu_pid() method from the BpfScheduler abstraction, allows the user-space scheduler to enforce the policy outlined above and improve performance on SMT systems. Keep in mind that this improvement is relevent only when the amount of tasks running in the system is less than the amount of CPUs. As soon as the amount of running tasks increases, they will be distributed across all available CPUs and cores, thereby negating the advantages of SMT isolation. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-16 11:33:35 +01:00

1 2 3

123 Commits