scx-upstream

mirror of https://github.com/sched-ext/scx.git synced 2024-12-12 03:34:22 +00:00

Author	SHA1	Message	Date
Andrea Righi	5cf113f058	scx_rustland_core: provide DispatchedTask API methods Provide distinct methods to set the target CPU and the per-task time slice to dispatched tasks. Moreover, also provide a constructor to create a DispatchedTask from a QueuedTask (this allows to automatically bounce a task from the scheduler to the BPF dispatcher without having to take care of setting the individual task's attributes). This also allows to make most of the attributes of DispatchedTask private, especially it allows to hide cpumask_cnt, that should be only used internally between the BPF and the user-space component. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-03-03 15:49:37 +01:00
Andrea Righi	e10f8a2d8e	scx_rustland_core: introduce per-task time slice Provide a way to set a different time slice per-task, by adding a new attribute slice_ns to the DispatchedTask struct. This attribute determines the time slice assigned to the task, if it is set to 0 then the global time slice (either the default one or the effective one, if set) will be used. At the same time, remove the payload attribute, that is basically unused (scx_rustland uses it to send the task's vruntime to the BPF dispatcher for debugging purposes, but it's not very useful anymore at this point). In the future we may introduce a proper interface to attach a custom payload to each task with a proper interface. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-03-03 15:06:56 +01:00
Jordan Rome	499924ead8	Add libbpf as a submodule This is to potentinally reduce issues with folks using different versions of libbpf at runtime. This also: - makes static linking of libbpf the default - adds steps in `meson setup` to fetch libbpf and make it	2024-03-01 12:39:35 -08:00
Andrea Righi	0d1c6555a4	scx_rustland_core: generate source files in-tree There is no need to generate source code in a temporary directory with RustLandBuilder(), we can simply generate code in-tree and exclude the generated source files from .gitignore. Having the generated source files in-tree can help to debug potential build issues (and it also allows to drop the the tempfile crate dependency). Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-28 17:49:44 +01:00
Andrea Righi	2ac1a5924f	scx_rustland_core: introduce RustLandBuilder() Introduce a wrapper to scx_utils::BpfBuilder that can be used to build the BPF component provided by scx_rustland_core. The source of the BPF components (main.bpf.c) is included in the crate as an array of bytes, the content is then unpacked in a temporary file to perform the build. The RustLandBuilder() helper is also used to generate bpf.rs (that implements the low-level user-space Rust connector to the BPF commponent). Schedulers based on scx_rustland_core can simply use RustLandBuilder(), to build the backend provided by scx_rustland_core. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-28 17:49:44 +01:00
Andrea Righi	e23426e299	scx_rustland_core: introduce method bpf.update_tasks() Introduce a helper function to update the counter of queued and scheduled tasks (used to notify the BPF component if the user-space scheduler has still some pending work to do). Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-28 17:49:44 +01:00
Andrea Righi	00e25530bc	scx_rlfifo: simple user-space FIFO scheduler written in Rust Implement a FIFO scheduler as an example usage of scx_rustland_core. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-28 17:49:44 +01:00
Andrea Righi	cf43129d89	scx_rustland: update documentation scx_rustland has significantly evolved since its original design. With the introduction of scx_rustland_core and the inclusion of the scx_rlfifo example, scx_rustland's focus can be shifted from solely being an "easy-to-read Rust scheduler template" to a fully functional scheduler. For this reason, update the README and documentation to reflect its revised design, objectives, and intended use cases. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-28 17:49:44 +01:00
Andrea Righi	871a6c10f9	scx_rustland_core: include scx_rustland backend Move the BPF component of scx_rustland to scx_rustland_core and make it available to other user-space schedulers. NOTE: main.bpf.c and bpf.rs are not pre-compiled in the scx_rustland_core crate, they need to be included in the user-space scheduler's source code in order to be compiled/linked properly. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-28 17:49:44 +01:00
Andrea Righi	416d6a940f	rust: introduce scx_rustland_core crate Introduce a separate crate (scx_rustland_core) that can be used to implement sched-ext schedulers in Rust that run in user-space. This commit only provides the basic layout for the new crate and the abstraction to the custom allocator. In general, any scheduler that has a user-space component needs to use the custom allocator to prevent potential deadlock conditions, caused by page faults (a kthread needs to run to resolve the page fault, but the scheduler is blocked waiting for the user-space page fault to be resolved => deadlock). However, we don't want to necessarily enforce this constraint to all the existing Rust schedulers, some of them may do all user-space allocations in safe paths, hence the separate scx_rustland_core crate. Merging this code in scx_utils would force all the Rust schedulers to use the custom allocator. In a future commit the scx_rustland backend will be moved to scx_rustland_core, making it a totally generic BPF scheduler framework that can be used to implement user-space schedulers in Rust. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-28 17:49:44 +01:00
David Vernet	8b04a2687f	rusty: Use new infeasible crate Now that we have a new 'infeasible' crate that abstracts the logic for implementing the infeasible weights solution. Let's update rusty to use it. Signed-off-by: David Vernet <void@manifault.com>	2024-02-26 10:51:54 -06:00
David Vernet	87eab38506	rustland: Update rustland to use topology.rs The new topology crate allows us to replace the custom rustland topology logic with the logic in the topology crate itself. Signed-off-by: David Vernet <void@manifault.com>	2024-02-23 13:09:06 -06:00
David Vernet	43624a87ce	rusty: Use new topology crate Now that we have this new Topology crate, let's update Rusty to use it instead of using the old one. Signed-off-by: David Vernet <void@manifault.com>	2024-02-23 10:39:55 -06:00
Tejun Heo	4dc77f8ddf	Merge pull request #149 from davemarchevsky/davemarchevsky_nice_equals scx_layered: Add MATCH_NICE_EQUALS match kind	2024-02-22 06:38:17 -10:00
Dave Marchevsky	9f510f18cd	scx_layered: Add MATCH_NICE_EQUALS match kind I have a usecase where specific nice values are used to bucket tasks into groups that are handled separately by different `scx_layered` policies, with no implications of relative priority between niceness X, X + 1, X - 1, etc. In other words, nicevals are used as simple tags in this scenario. If we wanted to treat a specific niceness this way e.g. `11`, we could do so with AND'd MATCH_NICE_{ABOVE,BELOW} like so: ```json "matches" : [ [ { "NiceAbove": 10 }, { "NiceBelow": 12 }, ], ], ``` But this is unnecessarily verbose and doesn't communicate the intent of the match very well. Adding a `NiceEquals` matcher simplifies the config and makes intent obvious: ```json "matches" : [ [ { "NiceEquals": 11 }, ], ], ``` This PR adds support for such a matcher. Also, rename `layer_match.nice_above_or_below` to just `layer_match.nice`, as the former doesn't describe the newly-added matcher's use of the field. It's still obvious that `layer_match.nice` is relevant to MATCH_NICE_{ABOVE, BELOW, EQUALS}. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>	2024-02-22 04:08:07 -08:00
David Vernet	615b594e1c	layered: Don't refresh cpumasks before attaching As mentioned in the previous commit, for some reason we're sometimes (non-deterministically) not seeing the updated cpumask / layer values in BPF if we initialize the cpumasks here before attaching. Let's undo this for now so to avoid observing buggy behavior, until we figure it out. Signed-off-by: David Vernet <void@manifault.com>	2024-02-21 19:19:45 -06:00
David Vernet	68d317079a	Revert "layered: Set layered cpumask in scheduler init call" This reverts commit `56ff3437a2`. For some reason we seem to be non-deterministically failing to see the updated layer values in BPF if we initialize before attaching. Let's just undo this specific part so that we can unblock this being broken, and we can figure it out async. Signed-off-by: David Vernet <void@manifault.com>	2024-02-21 19:17:19 -06:00
David Vernet	31df8fbd09	layered: Consume from layer with cpumask in layered_dispatch Currently, in layered_dispatch, we do the following: 1. Iterate over all preempt=true layers, and first try to consume from them. 2. Iterate over all confined layers, and consume from them if we find a layer with a cpumask that contains the consuming CPU. 3. Iterate over all grouped and open layers and consume from them in ordered sequence. In (2), we're only iterating over confined layers, but we should also be iterating over grouped layers. Otherwise, despite a consuming CPU being allocated to a specific grouped layer, the CPU will consume from whichever grouped or open layer has a task that's ready to run. Signed-off-by: David Vernet <void@manifault.com>	2024-02-21 15:38:23 -06:00
David Vernet	56ff3437a2	layered: Set layered cpumask in scheduler init call In layered_init, we're currently setting all bits in every layers' cpumask, and then asynchronously updating the cpumasks at later time to reflect their actual values at runtime. Now that we're updating the layered code to initialize the cpumasks before we attach the scheduler, we can instead have the init path actually refresh and initialize the cpumasks directly. Signed-off-by: David Vernet <void@manifault.com>	2024-02-21 15:38:23 -06:00
David Vernet	1f834e7f94	layered: Initialize layers before attaching scheduler We currently have a bug in layered wherein we could fail to propagate layer updates from user space to kernel space if a layer is never adjusted after it's first initialized. For example, in the following configuration: [ { "name": "workload.slice", "comment": "main workload slice", "matches": [ [ { "CgroupPrefix": "workload.slice/" } ] ], "kind": { "Grouped": { "cpus_range": [30, 30], "util_range": [ 0.0, 1.0 ], "preempt": false } } }, { "name": "normal", "comment": "the rest", "matches": [ [] ], "kind": { "Grouped": { "cpus_range": [2, 2], "util_range": [ 0.0, 1.0 ], "preempt": false } } } ] Both layers are static, and need only be resized a single time, so the configuration would never be propagated to the kernel due to us never calling update_bpf_layer_cpumask(). Let's instead have the initialization propagate changes to the skeleton before we attach the scheduler. This has the advantage both of fixing the bug mentioned above where a static configuration is never propagated to the kernel, and that we don't have a short period when the scheduler is first attached where we don't make optimal scheduling decisions due to the layer resizing not being propagated. Signed-off-by: David Vernet <void@manifault.com>	2024-02-21 15:38:21 -06:00
Tejun Heo	22d635c385	Merge pull request #141 from jordalgo/rusty-logging Add libbpf logging to rust schedulers	2024-02-20 13:52:39 -10:00
Andrea Righi	80de48ec83	scx_rustland: introduce --builtin-idle Add a command line option to enable/disable the sched-ext built-in idle selection logic in the user-space scheduler. With this option the user-space scheduler will try to dispatch tasks on the CPU selected during the .select_cpu() phase (using the built-in idle selection logic). Without this option the user-space scheduler will try to dispatch tasks to the first CPU available. The former can be useful to improve throughput, since tasks are more likely to stick on the same CPU, while the latter can provide better system responsiveness, especially when the system is significantly busy. Given that, by default, tasks can be dispatched directly bypassing the user-space scheduler if an idle CPU is found during .select_cpu(), the user-space scheduler is primarily engaged only when the system is busy (no idle CPUs are available). Under these circumstances, it is typically more efficient to dispatch tasks on the first available CPU. Hence, the default behavior is to ignore built-in idle selection logic in the user-space scheduler. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-21 00:25:14 +01:00
Andrea Righi	e487d71032	scx_rustland: simply CPU selection by relying on built-in idle selection Checking if a CPU is idle or busy in the user-space scheduler is a bit redundant, considering that we also rely on the built-in idle selection logic in the BPF part. Therefore get rid of the additional idle selection logic in the user-space scheduler and rely on the built-in idle selection. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-21 00:25:14 +01:00
Andrea Righi	2cd1d4b684	scx_rustland: introduce --full-user Introduce an option to send all scheduling events and actions to user-space, disabling any form of in-kernel optimization. Enabling this option will likely make the system less responsive (but more predictable in terms of performance) and it can be useful for debugging purposes. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-21 00:25:14 +01:00
Jordan Rome	7c32acece0	Add libbpf logging to the rust schedulers This is to get better logs when failing to load, attach, etc.	2024-02-20 15:17:10 -08:00
David Vernet	ef8aa9ea31	add documentation Signed-off-by: David Vernet <void@manifault.com>	2024-02-20 14:57:09 -06:00
David Vernet	8aba090d4f	rust: Add topology module to utils crate scx_rusty has logic in the scheduler to inspect the host to automatically build scheduling domains across every L3 cache. This would be generically useful for many different types of schedulers, so let's add it to the scx_utils crate so it can be used by others. Signed-off-by: David Vernet <void@manifault.com>	2024-02-20 14:57:09 -06:00
Andrea Righi	7ff06a6ff0	scx_rustland: prevent misaligned pointer dereference The buffer used to store struct queued_task_ctx items fetched from the BPF ring buffer needs to be aligned to the architecture register size, otherwise we may hit misaligned pointer dereference issues, such as: thread 'main' panicked at src/bpf.rs:162:43: misaligned pointer dereference: address must be a multiple of 0x8 but is 0x56516a51e004 note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace Prevent this by making sure the buffer is always aligned to 64-bits. Fixes: `93dc615` ("scx_rustland: use a ring buffer for queued tasks") Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-20 19:08:38 +01:00
Andrea Righi	93dc615653	scx_rustland: use a ring buffer for queued tasks Switch from a BPF_MAP_TYPE_QUEUE to a BPF_MAP_TYPE_RINGBUF to store the tasks that need to be processed by the user-space scheduler. A ring buffer allows to save a lot of memory copies and syscalls, since the memory is directly shared between the BPF and the user-space components. Performance profile before this change: 2.44% [kernel] [k] __memset 2.19% [kernel] [k] __sys_bpf 1.59% [kernel] [k] __kmem_cache_alloc_node 1.00% [kernel] [k] _copy_from_user After this change: 1.42% [kernel] [k] __memset 0.14% [kernel] [k] __sys_bpf 0.10% [kernel] [k] __kmem_cache_alloc_node 0.07% [kernel] [k] _copy_from_user Both the overhead of sys_bpf() and copy_from_user() are reduced by a factor of ~15x now (only the dispatch path is using sys_bpf() now). NOTE: despite being very effective, the current implementation is a bit of a hack. This is because the present ring buffer API exclusively permits consumption in a greedy manner, where multiple items can be consumed simultaneously. However, libbpf-rs does not provide precise information regarding the exact number of items consumed. By utilizing a more refined libbpf-rs API [1] we may be able to improve this code a bit. Moreover, libbpf-rs doesn't provide an API for the user_ring_buffer, so at the moment there's not a trivial way to apply the same change to the dispatched tasks. However, just with this change applied, the overhead of sys_bpf() and copy_from_user() is already minimal, so we won't get much benefits by changing the dispatch path to use a BPF ring buffer. [1] https://github.com/libbpf/libbpf-rs/pull/680 Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-20 12:30:22 +01:00
Andrea Righi	04685e633f	scx_rustland: avoid memory copies while accessing cpu_map Instead of using a BPF_MAP_TYPE_ARRAY to store which tasks are running on which CPU we can simply use a global array, mapped in the user-space address space. In this way we can avoid a lot of memory copies and call to sys_bpf(), significantly reducing the scheduler's overhead. Keep in mind that we don't need to be 100% correct while accessing this information, so we can accept some fuzziness in order to significantly reduce the scheduler's overhead. Performance profile before this change: 5.52% [kernel] [k] __sys_bpf 4.84% [kernel] [k] __kmem_cache_alloc_node 4.71% [kernel] [k] map_lookup_elem 4.10% [kernel] [k] _copy_from_user 3.51% [kernel] [k] bpf_map_copy_value 3.12% [kernel] [k] check_heap_object After this change: 2.20% [kernel] [k] __sys_bpf 1.91% [kernel] [k] map_lookup_and_delete_elem 1.60% [kernel] [k] __kmem_cache_alloc_node 1.10% [kernel] [k] _copy_from_user 0.12% [kernel] [k] check_heap_object n/a bpf_map_copy_value n/a map_lookup_elem With this change we can reduce the overhead of sys_bpf() by ~2x and the overhead of copy_from_user() by ~4x. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-20 12:30:16 +01:00
Andrea Righi	fc889c6995	scx_rustland: replace custom allocator with buddy-alloc Currently, the primary bottleneck in scx_rustland lies within its custom memory allocator, which is used to prevent page faults in the user-space scheduler. This is pretty evident looking at perf top: 39.95% scx_rustland [.] <scx_rustland::bpf::alloc::RustLandAllocator as core::alloc::global::GlobalAlloc>::alloc 3.41% [kernel] [k] _copy_from_user 3.20% [kernel] [k] __kmem_cache_alloc_node 2.59% [kernel] [k] __sys_bpf 2.30% [kernel] [k] __kmem_cache_free 1.48% libc.so.6 [.] syscall 1.45% [kernel] [k] __virt_addr_valid 1.42% scx_rustland [.] <scx_rustland::bpf::alloc::RustLandAllocator as core::alloc::global::GlobalAlloc>::dealloc 1.31% [kernel] [k] _copy_to_user 1.23% [kernel] [k] entry_SYSRETQ_unsafe_stack However, there's no need to reinvent the wheel here, rather than relying on an overly simplistic and inefficient allocator, we can rely on buddy-alloc [1], which is also capable of operating on a preallocated memory buffer. After switching to buddy-alloc, the performance profile under the same workload conditions looks like the following: 6.01% [kernel] [k] _copy_from_user 5.21% [kernel] [k] __kmem_cache_alloc_node 4.45% [kernel] [k] __sys_bpf 3.80% [kernel] [k] __kmem_cache_free 2.79% libc.so.6 [.] syscall 2.34% [kernel] [k] __virt_addr_valid 2.26% [kernel] [k] _copy_to_user 2.14% [kernel] [k] __check_heap_object 2.10% [kernel] [k] __check_object_size.part.0 2.02% [kernel] [k] entry_SYSRETQ_unsafe_stack With this change in place, the primary overhead is now moved to the bpf() syscall and the copies between kernel and user-space (this could potentially be optimized in the future using BPF ring buffers, instead of BPF FIFO queues). A better focus at the allocator overhead before vs after this change: [before] 39.95% scx_rustland [.] core::alloc::global::GlobalAlloc>::alloc 1.42% scx_rustland [.] core::alloc::global::GlobalAlloc>::dealloc [after] 1.50% scx_rustland [.] core::alloc::global::GlobalAlloc>::alloc 0.76% scx_rustland [.] core::alloc::global::GlobalAlloc>::dealloc [1] https://crates.io/crates/buddy-alloc Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-11 14:33:39 +01:00
Andrea Righi	ccf5946425	scx_rustland: speed up search by PID in tasks BTreeSet In order to prevent duplicate PIDs in the TaskTree (BTreeSet), we perform an O(N) search each time we add an item, to verify whether the PID already exists or not. Under heavy stress test conditions the O(N) complexity can have a potential impact on the overall performance. To mitigate this, introduce a HashMap that can be used to retrieve tasks by PID typically with a O(1) complexity. This could potentially degrade to O(N) in presence of hash collisions, but even in this case, accessing the hash map is still more efficient than scanning all the entries in the BTreeSet to search for the target PID. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-11 14:11:38 +01:00
Andrea Righi	7ce0d038e4	Merge pull request #133 from sched-ext/rustland-cpumask-gen-cnt scx_rustland: per-task cpumask generation counter	2024-02-10 19:07:02 +01:00
Andrea Righi	61d1ed338a	scx_rustland: per-task cpumask generation counter Introduce a per-task generation counter to check the validity of the cpumask at dispatch time. The logic is the following: - the cpumask generation number is incremented every time a task calls .set_cpumask() - when a task is enqueued the current generation number is stored in the queued_task_ctx and relayed to the user-space scheduler - the user-space scheduler can decide to dispatch the task on the CPU determined by the BPF layer in .select_cpu(), redirect the task to any other specific CPU, or redirect to the first CPU available (using NO_CPU) - task is then dispatched back to the BPF code along with its cpumask generation counter - at dispatch time the BPF code checks if the generation number is the same and it discards the dispatch attempt if the cpumask is not valid anymore (the task will be automatically re-enqueued by the sched-ext core code, potentially selecting another CPU / cpumask) - if the cpumask is valid, but the CPU selected by the user-space scheduler is invalid (according to the cpumask), the task will be transparently bounced by the BPF code to the shared DSQ (in this way the user-space code can be completely abstracted and dispatches that target invalid CPUs can be automatically fixed by the BPF layer) This solution can prevent stalls due to dispatches targeting invalid CPUs and it can also avoid redundant dispatch events, making the code more efficient and the cpumask interlocking more reliable. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-10 18:02:42 +01:00
David Vernet	1c00de9402	Merge pull request #129 from sched-ext/infeasible_weights Implement solution to infeasible weights problem	2024-02-09 16:23:56 -06:00
David Vernet	e627176d90	scx: Implement solution to infeasible weights problem As described in [0], there is an open problem in load balancing called the "infeasible weights" problem. Essentially, the problem boils down to the fact that a task with disproportionately high load can be granted more CPU time than they can actually consume per their duty cycle. This patch implements a solution to that problem, wherein we apply the algorithm described in this paper to adjust all infeasible weights in the system down to a feasible wight that gives them their full duty cycle, while allowing the remaining feasible tasks on the system to share the remaining compute capacity on the machine. [0]: https://drive.google.com/file/d/1fAoWUlmW-HTp6akuATVpMxpUpvWcGSAv/view?usp=drive_link Signed-off-by: David Vernet <void@manifault.com>	2024-02-09 16:23:12 -06:00
Andrea Righi	8e47602f00	scx_rustland: keep default CPU selection when idle Dispatch to the shared DSQ (NO_CPU) only when the assigned CPU is not idle anymore, otherwise maintain the same CPU that has been assigned by the BPF layer. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-08 22:48:07 +01:00
Andrea Righi	7085d57709	scx_rustland: kick user-space scheduler when a CPU is released When the system is not being fully utilized there may be delays in promptly awakening the user-space scheduler. This can happen for example, when some CPU-intensive tasks are constantly dispatched bypassing the user-space scheduler (e.g., using SCX_DSQ_LOCAL) and other CPUs are completely idle. Under this condition the update_idle() can fail to activate the user-space scheduler, because there are no pending events, and only the periodic timer will wake up the scheduler, potentially introducing lags of up to 1 sec. This can be reproduced, for example, running a video game that doesn't use all the CPUs available in the system (i.e., Team Fortress 2). With this game it is pretty easy to notice sporadic lags that are resumed after ~1sec, due to the periodic timer kicking scheduler. To prevent this from happening wake up the user-space scheduler immediately as soon as a CPU is released, speculating on the fact that most of the time there will be always another task ready to run. This can introduce a little more overhead in the scheduler (due to potential unnecessary wake up events), but it also prevents stuttery behaviors and it makes the system much more smooth and responsive, especially with video games. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-08 22:48:07 +01:00
Andrea Righi	cb82d91e0f	scx_rustland: use scx_bpf_dispatch_cancel() Use scx_bpf_dispatch_cancel() to invalidate dispatches on wrong per-CPU DSQ, due to cpumask race conditions, and redirect them to the shared DSQ. This prevents dispatching tasks to CPU that cannot be used according to the task's cpumask. With this applied the scheduler passed all the `stress-ng --race-sched` stress tests. Moreover, introduce a counter that is periodically reported to stdout as an additional statistic, that can be helpful for debugging. Link: https://github.com/sched-ext/sched_ext/pull/135 Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-08 22:48:07 +01:00
Andrea Righi	13e23e8cc9	scx_rustland: dump scheduler statistics before exiting Print all the scheduler statistics before exiting. Reporting the very last state of the scheduler can help to debug events that could trigger error conditions (such as page faults, scheduler congestions, etc.). While at it, fix also some minor coding style issues (tabs vs spaces). Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-08 15:37:44 +01:00
David Vernet	c574598dc7	scx_rusty: Fix typos Signed-off-by: David Vernet <void@manifault.com>	2024-02-07 23:38:26 -06:00
Tejun Heo	2062d1ad1f	scx: Add compat support for SCX_KICK_IDLE and use it for idle CPU wakeups SCX_KICK_IDLE is a new feature which isn't defined in older kernels. Add compat wrapper and use it for idle CPU wakeups. Signed-off-by: Tejun Heo <tj@kernel.org>	2024-02-06 15:28:40 -10:00
Andrea Righi	acb174aa51	scx_rustland: prevent duplicate PIDs in the task BTreeSet Items in the task BTreeSet are stored by pid and vruntime. Make sure that we never store multiple items with the same PID, so that re-enqueued tasks are not dispatched multiple times. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-03 14:46:39 +01:00
Andrea Righi	681b3fd807	scx_rustland: more aggressive time slice scaling Allow to scale the effective time slice down to 250 us. This can help to maintain a good quality of the audio even when the system is overloaded by multiple CPU-intensive tasks. Moreover, always round up the time slice scaling factor to be a little more aggressive and prioritize at scaling the time slice, so that we can prioritize low latency tasks even more. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-01 16:40:59 +01:00
Andrea Righi	26d6d530f0	scx_rustland: enhance interactive task classification Evaluate the number of voluntary context switches per second (nvcsw/sec) for each task using an exponentially weighted moving average (EWMA) with weight 0.5, that allows to classify interactive tasks with more accuracy. Using a simple average over a period of time of 10 sec can introduce small lags every 10 sec, as the statistics for the number of voluntary context switches are refreshed. This can result in interactive tasks taking a brief time to catch up in order to be accurately classified as so, causing for example short audio cracks, small drop of 5-10 fps in games, etc. Using a EMWA allows to smooth the average of nvcsw/sec, preventing short lags in the interactive tasks, while also preventing to incorrectly classify as interactive tasks that may experience an isolated short burst of voluntary context switches. This patch has been tested with the usual test case of playing a videogame while running a parallel kernel build in the background. Without this patch the short lag every 10 sec is clearly noticeable, with this patch applied the game and audio run smoothly. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-01 16:40:59 +01:00
Andrea Righi	baeea306fc	scx_rustland: rely on the built-in idle selection logic Simplify the idle selection logic by relying only on the built-in idle selection performed in the BPF layer. When there are idle CPUs available in the system, tasks are dispatched directly by the BPF dispatcher without invoking the user-space scheduler. This allows to avoid the user-space overhead and get the best system performance when CPU resources are not overcommitted. Once the number of tasks exceeds the available CPUs, the user-space scheduler takes over. However, by this time, the system is already overcommitted, so there's little advantage in attempting to pinpoint the optimal idle CPU through the user-space scheduler. Instead, tasks can be executed on the first available CPU, consistently dispatching them to the shared DSQ. This allows to achieve the optimal performance both with system under-utilization and over-utilization. With this change in place the user-space scheduler won't dispatch tasks directly to specific CPUs, but we still want to keep this as a generic feature in the BPF layer, so that it can be potentially used in the future by this scheduler or even by other user-space schedulers (once the BPF layer will be moved to a more generic place). Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-01 16:40:59 +01:00
Andrea Righi	b9e60f71ed	scx_rustland: usersched: code refactoring No functional change, just move code around to make it more readable. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-01 16:40:59 +01:00
Andrea Righi	d13ed5c025	scx_rustland: BPF: refine CPU dispatch logic When the user-space scheduler dispatches a task on a specific CPU, that CPU might not be valid, since the user-space doesn't have visibility of the task's cpumask. When this happens the BPF dispatcher (that has direct visibility of the cpumask) should automatically redirect the task to a valid CPU, but instead of bouncing the task on the shared DSQ, we should try to use the CPU assigned by the built-in idle selection logic. If this CPU is also not valid, then we can simply ignore the task, that has been de-queued and re-enqueued, since a valid CPU will be naturally re-selected at a later time. Moreover, avoid to kick any specific CPU when the task is dispatched to shared DSQ, since the task can be consumed on any CPU and the additional kick would simply add more overhead. Lastly, rename dsq_id_to_cpu() to dsq_to_cpu() and cpu_to_dsq_id() to cpu_to_dsq() for more clarity. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-01 16:38:17 +01:00
Andrea Righi	45d8b54eb9	scx_rustland: re-introduce per-CPU DSQ + a global shared DSQ With commit `c6ada25` ("scx_rustland: use custom pcpu DSQ instead of SCX_DSQ_LOCAL{_ON}") we tried to introduce custom per-CPU DSQs, instead of using SCX_DSQ_LOCAL and SCX_DSQ_LOCAL_ON to dispatch tasks. This was required, because dispatching tasks using SCX_DSQ_LOCAL_ON doesn't provide a guarantee that the cpumask, checked at dispatch time to determine the validity of a target CPU, remains valid. This method solved the cpumask validity issue, but unfortunately it introduced a noticeable performance regression and a potential starvation issue (that were probably caused by the same problem): if a task is assigned to a CPU in select_cpu() and the scheduler decides to dispatch it on a different CPU, the task will be added to the new CPU's DSQ, but if no dispatch event happens there, the task may remain stuck in the per-CPU DSQ for a long time, triggering the sched-ext watchdog timeout that would kick out the scheduler, for example: 12:53:28 [WARN] FAIL: IPC:CSteamEngin[7217] failed to run for 6.482s (err=1026) 12:53:28 [INFO] Unregister RustLand scheduler Therefore, we reverted this change with `6d89ece` ("scx_rustland: dispatch tasks only on the global DSQ"), dispatching all the tasks to the global DSQ, completely delegating the kernel to distribute tasks among the available CPUs. This is not the ideal solution, because we still want to give the possibility to the user-space scheduler to assign tasks to specific CPUs. Therefore, re-introduce distinct per-CPU DSQs, but also provide a global shared DSQ. Tasks dispatched in the per-CPU DSQs are consumed from the dispatch() callback of their corresponding CPU, tasks dispatched in the global shared DSQ are consumed from any CPU. In this way the BPF layer is able to provide an interface that gives the flexibility to the user-space to dispatch a task on a specific CPU or on the first CPU available, depending on the particular scheduler's need. If an invalid CPU (according to the cpumask) is selected the BPF dispatcher will transparently redirect the task to a valid CPU, selected using the built-in idle selection logic. In the future we may want to improve this part, giving to the user-space the visibility of the cpumask, in order to pick a valid CPU in advance and in a proper synchronized way. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-01 00:33:35 +01:00
Andrea Righi	b5e846c538	scx_rustland: BPF: small refactoring No functional change, just some refactoring to make the code more clear. We have is_usersched_needed() and set_usersched_needed() that are doing different things (the former is checkig if there are pending tasks for the scheduler, the latter is setting the usersched_needed flag to activate the dispatch of the user-space scheduler). Rename is_usersched_needed() to usersched_has_pending_tasks() to make the code more clear and understandable. Also move dispatch_user_scheduler() closer to the other dispatch-related helper functions. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-01 00:33:35 +01:00
Tejun Heo	6db362b27a	scx_rustland: Use scx_utils::user_exit_info Instead of the bespoke implementation. This also makes scx_rustland to print out debug dump if exists. Signed-off-by: Tejun Heo <tj@kernel.org>	2024-01-31 11:44:15 -10:00
Tejun Heo	965926f393	scx_rusty: Use scx_utils::user_exit_info Instead of the bespoke implementation. This also makes scx_rusty to print out debug dump if exists. Signed-off-by: Tejun Heo <tj@kernel.org>	2024-01-31 11:08:17 -10:00
Tejun Heo	105dc36b8f	scx_layered: Use scx_utils::user_exit_info Instead of the bespoke implementation. This also makes scx_layered to print out debug dump if exists. Signed-off-by: Tejun Heo <tj@kernel.org>	2024-01-31 10:54:20 -10:00
Tejun Heo	4ee8104a6d	Merge pull request #114 from dschatzberg/local_avoid_enqueue scx_layered: dispatch from select_cpu if possible	2024-01-31 08:33:26 -10:00
Dan Schatzberg	11e487c165	scx_layered: dispatch from select_cpu if possible If we are doing local dispatch, we can avoid enqueue() altogether by dispatching from select_cpu() Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>	2024-01-31 09:54:26 -08:00
Jordan Rome	1b3a9a1e72	[scx_layered] downgrade prometheus-client This library at version 22 is not available in fedora: https://src.fedoraproject.org/rpms/rust-prometheus-client Rather than bothering the maintainer, let's just downgrade here.	2024-01-31 04:36:01 -08:00
Dan Schatzberg	ab5635ff6d	scx_layered: Grab idle_smtmask a bit later This is a really minor optimization, but we don't need idle_smtmask to schedule pinned tasks, so defer it so the nr_cpus_allowed == 1 path is marginally faster. Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>	2024-01-29 08:16:37 -08:00
Dan Schatzberg	8c9e65d880	scx_layered: Remove unnecessary idle_cpumask idle_cpumask isn't used at all in pick_idle_cpu_from. The only need for these cpumasks is to check if prev_cpu is a wholly idle CPU (and we only do this when smt_enabled). idle_smtmask is sufficient for that check. Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>	2024-01-29 08:16:37 -08:00
Dan Schatzberg	142b6230b2	scx_layered: Fix AFFN_VIOL stat bump Prior to this patch, we only bump LSTAT_AFFN_BIOL when the target cpu was idle, but in both cases it should be counted as AFFN_VIOL. Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>	2024-01-26 13:13:16 -08:00
Tejun Heo	988b7d13c1	Bump versions scx_exit_info change doesn't require code to be updated but breaks binary compatbility. Bump versions and cut a new release.	2024-01-25 09:01:23 -10:00
Tejun Heo	eb997a6e55	Merge pull request #101 from dschatzberg/openmetrics scx_layered: Add support for OpenMetrics format	2024-01-25 08:59:16 -10:00
Dan Schatzberg	7f9548eb34	scx_layered: Add support for OpenMetrics format Currently scx_layered outputs statistics periodically as info! logs. The format of this is largely unstructured and mostly suitable for running scx_layered interactively (e.g. observing its behavior on the command line or via logs after the fact). In order to run scx_layered at larger scale, it's desireable to have statistics output in some format that is amenable to being ingested into monitoring databases (e.g. Prometheseus). This allows collection of stats across many machines. This commit adds a command line flag (-o) that outputs statistics to stdout in OpenMetrics format instead of the normal log mechanism. OpenMetrics has a public format specification (https://github.com/OpenObservability/OpenMetrics) and is in use by many projects. The library for producing OpenMetrics metrics is lightweight but does induce some changes. Primarily, metrics need to be pre-registered (see OpenMetricsStats::new()). Without -o, the output looks as before, for example: ``` 19:39:54 [INFO] CPUs: online/possible=52/52 nr_cores=26 19:39:54 [INFO] Layered Scheduler Attached 19:39:56 [INFO] tot= 9912 local=76.71 open_idle= 0.00 affn_viol= 2.63 tctx_err=0 proc=21ms 19:39:56 [INFO] busy= 1.3 util= 65.2 load= 263.4 fallback_cpu= 1 19:39:56 [INFO] batch : util/frac= 49.7/ 76.3 load/frac= 252.0: 95.7 tasks= 458 19:39:56 [INFO] tot= 2842 local=45.04 open_idle= 0.00 preempt= 0.00 affn_viol= 0.00 19:39:56 [INFO] cpus= 2 [ 0, 2] 04000001 00000000 19:39:56 [INFO] immediate: util/frac= 0.0/ 0.0 load/frac= 0.0: 0.0 tasks= 0 19:39:56 [INFO] tot= 0 local= 0.00 open_idle= 0.00 preempt= 0.00 affn_viol= 0.00 19:39:56 [INFO] cpus= 50 [ 0, 50] fbfffffe 000fffff 19:39:56 [INFO] normal : util/frac= 15.4/ 23.7 load/frac= 11.4: 4.3 tasks= 556 19:39:56 [INFO] tot= 7070 local=89.43 open_idle= 0.00 preempt= 0.00 affn_viol= 3.69 19:39:56 [INFO] cpus= 50 [ 0, 50] fbfffffe 000fffff 19:39:58 [INFO] tot= 7091 local=84.91 open_idle= 0.00 affn_viol= 2.64 tctx_err=0 proc=21ms 19:39:58 [INFO] busy= 0.6 util= 31.2 load= 107.1 fallback_cpu= 1 19:39:58 [INFO] batch : util/frac= 18.3/ 58.5 load/frac= 93.9: 87.7 tasks= 589 19:39:58 [INFO] tot= 2011 local=60.67 open_idle= 0.00 preempt= 0.00 affn_viol= 0.00 19:39:58 [INFO] cpus= 2 [ 2, 2] 04000001 00000000 19:39:58 [INFO] immediate: util/frac= 0.0/ 0.0 load/frac= 0.0: 0.0 tasks= 0 19:39:58 [INFO] tot= 0 local= 0.00 open_idle= 0.00 preempt= 0.00 affn_viol= 0.00 19:39:58 [INFO] cpus= 50 [ 50, 50] fbfffffe 000fffff 19:39:58 [INFO] normal : util/frac= 13.0/ 41.5 load/frac= 13.2: 12.3 tasks= 650 19:39:58 [INFO] tot= 5080 local=94.51 open_idle= 0.00 preempt= 0.00 affn_viol= 3.68 19:39:58 [INFO] cpus= 50 [ 50, 50] fbfffffe 000fffff ^C19:39:59 [INFO] EXIT: BPF scheduler unregistered ``` With -o passed, the output is in OpenMetrics format: ``` 19:40:08 [INFO] CPUs: online/possible=52/52 nr_cores=26 19:40:08 [INFO] Layered Scheduler Attached # HELP total Total scheduling events in the period. # TYPE total gauge total 8489 # HELP local % that got scheduled directly into an idle CPU. # TYPE local gauge local 86.45305689716104 # HELP open_idle % of open layer tasks scheduled into occupied idle CPUs. # TYPE open_idle gauge open_idle 0.0 # HELP affn_viol % which violated configured policies due to CPU affinity restrictions. # TYPE affn_viol gauge affn_viol 2.332430203793144 # HELP tctx_err Failures to free task contexts. # TYPE tctx_err gauge tctx_err 0 # HELP proc_ms CPU time this binary has consumed during the period. # TYPE proc_ms gauge proc_ms 20 # HELP busy CPU busy % (100% means all CPUs were fully occupied). # TYPE busy gauge busy 0.5294061026085283 # HELP util CPU utilization % (100% means one CPU was fully occupied). # TYPE util gauge util 27.37195512782239 # HELP load Sum of weight * duty_cycle for all tasks. # TYPE load gauge load 81.55024768702126 # HELP layer_util CPU utilization of the layer (100% means one CPU was fully occupied). # TYPE layer_util gauge layer_util{layer_name="immediate"} 0.0 layer_util{layer_name="normal"} 19.340849995024997 layer_util{layer_name="batch"} 8.031105132797393 # HELP layer_util_frac Fraction of total CPU utilization consumed by the layer. # TYPE layer_util_frac gauge layer_util_frac{layer_name="batch"} 29.34063385422595 layer_util_frac{layer_name="immediate"} 0.0 layer_util_frac{layer_name="normal"} 70.65936614577405 # HELP layer_load Sum of weight * duty_cycle for tasks in the layer. # TYPE layer_load gauge layer_load{layer_name="immediate"} 0.0 layer_load{layer_name="normal"} 11.14363313258934 layer_load{layer_name="batch"} 70.40661455443191 # HELP layer_load_frac Fraction of total load consumed by the layer. # TYPE layer_load_frac gauge layer_load_frac{layer_name="normal"} 13.664744680306903 layer_load_frac{layer_name="immediate"} 0.0 layer_load_frac{layer_name="batch"} 86.33525531969309 # HELP layer_tasks Number of tasks in the layer. # TYPE layer_tasks gauge layer_tasks{layer_name="immediate"} 0 layer_tasks{layer_name="normal"} 490 layer_tasks{layer_name="batch"} 343 # HELP layer_total Number of scheduling events in the layer. # TYPE layer_total gauge layer_total{layer_name="normal"} 6711 layer_total{layer_name="batch"} 1778 layer_total{layer_name="immediate"} 0 # HELP layer_local % of scheduling events directly into an idle CPU. # TYPE layer_local gauge layer_local{layer_name="batch"} 69.79752530933632 layer_local{layer_name="immediate"} 0.0 layer_local{layer_name="normal"} 90.86574281031143 # HELP layer_open_idle % of scheduling events into idle CPUs occupied by other layers. # TYPE layer_open_idle gauge layer_open_idle{layer_name="immediate"} 0.0 layer_open_idle{layer_name="batch"} 0.0 layer_open_idle{layer_name="normal"} 0.0 # HELP layer_preempt % of scheduling events that preempted other tasks. # # TYPE layer_preempt gauge layer_preempt{layer_name="normal"} 0.0 layer_preempt{layer_name="batch"} 0.0 layer_preempt{layer_name="immediate"} 0.0 # HELP layer_affn_viol % of scheduling events that violated configured policies due to CPU affinity restrictions. # TYPE layer_affn_viol gauge layer_affn_viol{layer_name="normal"} 2.950379973178364 layer_affn_viol{layer_name="batch"} 0.0 layer_affn_viol{layer_name="immediate"} 0.0 # HELP layer_cur_nr_cpus Current # of CPUs assigned to the layer. # TYPE layer_cur_nr_cpus gauge layer_cur_nr_cpus{layer_name="normal"} 50 layer_cur_nr_cpus{layer_name="batch"} 2 layer_cur_nr_cpus{layer_name="immediate"} 50 # HELP layer_min_nr_cpus Minimum # of CPUs assigned to the layer. # TYPE layer_min_nr_cpus gauge layer_min_nr_cpus{layer_name="normal"} 0 layer_min_nr_cpus{layer_name="batch"} 0 layer_min_nr_cpus{layer_name="immediate"} 0 # HELP layer_max_nr_cpus Maximum # of CPUs assigned to the layer. # TYPE layer_max_nr_cpus gauge layer_max_nr_cpus{layer_name="immediate"} 50 layer_max_nr_cpus{layer_name="normal"} 50 layer_max_nr_cpus{layer_name="batch"} 2 # EOF ^C19:40:11 [INFO] EXIT: BPF scheduler unregistered ``` Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>	2024-01-25 09:59:49 -08:00
Andrea Righi	6d89eceb93	scx_rustland: dispatch tasks only on the global DSQ Commit `c6ada25` ("scx_rustland: use custom pcpu DSQ instead of SCX_DSQ_LOCAL{_ON}") fixed the race issues with the cpumask, but it also introduced performance regressions. Until we figure out the reasons of the performance regressions, simplify the dispatcher and go back at using only the global DSQ, relying on the built-in idle cpu selection. In this way we can still enforce task affinity properly (`stress-ng --race-sched N` does not crash the scheduler) and we can also provide a better level of system responsiveness (according to the results of the stress tests done recently). The idea of this change is to make the scheduler usable in certain real-world scenarios (and as bug-free as possible), while we figure out the performance regressions of the per-CPU DSQ approach, that will likely be re-introduced later on in the future. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-23 13:24:12 +01:00
Andrea Righi	06b5ff3d2f	scx_rustland: clarify the logic to determine interactive tasks No functional change, simply rewrite the code a bit and update the comment to clarify the logic to detect interactive tasks and apply the priority boost. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-23 08:28:44 +01:00
Andrea Righi	ab1c4f66a8	scx_rustland: allow to disable the slice boost completely Allow to specify `-b 0` to completely disable the slice boost logic and fallback to standard vruntime-based scheduler with variable time slice. In this way interactive tasks will not get over-prioritized over the other tasks in the system. Having this option can help to easily track down potential performance regressions arising for over-prioritizing interactive tasks. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-23 00:34:06 +01:00
Andrea Righi	b4269452fc	scx_userland: handle preemption events from higher sched_class Make sure to re-schedule the user-space scheduler if it's preempted by a task from a higher priority sched_class. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-23 00:34:06 +01:00
Andrea Righi	2426d1024f	scx_rustland: increase max amount of enqueued tasks As the scheduler is progressing towards a more stable and usable state, it may be subject to heavy stress tests. For this reason, bump up the limit of MAX_ENQUEUED_TASKS to 8192 in the BPF component, to be able to sustain task-intensive stress tests, reducing the risk of potential scheduling congestion conditions. The downside is a negligible increase in the memory footprint of the BPF component, that is worth the cost in order to have an improved scheduler stability. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-21 15:47:35 +01:00
Andrea Righi	28bf96c78e	scx_rustland: mitigate unevictable memory page faults Page faults cannot happen when the user-space scheduler is running, otherwise we may hit deadlock conditions: a kthread may need to run to resolve the page fault, but the user-space scheduler is waiting on the page fault to be resolved => deadlock. We solved this problem (mostly) in commit `9708a80` ("scx_userland: use a custom memory allocator to prevent page faults"), introducing a custom allocator for the user-space scheduler that operates on a pre-allocated mlocked memory buffer, but there is an exception that can still trigger page faults: kcompactd. When memory compaction is enabled, specifically with vm.compact_unevictable_allowed=1 (which is often the default in many distributions), kcompactd regularly attempts to compact all memory zones, such that free memory is available in contiguous blocks where feasible, including unevictable memory as well. In the event that kcompactd remaps pages within the user-space scheduler's address space, it can lead to page faults, resulting in a potential deadlock. To prevent this from happening automatically set vm.compact_unevictable_allowed=0 when the scheduler is loaded and restore the previous value when the scheduler in unloaded. In this way we can prevent kcompactd from touching the unevictable memory associated to the user-space scheduler. Keep in mind that this is not a full bullet proof solution: something else in the system may still set vm.compact_unevictable_allowed=1 while the scheduler is running, re-enabling the risk of deadlock. Ideally we would need a way to mark the user-space scheduler memory as "really unevictable", or a proper kernel ABI to instruct kcompactd to exclude certain tasks (or better, cgroups) from its proactive memory compaction actions, but since then, this seems to be the best way to mitigate this issue. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-21 15:47:35 +01:00
David Vernet	c6ada251ef	scx_rustland: use custom pcpu DSQ instead of SCX_DSQ_LOCAL{_ON} We still don't have a reliable and non-racy way to manage cpumasks from the user-space scheduler, so it is quite hard for the scheduler to enforce the proper CPU affinity behavior. Despite checking the cpumask in the BPF part, tasks may still be assigned to a CPU that they cannot use, triggering scheduler errors. For example, it is really easy to crash the scheduler with a simple CPU affinity stress test (`stress-ng --race-sched 8 --timeout 5`): 14:51:28 [WARN] FAIL: SCX_DSQ_LOCAL[_ON] verdict target cpu 1 not allowed for stress-ng-race-[567048] (err=1024) To prevent this issue from happening, create custom DSQ for each CPU available in the system and use these per-CPU DSQs to dispatch all the tasks processed by the user-space scheduler, including the user-space scheduler itself. Then consume the these DSQs from the .dispatch() callback of the respective CPU, to transfer all the tasks to the consuming CPU's local DSQ, preventing the cpumask race condition encountered using SCX_DSQ_LOCAL_ON. With this patch applied the `stress-ng --race-sched N` stress test can be executed successfully (even with large values of N) without causing the scheduler to crash. Signed-off-by: David Vernet <void@manifault.com> [ arighi: kick target cpu to improve responsiveness, update comments ] Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-21 15:47:35 +01:00
Jordan Rome	9f9a97a97f	Update descriptions in cargo toml files	2024-01-19 18:19:46 -08:00
Andrea Righi	be1cb8774b	scx_rustland: improve SMT performance The user-space scheduler dispatches tasks in batches, with the batch size matching the number of idle CPUs. Commit `791bdbe` ("scx_rustland: introduce SMT support") changed the order of idle CPUs, prioritizing dispatching tasks on the least busy cores (those with the most idle CPUs) before moving on to busier cores (those with the least idle CPUs). While this approach works well for a small number of tasks, it can lead to uneven performance as the number of tasks increases and all cores are saturated. Such uneven performance can be attributed to SMT interactions causing potential short lags and erratic system performance. In some cases, disabling SMT entirely results in better system responsiveness. To address this issue, instruct the scheduler to implicitly disable SMT and consistently dispatch tasks only on the first (or last) CPU of each core. This approach ensures an equal distribution of tasks among the available cores, preventing SMT disturbances and aligning with non-SMT performance, also when a significant amount of tasks are running. Additionally, the unused sibling CPUs within each core can be used as "spare" CPUs for the BPF dispatcher. This is particularly beneficial for tasks that cannot be dispatched on the target CPU selected by the scheduler, due to cpumask restrictions or congestion conditions. Therefore, this new approach allows to enhance system responsiveness on SMT systems, while simultaneously improving scheduler stability. Some preliminary results on an AMD Ryzen 7 5800X 8-Cores (SMT enabled): running my usual benchmark of measuring the fps of a videogame (Counter-Strike 2) during a parallel kernel build-induced system overload, shows an improvement of approximately 2x (from 8-10fps to 15-25fps vs 1-2fps with EEVDF). Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-17 20:49:17 +01:00
Andrea Righi	f0c33320ab	scx_rustland: avoid calling scx_bpf_kick_cpu() from update_idle() Prior to commit `676bd88` ("bpf_rustland: do not dispatch the scheduler to the global DSQ"), the user-space scheduler was dispatched using SCX_DSQ_GLOBAL and we needed to explicitly kick idle CPUs from update_idle() to ensure that at least one CPU was available to run the user-space scheduler. Now that we are using SCX_DSQ_LOCAL_ON\|cpu to dispatch the user-space scheduler, the target CPU is implicitly kicked. Therefore, the call to scx_bpf_kick_cpu() within .update_idle() becomes redundant and we can get rid of it. Fixes: `676bd88` ("bpf_rustland: do not dispatch the scheduler to the global DSQ") Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-17 20:49:17 +01:00
Andrea Righi	0b3c399519	scx_rustland: introduce dynamic slice boost Update the slice boost dynamically, as a function of the amount of CPUs in the system and the amount of tasks currently waiting to be dispatched: as the amount of waiting tasks in the task_pool increases, reduce the slice boost. This adjustment ensures that the scheduler adheres more closely to a pure vruntime-based policy as the amount of tasks contending the available CPUs increases and it allows to sustain stress tests that are spawning a massive amount of tasks. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-16 11:51:51 +01:00
Andrea Righi	791bdbec97	scx_rustland: introduce SMT support Introduce a basic support of CPU topology awareness. With this change, the scheduler will prioritize dispatching tasks to idle CPUs with fewer busy SMT siblings, then, it will proceed to CPUs with more busy SMT siblings, in ascending order. To implement this, introduce a new CoreMapping abstraction, that provides a mapping of the available core IDs in the system along with their corresponding lists of CPU IDs. This, coupled with the get_cpu_pid() method from the BpfScheduler abstraction, allows the user-space scheduler to enforce the policy outlined above and improve performance on SMT systems. Keep in mind that this improvement is relevent only when the amount of tasks running in the system is less than the amount of CPUs. As soon as the amount of running tasks increases, they will be distributed across all available CPUs and cores, thereby negating the advantages of SMT isolation. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-16 11:33:35 +01:00
Andrea Righi	63209b865d	scx_rustland: support aligned allocations in RustLandAllocator Even if the current implementation of the user-space scheduler doesn't require to allocate aligned memory, add a simple support to aligned allocations in RustLandAllocator, in order to make it more generic and potentially usable by other schedulers / components. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-15 13:44:33 +01:00
Andrea Righi	c593e3605e	scx_rustland: report user-space scheduler page fault counter Periodically report a page fault counter in the scheduler output. The user-space scheduler should never trigger page faults, otherwise we may experience deadlocks (that would trigger the sched-ext watchdog, unloading the scheduler). Reporting a page fault counter periodically to stdout can be really helpful to debug potential issues with the custom allocator. Moreover, group together also nr_sched_congested and nr_failed_dispatches with nr_page_faults and use the sum of all these counters to determine the healthy status of the user-space scheduler (reporting it to stdout as well). Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-14 22:07:37 +01:00
Andrea Righi	9708a80130	scx_userland: use a custom memory allocator to prevent page faults To prevent potential deadlock conditions under heavy loads, any scheduler that delegates scheduling decisions to user-space should avoid triggering page faults. To address this issue, replace the default Rust allocator with a custom one (RustLandAllocator), designed to operate on a pre-allocated buffer. This, coupled with the memory locking (via mlockall), prevents page faults from happening during the execution of the user-space scheduler, avoiding the deadlock condition. This memory allocator is completely transparent to the user-space scheduler code and it is applied automatically when the bpf module is imported. In the future we may decide to move this allocator to a more generic place (scx_utils crate), so that also other user-space Rust schedulers can use it. This initial implementation of the RustLandAllocator is very simple: a basic block-based allocator that uses an array to track the status of each memory block (allocated or free). This allocator can be improved in the future, but right now, despite its simplicity, it shows a reasonable speed and efficiency in meeting memory requests from the user-space scheduler, having to deal mostly with small and uniformly sized allocations. With this change in place scx_rustland survived more than 10hrs on a heavily stressed system (with stress-ng and kernel builds running in a loop): $ ps -o pid,rss,etime,cmd -p `pidof scx_rustland` PID RSS ELAPSED CMD 34966 75840 10:00:44 ./build/scheds/rust/scx_rustland/debug/scx_rustland Without this change it is possible to trigger the sched-ext watchdog timeout in less than 5min, under the same system load conditions. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-14 22:07:37 +01:00
Andrea Righi	acc1d51560	scx_rustland: remove obsolete TODO note Entries from TaskInfoMap associated to exiting tasks are already removed via the BPF .exit_task() callback, so drop the obsolete TODO note and replace it with a proper comment. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-11 20:47:36 +01:00
Andrea Righi	12d89e1d84	scx_rustland: add a troubleshooting section Add a brief troubleshooting section to the command line help. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-11 18:14:46 +01:00
Andrea Righi	2157f638df	scx_rustland: voluntary context switch boost Improve priority boosting using voluntary context switches metric. Overview ======== The current criteria to apply the time slice boost (option `-b`) is to distinguish between newly created tasks and tasks that are already running: in order to prioritize interactive applications (games, multimedia, etc.) we apply a time slice usage penalty on newly created tasks, indirectly boosting the priority of tasks that are already running, which are likely to be the interactive applications that we aim to prioritize. Problem ======= This approach works well when the background workload forks a bunch of short-lived tasks (e.g., a parallel kernel build), but it fails to properly classify CPU-intensive background tasks (i.e., video/3D rendering, encryption, large data analysis, etc.), because these applications, typically, do not generate many short-lived processes. In presence of such workloads the time slice penalty is not enforced, resulting in a lack of any boost for interactive applications. Solution ======== A more effective critiria for distinguishing between interactive applications and background CPU-intensive applications is to examine the voluntary context switches: an application that periodically releases the CPU voluntarily is very likely to be interactive. Therefore, change the time slice boost logic to apply a bonus (scale down the accounted used time slice) to tasks that show an increase in their voluntary context switches counter over a time frame of 10 sec. Based on experimental results, this simple heurstic appears to be quite effective in classifying interactive tasks and prioritize them over potential background CPU-intensive tasks. Additionally, having a better criteria to identify interactive tasks allow to prioritize also newly created tasks, thereby enhancing the responsiveness of interactive shell sessions. This always ensures the prompt execution of system commands, even when the system is massively overloaded, unlike the previous time slice boost logic, which made interactive shell sessions less responsive by deprioritizing newly created tasks. Results ======= With this new logic in place it is possible to play a video game (e.g., Terraria) without experiencing any frame rate drop (60 fps), while a parallel CPU stress test (`stress-ng -c 32`) is running in the background. The same result can also be obtained with a parallel kernel build (`make -j 32`). Thus, there is no regression compared to the previous "ideal" test case. Even when mixing both workloads (`make -j 16` + `stress-ng -c 16`), Terraria can still be played without noticeable lag in the audio or video, maintaining a consistent 60 fps. In addition to that, shell commands are also very responsive. Following, the results (average and standard deviation of 10 runs) of two simple interactive shell commands, while both the `make -j 16` and `stress-ng -c 16` workloads are running in background: avg time "uname -r" "ps axuw > /dev/null" ========================================================= EEVDF 11.1ms 231.8ms scx_rustland 2.6ms 212.0ms stdev "uname -r" "ps axuw > /dev/null" ========================================================= EEVDF 2.28 23.41 scx_rustland 0.70 9.11 Tests conducted on a 8-cores laptop (11th Gen Intel i7-1195G7 @ 4.800GHz) with 16GB of RAM. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-11 18:14:30 +01:00
Andrea Righi	1cf03770c7	scx_rustland: expose voluntary context switches to the scheduler Provide the number of voluntary context switches (nvcsw) for each task to the user-space scheduler. This extra information can then be used by the scheduler to enhance its decision-making process when scheduling tasks. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-11 14:10:39 +01:00
Tejun Heo	1395f14975	Update README.md Embed the video and drop "live" from section title as it's not really live.	2024-01-10 14:47:33 -10:00
Andrea Righi	0198d893ce	scx_rustland: introduce time slice boost parameter Introduce a parameter to prioritize active running tasks over newly created tasks. This option can be used to enhance interactive applications (e.g., games, audio/video, GUIs, etc.) that are concurrently running with fork-intensive background workloads (such as a large parallel build for example). The boost value (which functions as a penalty) is applied to the time slice attributed to newly generated tasks, increasing their vruntime and, in an indirect manner, "boosting" the priority of all the other concurrent active tasks. The time slice boost parameter was applied in the live demo video [1] to enhance the frames per second (fps) of a video game (Terraria), running simultaneously with a parallel kernel build (`make -j 32`) on an 8-core laptop (the value used in the video matches the existing setting of running `scx_rustland -b 200`). [1] https://www.youtube.com/watch?v=oCfVbz9jvVQ Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-10 17:32:29 +01:00
Andrea Righi	732ba4900b	scx_rustland: avoid using SCX_ENQ_PREEMPT With the introduction of a the dynamic time slice that scales down based on the number of tasks in the system, there is no obvious benefit in utilizing SCX_ENQ_PREEMPT to dispatch the user-space scheduler. The reduced time slice as the task count increases already enhances the user-space scheduler's opportunities to run and efficiently manage scheduling tasks, even when the system is massively overloaded. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-10 17:32:29 +01:00
Andrea Righi	db9a29d618	scx_rustland: improve dynamic slice scaling Move scaling after tasks are sent to the dispatcher: tasks are dispatched based on the amount of idle CPUs, so checking for any remaining tasks still sitting in the scheduler after dispatch gives a better idea how busy the system is. Moreover, do not scale the time slice based on nr_cpus (otherwise, systems with a large amount of CPUs would rarely get any scaling at all). Instead, apply a scaling factor as a function of how many tasks are still waiting in the scheduler: nr_scheduled / 2. This method scales better as the number of CPUs increases. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-09 22:11:07 +01:00
Andrea Righi	1da2983804	scx_rustland: get rid of force_local Now that we can dispatch directly from select_cpu() we can make the code more compact and readable by removing the force_local logic. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-09 22:11:07 +01:00
Andrea Righi	6ead675fb6	scx_rustland: add a link to the live demo in the README Update the README.md adding a link to a live demo video of the scheduler. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-09 22:11:07 +01:00
Tejun Heo	942b0269b8	Bump versions After updates to reflect the updated init and direct dispatch API, the schedulers aren't compatible with older kernels. Bump versions and publish releases.	2024-01-08 18:49:54 -10:00
Tejun Heo	552b75a9c7	scx: Build fix after kernel update In the latest kernel, sched_ext API has changed in two areas: - ops.prep_enable/cancel_enable/enable/disable() replaced with ops.init_task/enable/disable/exit_task(). - scx_bpf_dispatch() can now be called from ops.select_cpu(). Also, SCX_ENQ_LOCAL flag is removed. Instead, users can call scx_bpf_select_cpu_dfl() from ops.select_cpu() and use the @is_idle out param value to determine whether to dispatch directly. This commit updates all schedules so that they build. - Init functions renamed / merged / split. - ops.select_cpu() is added to several schedulers and local direct disptching logic is moved there. This is the minimum update which is need to make the schedulers build and work. It needs further update to e.g. move vtime udpates to ops.enable().	2024-01-08 14:48:24 -10:00
Andrea Righi	1ea5aebfb4	scx_rustland: always consider slice_ns as maximum time slice With the introduction of a the dynamic time slice that scales down based on the number of tasks in the system, there is no need anymore to apply a constant scaling factor to time slice to extend the range of the allowed time slices. Therefore, get rid of the static scaling and use slice_ns as the upper limit for the time slice accounted to the tasks. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-08 19:22:38 +01:00
Andrea Righi	9b482f48f1	scx_rustland: determine the amount of cores via /proc/stat libbpf_rs::num_possible_cpus() may take into account multi-threads multi-cores information, that are not used efficiently by the scheduler at the moment. For simplicity rely on /proc/stat to determine the amount of CPUs that can be used by the scheduler and provide a proper abstraction to access this information from the bpf Rust module. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-08 19:11:25 +01:00
Andrea Righi	0d107d6220	scx_rustland: return the proper cpu value from get_task_cpu() Fix the ternary operator expression to return the CPU id, instead of the boolean result of the condition. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-08 19:10:59 +01:00
Andrea Righi	fa6915cc0a	scx_rustland: simplify update_enqueued() With the introduction of a variable time slice that scales down in function of the amount of waiting tasks, the scheduler is able to handle a steady stream of newly spawned tasks, without having to de-prioritize them to guarantee a good level of system responsiveness. Hence, the logic for de-prioritizing new tasks can be removed, as it currently doesn't provide any measurable benefits. In fact, it even proves counterproductive as it can implicitly slow down the interactive performance of shell sessions when the system is overloaded with a significant amount of CPU hogs (e.g, `stress-ng -c 128`). Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-08 07:38:52 +01:00
Andrea Righi	bf98154ee1	scx_rustland: use dynamic time slice in the user-space scheduler Implement a simple logic in the user-space scheduler to automatically adjust the tasks' time slice: reduce the time slice by a scaling factor of (nr_waiting / nr_cpus + 1), where nr_waiting is the amount of tasks waiting in the scheduler and nr_cpus is the amount of CPUs in the system. Using a fine-grained time slice as the number of tasks in the system grows, improves responsiveness of low-latency activities (e.g., audio, video games), also in presence of other CPU-intensive tasks that are concurrently running in the system. On the other hand, extending the time slice when only a limited number of tasks are active in the system contributes to an enhancement in the overall system throughput and a reduced amount of context switches. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-08 07:38:52 +01:00
Andrea Righi	303c4ea548	scx_rustland: dynamic time slice support Add to BpfScheduler() the new methods set_effective_slice_us() and get_effective_slice_us(). These methods can be used by the user-space scheduler to dynamically adjust (and retrieve) the effective time slice used to dispatch tasks within the BPF dispatcher. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-08 07:35:31 +01:00
Andrea Righi	2a32d81859	scx_rustland: store default slice_ns in the scheduler class Cache slice_ns into the main scheduler class to avoid accessing it via self.bpf.skel.rodata().slice_ns every single time. This also makes the scheduler code more clear and more abstracted from the BPF details. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-07 16:14:51 +01:00
Andrea Righi	8ccbbdadee	scx_userland: improve BPF logging Always report task comm, nr_queued and nr_scheduled in the log messages. Moreover, report also task name (comm) and cpu when possible. All these extra information can be really helpful to trace and debug scheduling issues. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-07 16:14:51 +01:00
Andrea Righi	295873ac41	scx_rustland: always dispatch per-CPU kthreads from enqueue We allow tasks to bypass the user-space scheduler and be dispatched directly using a shortcut in the enqueue path, if their running CPU is immediately available or if the task is per-CPU kthread. However, the shortcut is disabled if the user-space scheduler has some pending activities to do (to avoid disrupting too much its decision). In this case the shortcut is disabled also for per-CPU kthreads and that may cause priority-inversion problems in the system, triggering some stall of some per-CPU kthreads (such as rcuog/N) and short system lockups, if the system is overloaded. Prevent this by always enabing the dispatch shortcut for per-CPU kthreads. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-06 11:06:53 +01:00
Andrea Righi	0c3bdb16fe	scx_rustland: prevent using SCX_DSQ_LOCAL_ON from enqueue() When we fail to push a task to the queued BPF map we fallback to direct dispatch, but we can't use SCX_DSQ_LOCAL_ON. So, make sure to use SCX_DSQ_GLOBAL in this case to prevent scheduler crashes. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-06 11:06:53 +01:00
Andrea Righi	05d997c539	scx_rustland: more robust CPU selection logic in the dispatcher Instead of just trying the target CPU and the previously used CPU, we could cycle among all the available CPUs (if both those CPUs cannot be used), before using the global DSQ. This allows to not de-prioritize too much tasks that can't be scheduled on the CPU selected by the scheduler (or their previously used CPU), and we can still dispatch them using SCX_DSQ_LOCAL_ON, like any other task. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-06 11:06:53 +01:00
Andrea Righi	18a990ae82	scx_rustland: assign min_vruntime before time slice evaluation Assign min_vruntime to the task before the weighted time slice is evaluated, then add the time slice. In this way we still ensure that the task's vruntime is in the range (min_vruntime + 1, min_vruntime + max_slice_ns], but we don't nullify the effect of the evaluated time slice if the starting vruntime of the task is too small. Also change update_enqueued() to return the evaluated weighted time slice (that can be used in the future). Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-06 11:06:53 +01:00
Andrea Righi	92109c95a9	scx_rustland: small TaskTree.push() refactoring Change TaskTree.push() to accept directly a Task object, rather than each individual attribute. Moreover, Task attributes don't need to be public, since both TaskTree and Task are only used locally. This makes the code more elegant and more readable. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-06 11:06:53 +01:00
Jordan Rome	661ea57c5c	bump scx_rusty and scx_layered These were supposed to be bumped in this commit: `fed1dae9da`	2024-01-04 13:57:29 -08:00
Andrea Righi	96f3eb42be	Merge pull request #68 from sched-ext/scx-rustland-refactoring scx_rustland: refactoring	2024-01-04 20:42:30 +01:00
Andrea Righi	7813992896	scx_rustland: introduce nr_failed_dispatches Introduce a new counter to report the amount of failed dispatches: if the scheduler designates a target CPU for a task, and both the chosen CPU and the previously utilized one are unavailable when the task is dispatched, the task will be sent to the global DSQ, and the counter will be incremented. Also mark all the methods to access these statistics counters as optional. In the future we may also provide a "verbose" option and show these statistics only when the scheduler runs in verbose mode. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-04 17:36:06 +01:00
Andrea Righi	796a7ebc0e	scx_rustland: provide an abstraction layer for the BPF component Move the code responsible for interfacing with the BPF component into its own module and provide high-level abstractions for the user-space scheduler, hiding all the internal BPF implementation details. This makes the user-space scheduler code much more readable and it allows potential developers/contributors that want to focus at the pure scheduling details to modify the scheduler in a generic way, without having to worry about the internal BPF details. In the future we may even decide to provide the BPF abstraction as a separate crate, that could be used as a baseline to implement user-space schedulers in Rust. API overview ============ The main BPF interface is provided by BpfScheduler(). When this object is initialized it will take care of registering and initializing the BPF component. Then the scheduler can use the BpfScheduler() instance to receive tasks (in the form of QueuedTask object) and dispatch tasks (in the form of DispatchedTask objects), using respectively the methods dequeue_task() and dispatch_task(). The CPU ownership map can be accessed using the method get_cpu_pid(), this also allows to keep track of the idle and busy CPUs, with the corrsponding PIDs associated to them. BPF counters and statistics can be accessed using the methods nr_*_mut(), in particular nr_queued_mut() and nr_scheduled_mut() can be updated to notify the BPF component if the user-space scheduler has some pending work to do or not. Finally the methods read_bpf_exit_kind() and report_bpf_exit_kind() can be used respectively to read the exit code and exit message from the BPF component, when the scheduler is unregistered. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-04 16:49:09 +01:00
Jordan Rome	5bacefcdbe	Add README files for each rust scheduler This because each scheduler has it's own Rust Crate and it's better if they had a README associated with each one. https://crates.io/crates/scx_layered	2024-01-04 07:35:44 -08:00
Andrea Righi	7c11837a61	scx_rustland: make dispatcher more robust We always try to use the current CPU (from the .dispatch() callback) to run the user-space scheduler itself and if the current CPU is not usable (according to the cpumask) we just re-use the previouly used CPU. However, if the previously used CPU is also not usable, we may trigger the following error: sched_ext: runtime error (SCX_DSQ_LOCAL[_ON] verdict target cpu 4 not allowed for scx_rustland[256201]) Potentially this can also happen with any task, so improve the dispatch logic as following: - dispatch on the target CPU, if usable - otherwise dispatch on the previously used CPU, if usable - otherwise dispatch on the global DSQ Moreover, rename dispatch_on_cpu() -> dispatch_task() for better clarity. This should be enough to handle all the possible decisions made by the user-space scheduler, making the dispatcher more robust. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-04 10:21:40 +01:00
Andrea Righi	69c1dfc03c	scx_rustland: remove unnecessary scx_bpf_dispatch_nr_slots() check In the dispatch callback we can dispatch tasks to any CPU, according to the scheduler decisions, so there's no reason to check for the available dispatch slots in the current CPU only, to determine if we need to stop dispatching tasks. Since the scheduler is aware of the idle state of the CPUs (via the CPU ownership map) it has all the information to automatically regulate the flow of dispatched tasks and not overflow the dispatch slots, therefore it is safe to remove this check. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-04 09:41:54 +01:00
Andrea Righi	6b1e7d927d	scx_rustland: update comments and documentation in the BPF part No functional change, only a little polishing, including updates to comments and documentation to align with the latest changes in the code. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-04 09:40:49 +01:00
Andrea Righi	bb1c32d395	scx_rustland: avoid bypassing the scheduler with pending activities While bypassing the user-space scheduler can provide some benefits at reducing the scheduling overhead, doing so underneath the scheduler while it is actively taking decisions may disrupt its work and have a negative effect on the overall system performance. For this reason, activate the logic to bypass the user-space scheduler only when there is no pending work it. This change makes the scheduler much more reliable, for example on a 8-cores system it is really easy to trigger short lockups or even trigger the sched-ext watchdog that kicks out the scheduler, running the following stress test: $ stress-ng -c 128 With this change applied the system remains reasonably responsive and the scheduler is never disabled by the sched-ext watchdog. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-03 22:54:14 +01:00
Andrea Righi	5d15d34777	scx_rustland: charge additional time slice to new tasks Instead of accounting (max_slice_ns / 2) to the vruntime of all the new tasks, add that to thier regular weighted time delta, as an additional penalty. This allows to distinguish new CPU intensive tasks vs new less CPU intensive tasks, and prioritize the latter over the former. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-03 22:54:10 +01:00
Andrea Righi	8820af8d36	scx_rustland: enable user-space scheduler to preempt other tasks Use SCX_ENQ_PREEMPT to dispatch the user-space scheduler. This can help to mitigate starvation in presence of many cpu hogs (way more than the amount of available CPUs) running in the system, by giving the scheduler more chances to drain the amount of tasks that may be starving in a waiting state. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-03 22:54:00 +01:00
Andrea Righi	5d9182d9c3	scx_rustland: prioritize interactive workloads The current implementation of the user-space scheduler is strongly prioritizing newly created tasks by setting their initial vruntime to (min_vruntime + 1); this prioritization places them ahead of other tasks waiting to run. While this approach is efficient for processing short-lived tasks, it makes the scheduler vulnerable to fork-bomb attacks and significantly penalizes interactive workloads (e.g., "foreground" applications), in particular in the presence of background applications that are spawning multiple tasks, such as parallel builds. Instead of prioritizing newly created tasks, do the opposite and account (max_slice_ns / 2) to their initial vruntime, to make sure they are not scheduled before the other tasks that are already waiting for the CPU in the current scheduler run. This allows to mitigate potential fork-bomb attacks and it strongly improves the responsiveness of interactive applications (such as UI, audio/video streams, gaming, etc.). With this change applied, under certain conditions, scx_rustland can even outperform the default Linux scheduler. For example, with a parallel kernel build (make -j32) running in the background, I can play Terraria with a constant rate of ~30-40 fps, while the default Linux scheduler can handle only ~20-30 fps under the same conditions. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-03 18:28:54 +01:00
Andrea Righi	50b5f6e8c6	scx_rustland: do not update exiting tasks statistics Avoid updating task information for tasks that are exiting, as they won't be used by the user-space scheduler. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-03 09:10:20 +01:00
Andrea Righi	b7a9d3775a	scx_rustland: schedule non-cpu intensive kthreads normally With commit `a7677fd` ("scx_rustland: bypass user-space scheduler for short-lived kthreads") we were try to mitigate a problem that was actually introduced by using the wrong formula to evaluate weighted vruntime, see commit `2900b20` ("scx_rustland: evaluate the proper vruntime delta"). Reverting that (pseudo-)optimization doesn't seem to introduce any performance/latency regression and it makes the code more elegant, therefore drop it. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-03 07:46:01 +01:00
Andrea Righi	a09482f0ef	scx_rustland: notify user-space scheduler about exiting tasks Instead of implementing a garbage collector to periodically free up exiting tasks' resources, implement a proper synchronous mechanism to notify the user-space scheduler about the exiting tasks from the BPF component, using the .disable() callback. When the user-space scheduler receives a queued task with a negative CPU number, it can then release all the resources associated with that task (which currently includes only the entry in the TaskInfoMap for now). This allows to get rid of the TaskInfoMap periodic garbage collector routine, save a lot of syscalls in procfs (used to check if the pids were still alive), and improve the overall scheduler performance. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-02 12:57:27 +01:00
Andrea Righi	280796c4bd	scx_rustland: small code refactoring No functional change, make the user-space scheduler code a bit more readable and more Rust idiomatic. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-01 19:47:30 +01:00
Andrea Righi	2900b208fe	scx_rustland: evaluate the proper vruntime delta The forumla used to evaluate the weighted time delta is not correct, it's not considering the weight as a percentage. Fix this by using the proper formula. Moreover, take into account also the task weight when evaluating the maximum time delta to account in vruntime and make sure that we never charge a task more than slice_ns. This helps to prevent starvation of low priority tasks. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-01 19:47:30 +01:00
Andrea Righi	90e92ace2d	scx_rustland: prevent starvation handling short-lived tasks properly Prevent newly created short-lived tasks from starving the other tasks sitting in the user-space scheduler. This can be done setting an initial vruntime of (min_vruntime + 1) to newly scheduled tasks, instead of min_vruntime: this ensures a progressing global vruntime durig each scheduler run, providing a priority boost to newer tasks (that is still beneficial for potential short-lived tasks) while also preventing excessive starvation of the other tasks sitting in the user-space scheduler, waiting to be dispatched. Without this change it is really easy to create a stall condition simply by forking a bunch of short-lived tasks in a busy loop, with this change applied the scheduler can handle properly the consistent flow of newly created short-lived tasks, without introducing any stall. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-01 16:58:28 +01:00
Andrea Righi	676bd88ada	bpf_rustland: do not dispatch the scheduler to the global DSQ Never dispatch the user-space scheduler to the global DSQ, while all the other tasks are dispatched to the local per-CPU DSQ. Since tasks are consumed from the local DSQ first and then from the global DSQ, we may end up starving the scheduler if we dispatch only this one on the global DSQ. In fact it is really easy to trigger a stall with a workload that triggers many context switches in the system, for example (on a 8 cores system): $ stress-ng --cpu 32 --iomix 4 --vm 2 --vm-bytes 128M --fork 4 --timeout 30s ... 09:28:11 [WARN] EXIT: scx_rustland[1455943] failed to run for 5.275s 09:28:11 [INFO] Unregister RustLand scheduler To prevent this from happening also dispatch the user-space scheduler on the local DSQ, using the current CPU where .dispatch() is called, if possible, or the previously used CPU otherwise. Apply the same logic when the scheduler is congested: dispatch on the previously used CPU using the local DSQ. In this way all tasks will always get the same "dispatch priority" and we can prevent the scheduler starvation issue. Note that with this change in place dispatch_global() is never used and we can get rid of it. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-01 14:17:23 +01:00
Andrea Righi	0fc46b2be2	scx_rustland: remove SCX_ENQ_LAST check in is_task_cpu_available() With commit `49f2e7c` ("scx_rustland: enable SCX_OPS_ENQ_LAST") we have enabled SCX_OPS_ENQ_LAST that seems to save some unnecessary user-space scheduler activations when the system is mostly idle. We are also checking for the SCX_ENQ_LAST in the enqueue flags, that apparently it is not needed and we can achieve the same behavior dropping this check. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-01 14:17:23 +01:00
Andrea Righi	840260141d	scx_rustland: never account more than slice_ns to vruntime In any case make sure that we never account more than the maximum slice_ns to a task's vruntime. This helps to prevent starving a task for too long in the user-space scheduler. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-01 14:17:23 +01:00
Andrea Righi	61c77b7d87	scx_rustland: clean up old entries in the task map The user-space scheduler maintains an internal hash map of tasks information (indexed by their pid). Tasks are only added to this hash map and never removed. After running the scheduler for a while we may experience a performance degration, because the hash map keeps growing. Therefore implement a mechanism of garbage collector to remove the old entries from the task map (periodically removing pids that don't exist anymore). Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-01 14:17:23 +01:00
Andrea Righi	27739065bc	scx_rustland: rename variable id -> pos for better clarity Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-01 14:17:23 +01:00
Andrea Righi	1cdcb8af60	scx_rustland: show the CPU where the scheduler is running In the scheduler statistics reported periodically to stdout, instead of showing "pid=0" for the CPU where the scheduler is running (like an idle CPU), show "[self]". This helps to identify exactly where the user-space scheduler is running (when and where it migrates, etc.). Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2023-12-31 17:03:30 +01:00
Andrea Righi	a7677fdf28	scx_rustland: bypass user-space scheduler for short-lived kthreads Bypass the user-space scheduler for kthreads that still have more than half of their runtime budget. As they are likely to release the CPU soon, granting them a substantial priority boost can enhance the overall system performance. In the event that one of these kthreads turns into a CPU hog, it will deplete its runtime budget and therefore it will be scheduled like any other normal task through the user-space scheduler. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2023-12-31 16:40:05 +01:00
Andrea Righi	405a11308e	scx_rustland: always use dispatch_on_cpu() when possible Use dispatch_on_cpu() when possible, so that all tasks dispatched by the user-space scheduler gets the same priority, instead of having some of them dispatched to the global DSQ and others dispatched to the per-CPU DSQ. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2023-12-31 16:08:31 +01:00
Andrea Righi	49f2e7ce06	scx_rustland: enable SCX_OPS_ENQ_LAST Make sure the scheduler is not activated if we are deadling with the last task running. This allows to consistency reduce scx_rustland CPU usage in systems that are mostly idle (and avoid unnecessary power consumption). Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2023-12-31 16:06:45 +01:00
Andrea Righi	0522219bea	scx_rustland: prevent dispatching multiple tasks on the same idle cpu When a task is dispatched we always try to pick the previously used CPU (if idle) to minimize the migration overhead. Alternatively, if such CPU is not available, we pick any other idle CPU in the system. However, we don't update the list of idle CPUs as we dispatch tasks, therefore we may end up sending multiple tasks to the same idle CPU (if their previously used CPU is the same) and we may even skip some idle CPUs completely. Change this logic to make sure that we never dispatch multiple tasks to the same idle CPU, by updating the list of idle CPUs as we send tasks to the BPF dispatcher. This also avoids dispatching tasks with a closely matched vruntime to the same CPU, thereby negating the advantages of the vruntime ordering. With this change in place, we ensure that tasks with a similar vruntime are dispatched to different CPUs, leading to significant improvements in latency performance. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2023-12-31 09:37:39 +01:00
Andrea Righi	38145f8dc9	scx_rustland: check CPU selection validity When the scheduler decides to assign a different CPU to the task always make sure the assignment is valid according to the task cpumask. If it's not valid simply dispatch the task to the global DSQ. This prevents the scheduler from exiting with errors like this: 09:11:02 [WARN] EXIT: SCX_DSQ_LOCAL[_ON] verdict target cpu 7 not allowed for gcc[440718] In the future we may want move this check directly into the user-space scheduler, but for now let's keep this check in the BPF dispatcher as a quick fix. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2023-12-30 10:40:46 +01:00
Andrea Righi	1a2c9f5fd4	scx_rustland: improve scheduler's idle CPU selection The current CPU selection logic in the scheduler presents some inefficiencies. When a task is drained from the BPF queue, the scheduler immediately checks whether the CPU previously assigned to the task is still idle, assigning it if it is. Otherwise, it iterates through available CPUs, always starting from CPU #0, and selects the first idle one without updating its state. This approach is consistently applied to the entire batch of tasks drained from the BPF queue, resulting in all of them being assigned to the same idle CPU (also with a higher likelihood of allocation to lower CPU ids rather than higher ones). While dispatching a batch of tasks to the same idle CPU is not necessarily problematic, a fairer distribution among the list of idle CPUs would be preferable. Therefore change the CPU selection logic to distribute tasks equally among the idle CPUs, still maintaining the preference for the previously used one. Additionally, apply the CPU selection logic just before tasks are dispatched, rather than assigning a CPU when tasks are drained from the BPF queue. This adjustment is important, because tasks may linger in the scheduler's internal structures for a bit and the idle state of the CPUs in the system may change during that period. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2023-12-30 10:34:08 +01:00
Andrea Righi	e90bc923f9	scx_rustland: introduce nr_waiting concept We want to activate the user-space scheduler only when there are pending tasks that require scheduling actions. To do so we keep track of the queued tasks via nr_queued, that is incremented in .enqueue() when a task is sent to the user-space scheduler and decremented in .dispatch() when a task is dispatched. However, we may trigger an unbalance if the same pid is sent to the scheduler multiple times (because the scheduler store all the tasks by their unique pid). When this happens nr_queued is never decremented back to 0, leading the user-space scheduler to constantly spin, even if there's no activity to do. To prevent this from happening split nr_queued into nr_queued and nr_scheduled. The former will be updated by the BPF component every time that a task is sent to the scheduler and it's up to the user-space scheduler to reset the counter when the queue is fully dreained. The latter is maintained by the user-space scheduler and represents the amount of tasks that are still processed by the scheduler and are waiting to be dispatched. The sum of nr_queued + nr_scheduled will be called nr_waiting and we can rely on this metric to determine if the user-space scheduler has some pending work to do or not. This change makes rust_rustland more reliable and it strongly reduces the CPU usage of the user-space scheduler by eliminating a lot of unnecessary activations. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2023-12-29 21:15:04 +01:00
Andrea Righi	d67dfe50f9	scx_rustland: treat the CPU running the user-space scheduler as idle Considering the CPU where the user-space scheduler is running as busy doesn't really provide any benefit, since the user-space scheduler is constantly dispatching an amount of tasks equal to the amount of idle CPUs and then yields (therefore its own CPU should be considered idle). Considering the CPU where the user-space scheduler is running as busy doesn't provide any benefit, as the scheduler consistently dispatches tasks equal to the number of idle CPUs and then yields (therefore its own CPU should be considered idle). This also allows to reduce the overall user-space scheduler CPU utilization, especially when the system is mostly idle, without introducing any measurable performance regression. Measuring the average CPU utilization of a (mostly) idle system over a time period of 60 sec: - wihout this patch: 5.41% avg cpu util - with this patch: 2.26% avg cpu util Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2023-12-29 21:14:58 +01:00
Andrea Righi	cc17780c24	scx_rustland: add documentation to scheds/rust/README.md Add documentation for scx_rustland to the README.md files of the Rust schedulers. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2023-12-29 09:13:54 +01:00
Andrea Righi	6df4d7e0c6	scx_rustland: introduce an update_idle() callback Move the logic to activate the userspace scheduler to an update_idle() callback, which is called when the CPU is about to go idle. This disables the built-in idle tracking mechanism, so it allows to rely completely on the internal CPU ownership logic (via get_cpu_owner() and set_cpu_owner()) and it also allows to share the idle state with the user-space scheduler via the BPF_MAP_TYPE_ARRAY cpu_map. Moreover, when the user-space scheduler is activated, kick the idle cpu to trigger immediate dispatch and avoid bubbles in the scheduling pipeline. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2023-12-28 14:41:08 +01:00
Andrea Righi	1baae38e7f	Revert "scx_rustland: always dispatch kthreads on the local CPU" This reverts commit `9237e1d` ("scx_rustland: always dispatch kthreads on the local CPU"). Do not always prioritize all kthreads, we may have unbound workqueue workers that can consume a lot of CPU cycles (e.g., encryption workers), so we definitely want to apply the scheduling for those. Therefore, restore the old behavior to prioritize only per-CPU kthreads. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2023-12-28 14:40:03 +01:00
Andrea Righi	9237e1d835	scx_rustland: always dispatch kthreads on the local CPU Adding extra overhead to any kthread can potentially slow down the entire system, so make sure this never happens by dispatching all kthreads directly on the same local CPU (not just the per-CPU kthreads), bypassing the user-space scheduler. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2023-12-27 14:15:46 +01:00
Andrea Righi	f0ece7af6b	scx_rustland: wake-up user-space scheduler when a CPU is released Trigger the user-space scheduler only upon a task's CPU release event (avoiding its activation during each enqueue event) and only if there are tasks waiting to be processed by the user-space scheduler. This should save unnecessary calls to the user-space scheduler, reducing the overall overhead of the scheduler. Moreover, rename nr_enqueues to nr_queued and store the amount of tasks currently queued to the user-space scheduler (that are waiting to be dispatched). Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2023-12-27 14:15:46 +01:00
Andrea Righi	7d01be9568	scx_rustland: provide get/set_cpu_owner() Provide the following primitives to get and set CPU ownership in the BPF part. This improves code readability and these primitives can be used by the BPF part as a baseline to implement a better CPU idle tracking in the future. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2023-12-27 14:15:39 +01:00
Andrea Righi	cd7e1c6248	scx_rustland: clarify BPF / user-space interlocking BPF doesn't have full memory model yet, and while strict atomicity might not be necessary in this context, it is advisable to enhance clarity in the interlocking model. To achieve this, provide the following primitives to operate on usersched_needed: static void set_usersched_needed(void) static bool test_and_clear_usersched_needed(void) Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2023-12-26 14:28:24 +01:00
Andrea Righi	e038a530ae	scx_rustland: dispatch tasks in batch Dispatch tasks in a batch equal to the amount of idle CPUs in the system. This allows to reduce the pressure on the dispatcher queues, improving the effectiveness of the scheduler (by having more tasks sitting in the scheduler task pool) and mitigating potential priority inversion issues. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2023-12-23 10:44:03 +01:00
Andrea Righi	4d98862674	scx_rustland: expose CPU information to the user-space scheduler Provide an interface for the BPF dispatcher and user-space scheduler to share CPU information. This information can empower the user-space scheduler to make more informed decisions and enable the implementation of a broader range of scheduling policies. With this change the BPF dispatcher provides a CPU map (one entry per CPU) that stores the pid that is running on each CPU (0 if the CPU is idle). The CPU map is updated by the BPF dispatcher in the .running() and .stopping() callbacks. The dispatcher then sends to the user-space scheduler a suggestion of the candidate CPU for each task that needs to run (that is always the previously used CPU), along with all the task's information. The user-space scheduler can decide to confirm the selected CPU or to choose a different one, using all the shared CPU information. Lastly, the selected CPU is communicated back to the dispatcher along with all the task's information and the BPF dispatcher takes care of executing the task on the selected CPU, eventually triggering a migration. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2023-12-23 10:38:56 +01:00
Andrea Righi	968ac80a3f	scx_rustland: handle graceful vs non-graceful exit Do not report an exit error message if it's empty. Moreover, distinguish between a graceful exit vs a non-graceful exit. In general, try to follow the behavior of user_exit_info.h for the C schedulers. NOTE: in the future the whole exit handling probably can be moved to a more generic place (scx_utils) to prevent code duplication across schedulers and also to prevent small inconsistencies like this one. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2023-12-22 19:44:14 +01:00
Andrea Righi	f7f0e3236c	scx_rustland: rename from scx_rustlite Rename scx_rustlite to scx_rustland to better represent the mirroring of scx_userland (in C), but implemented in Rust. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2023-12-22 00:20:14 +01:00
Andrea Righi	086c6dffc8	scx_rustlite: simple user-space scheduler written in Rust This scheduler is made of a BPF component (dispatcher) that implements the low level sched-ext functionalities and a user-space counterpart (scheduler), written in Rust, that implements the actual scheduling policy. The main goal of this scheduler is to be easy to read and well documented, so that newcomers (i.e., students, researchers, junior devs, etc.) can use this as a template to quickly experiment scheduling theory. For this reason the design of this scheduler is mostly focused on simplicity and code readability. Moreover, the BPF dispatcher is completely agnostic of the particular scheduling policy implemented by the user-space scheduler. For this reason developers that are willing to use this scheduler to experiment scheduling policies should be able to simply modify the Rust component, without having to deal with any internal kernel / BPF details. Future improvements: - Transfer the responsibility of determining the CPU for executing a particular task to the user-space scheduler. Right now this logic is still fully implemented in the BPF part and the user-space scheduler can only decide the order of execution of the tasks, that significantly restricts the scheduling policies that can be implemented in the user-space scheduler. - Experiment the possibility to send tasks from the user-space scheduler to the BPF dispatcher using a batch size, instead of draining the task queue completely and sending all the tasks at once every single time. A batch size should help to reduce the overhead and it should also help to reduce the wakeups of the user-space scheduler. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2023-12-21 18:53:30 +01:00
Jordan Rome	e9a9d32ab6	Restructure scheds folder names - combine c and kernel-examples as it's confusing to have both - rename 'rust-user' and 'c-user' to just 'rust' and 'c', which is simpler - update and fix sync-to-kernel.sh	2023-12-17 13:14:31 -08:00

... 4 5 6 7 8

397 Commits