JakeHillion/scx

mirror of https://github.com/JakeHillion/scx.git synced 2024-12-03 06:17:11 +00:00

Author	SHA1	Message	Date
David Vernet	c71946e16a	Merge pull request #162 from jordalgo/cargo-gate Gate cargo build options behind 'enable_rust'	2024-02-26 11:07:26 -06:00
David Vernet	8b04a2687f	rusty: Use new infeasible crate Now that we have a new 'infeasible' crate that abstracts the logic for implementing the infeasible weights solution. Let's update rusty to use it. Signed-off-by: David Vernet <void@manifault.com>	2024-02-26 10:51:54 -06:00
David Vernet	b1dc37889f	infeasible: Add a new infeasible crate for load balancing We want to avoid every scheduler implementation from having to implement the solution to the infeasible weights problem, but we also want to enable sufficient flexibility where not every program has to have the same partition of scheduling domains, etc. To enable this, a new infeasible crate is added which encapsulates all of the logic for being given duty cycle and weight, and performing the necessary math to adjust for infeasibility. Signed-off-by: David Vernet <void@manifault.com>	2024-02-26 10:51:52 -06:00
Jordan Rome	9cedac45ee	Gate cargo build options behind 'enable_rust' This is to not have to require cargo when building the c schedulers.	2024-02-25 17:51:43 -08:00
David Vernet	869b1c2a4c	Merge pull request #160 from sched-ext/ci-exclude-example-scheds ci: eclude scx_qmap and scx_userland from testing	2024-02-24 11:55:30 -06:00
Andrea Righi	d1cfe1765d	ci: eclude scx_qmap and scx_userland from testing These two schedulers are provided mostly as examples / PoC, so we should exclude them from our periodic testing, to prevent triggering false positives in our CI. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-24 10:24:06 +01:00
Tejun Heo	77ee1cd41f	Merge pull request #158 from sched-ext/htejun sync-to-kernel.sh: Sync scx_central and scx_flatcg	2024-02-23 14:21:56 -10:00
Tejun Heo	7e578bb034	sync-to-kernel.sh: Sync scx_central and scx_flatcg They got added back as example schedulers in the kernel tree.	2024-02-23 14:21:03 -10:00
Tejun Heo	be9708aade	Merge pull request #157 from sched-ext/htejun common: Cosmetic change for consistency	2024-02-23 10:55:44 -10:00
Tejun Heo	b997b4ad17	common: Cosmetic change for consistency	2024-02-23 10:55:00 -10:00
David Vernet	70960ab162	Merge pull request #155 from sched-ext/rustland_refactor scx_rustland: cpu topology refactoring	2024-02-23 14:27:48 -06:00
Tejun Heo	948b2f0fc8	Merge pull request #156 from sched-ext/htejun scheds/sync-to-kernel.sh: Warn and skip if destination file is missing instead of failing	2024-02-23 09:32:38 -10:00
Tejun Heo	c789b4dd66	scheds/sync-to-kernel.sh: Warn and skip if destination file is missing instead of failing We aren't gonna sync all headers to the kernel tree.	2024-02-23 09:28:41 -10:00
David Vernet	87eab38506	rustland: Update rustland to use topology.rs The new topology crate allows us to replace the custom rustland topology logic with the logic in the topology crate itself. Signed-off-by: David Vernet <void@manifault.com>	2024-02-23 13:09:06 -06:00
David Vernet	f85cde7038	topology: Add maps to cores and cpus from root Topology object For convenience, let's provide callers with a way to easily look up cores and CPUs from the root topology object. Signed-off-by: David Vernet <void@manifault.com>	2024-02-23 13:09:02 -06:00
David Vernet	9904d03289	Merge pull request #154 from sched-ext/topology_refactor Topology refactor	2024-02-23 11:02:33 -06:00
David Vernet	43624a87ce	rusty: Use new topology crate Now that we have this new Topology crate, let's update Rusty to use it instead of using the old one. Signed-off-by: David Vernet <void@manifault.com>	2024-02-23 10:39:55 -06:00
David Vernet	c5a3b83bbd	topology: Add new topology crate The topology.rs crate is insufficiently generic, and reflects implementation details of scx_rusty more than it provides generic use cases for modeling a host's topology. This adds a new topology2.rs crate that will replace topology.rs. We have this as an intermediate commit so that we don't bundle updating scx_rusty with adding this crate. Signed-off-by: David Vernet <void@manifault.com>	2024-02-23 10:39:00 -06:00
David Vernet	608df7f96f	Merge pull request #152 from sched-ext/topology_refactor Add cpumask.rs crate	2024-02-22 18:07:55 -06:00
David Vernet	a2f531e429	topology: Refactor topology.rs to use cpumask crate Now that we have cpumask.rs, we can remove some logic from topology.rs and have it create and use Cpumasks. Signed-off-by: David Vernet <void@manifault.com>	2024-02-22 17:25:44 -06:00
David Vernet	bd15eb8e41	cpumask: Add new Cpumask crate Let's add a Cpumask trait that schedulers can use to avoid all having to deal directly with BitVec and the like. Signed-off-by: David Vernet <void@manifault.com>	2024-02-22 17:24:26 -06:00
Tejun Heo	4dc77f8ddf	Merge pull request #149 from davemarchevsky/davemarchevsky_nice_equals scx_layered: Add MATCH_NICE_EQUALS match kind	2024-02-22 06:38:17 -10:00
Dave Marchevsky	9f510f18cd	scx_layered: Add MATCH_NICE_EQUALS match kind I have a usecase where specific nice values are used to bucket tasks into groups that are handled separately by different `scx_layered` policies, with no implications of relative priority between niceness X, X + 1, X - 1, etc. In other words, nicevals are used as simple tags in this scenario. If we wanted to treat a specific niceness this way e.g. `11`, we could do so with AND'd MATCH_NICE_{ABOVE,BELOW} like so: ```json "matches" : [ [ { "NiceAbove": 10 }, { "NiceBelow": 12 }, ], ], ``` But this is unnecessarily verbose and doesn't communicate the intent of the match very well. Adding a `NiceEquals` matcher simplifies the config and makes intent obvious: ```json "matches" : [ [ { "NiceEquals": 11 }, ], ], ``` This PR adds support for such a matcher. Also, rename `layer_match.nice_above_or_below` to just `layer_match.nice`, as the former doesn't describe the newly-added matcher's use of the field. It's still obvious that `layer_match.nice` is relevant to MATCH_NICE_{ABOVE, BELOW, EQUALS}. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>	2024-02-22 04:08:07 -08:00
Tejun Heo	b77996f9c5	Merge pull request #150 from sched-ext/refresh_after_attach Refresh after attach	2024-02-21 21:51:02 -10:00
David Vernet	615b594e1c	layered: Don't refresh cpumasks before attaching As mentioned in the previous commit, for some reason we're sometimes (non-deterministically) not seeing the updated cpumask / layer values in BPF if we initialize the cpumasks here before attaching. Let's undo this for now so to avoid observing buggy behavior, until we figure it out. Signed-off-by: David Vernet <void@manifault.com>	2024-02-21 19:19:45 -06:00
David Vernet	68d317079a	Revert "layered: Set layered cpumask in scheduler init call" This reverts commit `56ff3437a2`. For some reason we seem to be non-deterministically failing to see the updated layer values in BPF if we initialize before attaching. Let's just undo this specific part so that we can unblock this being broken, and we can figure it out async. Signed-off-by: David Vernet <void@manifault.com>	2024-02-21 19:17:19 -06:00
David Vernet	ebce76b0cb	Merge pull request #148 from sched-ext/layered_fixes layered: Fix static configuration, and dispatch for Grouped layers	2024-02-21 16:12:12 -06:00
David Vernet	31df8fbd09	layered: Consume from layer with cpumask in layered_dispatch Currently, in layered_dispatch, we do the following: 1. Iterate over all preempt=true layers, and first try to consume from them. 2. Iterate over all confined layers, and consume from them if we find a layer with a cpumask that contains the consuming CPU. 3. Iterate over all grouped and open layers and consume from them in ordered sequence. In (2), we're only iterating over confined layers, but we should also be iterating over grouped layers. Otherwise, despite a consuming CPU being allocated to a specific grouped layer, the CPU will consume from whichever grouped or open layer has a task that's ready to run. Signed-off-by: David Vernet <void@manifault.com>	2024-02-21 15:38:23 -06:00
David Vernet	56ff3437a2	layered: Set layered cpumask in scheduler init call In layered_init, we're currently setting all bits in every layers' cpumask, and then asynchronously updating the cpumasks at later time to reflect their actual values at runtime. Now that we're updating the layered code to initialize the cpumasks before we attach the scheduler, we can instead have the init path actually refresh and initialize the cpumasks directly. Signed-off-by: David Vernet <void@manifault.com>	2024-02-21 15:38:23 -06:00
David Vernet	1f834e7f94	layered: Initialize layers before attaching scheduler We currently have a bug in layered wherein we could fail to propagate layer updates from user space to kernel space if a layer is never adjusted after it's first initialized. For example, in the following configuration: [ { "name": "workload.slice", "comment": "main workload slice", "matches": [ [ { "CgroupPrefix": "workload.slice/" } ] ], "kind": { "Grouped": { "cpus_range": [30, 30], "util_range": [ 0.0, 1.0 ], "preempt": false } } }, { "name": "normal", "comment": "the rest", "matches": [ [] ], "kind": { "Grouped": { "cpus_range": [2, 2], "util_range": [ 0.0, 1.0 ], "preempt": false } } } ] Both layers are static, and need only be resized a single time, so the configuration would never be propagated to the kernel due to us never calling update_bpf_layer_cpumask(). Let's instead have the initialization propagate changes to the skeleton before we attach the scheduler. This has the advantage both of fixing the bug mentioned above where a static configuration is never propagated to the kernel, and that we don't have a short period when the scheduler is first attached where we don't make optimal scheduling decisions due to the layer resizing not being propagated. Signed-off-by: David Vernet <void@manifault.com>	2024-02-21 15:38:21 -06:00
David Vernet	59bd206559	Merge pull request #146 from sched-ext/rusty_topology	2024-02-21 08:33:25 -06:00
David Vernet	49065de8df	scx: Demote panic! to warn! in topology crate We currently panic! if we're building a Topology that detects more than two siblings on a physical core. This can and will likely happen on multi-socket machines. Given that we're planning to add support for detecting NUMA nodes soon, let's just demote the panic! to a warn!. Signed-off-by: David Vernet <void@manifault.com>	2024-02-20 21:01:17 -06:00
Tejun Heo	22d635c385	Merge pull request #141 from jordalgo/rusty-logging Add libbpf logging to rust schedulers	2024-02-20 13:52:39 -10:00
Tejun Heo	4c63c7884c	Merge pull request #145 from sched-ext/rustland-cmdline-options scx_rustland: additional comand line options	2024-02-20 13:51:59 -10:00
Andrea Righi	80de48ec83	scx_rustland: introduce --builtin-idle Add a command line option to enable/disable the sched-ext built-in idle selection logic in the user-space scheduler. With this option the user-space scheduler will try to dispatch tasks on the CPU selected during the .select_cpu() phase (using the built-in idle selection logic). Without this option the user-space scheduler will try to dispatch tasks to the first CPU available. The former can be useful to improve throughput, since tasks are more likely to stick on the same CPU, while the latter can provide better system responsiveness, especially when the system is significantly busy. Given that, by default, tasks can be dispatched directly bypassing the user-space scheduler if an idle CPU is found during .select_cpu(), the user-space scheduler is primarily engaged only when the system is busy (no idle CPUs are available). Under these circumstances, it is typically more efficient to dispatch tasks on the first available CPU. Hence, the default behavior is to ignore built-in idle selection logic in the user-space scheduler. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-21 00:25:14 +01:00
Andrea Righi	e487d71032	scx_rustland: simply CPU selection by relying on built-in idle selection Checking if a CPU is idle or busy in the user-space scheduler is a bit redundant, considering that we also rely on the built-in idle selection logic in the BPF part. Therefore get rid of the additional idle selection logic in the user-space scheduler and rely on the built-in idle selection. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-21 00:25:14 +01:00
Andrea Righi	2cd1d4b684	scx_rustland: introduce --full-user Introduce an option to send all scheduling events and actions to user-space, disabling any form of in-kernel optimization. Enabling this option will likely make the system less responsive (but more predictable in terms of performance) and it can be useful for debugging purposes. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-21 00:25:14 +01:00
Jordan Rome	7c32acece0	Add libbpf logging to the rust schedulers This is to get better logs when failing to load, attach, etc.	2024-02-20 15:17:10 -08:00
David Vernet	8814286c40	Merge pull request #143 from sched-ext/rusty_topology rust: Add topology module to utils crate	2024-02-20 15:15:56 -06:00
David Vernet	ef8aa9ea31	add documentation Signed-off-by: David Vernet <void@manifault.com>	2024-02-20 14:57:09 -06:00
David Vernet	8aba090d4f	rust: Add topology module to utils crate scx_rusty has logic in the scheduler to inspect the host to automatically build scheduling domains across every L3 cache. This would be generically useful for many different types of schedulers, so let's add it to the scx_utils crate so it can be used by others. Signed-off-by: David Vernet <void@manifault.com>	2024-02-20 14:57:09 -06:00
Tejun Heo	e9f7477770	Merge pull request #144 from sched-ext/rustland-fix-mem-alignment scx_rustland: prevent misaligned pointer dereference	2024-02-20 08:15:38 -10:00
Andrea Righi	7ff06a6ff0	scx_rustland: prevent misaligned pointer dereference The buffer used to store struct queued_task_ctx items fetched from the BPF ring buffer needs to be aligned to the architecture register size, otherwise we may hit misaligned pointer dereference issues, such as: thread 'main' panicked at src/bpf.rs:162:43: misaligned pointer dereference: address must be a multiple of 0x8 but is 0x56516a51e004 note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace Prevent this by making sure the buffer is always aligned to 64-bits. Fixes: `93dc615` ("scx_rustland: use a ring buffer for queued tasks") Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-20 19:08:38 +01:00
Tejun Heo	3535e55985	Merge pull request #142 from sched-ext/rustland-ringbuf scx_rustland: improve kernel/user-space communication	2024-02-20 06:30:25 -10:00
Andrea Righi	93dc615653	scx_rustland: use a ring buffer for queued tasks Switch from a BPF_MAP_TYPE_QUEUE to a BPF_MAP_TYPE_RINGBUF to store the tasks that need to be processed by the user-space scheduler. A ring buffer allows to save a lot of memory copies and syscalls, since the memory is directly shared between the BPF and the user-space components. Performance profile before this change: 2.44% [kernel] [k] __memset 2.19% [kernel] [k] __sys_bpf 1.59% [kernel] [k] __kmem_cache_alloc_node 1.00% [kernel] [k] _copy_from_user After this change: 1.42% [kernel] [k] __memset 0.14% [kernel] [k] __sys_bpf 0.10% [kernel] [k] __kmem_cache_alloc_node 0.07% [kernel] [k] _copy_from_user Both the overhead of sys_bpf() and copy_from_user() are reduced by a factor of ~15x now (only the dispatch path is using sys_bpf() now). NOTE: despite being very effective, the current implementation is a bit of a hack. This is because the present ring buffer API exclusively permits consumption in a greedy manner, where multiple items can be consumed simultaneously. However, libbpf-rs does not provide precise information regarding the exact number of items consumed. By utilizing a more refined libbpf-rs API [1] we may be able to improve this code a bit. Moreover, libbpf-rs doesn't provide an API for the user_ring_buffer, so at the moment there's not a trivial way to apply the same change to the dispatched tasks. However, just with this change applied, the overhead of sys_bpf() and copy_from_user() is already minimal, so we won't get much benefits by changing the dispatch path to use a BPF ring buffer. [1] https://github.com/libbpf/libbpf-rs/pull/680 Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-20 12:30:22 +01:00
Andrea Righi	04685e633f	scx_rustland: avoid memory copies while accessing cpu_map Instead of using a BPF_MAP_TYPE_ARRAY to store which tasks are running on which CPU we can simply use a global array, mapped in the user-space address space. In this way we can avoid a lot of memory copies and call to sys_bpf(), significantly reducing the scheduler's overhead. Keep in mind that we don't need to be 100% correct while accessing this information, so we can accept some fuzziness in order to significantly reduce the scheduler's overhead. Performance profile before this change: 5.52% [kernel] [k] __sys_bpf 4.84% [kernel] [k] __kmem_cache_alloc_node 4.71% [kernel] [k] map_lookup_elem 4.10% [kernel] [k] _copy_from_user 3.51% [kernel] [k] bpf_map_copy_value 3.12% [kernel] [k] check_heap_object After this change: 2.20% [kernel] [k] __sys_bpf 1.91% [kernel] [k] map_lookup_and_delete_elem 1.60% [kernel] [k] __kmem_cache_alloc_node 1.10% [kernel] [k] _copy_from_user 0.12% [kernel] [k] check_heap_object n/a bpf_map_copy_value n/a map_lookup_elem With this change we can reduce the overhead of sys_bpf() by ~2x and the overhead of copy_from_user() by ~4x. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-20 12:30:16 +01:00
Tejun Heo	a193fd01d1	Merge pull request #140 from sched-ext/update-readme README: drop reference to old ABI /sys/kernel/debug/sched/ext	2024-02-19 05:45:57 -10:00
Andrea Righi	9bbc66b537	README: drop reference to old ABI /sys/kernel/debug/sched/ext We have moved generic sched-ext status files outside of debugfs, so that kernels without SCHED_DEBUG enabled can also access this information. However, the README.md file is still showing the old ABI in an example. Replace it with the new sysfs interface to avoid confusion. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-19 14:14:18 +01:00
Tejun Heo	d790bdb14c	Merge pull request #138 from sirlucjan/0.1.7 Bump to 0.1.7	2024-02-12 22:08:10 -10:00
Piotr Gorski	cb11aecbd0	Bump to 0.1.7 Signed-off-by: Piotr Gorski <lucjan.lucjanov@gmail.com>	2024-02-13 08:20:38 +01:00

1 2 3 4 5 ...

464 Commits