JakeHillion/scx

mirror of https://github.com/JakeHillion/scx.git synced 2024-11-26 11:30:22 +00:00

Author	SHA1	Message	Date
Tejun Heo	3e7ef35649	Merge pull request #250 from multics69/lavd-issue-234 scx_lavd: replesih time slice at ops.running() only when necessary	2024-04-29 09:01:04 -10:00
Tejun Heo	5b7b7d5193	Merge pull request #247 from multics69/lavd-issue-244 scx_lavd: always inline submit_task_ctx to make the verifier happy	2024-04-29 07:53:38 -10:00
Changwoo Min	5f63e0ca30	scx_lavd: replesih time slice at ops.running() only when necessary The current code replenishes the task's time slice whenever the task becomes ops.running(). However, there is a case where such behavior can starve the other tasks, causing the watchdog timeout error. One (if not all) such case is when a task is preempted while running by the higher scheduler class (e.g., RT, DL). In such a case, the task will be transit in a cycle of ops.running() -> ops.stopping() -> ops.running() -> etc. Whenever it becomes re-running, it will be placed at the head of local DSQ and ops.running() will renew its time slice. Hence, in the worst case, the task can run forever since its time slice is never exhausted. The fix is assigning the time slice only once by checking if the time slice is calculated before. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-04-29 12:13:31 +09:00
Andrea Righi	cabde30736	scx_utils: bump up version to 0.8.0 Bump up scx-utils version to provide the new scx_utils::TopologyMap. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-04-28 21:01:16 +02:00
Andrea Righi	5effb4fc4c	scx_rustland: bump up version to 0.0.5 Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-04-28 12:01:38 +02:00
Andrea Righi	0785246ee2	scx_rustland: provide --version option Provide a command line option to print the version of the scheduler and the scx_rustland_core crate. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-04-28 12:01:38 +02:00
Andrea Righi	fb2f5c240e	scx_rustland_core: bump up version to 0.3 Given that rustland_core now supports task preemption and it has been tested successfully, it's worhtwhile to cut a new version of the crate. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-04-28 12:01:38 +02:00
Andrea Righi	905960f752	scx_lavd: use c_char consistently In Rust c_char can be aliased to i8 or u8, depending on the particular target architecture. For example, trying to build scx_lavd on ppc64 triggers the following error: error[E0308]: mismatched types --> src/main.rs:200:38 \| 200 \| let c_tx_cm: const c_char = (&tx.comm as const [i8; 17]) as const i8; \| ------------- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ expected `const u8`, found `const i8` \| \| \| expected due to this \| = note: expected raw pointer `const u8` found raw pointer `*const i8` To fix this, consistently use c_char instead of assuming it corresponds to i8. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-04-27 17:21:19 +02:00
Changwoo Min	f470b1aa13	scx_lavd: always inline submit_task_ctx to make the verifier happy In _some_ kernel versions, loading scx_lavd fails with an error of "bpf_rcu_read_unlock is missing". The usage of bpf_rcu_read_lock/unlock() in proc_dump_all_tasks() is correct but the bpf verifier still think bpf_rcu_read_unlock() is missing. The most plausible reason so far is that the problematic kernel does not have a commit 6fceea0fa59f ("bpf: Transfer RCU lock state between subprog calls"), failing inter-procedural analysis between proc_dump_all_tasks() and submit_task_ctx(). Thus, we force inline submit_task_ctx() (no inter-procedural analysis by the verifier is necessary) for the time being. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-04-28 00:11:38 +09:00
Changwoo Min	d0d0a18b10	scx_lavd: fix copyright information Correct the copyright and author information Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-04-26 16:36:58 +09:00
Andrea Righi	973aded5a8	Merge pull request #238 from sched-ext/rustland-reduce-topology-overhead scx_rustland: reduce overhead by caching host topology	2024-04-24 22:24:23 +02:00
David Vernet	5ba137e8c9	layered: Make layered backwards compat with cpufreq Only the very newest kernels support scx_bpf_cpuperf_set(). Let's update scx_layered to accommodate older kernels as well. Signed-off-by: David Vernet <void@manifault.com>	2024-04-24 14:01:51 -05:00
Tejun Heo	9a9b4dd23e	Merge pull request #239 from hodgesds/cpufreq_helpers Add CPU frequency related helpers and extend scx_layered	2024-04-24 07:22:15 -10:00
Andrea Righi	5302ff1cdc	scx_rustland: use TopologyMap for efficient CPU topology iteration Looking at perf top it seems that the scheduler can spend a significant amount of time iterating over the CPU topology/cpumask information, especially when the system is running a significant amount of tasks: 2.57% scx_rustland [.] <scx_utils::cpumask::CpumaskIntoIterator as core::iter::traits::iterator::Iterator>::next Considering that scx_rustland doesn't support CPU hotplugging yet (it requires a full restart to properly handle CPU hotplug events), we can completely avoid this overhead by caching a TopologyMap object at the beginning, when the scheduler starts, instead of constantly re-evaluating the CPU topology information. This allows to reduce the scheduler overhead by ~5% CPU utilization under heavy load conditions (from ~65% -> ~60%, according to top). Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-04-24 17:08:06 +02:00
Daniel Hodges	32e97bf4d5	Adds CPU frequency related helpers and extend scx_layered This change adds `scx_bpf_cpuperf_cap`, `scx_bpf_cpuperf_cur` and `scx_bpf_cpuperf_set` definitions that were recently introduced into [`sched_ext`](https://github.com/sched-ext/sched_ext/pull/180). It adds a `perf` field to `scx_layered` to allow for controlling performance per layer. Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>	2024-04-24 07:27:52 -07:00
David Vernet	a8daf372b2	Merge pull request #241 from sched-ext/cpumask_efficient topology: Don't allocate on calls to span()	2024-04-24 09:21:15 -05:00
David Vernet	24c248eebb	layered: Add support for filtering on process name If a library creates threads, those threads will often have the same name. If two different processes of different priority both use a library, it may be that we want the library's threads in each process to be put into different layers. To support this, let's add the ability to filter not only by task name, but also by process name via the task thread group leader's comm. Tested by creating two executables named "foo" and "bar", which both spawn a bunch of tasks named "exp_worker" that spin until being interrupted. With this config: https://pastebin.com/Uz2phzxQ, the tasks were correctly matched to the expected layers. Signed-off-by: David Vernet <void@manifault.com>	2024-04-23 23:12:37 -05:00
David Vernet	c187c65702	topology: Don't allocate on calls to span() We're currently cloning cpumasks returned by calls to {Core, Cache, Node, Topology}::span(). If a caller needs to clone it, they can. Let's not penalize the callers that just want to query the underlying cpumask. Signed-off-by: David Vernet <void@manifault.com>	2024-04-23 22:59:42 -05:00
David Vernet	a998fb7d01	layered: Clarify f: and file: prefix behavior Some people have expressed confusion at this behavior. Let's be a bit more explicit in the documentation. Signed-off-by: David Vernet <void@manifault.com>	2024-04-23 20:39:28 -05:00
Andrea Righi	fbe9a80af8	scx_rustland: introduce --no-preemption Provide a run-time option to disable task preemption. This option can be used to improve the throughput of the CPU-intensive tasks while still providing a good level of responsiveness in the system. By default preemption is enabled, to provide a higher level of responsiveness to the interactive tasks. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-04-23 07:13:30 +02:00
Andrea Righi	0ffaaac6db	scx_rustland: enable preemption Use the new scx_rustland_core dispatch flag RL_PREEMPT_CPU to allow interactive tasks to preempt other tasks with scx_rustland. If the built-in idle selection logic is enforced (option `-i`), the scheduler prioritizes keeping tasks on the target CPU designated by this logic. With preemption enabled, these tasks have a higher likelihood of reusing their cached working set, potentially improving performance. Alternatively, when tasks are dispatched to the first available CPU (default behavior), interactive tasks benefit from running more promptly by kicking out other tasks before their assigned time slice expires. This potentially allows to increase the default time slice to higher values in the future, to improve the overall throughput in the system and, at the same time, still maintain a good level of responsiveness, because interactive tasks are now able to run pretty much immediately, independently on the remaining time slice of the other tasks that are contending the CPUs in the system. = Results = Measuring the performance of the usual benchmark "playing a video game while running a parallel kernel build in background" seems to give around 2-10% boost in the fps with preemption enabled, depending on the particular video game. Results were obtained running a `make -j32` kernel build on a AMD Ryzen 7 5800X 8-Cores 16GB RAM, while testing video games such as Baldur's Gate 3 (with a solid +10% fps), Counter Strike 2 (around +5%) and Team Fortress 2 (+2% boost). Moreover, some WebGL applications (such as https://webglsamples.org/aquarium/aquarium.html) seem to benefit even more with preemption enabled, providing up to a +15% fps boost. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-04-23 07:13:30 +02:00
Andrea Righi	6d2aac1591	scx_rustland_core: introduce dispatch flags Reserve some bits of the `cpu` attribute of a task to store special dispatch flags. Initially, let's introduce just RL_CPU_ANY to replace the special value NO_CPU, indicating that the task can be dispatched on any CPU, specifically the first CPU that becomes available. This allows to keep the CPU value assigned by the builtin idle selection logic, that can potentially be used later for further optimizations. Moreover, having the possibility to specify dispatch flags gives more flexibility and it allows to map new scheduling features to such flags. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-04-23 07:13:30 +02:00
takase1121	3e12676ca2	scheds-rust: add explanation for chaining schedulers	2024-04-23 08:30:38 +08:00
takase1121	5d20f89a87	scheds-rust: build rust schedulers in sequence	2024-04-23 08:06:27 +08:00
David Vernet	5f1eac85ff	layered: Fix init_task When I transitioned layered to using task local storage, I messed up initializing the task ctx, not realizing we previously had a separate variable that was initializing the hasmap entry. We need to initialize the task's layer to -11, and also set refresh_layer to 1. Signed-off-by: David Vernet <void@manifault.com>	2024-04-18 09:44:32 -05:00
David Vernet	45589cd0f7	lavd: Fix a few typos Noticed a few typos. Let's fix em up Signed-off-by: David Vernet <void@manifault.com>	2024-04-17 08:17:52 -05:00
David Vernet	ffced1f615	rusty: Remove explicit padding As of libbpf-rs 0.23.0 (which contains commit `9d9e979fcf`), libbpf-rs now generates rust structs that honor padding. We can therefore remove the custom padding in scx_rusty's struct pcpu_ctx. For example, here is the generated pub struct pcpu_ctx: pub struct pcpu_ctx { pub dom_rr_cur: u32, pub dom_id: u32, pub nr_node_doms: u32, pub node_doms: [u32; 64], pub __pad_268: [u8; 52], } And here is the matching struct in the BPF object file: struct pcpu_ctx { u32 dom_rr_cur; /* 0 4 / u32 dom_id; / 4 4 / u32 nr_node_doms; / 8 4 / u32 node_doms[64]; / 12 256 / / size: 320, cachelines: 5, members: 4 / / padding: 52 */ } __attribute__((__aligned__(64))); Signed-off-by: David Vernet <void@manifault.com>	2024-04-12 13:52:13 -05:00
David Vernet	e032ee7cc0	rusty: Add lookup_pcpu_ctx() helper Getting rid of more boilerplate Signed-off-by: David Vernet <void@manifault.com>	2024-04-11 19:30:23 -05:00
David Vernet	885a9fd7da	rusty: Make lookup_task_ctx() static It doesn't need to be a global prog. Let's make it static. Signed-off-by: David Vernet <void@manifault.com>	2024-04-11 19:30:23 -05:00
David Vernet	0ff73754cf	rusty: Add create_save_cpumask() helper We have a lot of boilerplate code where we create a cpumask, initialize it, and then bpf_kptr_xchg() it into the map. In an effort to slightly reduce the amount of boilerplate, let's create a helper that can alleviate some of it. Signed-off-by: David Vernet <void@manifault.com>	2024-04-11 19:30:21 -05:00
David Vernet	e27d5b4e67	rusty: Fix a few random issues There are some random issues in the code, like unused variables, and bad print formatters. I'm not sure why the compiler isn't consistently complaining, but let's fix them. Signed-off-by: David Vernet <void@manifault.com>	2024-04-11 19:21:02 -05:00
David Vernet	31cc2dccb9	rusty: Allocate DSQ on appropriate NUMA node In scx_rusty, now that we have a complete view of the host's topology thanks to the Topology crate, we can update our calls to scx_bpf_create_dsq() to create the DSQ on the NUMA node of the domain. It's unclear how much this will end up mattering for performance in the typical case, but we might as well do the right thing given that host topolgoy is static, and we have the information. Signed-off-by: David Vernet <void@manifault.com>	2024-04-11 00:01:25 -05:00
Dan Schatzberg	6eefc8c27f	Fix error typo ENONET means "Machine is not on the network" - this was supposed to be ENOENT "No such file or directory"	2024-04-10 15:28:05 -04:00
Changwoo Min	f53c29759e	scx_lavd: support preemption (in some scenarios) (#224 ) * scx-lavd: preemption of a lower-priority task using kick cpu When a task is enqueued to the global queue, the scheduler checks if there is a lower priority task than the enqueued task. If so, it kicks out the lower-priority task, hoping the newly enqueued task or another higher-priority task runs on the kicked CPU. Kicking another CPU is expensive as an IPI is involved, so the scheduler judiciously kicks the CPU when its benefit (i.e., priority gap) is clear enough. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-04-09 14:25:53 +09:00
David Vernet	9a8ed8ab44	Merge pull request #218 from sched-ext/rusty_hotplug Gracefully handle hotplug in scx_rusty	2024-04-04 16:03:59 -05:00
Andrea Righi	17a30bddc9	scx_rustland_core: bump up version to 0.2 Bump up the version of the crate and update dependencies. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-04-04 22:44:55 +02:00
David Vernet	622b61dd2f	rusty: Support restarting rusty on hotplug events The scx_rusty scheduler does not support hotplug, and expects a static host topology throughout its runtime. Though the kernel does have support for detecting hotplug events, we currently don't detect this in the kernel, nor surface it to user space when it happens. Now that we have scx_bpf_exit(), we can gracefully exit the kernel in the event of a hotplug, and communicate to user space that it should restart the scheduler. This patch adds that support to scx_rusty. Note that this assumes that we're running on a recent enough kernel that has scx_bpf_exit(). If it doesn't, then we instead just error out of the kernel scheduler and exit the application. Signed-off-by: David Vernet <void@manifault.com>	2024-04-04 14:52:48 -05:00
Tejun Heo	ba52cc131b	scx_lavd: Add .gitignore	2024-04-04 07:15:37 -10:00
Tejun Heo	a60737a6bf	Merge pull request #207 from sched-ext/api-updates scx: Apply API updates from sched_ext	2024-04-02 14:26:42 -10:00
Tejun Heo	b925bdf94d	Cargo.toml: Update libbpf-rs/cargo dependencies to 0.23 and drop patch.crates-io sections New versions of libbpf-rs and libbpf-cargo are now available with all the needed features. Update the dependencies and drop the patch sections.	2024-04-02 11:19:39 -10:00
Tejun Heo	6f81409df4	Bump versions - scx_utils bumped from 0.6.0 to 0.7.0. - Repo and rust schedulers get a PATCH level bump.	2024-04-02 10:58:50 -10:00
Tejun Heo	f3e20ae9b3	scx_rustland: Apply API updates and add --exit-dump-len option to scx_rustland	2024-04-02 10:30:56 -10:00
David Vernet	5088328f9e	rusty: Check LOCAL_DSQ length for WAKE_SYNC In rusty_select_cpu(), if a task is WAKE_SYNC, we'll currently migrate the task to that CPU if there are any idle cores on the system. As in [0], this condition is insufficient, as there could be idle cores elsewhere on the system, but still tasks piled up on a single local DSQ. Let's add a condition that the local DSQ has to be empty in order to apply the WAKE_SYNC migration. Before patch: [void@maniforge src]$ hackbench Running in process mode with 10 groups using 40 file descriptors each (== 400 tasks) Each sender will pass 100 messages of 100 bytes Time: 0.433 With patch: [void@maniforge src]$ hackbench Running in process mode with 10 groups using 40 file descriptors each (== 400 tasks) Each sender will pass 100 messages of 100 bytes Time: 0.035 Signed-off-by: David Vernet <void@manifault.com>	2024-04-02 15:17:32 -05:00
Tejun Heo	dfa978d166	scx_lavd: Apply API updates	2024-04-02 10:08:02 -10:00
Tejun Heo	0c07f382b1	scx_rusty: Apply API updates	2024-04-02 10:07:54 -10:00
Tejun Heo	59bbd800c1	compat: Implement scx_utils::compat and fix up scx_layered Implement scx_utils::compat to match C's scx/compat.h and update scx_layered. Other rust scheds are still broken.	2024-04-02 07:08:56 -10:00
Changwoo Min	3a3bd2a750	scx_lavd: increase the upper bound of ineligible duration Change the upper bound of ineligible duration (LAVD_ELIGIBLE_TIME_MAX). The updated (2x increased) upper bound reflects the distribution of tasks' eligible_delta_ns better. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-03-30 22:59:06 +09:00
Changwoo Min	8efaf0c4c2	scx_lavd: improve the accuracy of task's run_freq Change the calculation of the run_frequence using the wait_period from the last time the task yielded CPU to this time when the task is running. The old implementation measures the time interval between the last stopping and the current running and increases run_freq without reason. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-03-30 22:55:17 +09:00
Changwoo Min	fe3efb8ce2	scx_lavd: rename last_{start/stop/wait/wake}_clk for consistency Change the last_{start/stop/wait/wake}_clk in task_ctx to last_{running/stopping/quiescent/runnable}_clk, matching with state transition names. In addition, add comments and reorder fields in task_ctx for readability. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-03-30 10:13:20 +09:00
Changwoo Min	3ba10a8d4f	scx_lavd: accumulate consecutive runnings When a task runs more than once (running <->stopping) within one runnable-quiescent transition, accumulate runtime of multiple runnings for statistics. This helps to get the task's runtime per schedule when supposing that a huge time slice is given, which is what we want to collect for scheduling decisions. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-03-29 17:19:30 +09:00
Changwoo Min	7b99ed9c5c	scx_lavd: drop runtime_boost using slice_boost_prio Remove runtime_boost using slice_boost_prio. Without slice_boost_prio, the scheduler collects the exact time slice. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-03-29 16:31:03 +09:00
Changwoo Min	5629189527	scx_lavd: change update_stat_for_() for consistency Let's change the function names of update_stat_for_() as follow their callers for consistency and less confusion. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-03-29 14:49:06 +09:00
Changwoo Min	04c9e7fe9d	Merge pull request #201 from multics69/perf-vdeadline01 scx_lavd: fix merge conflicts between PR 197 and 199	2024-03-28 14:15:00 +09:00
Changwoo Min	0ea1aab070	scx_lavd: fix merge conflicts Merge branch 'perf-vdeadline01' of github.com:sched-ext/scx into perf-vdeadline01	2024-03-28 13:49:19 +09:00
Tejun Heo	340938025f	Merge pull request #200 from sched-ext/layered_delete layered: Use TLS map instead of hash map	2024-03-27 17:09:20 -10:00
Changwoo Min	60472db845	Merge pull request #197 from multics69/perf-vdeadline01 scx_lavd: improve virtual deadline calculation	2024-03-28 11:44:54 +09:00
Changwoo Min	67f41c7d83	scx_lavd: bug fix: slice_boost should be update before adjusted runtime The run_time_boosted_ns calculation requires updated slice_boost_prio, so updating slice_boost_prio should be done before updating run_time_boosted_ns. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-03-28 11:21:42 +09:00
David Vernet	e857dd90ab	layered: Use TLS map instead of hash map In scx_layered, we're using a BPF_MAP_TYPE_HASH map (indexed by pid) rather than a BPF_MAP_TYPE_TASK_STORAGE, to track local storage for a task. As far as I can tell, there's no reason we need to be doing this. We never access the map from user space, and we're even passing a struct task_struct * to a helper subprog to look up the task context rather than only doing it by pid. Using a hashmap is error prone for this because we end up having to manually track lifecycles for entries in the map rather than relying on BPF to do it for us. For example, BPF will automatically free a task's entry from the map when it exits. Let's just use TLS here rather than a hashmap to avoid issues from this (e.g. we've observed the scheduler getting evicted because we're accessing a stale map entry after a task has been destroyed). Reported-by: Valentin Andrei <vandrei@meta.com> Signed-off-by: David Vernet <void@manifault.com>	2024-03-27 20:14:27 -05:00
Changwoo Min	31157ebc81	scx-lavd: make the comments in update_sys_cpu_load() clear The current description is a bit confusing, so update the comments for clarity. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-03-28 06:45:57 +09:00
Tejun Heo	129d99f542	scx_lavd: Remove custom task state tracking transit_task_stat() is now tracking the same runnable, running, stopping, quiescent transitions that sched_ext core already tracks and always returns %true. Let's remove it.	2024-03-26 12:23:19 -10:00
Tejun Heo	d7ec05e017	scx_lavd: Call update_stat_for_enq() from lavd_runnable() LAVD_TASK_STAT_ENQ is tracking a subset of runnable task state transitions - the ones which end up calling ops.enqueue(). However, what it is trying to track is a task becoming runnable so that its load can be added to the cpu's load sum. Move the LAVD_TASK_STAT_ENQ state transition and update_stat_for_enq() invocation to ops.runnable() which is called for all runnable transitions. Note that when all the methods are invoked, the invocation order would be ops.select_cpu(), runnable() and then enqueue(). So, this change moves update_stat_for_enq() invocation before calc_when_to_run() for put_global_rq(). update_stat_for_enq() updates taskc->load_actual which is consumed by calc_greedy_ratio() and thus affects calc_when_to_run(). Before this patch, calc_greedy_ratio() would use load_actual which doesn't reflect the last running period. After this patch, the latest running period will be reflected when the task gets queued to the global queue. The difference is unlikely to matter but it'd probably make sense to make it more consistent (e.g. do it at the end of quiescent transition). After this change, transit_task_stat() doesn't detect any invalid transitions.	2024-03-26 12:23:19 -10:00
Tejun Heo	625bb84bc4	scx_lavd: Move load subtraction to quiescent state transition scx_lavd tracks task state transitions and updates statistics on each valid transition. However, there's an asymmetry between the runnable/running and stopping/quiescent transitions. In the former, the runnable and running transitions are accounted separately in update_stat_for_enq() and update_stat_for_run(), respectively. However, in the latter, the two transitions are combined together in update_stat_for_stop(). This asymmetry leads to incorrect accounting. For example, a task's load should be added to the cpu's load sum when the task gets enqueued and subtracted when the task is no longer runnable (quiescent). The former is accounted correctly from update_stat_for_enq() but the latter is done whenever the task stops. A task can transit between running and stopping multiple times before becoming quiescent, so the asymmetry can end up subtracting the load of a task which is still running from the cpu's load sum. This patch: - introduces LAVD_TASK_STAT_QUIESCENT and updates transit_task_stat() so that it can handle all valid state transitions including the multiple back and forth transitions between two pairs - QUIESCENT <-> ENQ and RUNNING <-> STOPPING. - restores the symmetry by moving load adjustments part from update_stat_for_stop() to new update_stat_for_quiescent(). This removes a good chunk of ignored transitions. The next patch will take care of the rest.	2024-03-26 12:23:19 -10:00
Tejun Heo	dd40377f03	scx_lavd: Drop unnecessary `extern crate`s Since https://doc.rust-lang.org/edition-guide/rust-2018/path-changes.html, extern crate declarations aren't necessary. Let's drop them.	2024-03-26 12:23:19 -10:00
David Vernet	602ec5ada3	layered: Make helper functions static lookup_task_ctx(), lookup_task_ctx_may_fail(), and lookup_layer() currently don't have the static keyword, so BPF may treat them as a global function. We don't actually want these to be global, so let's make them static to avoid confusing the verifier. Signed-off-by: David Vernet <void@manifault.com>	2024-03-26 15:08:32 -05:00
Changwoo Min	83169481a6	scx_lavd: improve latency criticality to latency priority mapping The old approach is mapping [0, maximum latency criticliy] to [-boost range, boost range). This approach is easily affected by one outlier maximum value and suffers from the integer truncation error. The new approach divides the range into two -- [minimum latency criticality, average latency criticality) and [average latency criticality, maximum latency criticality] -- and maps them into [boost range/2, 0) and [0, -boost range/2), respectively, Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-03-25 22:13:41 +09:00
Changwoo Min	2b5d3c1300	scx_lavd: change sched_prio_to_latency_weight to more skewed one Replace a latency weight arrary to more skewed one, which is the inverse of sched_prio_to_slice_weight. It turns out more skewed one works better under highly CPU-overloaded cases since it gives a longer deadline to non-latency-critical tasks. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-03-21 14:01:44 +09:00
Changwoo Min	9c12b607ca	scx_lavd: increase LAVD_LC_RUNTIME_MAX for improved lat_prio As the calculated runtime increases by considering the number of full-time slice consumption, increase the upper bound (LAVD_LC_RUNTIME_MAX) of runtime to be considered in latency calculation. Also, add LAVD_SLICE_BOOST_MAX_PRIO to avoid slice_boost_prio dropping to zero suddenly. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-03-21 10:59:13 +09:00
Changwoo Min	32570789d8	scx_lavd: improve the accuracy of runtime per schedule Take slice_boost_prio -- how many times a full time slice was consumed -- into consideration in calculating run_time_ns (runtime per schedule). This improve the accuracy especially when a task is overscheduled and its time slice is reduced for enforcing fairness. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-03-21 10:32:09 +09:00
Changwoo Min	b37370bb35	scx_lavd: entail two invalid task state transitions Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-03-20 00:15:47 +09:00
Changwoo Min	8860f26ff4	scx_lavd: add a sanity check if runtime is negative Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-03-20 00:15:37 +09:00
Changwoo Min	fa2282363b	scx_lavd: more explanation about sched_prio_to_latency_weight Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-03-19 21:31:37 +09:00
Changwoo Min	24bddad9b4	scx_lavd: fix a typo Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-03-19 21:19:55 +09:00
Changwoo Min	512c4e794f	scx_lavd: fix potential CPU stall in lavd_select_cpu() Returning prev_cpu after picking an idle CPU will cause the idle CPU stall because the idle core was already punched out from the idle mask by the scx core so it is no longer idle from the scx core's point of view. This fix conducts the idle core selection at the last step so it never return prev_cpu after picking the idle core. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-03-19 00:46:46 +09:00
Changwoo Min	e41c674fae	scx_lavd: remove redundant latency calculation at calc_latency_weight() Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-03-19 00:46:15 +09:00
Changwoo Min	865269f438	scx_lavd: remove unnecessary condition check at slice_fully_consumed() Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-03-19 00:46:15 +09:00
Changwoo Min	c2b1a10e17	scx_lavd: remove unnecessary condition check at update_stat_for_stop() Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-03-19 00:46:15 +09:00
Changwoo Min	a27b509452	sdx_lavd: use is_wakeup_ef() in checking wait flag Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-03-19 00:46:15 +09:00
Changwoo Min	419ccae8db	scx_lavd: improve the clarity of the task state transition Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-03-19 00:46:01 +09:00
Changwoo Min	66e15285ea	scx_lavd: move scx_bpf_error() calls to get_cpu_ctx{_id}() Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-03-19 00:46:01 +09:00
Changwoo Min	0fc5591bf6	scx_lavd: add a utility func, {try_}get_task_ctx() get_task_ctx() and try_get_task_ctx() were added for common error handling for task lookup failure. Since idle "swapper" task is not under sched_ext, try_get_task_ctx() is added for the case such that idle task can be searched. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-03-19 00:46:01 +09:00
Changwoo Min	97b4d9ce5a	scx_lavd: remove unnecessary condition check in is_wakeup_wf() We don't need to test SCX_WAKE_SYNC because SCX_WAKE_SYNC should only be set when SCX_WAKE_TTWU is set. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-03-19 00:46:01 +09:00
Changwoo Min	47e7238b13	scs_lavd: improve the description of fairness Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-03-19 00:45:37 +09:00
Changwoo Min	670c1b5b92	scx_lavd: print one scheduling decision by default Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-03-19 00:30:41 +09:00
Changwoo Min	315e5b3fe2	scx_lavd: remove unnecessary arg from put_local_rq() cpu_id is unused and not necessary in pu_local_rq(), so it it removed. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-03-19 00:30:26 +09:00
Changwoo Min	ead7d55c5c	scx_lavd: replace num_cpus to scx_utils::Topology This removes the external carte depenendy and avoides the known bugs in the num_cpus carte. Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-03-19 00:30:26 +09:00
Changwoo Min	17bce169e7	scx_lavd: fix formatting issues in main.rs and main.bpf.c Signed-off-by: Changwoo Min <changwoo@igalia.com>	2024-03-19 00:30:26 +09:00
Changwoo Min	fb73520990	scx_lavd: add scx_lavd to the meson build	2024-03-16 10:55:37 +09:00
Changwoo Min	6ab3928a0d	scx_lavd: add scx_lavd (Latency-criticality Aware Virtual Deadline) scheduler scx_lavd is a BPF scheduler that implements an LAVD (Latency-criticality Aware Virtual Deadline) scheduling algorithm. While LAVD is new and still evolving, its core ideas are 1) measuring how much a task is latency critical and 2) leveraging the task's latency-criticality information in making various scheduling decisions (e.g., task's deadline, time slice, etc.). As the name implies, LAVD is based on the foundation of deadline scheduling. This scheduler consists of the BPF part and the rust part. The BPF part makes all the scheduling decisions; the rust part loads the BPF code and conducts other chores (e.g., printing sampled scheduling decisions).	2024-03-16 10:31:07 +09:00
David Vernet	35b7dc95d0	rusty: Fix up the scheduler description There were a few issues, e.g. us still mentioning the infeasible weights problem, and arguments using underscores despite clap rendering them with dashes. Let's fix them up. Signed-off-by: David Vernet <void@manifault.com>	2024-03-14 11:21:03 -05:00
David Vernet	4520514fe8	rusty: Account for disabled but offline CPUs As described in https://bugzilla.kernel.org/show_bug.cgi?id=218109, https://github.com/sched-ext/scx/issues/147 and https://github.com/sched-ext/sched_ext/issues/69, AMD chips can sometimes report fully disabled CPUs as offline, which causes us to count them when looking at /sys/devices/system/cpu/possible. Additionally, systems can have holes in their active CPU maps. For example, a system with CPUs 0, 1, 2, 3 possible, may have only 0 and 2 active. To address this, we need to do a few things: 1. Update topology.rs to be clear that it's returning the number of _possible_ CPUs in the system. Also update Topology to only record online CPUs when creating its span and iterating over sysfs when creating domains. It was previously trying to record when a CPU was online, but this was actually broken as the topology directory isn't present in sysfs when the CPU is offline. 2. Schedulers should not be relying on nr_possible_cpus for anything other than interacting with per-CPU data (e.g. for stats extraction), or e.g. verifying maximum sizes of statically sized arrays in BPF. It should _not_ be used for e.g. performing load calculations, etc. With that said, we'll also need to update schedulers to not rely on the nr_possible_cpus figure being exported by the topology crate. We do that for rusty in this patch, but don't fix any of the others other than updating how they call topology.rs. 3. Account for the fact that LLC IDs may be non-contiguous. For example, if there is a single core in an LLC, then if we assign LLC IDs to domains, then the domain IDs won't be contiguous. This doesn't fit our current model which is used by e.g. infeasible_weights.rs. We'll update some of the code in rusty to accomodate this, but we'll need to do more. 4. Update schedulers to properly reset themselves in the event of a hotplug event. We'll take care of that in a follow-on change. Signed-off-by: David Vernet <void@manifault.com>	2024-03-14 11:15:28 -05:00
David Vernet	2b8a3ea984	rusty: Iterate over domains, not IDs If a CPU is offline, it could cause an LLC to go offline, which could cause us to have non-contiguous domain IDs. Right now, a few places in code assume contiguous domain IDs, such as in the infeasible weights crate. Let's update domain.rs and load_balaance.rs to do the right thing. We'll fix the others later. Signed-off-by: David Vernet <void@manifault.com>	2024-03-14 11:02:01 -05:00
David Vernet	4e9cf5181e	rusty: Fix domain weight() function We were looking at the domain cpumask length, instead of its weight. Correct the oversight. Signed-off-by: David Vernet <void@manifault.com>	2024-03-14 11:02:01 -05:00
David Vernet	bc0336d727	cpumask: Add bitwise ops for cpumask We implement functions or(), and(), and xor() for cpumasks, but we should also implement the bitwise ops for those operations in case people prefer that syntax. Signed-off-by: David Vernet <void@manifault.com>	2024-03-14 11:02:01 -05:00
David Vernet	583696f940	topology: Include last CPU in online We're iterating from min..max cpu in cpus_online(), but that's not inclusive of the max CPU. Let's also include that so we don't think that last CPU is offline. Signed-off-by: David Vernet <void@manifault.com>	2024-03-14 11:01:52 -05:00
Andrea Righi	2cd3929475	scx_rustland: mitigate sub-optimal performance with offline CPUs Most of the schedulers assume that the amount of possible CPUs in the system represents the actual number of CPUs available. This is not always true: some CPUs may be offline or certain CPU models (AMD CPUs for example) may include unavailable CPUs in this number. This can lead to sub-optimal performance or even errors in the scheduler (see for example [1][2]). Ideally, we need to attack this issue in a more generic way, such as having a proper API provided by a C library, that can be used by all schedulers and the topology Rust module (scx_utils crate). But for now, let's try to mitigate most of the common sub-optimal cases separately inside each scheduler. For rustland we can apply some mitigations both in select_cpu() (for the BPF part) and in the user-space part: - the former is fixed in the sched-ext kernel by commit 94dc0c01b957 ("scx: Use cpu_online_mask when resetting idle masks"). However, adding an extra check `cpu < num_possible_cpus` in select_cpu(), allows to properly support AMD CPUs, even with kernels that don't have the cpu_online_mask fix yet (this doesn't always guarantee the validity of cpu, but it should be enough to mitigate the majority of the potential sub-optimal cases, without introducing any significant overhead) - the latter can be fixed relying on topology.span(), instead of topology.nr_cpus(), to count the amount of available CPUs in the system. [1] https://github.com/sched-ext/sched_ext/issues/69 [2] https://github.com/sched-ext/scx/issues/147 Link: `94dc0c01b9` Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-03-14 10:19:31 +01:00
David Vernet	3cda1bc690	Merge pull request #187 from sched-ext/layered-updates scx_layered: Make config json assume default vaules for unspecified fields	2024-03-13 17:15:18 -05:00
Tejun Heo	76fb0fdd8f	scx_layered: Make config json assume default vaules for unspecified fields This makes writing configs and allows introducing new fields without breaking existing configs.	2024-03-13 11:10:38 -10:00
Tejun Heo	6048992ca7	Merge pull request #185 from sched-ext/layered-updates scx_layered: Implement layer properties `exclusive` and `min_exec_us`	2024-03-13 09:59:37 -10:00
Tejun Heo	60b346c1fc	scx_layered: Add more comments	2024-03-13 09:56:28 -10:00
David Vernet	91cb5ce8ab	Merge pull request #178 from sched-ext/multi_numa_rusty rusty: Implement NUMA-aware load balancing	2024-03-12 15:50:27 -05:00
David Vernet	c8d841d50b	rusty: Add comments + use VecDeque Given the complexity of migrating load between nodes (we're doing four nested loops), we should add a comment explaining what we're doing. This commit does that. In addition, we use a VecDeque to store (and then restore) push nodes and push domains so that we can re-add them to their respective lists in load-sorted order rather than reverse-load-sorted order. This allows us to avoid having to do unnecessary right-shifts every time a push object is re-added to its containing list. Signed-off-by: David Vernet <void@manifault.com>	2024-03-12 13:49:14 -07:00
Tejun Heo	a9457a408e	scx_layered: stat reporting updates	2024-03-12 10:48:21 -10:00
Tejun Heo	a642fc873b	scx_layered: Fix stat reporting GSTAT_TASK_CTX_FREE_FAILED should report total while EXCL_* should report delta pct. Fix them.	2024-03-12 10:25:51 -10:00
David Vernet	03f68092ee	rusty: Fix a few remaining issues Fixing alignment, moving a couple bail! calls around, and adding a missing break from move_between_nodes() that lets us bail out of a loop early. Signed-off-by: David Vernet <void@manifault.com>	2024-03-12 12:44:38 -07:00
Tejun Heo	58cbc5361d	scx_layered: warn if omitted stats aren't zero	2024-03-12 09:29:31 -10:00
Tejun Heo	37006d1bc1	scx_layered: Use saturating sub when reading system stats, other misc changes Sometimes io_wait time goes in the wrong direction. Use saturating sub.	2024-03-12 06:14:06 -10:00
Tejun Heo	342a4946af	scx_layered: Better pct formatting when printing stats	2024-03-11 22:18:03 -10:00
Tejun Heo	be2102775b	scx_layered: Implement min_exec_us option which can be used to penalize tasks which wake up very frequently without doing much.	2024-03-11 22:13:11 -10:00
Tejun Heo	0c62b24993	scx_layered: Implement exclusive property A task in an exclusive grouped or open layer occupied a whole core - the sibling CPU is kept idle.	2024-03-11 18:27:16 -10:00
David Vernet	24d798c2ff	rusty: Use a flat list of NumaNodes during LB As Tejun pointed out in review, the disadvantage of using push/pull/balanced lists is that if the domains inside the nodes are balanced, we won't be able to push load between them. I'd originally done it that way both as an optimization, but also to allow me to iterate over the lists of pushable and pullable domains mutably. That was addressed in the prior commit, but the nodes themselves were still put into 3 buckets. I think this is generally just a cleaner way of doing things, so let's just collapse the nodes into a flat list as well. This prevents us from having to coalesce the lists, std::mem::swap them, etc. Signed-off-by: David Vernet <void@manifault.com>	2024-03-11 21:04:10 -07:00
David Vernet	829b1d3ced	rusty: Don't use multiple SortedVec's in struct NumaNode Tejun pointed out that a possible issue exists in the current implementation, wherein if you have two NUMA nodes that are imbalanced, but their domains are internally balanced, we'll fail to migrate between them if all nodes are in the balanced_nodes list. To address this, let's just use a single global list for all types of domains, and do checking internally for imbalances. The reason it was done this way in the first place was to allow me to mutably iterate over both vectors in a nested loop. The way around that is to just use loop {} and push/pop domains from the list. We could do the same thing for the NUMA nodes themselves, which are also in 3 separate lists in the LoadBalancer. We'll do that in a subsequent commit. Signed-off-by: David Vernet <void@manifault.com>	2024-03-11 21:04:10 -07:00
David Vernet	3d2507e6f2	rusty: Add separate flag for x NUMA greedy task stealing In scx_rusty, a CPU that is going to go idle will attempt to steal tasks from remote domains when its domain has no tasks to run, and a remote domain has at least greedy_threshold enqueued tasks. This stealing is temporary, but of course has a cost in that the CPU that's stealing the task may cause it to suffer from cache misses, or in the case of multi-node machines, remote NUMA accesses and working sets split across multiple domains. Given the higher cost of x NUMA work stealing, let's add a separate flag that lets users tune the threshold for doing cross NUMA greedy task stealing. Signed-off-by: David Vernet <void@manifault.com>	2024-03-11 21:02:23 -07:00
Tejun Heo	76cc337d78	scx_layered: Add exclusive option to Open and Grouped layers Actual implementation isn't done yet.	2024-03-11 12:07:03 -10:00
Jordan Rome	54fe1c954e	Merge pull request #179 from jordalgo/bpftool Fetch and build bpftool by default	2024-03-11 17:54:29 -04:00
Andrea Righi	bd2c18afd5	Revert "scx_rustland_core: use new consume_raw() libbpf-rs API" In order to use the new consume_raw() API we need to depend on a version of libbpf-rs that is not released yet. Apparently adding such dependency may introduce a potential dependency conflict with libbpf-sys. Therefore, revert this change and go back to the previous consume() API. One a new version of libbpf-rs will be out we can update all our dependencies to use the new libbpf-rs and re-apply this patch to scx_rustland_core. Fixes: `7c8c5fd` ("scx_rustland_core: use new consume_raw() libbpf-rs API") Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-03-11 21:54:21 +01:00
Jordan Rome	ffc7b7dc4a	Fetch and build bpftool by default This pairs with the new default behavior to fetch and build libbpf and is mostly being used so we can use the latest bpftool and libbpf.	2024-03-11 10:00:01 -07:00
Andrea Righi	b7c06b9ed9	Merge pull request #181 from sched-ext/rustland-interactive-tuning scx_rustland: interactive tuning	2024-03-10 19:31:00 +01:00
Andrea Righi	155444e1c0	scx_rustland: set default time slice to 5ms In line with rustland's focus on prioritizing interactive tasks, set the default base time slice to 5ms. This allows to mitigate potential audio craking issues or system lags when the system is overloaded or under memory pressure condition (i.e., https://github.com/sched-ext/scx/issues/96#issuecomment-1978154324). A downside of this change is to introduce potential regressions in the throughput of CPU-intensive workloads, but in such scenarios rustland may not be the optimal choice and alternative schedulers may be preferred. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-03-10 14:46:11 +01:00
Andrea Righi	0a7161cbc7	scx_rustland: limit range of task weight Some high-priority tasks may have a weight too high, that can potentially disrupt the slice boost optimization logic, causing interactive tasks to be less responsive. In line with rustland's focus on prioritizing interactive tasks, prevent giving too much CPU bandwidth to such high-priority tasks by limiting the maximum task weight to 1000. This allows to maintain a good level of system responsiveness even in presence of tasks with a really high priority. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-03-10 14:39:29 +01:00
Andrea Righi	7c8c5fdd48	scx_rustland_core: use new consume_raw() libbpf-rs API Use the new consume_raw() API provided by libbpf-rs with https://github.com/libbpf/libbpf-rs/pull/680. This allows to be more precise and efficient at processing tasks consumed from the BPF ring buffer. NOTE: the new consume_raw() API is not available yet in any official release of the libbpf-rs crate, but cargo allows to pick versions directly from git. This slightly increases the build time of scx_rustland_core and the schedulers based on this crate (since we need to recompile libbpf-rs from source), but we can re-add a proper versioned dependency once the libbpf-rs is out. TODO: this new API also offers the possibility to consume multiple items from the BPF ring buffer with a single call to consume_raw(). This could be investigated and implemented as a potential future enhancement. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-03-10 09:55:17 +01:00
David Vernet	1c3168d2a4	topology: Don't assume unique core IDs The current topology.rs crate assumes that all cores have unique core IDs in a system. This need not be the case, such as in certain Intel Xeon processors which reuse core IDs in different NUMA nodes. Let's update the crate to assume unique core IDs only per socket. Signed-off-by: David Vernet <void@manifault.com>	2024-03-08 15:13:46 -06:00
David Vernet	26a94b1b14	rusty: Add debug! logging to load_balance.rs We removed the debug!() output that was previously present in main.rs. Let's add more debug!() output that helps debug the current LB hierarchy. Signed-off-by: David Vernet <void@manifault.com>	2024-03-08 15:13:46 -06:00
David Vernet	0d0b101398	rusty: Add load balancing statistics to rusty The scx_rusty load balancer is currently no longer exporting statistics such as domain load averages, load sums, etc. Now that we're also balancing by NUMA, we'll need a way to hierarchically illustrate load balancing statistics. This patch adds support for that. Signed-off-by: David Vernet <void@manifault.com> updating stats printing Signed-off-by: David Vernet <void@manifault.com>	2024-03-08 15:13:36 -06:00
David Vernet	0871a9525d	rusty: Add direct_greedy_numa flag Users may want to toggle whether tasks can be temporarily sent to idle CPUs on remote NUMA nodes. By default, we want it to be disabled as a task spanning multiple NUMA nodes will end up having its working set spanning both nodes, which is probably not desirable. However, in case a workload really wants to encourage work conservation, let's add a flag that allows them to toggle it. Signed-off-by: David Vernet <void@manifault.com>	2024-03-08 15:12:00 -06:00
David Vernet	d0ebfb85ef	rusty: Disable direct greedy stealing between NUMA nodes scx_rusty currently pushes tasks to idle cores if the direct greedy threshold is exceeded, even if the core is on a remote NUMA node. This behavior is probably not desired in most scenarios. The worst that will happen if a task is pushed to an idle core in the same node is some L3 cache miss traffic, but for multiple NUMA nodes, it could cause the task to have its working set span multiple nodes. Let's disable direct greedy work stealing across NUMA nodes. A future commit will add a flag that's disabled by default, and let's users turn this on if they really want to encourage work conservation. Signed-off-by: David Vernet <void@manifault.com>	2024-03-08 15:11:59 -06:00
David Vernet	db152cfbe8	rusty: Implement NUMA-aware load balancing Right now, scx_rusty has no notion of domains spanning NUMA nodes, and makes no distinction when making load balancing decisions, or work stealing. This can cause problems on multi-NUMA machines, as load balancing and work stealing across NUMA nodes has significantly different cost from across L3 cache boundaries. In order to better support multi-NUMA machines, this commit adds another layer to the rusty load balancer, which balances across NUMA nodes using a different cost function from balancing across domains. Load balancing now takes place over the span of two passes: 1. In the first pass, we fix imbalances across NUMA nodes by moving tasks between domains across those NUMA node boundaries. We require a load imbalance of at least 17% in order to move load at this stage. The ratio of load imbalance we attempt to adjust (50%) and the maximum amount of load we're allowed to push out of a domain (50%) is still the same as when balancing between domains inside a NUMA node, but this is easy to tune with the current setup. 2. Once we've balanced across NUMA nodes, we iterate over all nodes and balance between the domains within each NUMA node. The cost function here is the same as what it has been thus far: we require at least a 5% imbalance in order to trigger load balancing. There are a few additional changes / improvements to load balancing in this commit: 1. NUMA nodes and domains are now ordered according to their load by using SortedVec objects. We were previously using BTreeMap keyed by load, but this was suboptimal due to the fact that it doesn't allow duplicate entries. 2. We're no longer exporting load balancing statistics as a vector of data such as load sums, averages, and imbalances. This is instead all encapsulated in the load balancing hierarchy we setup in lb.load_balance(). These statistics are not yet exported, but they will be in a subsequent commit. One of the issues with this commit is that it does introduce some almost-identical logic that somehow begs to be deduplicated. For example, when we balance between NUMA nodes, the logic for iterating over push nodes and pushing to pull nodes is very similar to the logic of iterating over push domains and pull domains when balancing within a node. It may be that this can be improved. The following are some benchmarks run on an Intel Xeon Gold 6138 (2 x 40 core processor): kcompile -------- On Commit a27648c74210 ("afs: Fix setting of mtime when creating a file/dir/symlink"): 1. make allyesconfig 2. make -j $(nproc) built-in.a 3. make -j clean 4. goto 2 Runtime ------- o-----------o-----------o----------o \| scx_rusty \| CFS \| Delta \| ---------o-----------o-----------o----------o Mean \| 562.688s \| 566.085s \| -.6% \| ---------o-----------o-----------o----------o Variance \| 0.54387 \| 0.72431 \| -24.9% \| ---------o-----------o-----------o----------o o-----------o-----------o----------o \| rusty NUMA\| rusty ORIG\| Delta \| ---------o-----------o-----------o----------o Mean \| 562.688s \| 563.209s \| -.092% \| ---------o-----------o-----------o----------o Variance \| 0.54387 \| 0.42038 \| 29.38% \| ---------o-----------o-----------o----------o scx_rusty with NUMA awareness clearly beats CFS, but only barely beats scx_rusty without it. This isn't necessarily super surprising given that this is kcompile, which has very poor front-end CPU locality. Further experimentation with toggling the cost function for performing migrations may improve this further. CPU util -------- o-----------o-----------o----------o \| scx_rusty \| CFS \| Delta \| ---------o-----------o-----------o----------o Mean \| 7654.25% \| 7551.67% \| 1.11% \| ---------o-----------o-----------o----------o Variance \| 165.35714 \| 158.3333 \| 4.436% \| ---------o-----------o-----------o----------o o-----------o-----------o----------o \| rusty NUMA\| rusty ORIG\| Delta \| ---------o-----------o-----------o----------o Mean \| 7654.25% \| 7641.57% \| 0.1659% \| ---------o-----------o-----------o----------o Variance \| 165.35714 \| 1230.619 \| -86.5% \| ---------o-----------o-----------o----------o As expected, CPU util is quite a bit higher with scx_rusty than it is with CFS. Further experiments that could be interesting are always enabling direct-greedy stealing between domains within a NUMA node, and then comparing rusty NUMA and rusty ORIG. rusty NUMA prevents stealing between NUMA nodes, so this would show whether the locality introduced by NUMA awareness appropriately offsets the loss of work conservation. Major PFs --------- o-----------o-----------o----------o \| scx_rusty \| CFS \| Delta \| ---------o-----------o-----------o----------o Mean \| 5332 \| 3950 \| 36.566% \| ---------o-----------o-----------o----------o Variance \| 6975.5 \| 5986.333 \| 16.5237% \| ---------o-----------o-----------o----------o o-----------o-----------o----------o \| rusty NUMA\| rusty ORIG\| Delta \| ---------o-----------o-----------o----------o Mean \| 5332 \| 5336.5 \| -.084% \| ---------o-----------o-----------o----------o Variance \| 6975.5 \| 955.5 \| 630.03% \| ---------o-----------o-----------o----------o Also as expected, major page faults are far highe higher with scx_rusty than with CFS. This is expected even with NUMA awareness, given that scx_rusty is still less sticky than CFS. Further experiments that could be interesting are tuning the threshold for which we perform x NUMA migrations to try and keep this value even lower. The rate of major page faults between rusty NUMA and rusty ORIG were very close, though rusty NUMA was a bit lower. Signed-off-by: David Vernet <void@manifault.com>	2024-03-08 15:11:17 -06:00
David Vernet	0b1c3713b2	rusty: Remove lb_apply_weight param from lb_step() Let's just query self.tuner.fully_utilized directly and save a few lines of code. Signed-off-by: David Vernet <void@manifault.com>	2024-03-08 15:11:17 -06:00
David Vernet	758f762058	rusty: Move LoadBalancer out of rusty.rs More cleanup of scx_rusty. Let's move the LoadBalancer out of rusty.rs and into its own file. It will soon be extended quite a bit to support multi-NUMA and other multivariate LB cost functions, so it's time to clean things up and split it out. Signed-off-by: David Vernet <void@manifault.com>	2024-03-08 15:11:17 -06:00
David Vernet	94f75bcec6	rusty: Refactor Tuner and DomainGroup out of rusty.rs rusty.rs is growing a bit unwieldy. We're going to want to update its load balancing logic somewhat significantly to account for multi-NUMA and other cost functions, so let's start cleaning the code up so that things are more logically segmented and easier to work with. To start, we move the Tuner and DomainGroup/Domain objects into their own modules. Signed-off-by: David Vernet <void@manifault.com>	2024-03-08 15:10:37 -06:00
Andrea Righi	be5e51dfaa	scx_rlfifo: print a performance warning banner scx_rlfifo is provided as a simple example to show how to use scx_rustland_core and it's not supposed to be used in a real production environment. To prevent performance bug reports print an explicit warning when it's started to clarify the goal of this scheduler. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-03-05 19:36:17 +01:00
Andrea Righi	fe19754132	scx_rlfifo: replace 1ms sleep with sched_yield() Small improvement to make the scheduler a bit more responsive, without introducing too much complexity or too much CPU overhead. This can be achieved by replacing a sleep of 1ms with a sched_yield() every time that the scheduler has finished to dispatch all the queued tasks. This also makes the code a bit smaller and easier to read. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-03-05 18:42:24 +01:00
Andrea Righi	5cf113f058	scx_rustland_core: provide DispatchedTask API methods Provide distinct methods to set the target CPU and the per-task time slice to dispatched tasks. Moreover, also provide a constructor to create a DispatchedTask from a QueuedTask (this allows to automatically bounce a task from the scheduler to the BPF dispatcher without having to take care of setting the individual task's attributes). This also allows to make most of the attributes of DispatchedTask private, especially it allows to hide cpumask_cnt, that should be only used internally between the BPF and the user-space component. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-03-03 15:49:37 +01:00
Andrea Righi	e10f8a2d8e	scx_rustland_core: introduce per-task time slice Provide a way to set a different time slice per-task, by adding a new attribute slice_ns to the DispatchedTask struct. This attribute determines the time slice assigned to the task, if it is set to 0 then the global time slice (either the default one or the effective one, if set) will be used. At the same time, remove the payload attribute, that is basically unused (scx_rustland uses it to send the task's vruntime to the BPF dispatcher for debugging purposes, but it's not very useful anymore at this point). In the future we may introduce a proper interface to attach a custom payload to each task with a proper interface. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-03-03 15:06:56 +01:00
Jordan Rome	499924ead8	Add libbpf as a submodule This is to potentinally reduce issues with folks using different versions of libbpf at runtime. This also: - makes static linking of libbpf the default - adds steps in `meson setup` to fetch libbpf and make it	2024-03-01 12:39:35 -08:00
Andrea Righi	0d1c6555a4	scx_rustland_core: generate source files in-tree There is no need to generate source code in a temporary directory with RustLandBuilder(), we can simply generate code in-tree and exclude the generated source files from .gitignore. Having the generated source files in-tree can help to debug potential build issues (and it also allows to drop the the tempfile crate dependency). Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-28 17:49:44 +01:00
Andrea Righi	2ac1a5924f	scx_rustland_core: introduce RustLandBuilder() Introduce a wrapper to scx_utils::BpfBuilder that can be used to build the BPF component provided by scx_rustland_core. The source of the BPF components (main.bpf.c) is included in the crate as an array of bytes, the content is then unpacked in a temporary file to perform the build. The RustLandBuilder() helper is also used to generate bpf.rs (that implements the low-level user-space Rust connector to the BPF commponent). Schedulers based on scx_rustland_core can simply use RustLandBuilder(), to build the backend provided by scx_rustland_core. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-28 17:49:44 +01:00
Andrea Righi	e23426e299	scx_rustland_core: introduce method bpf.update_tasks() Introduce a helper function to update the counter of queued and scheduled tasks (used to notify the BPF component if the user-space scheduler has still some pending work to do). Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-28 17:49:44 +01:00
Andrea Righi	00e25530bc	scx_rlfifo: simple user-space FIFO scheduler written in Rust Implement a FIFO scheduler as an example usage of scx_rustland_core. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-28 17:49:44 +01:00
Andrea Righi	cf43129d89	scx_rustland: update documentation scx_rustland has significantly evolved since its original design. With the introduction of scx_rustland_core and the inclusion of the scx_rlfifo example, scx_rustland's focus can be shifted from solely being an "easy-to-read Rust scheduler template" to a fully functional scheduler. For this reason, update the README and documentation to reflect its revised design, objectives, and intended use cases. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-28 17:49:44 +01:00
Andrea Righi	871a6c10f9	scx_rustland_core: include scx_rustland backend Move the BPF component of scx_rustland to scx_rustland_core and make it available to other user-space schedulers. NOTE: main.bpf.c and bpf.rs are not pre-compiled in the scx_rustland_core crate, they need to be included in the user-space scheduler's source code in order to be compiled/linked properly. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-28 17:49:44 +01:00
Andrea Righi	416d6a940f	rust: introduce scx_rustland_core crate Introduce a separate crate (scx_rustland_core) that can be used to implement sched-ext schedulers in Rust that run in user-space. This commit only provides the basic layout for the new crate and the abstraction to the custom allocator. In general, any scheduler that has a user-space component needs to use the custom allocator to prevent potential deadlock conditions, caused by page faults (a kthread needs to run to resolve the page fault, but the scheduler is blocked waiting for the user-space page fault to be resolved => deadlock). However, we don't want to necessarily enforce this constraint to all the existing Rust schedulers, some of them may do all user-space allocations in safe paths, hence the separate scx_rustland_core crate. Merging this code in scx_utils would force all the Rust schedulers to use the custom allocator. In a future commit the scx_rustland backend will be moved to scx_rustland_core, making it a totally generic BPF scheduler framework that can be used to implement user-space schedulers in Rust. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-28 17:49:44 +01:00
David Vernet	8b04a2687f	rusty: Use new infeasible crate Now that we have a new 'infeasible' crate that abstracts the logic for implementing the infeasible weights solution. Let's update rusty to use it. Signed-off-by: David Vernet <void@manifault.com>	2024-02-26 10:51:54 -06:00
David Vernet	87eab38506	rustland: Update rustland to use topology.rs The new topology crate allows us to replace the custom rustland topology logic with the logic in the topology crate itself. Signed-off-by: David Vernet <void@manifault.com>	2024-02-23 13:09:06 -06:00
David Vernet	43624a87ce	rusty: Use new topology crate Now that we have this new Topology crate, let's update Rusty to use it instead of using the old one. Signed-off-by: David Vernet <void@manifault.com>	2024-02-23 10:39:55 -06:00
Tejun Heo	4dc77f8ddf	Merge pull request #149 from davemarchevsky/davemarchevsky_nice_equals scx_layered: Add MATCH_NICE_EQUALS match kind	2024-02-22 06:38:17 -10:00
Dave Marchevsky	9f510f18cd	scx_layered: Add MATCH_NICE_EQUALS match kind I have a usecase where specific nice values are used to bucket tasks into groups that are handled separately by different `scx_layered` policies, with no implications of relative priority between niceness X, X + 1, X - 1, etc. In other words, nicevals are used as simple tags in this scenario. If we wanted to treat a specific niceness this way e.g. `11`, we could do so with AND'd MATCH_NICE_{ABOVE,BELOW} like so: ```json "matches" : [ [ { "NiceAbove": 10 }, { "NiceBelow": 12 }, ], ], ``` But this is unnecessarily verbose and doesn't communicate the intent of the match very well. Adding a `NiceEquals` matcher simplifies the config and makes intent obvious: ```json "matches" : [ [ { "NiceEquals": 11 }, ], ], ``` This PR adds support for such a matcher. Also, rename `layer_match.nice_above_or_below` to just `layer_match.nice`, as the former doesn't describe the newly-added matcher's use of the field. It's still obvious that `layer_match.nice` is relevant to MATCH_NICE_{ABOVE, BELOW, EQUALS}. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>	2024-02-22 04:08:07 -08:00
David Vernet	615b594e1c	layered: Don't refresh cpumasks before attaching As mentioned in the previous commit, for some reason we're sometimes (non-deterministically) not seeing the updated cpumask / layer values in BPF if we initialize the cpumasks here before attaching. Let's undo this for now so to avoid observing buggy behavior, until we figure it out. Signed-off-by: David Vernet <void@manifault.com>	2024-02-21 19:19:45 -06:00
David Vernet	68d317079a	Revert "layered: Set layered cpumask in scheduler init call" This reverts commit `56ff3437a2`. For some reason we seem to be non-deterministically failing to see the updated layer values in BPF if we initialize before attaching. Let's just undo this specific part so that we can unblock this being broken, and we can figure it out async. Signed-off-by: David Vernet <void@manifault.com>	2024-02-21 19:17:19 -06:00
David Vernet	31df8fbd09	layered: Consume from layer with cpumask in layered_dispatch Currently, in layered_dispatch, we do the following: 1. Iterate over all preempt=true layers, and first try to consume from them. 2. Iterate over all confined layers, and consume from them if we find a layer with a cpumask that contains the consuming CPU. 3. Iterate over all grouped and open layers and consume from them in ordered sequence. In (2), we're only iterating over confined layers, but we should also be iterating over grouped layers. Otherwise, despite a consuming CPU being allocated to a specific grouped layer, the CPU will consume from whichever grouped or open layer has a task that's ready to run. Signed-off-by: David Vernet <void@manifault.com>	2024-02-21 15:38:23 -06:00
David Vernet	56ff3437a2	layered: Set layered cpumask in scheduler init call In layered_init, we're currently setting all bits in every layers' cpumask, and then asynchronously updating the cpumasks at later time to reflect their actual values at runtime. Now that we're updating the layered code to initialize the cpumasks before we attach the scheduler, we can instead have the init path actually refresh and initialize the cpumasks directly. Signed-off-by: David Vernet <void@manifault.com>	2024-02-21 15:38:23 -06:00
David Vernet	1f834e7f94	layered: Initialize layers before attaching scheduler We currently have a bug in layered wherein we could fail to propagate layer updates from user space to kernel space if a layer is never adjusted after it's first initialized. For example, in the following configuration: [ { "name": "workload.slice", "comment": "main workload slice", "matches": [ [ { "CgroupPrefix": "workload.slice/" } ] ], "kind": { "Grouped": { "cpus_range": [30, 30], "util_range": [ 0.0, 1.0 ], "preempt": false } } }, { "name": "normal", "comment": "the rest", "matches": [ [] ], "kind": { "Grouped": { "cpus_range": [2, 2], "util_range": [ 0.0, 1.0 ], "preempt": false } } } ] Both layers are static, and need only be resized a single time, so the configuration would never be propagated to the kernel due to us never calling update_bpf_layer_cpumask(). Let's instead have the initialization propagate changes to the skeleton before we attach the scheduler. This has the advantage both of fixing the bug mentioned above where a static configuration is never propagated to the kernel, and that we don't have a short period when the scheduler is first attached where we don't make optimal scheduling decisions due to the layer resizing not being propagated. Signed-off-by: David Vernet <void@manifault.com>	2024-02-21 15:38:21 -06:00
Tejun Heo	22d635c385	Merge pull request #141 from jordalgo/rusty-logging Add libbpf logging to rust schedulers	2024-02-20 13:52:39 -10:00
Andrea Righi	80de48ec83	scx_rustland: introduce --builtin-idle Add a command line option to enable/disable the sched-ext built-in idle selection logic in the user-space scheduler. With this option the user-space scheduler will try to dispatch tasks on the CPU selected during the .select_cpu() phase (using the built-in idle selection logic). Without this option the user-space scheduler will try to dispatch tasks to the first CPU available. The former can be useful to improve throughput, since tasks are more likely to stick on the same CPU, while the latter can provide better system responsiveness, especially when the system is significantly busy. Given that, by default, tasks can be dispatched directly bypassing the user-space scheduler if an idle CPU is found during .select_cpu(), the user-space scheduler is primarily engaged only when the system is busy (no idle CPUs are available). Under these circumstances, it is typically more efficient to dispatch tasks on the first available CPU. Hence, the default behavior is to ignore built-in idle selection logic in the user-space scheduler. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-21 00:25:14 +01:00
Andrea Righi	e487d71032	scx_rustland: simply CPU selection by relying on built-in idle selection Checking if a CPU is idle or busy in the user-space scheduler is a bit redundant, considering that we also rely on the built-in idle selection logic in the BPF part. Therefore get rid of the additional idle selection logic in the user-space scheduler and rely on the built-in idle selection. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-21 00:25:14 +01:00
Andrea Righi	2cd1d4b684	scx_rustland: introduce --full-user Introduce an option to send all scheduling events and actions to user-space, disabling any form of in-kernel optimization. Enabling this option will likely make the system less responsive (but more predictable in terms of performance) and it can be useful for debugging purposes. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-21 00:25:14 +01:00
Jordan Rome	7c32acece0	Add libbpf logging to the rust schedulers This is to get better logs when failing to load, attach, etc.	2024-02-20 15:17:10 -08:00
David Vernet	ef8aa9ea31	add documentation Signed-off-by: David Vernet <void@manifault.com>	2024-02-20 14:57:09 -06:00
David Vernet	8aba090d4f	rust: Add topology module to utils crate scx_rusty has logic in the scheduler to inspect the host to automatically build scheduling domains across every L3 cache. This would be generically useful for many different types of schedulers, so let's add it to the scx_utils crate so it can be used by others. Signed-off-by: David Vernet <void@manifault.com>	2024-02-20 14:57:09 -06:00
Andrea Righi	7ff06a6ff0	scx_rustland: prevent misaligned pointer dereference The buffer used to store struct queued_task_ctx items fetched from the BPF ring buffer needs to be aligned to the architecture register size, otherwise we may hit misaligned pointer dereference issues, such as: thread 'main' panicked at src/bpf.rs:162:43: misaligned pointer dereference: address must be a multiple of 0x8 but is 0x56516a51e004 note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace Prevent this by making sure the buffer is always aligned to 64-bits. Fixes: `93dc615` ("scx_rustland: use a ring buffer for queued tasks") Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-20 19:08:38 +01:00
Andrea Righi	93dc615653	scx_rustland: use a ring buffer for queued tasks Switch from a BPF_MAP_TYPE_QUEUE to a BPF_MAP_TYPE_RINGBUF to store the tasks that need to be processed by the user-space scheduler. A ring buffer allows to save a lot of memory copies and syscalls, since the memory is directly shared between the BPF and the user-space components. Performance profile before this change: 2.44% [kernel] [k] __memset 2.19% [kernel] [k] __sys_bpf 1.59% [kernel] [k] __kmem_cache_alloc_node 1.00% [kernel] [k] _copy_from_user After this change: 1.42% [kernel] [k] __memset 0.14% [kernel] [k] __sys_bpf 0.10% [kernel] [k] __kmem_cache_alloc_node 0.07% [kernel] [k] _copy_from_user Both the overhead of sys_bpf() and copy_from_user() are reduced by a factor of ~15x now (only the dispatch path is using sys_bpf() now). NOTE: despite being very effective, the current implementation is a bit of a hack. This is because the present ring buffer API exclusively permits consumption in a greedy manner, where multiple items can be consumed simultaneously. However, libbpf-rs does not provide precise information regarding the exact number of items consumed. By utilizing a more refined libbpf-rs API [1] we may be able to improve this code a bit. Moreover, libbpf-rs doesn't provide an API for the user_ring_buffer, so at the moment there's not a trivial way to apply the same change to the dispatched tasks. However, just with this change applied, the overhead of sys_bpf() and copy_from_user() is already minimal, so we won't get much benefits by changing the dispatch path to use a BPF ring buffer. [1] https://github.com/libbpf/libbpf-rs/pull/680 Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-20 12:30:22 +01:00
Andrea Righi	04685e633f	scx_rustland: avoid memory copies while accessing cpu_map Instead of using a BPF_MAP_TYPE_ARRAY to store which tasks are running on which CPU we can simply use a global array, mapped in the user-space address space. In this way we can avoid a lot of memory copies and call to sys_bpf(), significantly reducing the scheduler's overhead. Keep in mind that we don't need to be 100% correct while accessing this information, so we can accept some fuzziness in order to significantly reduce the scheduler's overhead. Performance profile before this change: 5.52% [kernel] [k] __sys_bpf 4.84% [kernel] [k] __kmem_cache_alloc_node 4.71% [kernel] [k] map_lookup_elem 4.10% [kernel] [k] _copy_from_user 3.51% [kernel] [k] bpf_map_copy_value 3.12% [kernel] [k] check_heap_object After this change: 2.20% [kernel] [k] __sys_bpf 1.91% [kernel] [k] map_lookup_and_delete_elem 1.60% [kernel] [k] __kmem_cache_alloc_node 1.10% [kernel] [k] _copy_from_user 0.12% [kernel] [k] check_heap_object n/a bpf_map_copy_value n/a map_lookup_elem With this change we can reduce the overhead of sys_bpf() by ~2x and the overhead of copy_from_user() by ~4x. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-20 12:30:16 +01:00
Andrea Righi	fc889c6995	scx_rustland: replace custom allocator with buddy-alloc Currently, the primary bottleneck in scx_rustland lies within its custom memory allocator, which is used to prevent page faults in the user-space scheduler. This is pretty evident looking at perf top: 39.95% scx_rustland [.] <scx_rustland::bpf::alloc::RustLandAllocator as core::alloc::global::GlobalAlloc>::alloc 3.41% [kernel] [k] _copy_from_user 3.20% [kernel] [k] __kmem_cache_alloc_node 2.59% [kernel] [k] __sys_bpf 2.30% [kernel] [k] __kmem_cache_free 1.48% libc.so.6 [.] syscall 1.45% [kernel] [k] __virt_addr_valid 1.42% scx_rustland [.] <scx_rustland::bpf::alloc::RustLandAllocator as core::alloc::global::GlobalAlloc>::dealloc 1.31% [kernel] [k] _copy_to_user 1.23% [kernel] [k] entry_SYSRETQ_unsafe_stack However, there's no need to reinvent the wheel here, rather than relying on an overly simplistic and inefficient allocator, we can rely on buddy-alloc [1], which is also capable of operating on a preallocated memory buffer. After switching to buddy-alloc, the performance profile under the same workload conditions looks like the following: 6.01% [kernel] [k] _copy_from_user 5.21% [kernel] [k] __kmem_cache_alloc_node 4.45% [kernel] [k] __sys_bpf 3.80% [kernel] [k] __kmem_cache_free 2.79% libc.so.6 [.] syscall 2.34% [kernel] [k] __virt_addr_valid 2.26% [kernel] [k] _copy_to_user 2.14% [kernel] [k] __check_heap_object 2.10% [kernel] [k] __check_object_size.part.0 2.02% [kernel] [k] entry_SYSRETQ_unsafe_stack With this change in place, the primary overhead is now moved to the bpf() syscall and the copies between kernel and user-space (this could potentially be optimized in the future using BPF ring buffers, instead of BPF FIFO queues). A better focus at the allocator overhead before vs after this change: [before] 39.95% scx_rustland [.] core::alloc::global::GlobalAlloc>::alloc 1.42% scx_rustland [.] core::alloc::global::GlobalAlloc>::dealloc [after] 1.50% scx_rustland [.] core::alloc::global::GlobalAlloc>::alloc 0.76% scx_rustland [.] core::alloc::global::GlobalAlloc>::dealloc [1] https://crates.io/crates/buddy-alloc Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-11 14:33:39 +01:00
Andrea Righi	ccf5946425	scx_rustland: speed up search by PID in tasks BTreeSet In order to prevent duplicate PIDs in the TaskTree (BTreeSet), we perform an O(N) search each time we add an item, to verify whether the PID already exists or not. Under heavy stress test conditions the O(N) complexity can have a potential impact on the overall performance. To mitigate this, introduce a HashMap that can be used to retrieve tasks by PID typically with a O(1) complexity. This could potentially degrade to O(N) in presence of hash collisions, but even in this case, accessing the hash map is still more efficient than scanning all the entries in the BTreeSet to search for the target PID. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-11 14:11:38 +01:00
Andrea Righi	7ce0d038e4	Merge pull request #133 from sched-ext/rustland-cpumask-gen-cnt scx_rustland: per-task cpumask generation counter	2024-02-10 19:07:02 +01:00
Andrea Righi	61d1ed338a	scx_rustland: per-task cpumask generation counter Introduce a per-task generation counter to check the validity of the cpumask at dispatch time. The logic is the following: - the cpumask generation number is incremented every time a task calls .set_cpumask() - when a task is enqueued the current generation number is stored in the queued_task_ctx and relayed to the user-space scheduler - the user-space scheduler can decide to dispatch the task on the CPU determined by the BPF layer in .select_cpu(), redirect the task to any other specific CPU, or redirect to the first CPU available (using NO_CPU) - task is then dispatched back to the BPF code along with its cpumask generation counter - at dispatch time the BPF code checks if the generation number is the same and it discards the dispatch attempt if the cpumask is not valid anymore (the task will be automatically re-enqueued by the sched-ext core code, potentially selecting another CPU / cpumask) - if the cpumask is valid, but the CPU selected by the user-space scheduler is invalid (according to the cpumask), the task will be transparently bounced by the BPF code to the shared DSQ (in this way the user-space code can be completely abstracted and dispatches that target invalid CPUs can be automatically fixed by the BPF layer) This solution can prevent stalls due to dispatches targeting invalid CPUs and it can also avoid redundant dispatch events, making the code more efficient and the cpumask interlocking more reliable. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-10 18:02:42 +01:00
David Vernet	1c00de9402	Merge pull request #129 from sched-ext/infeasible_weights Implement solution to infeasible weights problem	2024-02-09 16:23:56 -06:00
David Vernet	e627176d90	scx: Implement solution to infeasible weights problem As described in [0], there is an open problem in load balancing called the "infeasible weights" problem. Essentially, the problem boils down to the fact that a task with disproportionately high load can be granted more CPU time than they can actually consume per their duty cycle. This patch implements a solution to that problem, wherein we apply the algorithm described in this paper to adjust all infeasible weights in the system down to a feasible wight that gives them their full duty cycle, while allowing the remaining feasible tasks on the system to share the remaining compute capacity on the machine. [0]: https://drive.google.com/file/d/1fAoWUlmW-HTp6akuATVpMxpUpvWcGSAv/view?usp=drive_link Signed-off-by: David Vernet <void@manifault.com>	2024-02-09 16:23:12 -06:00
Andrea Righi	8e47602f00	scx_rustland: keep default CPU selection when idle Dispatch to the shared DSQ (NO_CPU) only when the assigned CPU is not idle anymore, otherwise maintain the same CPU that has been assigned by the BPF layer. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-08 22:48:07 +01:00
Andrea Righi	7085d57709	scx_rustland: kick user-space scheduler when a CPU is released When the system is not being fully utilized there may be delays in promptly awakening the user-space scheduler. This can happen for example, when some CPU-intensive tasks are constantly dispatched bypassing the user-space scheduler (e.g., using SCX_DSQ_LOCAL) and other CPUs are completely idle. Under this condition the update_idle() can fail to activate the user-space scheduler, because there are no pending events, and only the periodic timer will wake up the scheduler, potentially introducing lags of up to 1 sec. This can be reproduced, for example, running a video game that doesn't use all the CPUs available in the system (i.e., Team Fortress 2). With this game it is pretty easy to notice sporadic lags that are resumed after ~1sec, due to the periodic timer kicking scheduler. To prevent this from happening wake up the user-space scheduler immediately as soon as a CPU is released, speculating on the fact that most of the time there will be always another task ready to run. This can introduce a little more overhead in the scheduler (due to potential unnecessary wake up events), but it also prevents stuttery behaviors and it makes the system much more smooth and responsive, especially with video games. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-08 22:48:07 +01:00
Andrea Righi	cb82d91e0f	scx_rustland: use scx_bpf_dispatch_cancel() Use scx_bpf_dispatch_cancel() to invalidate dispatches on wrong per-CPU DSQ, due to cpumask race conditions, and redirect them to the shared DSQ. This prevents dispatching tasks to CPU that cannot be used according to the task's cpumask. With this applied the scheduler passed all the `stress-ng --race-sched` stress tests. Moreover, introduce a counter that is periodically reported to stdout as an additional statistic, that can be helpful for debugging. Link: https://github.com/sched-ext/sched_ext/pull/135 Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-08 22:48:07 +01:00
Andrea Righi	13e23e8cc9	scx_rustland: dump scheduler statistics before exiting Print all the scheduler statistics before exiting. Reporting the very last state of the scheduler can help to debug events that could trigger error conditions (such as page faults, scheduler congestions, etc.). While at it, fix also some minor coding style issues (tabs vs spaces). Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-08 15:37:44 +01:00
David Vernet	c574598dc7	scx_rusty: Fix typos Signed-off-by: David Vernet <void@manifault.com>	2024-02-07 23:38:26 -06:00
Tejun Heo	2062d1ad1f	scx: Add compat support for SCX_KICK_IDLE and use it for idle CPU wakeups SCX_KICK_IDLE is a new feature which isn't defined in older kernels. Add compat wrapper and use it for idle CPU wakeups. Signed-off-by: Tejun Heo <tj@kernel.org>	2024-02-06 15:28:40 -10:00
Andrea Righi	acb174aa51	scx_rustland: prevent duplicate PIDs in the task BTreeSet Items in the task BTreeSet are stored by pid and vruntime. Make sure that we never store multiple items with the same PID, so that re-enqueued tasks are not dispatched multiple times. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-03 14:46:39 +01:00
Andrea Righi	681b3fd807	scx_rustland: more aggressive time slice scaling Allow to scale the effective time slice down to 250 us. This can help to maintain a good quality of the audio even when the system is overloaded by multiple CPU-intensive tasks. Moreover, always round up the time slice scaling factor to be a little more aggressive and prioritize at scaling the time slice, so that we can prioritize low latency tasks even more. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-01 16:40:59 +01:00
Andrea Righi	26d6d530f0	scx_rustland: enhance interactive task classification Evaluate the number of voluntary context switches per second (nvcsw/sec) for each task using an exponentially weighted moving average (EWMA) with weight 0.5, that allows to classify interactive tasks with more accuracy. Using a simple average over a period of time of 10 sec can introduce small lags every 10 sec, as the statistics for the number of voluntary context switches are refreshed. This can result in interactive tasks taking a brief time to catch up in order to be accurately classified as so, causing for example short audio cracks, small drop of 5-10 fps in games, etc. Using a EMWA allows to smooth the average of nvcsw/sec, preventing short lags in the interactive tasks, while also preventing to incorrectly classify as interactive tasks that may experience an isolated short burst of voluntary context switches. This patch has been tested with the usual test case of playing a videogame while running a parallel kernel build in the background. Without this patch the short lag every 10 sec is clearly noticeable, with this patch applied the game and audio run smoothly. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-01 16:40:59 +01:00
Andrea Righi	baeea306fc	scx_rustland: rely on the built-in idle selection logic Simplify the idle selection logic by relying only on the built-in idle selection performed in the BPF layer. When there are idle CPUs available in the system, tasks are dispatched directly by the BPF dispatcher without invoking the user-space scheduler. This allows to avoid the user-space overhead and get the best system performance when CPU resources are not overcommitted. Once the number of tasks exceeds the available CPUs, the user-space scheduler takes over. However, by this time, the system is already overcommitted, so there's little advantage in attempting to pinpoint the optimal idle CPU through the user-space scheduler. Instead, tasks can be executed on the first available CPU, consistently dispatching them to the shared DSQ. This allows to achieve the optimal performance both with system under-utilization and over-utilization. With this change in place the user-space scheduler won't dispatch tasks directly to specific CPUs, but we still want to keep this as a generic feature in the BPF layer, so that it can be potentially used in the future by this scheduler or even by other user-space schedulers (once the BPF layer will be moved to a more generic place). Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-01 16:40:59 +01:00
Andrea Righi	b9e60f71ed	scx_rustland: usersched: code refactoring No functional change, just move code around to make it more readable. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-01 16:40:59 +01:00
Andrea Righi	d13ed5c025	scx_rustland: BPF: refine CPU dispatch logic When the user-space scheduler dispatches a task on a specific CPU, that CPU might not be valid, since the user-space doesn't have visibility of the task's cpumask. When this happens the BPF dispatcher (that has direct visibility of the cpumask) should automatically redirect the task to a valid CPU, but instead of bouncing the task on the shared DSQ, we should try to use the CPU assigned by the built-in idle selection logic. If this CPU is also not valid, then we can simply ignore the task, that has been de-queued and re-enqueued, since a valid CPU will be naturally re-selected at a later time. Moreover, avoid to kick any specific CPU when the task is dispatched to shared DSQ, since the task can be consumed on any CPU and the additional kick would simply add more overhead. Lastly, rename dsq_id_to_cpu() to dsq_to_cpu() and cpu_to_dsq_id() to cpu_to_dsq() for more clarity. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-01 16:38:17 +01:00
Andrea Righi	45d8b54eb9	scx_rustland: re-introduce per-CPU DSQ + a global shared DSQ With commit `c6ada25` ("scx_rustland: use custom pcpu DSQ instead of SCX_DSQ_LOCAL{_ON}") we tried to introduce custom per-CPU DSQs, instead of using SCX_DSQ_LOCAL and SCX_DSQ_LOCAL_ON to dispatch tasks. This was required, because dispatching tasks using SCX_DSQ_LOCAL_ON doesn't provide a guarantee that the cpumask, checked at dispatch time to determine the validity of a target CPU, remains valid. This method solved the cpumask validity issue, but unfortunately it introduced a noticeable performance regression and a potential starvation issue (that were probably caused by the same problem): if a task is assigned to a CPU in select_cpu() and the scheduler decides to dispatch it on a different CPU, the task will be added to the new CPU's DSQ, but if no dispatch event happens there, the task may remain stuck in the per-CPU DSQ for a long time, triggering the sched-ext watchdog timeout that would kick out the scheduler, for example: 12:53:28 [WARN] FAIL: IPC:CSteamEngin[7217] failed to run for 6.482s (err=1026) 12:53:28 [INFO] Unregister RustLand scheduler Therefore, we reverted this change with `6d89ece` ("scx_rustland: dispatch tasks only on the global DSQ"), dispatching all the tasks to the global DSQ, completely delegating the kernel to distribute tasks among the available CPUs. This is not the ideal solution, because we still want to give the possibility to the user-space scheduler to assign tasks to specific CPUs. Therefore, re-introduce distinct per-CPU DSQs, but also provide a global shared DSQ. Tasks dispatched in the per-CPU DSQs are consumed from the dispatch() callback of their corresponding CPU, tasks dispatched in the global shared DSQ are consumed from any CPU. In this way the BPF layer is able to provide an interface that gives the flexibility to the user-space to dispatch a task on a specific CPU or on the first CPU available, depending on the particular scheduler's need. If an invalid CPU (according to the cpumask) is selected the BPF dispatcher will transparently redirect the task to a valid CPU, selected using the built-in idle selection logic. In the future we may want to improve this part, giving to the user-space the visibility of the cpumask, in order to pick a valid CPU in advance and in a proper synchronized way. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-01 00:33:35 +01:00
Andrea Righi	b5e846c538	scx_rustland: BPF: small refactoring No functional change, just some refactoring to make the code more clear. We have is_usersched_needed() and set_usersched_needed() that are doing different things (the former is checkig if there are pending tasks for the scheduler, the latter is setting the usersched_needed flag to activate the dispatch of the user-space scheduler). Rename is_usersched_needed() to usersched_has_pending_tasks() to make the code more clear and understandable. Also move dispatch_user_scheduler() closer to the other dispatch-related helper functions. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-02-01 00:33:35 +01:00
Tejun Heo	6db362b27a	scx_rustland: Use scx_utils::user_exit_info Instead of the bespoke implementation. This also makes scx_rustland to print out debug dump if exists. Signed-off-by: Tejun Heo <tj@kernel.org>	2024-01-31 11:44:15 -10:00
Tejun Heo	965926f393	scx_rusty: Use scx_utils::user_exit_info Instead of the bespoke implementation. This also makes scx_rusty to print out debug dump if exists. Signed-off-by: Tejun Heo <tj@kernel.org>	2024-01-31 11:08:17 -10:00
Tejun Heo	105dc36b8f	scx_layered: Use scx_utils::user_exit_info Instead of the bespoke implementation. This also makes scx_layered to print out debug dump if exists. Signed-off-by: Tejun Heo <tj@kernel.org>	2024-01-31 10:54:20 -10:00
Tejun Heo	4ee8104a6d	Merge pull request #114 from dschatzberg/local_avoid_enqueue scx_layered: dispatch from select_cpu if possible	2024-01-31 08:33:26 -10:00
Dan Schatzberg	11e487c165	scx_layered: dispatch from select_cpu if possible If we are doing local dispatch, we can avoid enqueue() altogether by dispatching from select_cpu() Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>	2024-01-31 09:54:26 -08:00
Jordan Rome	1b3a9a1e72	[scx_layered] downgrade prometheus-client This library at version 22 is not available in fedora: https://src.fedoraproject.org/rpms/rust-prometheus-client Rather than bothering the maintainer, let's just downgrade here.	2024-01-31 04:36:01 -08:00
Dan Schatzberg	ab5635ff6d	scx_layered: Grab idle_smtmask a bit later This is a really minor optimization, but we don't need idle_smtmask to schedule pinned tasks, so defer it so the nr_cpus_allowed == 1 path is marginally faster. Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>	2024-01-29 08:16:37 -08:00
Dan Schatzberg	8c9e65d880	scx_layered: Remove unnecessary idle_cpumask idle_cpumask isn't used at all in pick_idle_cpu_from. The only need for these cpumasks is to check if prev_cpu is a wholly idle CPU (and we only do this when smt_enabled). idle_smtmask is sufficient for that check. Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>	2024-01-29 08:16:37 -08:00
Dan Schatzberg	142b6230b2	scx_layered: Fix AFFN_VIOL stat bump Prior to this patch, we only bump LSTAT_AFFN_BIOL when the target cpu was idle, but in both cases it should be counted as AFFN_VIOL. Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>	2024-01-26 13:13:16 -08:00
Tejun Heo	988b7d13c1	Bump versions scx_exit_info change doesn't require code to be updated but breaks binary compatbility. Bump versions and cut a new release.	2024-01-25 09:01:23 -10:00
Tejun Heo	eb997a6e55	Merge pull request #101 from dschatzberg/openmetrics scx_layered: Add support for OpenMetrics format	2024-01-25 08:59:16 -10:00
Dan Schatzberg	7f9548eb34	scx_layered: Add support for OpenMetrics format Currently scx_layered outputs statistics periodically as info! logs. The format of this is largely unstructured and mostly suitable for running scx_layered interactively (e.g. observing its behavior on the command line or via logs after the fact). In order to run scx_layered at larger scale, it's desireable to have statistics output in some format that is amenable to being ingested into monitoring databases (e.g. Prometheseus). This allows collection of stats across many machines. This commit adds a command line flag (-o) that outputs statistics to stdout in OpenMetrics format instead of the normal log mechanism. OpenMetrics has a public format specification (https://github.com/OpenObservability/OpenMetrics) and is in use by many projects. The library for producing OpenMetrics metrics is lightweight but does induce some changes. Primarily, metrics need to be pre-registered (see OpenMetricsStats::new()). Without -o, the output looks as before, for example: ``` 19:39:54 [INFO] CPUs: online/possible=52/52 nr_cores=26 19:39:54 [INFO] Layered Scheduler Attached 19:39:56 [INFO] tot= 9912 local=76.71 open_idle= 0.00 affn_viol= 2.63 tctx_err=0 proc=21ms 19:39:56 [INFO] busy= 1.3 util= 65.2 load= 263.4 fallback_cpu= 1 19:39:56 [INFO] batch : util/frac= 49.7/ 76.3 load/frac= 252.0: 95.7 tasks= 458 19:39:56 [INFO] tot= 2842 local=45.04 open_idle= 0.00 preempt= 0.00 affn_viol= 0.00 19:39:56 [INFO] cpus= 2 [ 0, 2] 04000001 00000000 19:39:56 [INFO] immediate: util/frac= 0.0/ 0.0 load/frac= 0.0: 0.0 tasks= 0 19:39:56 [INFO] tot= 0 local= 0.00 open_idle= 0.00 preempt= 0.00 affn_viol= 0.00 19:39:56 [INFO] cpus= 50 [ 0, 50] fbfffffe 000fffff 19:39:56 [INFO] normal : util/frac= 15.4/ 23.7 load/frac= 11.4: 4.3 tasks= 556 19:39:56 [INFO] tot= 7070 local=89.43 open_idle= 0.00 preempt= 0.00 affn_viol= 3.69 19:39:56 [INFO] cpus= 50 [ 0, 50] fbfffffe 000fffff 19:39:58 [INFO] tot= 7091 local=84.91 open_idle= 0.00 affn_viol= 2.64 tctx_err=0 proc=21ms 19:39:58 [INFO] busy= 0.6 util= 31.2 load= 107.1 fallback_cpu= 1 19:39:58 [INFO] batch : util/frac= 18.3/ 58.5 load/frac= 93.9: 87.7 tasks= 589 19:39:58 [INFO] tot= 2011 local=60.67 open_idle= 0.00 preempt= 0.00 affn_viol= 0.00 19:39:58 [INFO] cpus= 2 [ 2, 2] 04000001 00000000 19:39:58 [INFO] immediate: util/frac= 0.0/ 0.0 load/frac= 0.0: 0.0 tasks= 0 19:39:58 [INFO] tot= 0 local= 0.00 open_idle= 0.00 preempt= 0.00 affn_viol= 0.00 19:39:58 [INFO] cpus= 50 [ 50, 50] fbfffffe 000fffff 19:39:58 [INFO] normal : util/frac= 13.0/ 41.5 load/frac= 13.2: 12.3 tasks= 650 19:39:58 [INFO] tot= 5080 local=94.51 open_idle= 0.00 preempt= 0.00 affn_viol= 3.68 19:39:58 [INFO] cpus= 50 [ 50, 50] fbfffffe 000fffff ^C19:39:59 [INFO] EXIT: BPF scheduler unregistered ``` With -o passed, the output is in OpenMetrics format: ``` 19:40:08 [INFO] CPUs: online/possible=52/52 nr_cores=26 19:40:08 [INFO] Layered Scheduler Attached # HELP total Total scheduling events in the period. # TYPE total gauge total 8489 # HELP local % that got scheduled directly into an idle CPU. # TYPE local gauge local 86.45305689716104 # HELP open_idle % of open layer tasks scheduled into occupied idle CPUs. # TYPE open_idle gauge open_idle 0.0 # HELP affn_viol % which violated configured policies due to CPU affinity restrictions. # TYPE affn_viol gauge affn_viol 2.332430203793144 # HELP tctx_err Failures to free task contexts. # TYPE tctx_err gauge tctx_err 0 # HELP proc_ms CPU time this binary has consumed during the period. # TYPE proc_ms gauge proc_ms 20 # HELP busy CPU busy % (100% means all CPUs were fully occupied). # TYPE busy gauge busy 0.5294061026085283 # HELP util CPU utilization % (100% means one CPU was fully occupied). # TYPE util gauge util 27.37195512782239 # HELP load Sum of weight * duty_cycle for all tasks. # TYPE load gauge load 81.55024768702126 # HELP layer_util CPU utilization of the layer (100% means one CPU was fully occupied). # TYPE layer_util gauge layer_util{layer_name="immediate"} 0.0 layer_util{layer_name="normal"} 19.340849995024997 layer_util{layer_name="batch"} 8.031105132797393 # HELP layer_util_frac Fraction of total CPU utilization consumed by the layer. # TYPE layer_util_frac gauge layer_util_frac{layer_name="batch"} 29.34063385422595 layer_util_frac{layer_name="immediate"} 0.0 layer_util_frac{layer_name="normal"} 70.65936614577405 # HELP layer_load Sum of weight * duty_cycle for tasks in the layer. # TYPE layer_load gauge layer_load{layer_name="immediate"} 0.0 layer_load{layer_name="normal"} 11.14363313258934 layer_load{layer_name="batch"} 70.40661455443191 # HELP layer_load_frac Fraction of total load consumed by the layer. # TYPE layer_load_frac gauge layer_load_frac{layer_name="normal"} 13.664744680306903 layer_load_frac{layer_name="immediate"} 0.0 layer_load_frac{layer_name="batch"} 86.33525531969309 # HELP layer_tasks Number of tasks in the layer. # TYPE layer_tasks gauge layer_tasks{layer_name="immediate"} 0 layer_tasks{layer_name="normal"} 490 layer_tasks{layer_name="batch"} 343 # HELP layer_total Number of scheduling events in the layer. # TYPE layer_total gauge layer_total{layer_name="normal"} 6711 layer_total{layer_name="batch"} 1778 layer_total{layer_name="immediate"} 0 # HELP layer_local % of scheduling events directly into an idle CPU. # TYPE layer_local gauge layer_local{layer_name="batch"} 69.79752530933632 layer_local{layer_name="immediate"} 0.0 layer_local{layer_name="normal"} 90.86574281031143 # HELP layer_open_idle % of scheduling events into idle CPUs occupied by other layers. # TYPE layer_open_idle gauge layer_open_idle{layer_name="immediate"} 0.0 layer_open_idle{layer_name="batch"} 0.0 layer_open_idle{layer_name="normal"} 0.0 # HELP layer_preempt % of scheduling events that preempted other tasks. # # TYPE layer_preempt gauge layer_preempt{layer_name="normal"} 0.0 layer_preempt{layer_name="batch"} 0.0 layer_preempt{layer_name="immediate"} 0.0 # HELP layer_affn_viol % of scheduling events that violated configured policies due to CPU affinity restrictions. # TYPE layer_affn_viol gauge layer_affn_viol{layer_name="normal"} 2.950379973178364 layer_affn_viol{layer_name="batch"} 0.0 layer_affn_viol{layer_name="immediate"} 0.0 # HELP layer_cur_nr_cpus Current # of CPUs assigned to the layer. # TYPE layer_cur_nr_cpus gauge layer_cur_nr_cpus{layer_name="normal"} 50 layer_cur_nr_cpus{layer_name="batch"} 2 layer_cur_nr_cpus{layer_name="immediate"} 50 # HELP layer_min_nr_cpus Minimum # of CPUs assigned to the layer. # TYPE layer_min_nr_cpus gauge layer_min_nr_cpus{layer_name="normal"} 0 layer_min_nr_cpus{layer_name="batch"} 0 layer_min_nr_cpus{layer_name="immediate"} 0 # HELP layer_max_nr_cpus Maximum # of CPUs assigned to the layer. # TYPE layer_max_nr_cpus gauge layer_max_nr_cpus{layer_name="immediate"} 50 layer_max_nr_cpus{layer_name="normal"} 50 layer_max_nr_cpus{layer_name="batch"} 2 # EOF ^C19:40:11 [INFO] EXIT: BPF scheduler unregistered ``` Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>	2024-01-25 09:59:49 -08:00
Andrea Righi	6d89eceb93	scx_rustland: dispatch tasks only on the global DSQ Commit `c6ada25` ("scx_rustland: use custom pcpu DSQ instead of SCX_DSQ_LOCAL{_ON}") fixed the race issues with the cpumask, but it also introduced performance regressions. Until we figure out the reasons of the performance regressions, simplify the dispatcher and go back at using only the global DSQ, relying on the built-in idle cpu selection. In this way we can still enforce task affinity properly (`stress-ng --race-sched N` does not crash the scheduler) and we can also provide a better level of system responsiveness (according to the results of the stress tests done recently). The idea of this change is to make the scheduler usable in certain real-world scenarios (and as bug-free as possible), while we figure out the performance regressions of the per-CPU DSQ approach, that will likely be re-introduced later on in the future. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-23 13:24:12 +01:00
Andrea Righi	06b5ff3d2f	scx_rustland: clarify the logic to determine interactive tasks No functional change, simply rewrite the code a bit and update the comment to clarify the logic to detect interactive tasks and apply the priority boost. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-23 08:28:44 +01:00
Andrea Righi	ab1c4f66a8	scx_rustland: allow to disable the slice boost completely Allow to specify `-b 0` to completely disable the slice boost logic and fallback to standard vruntime-based scheduler with variable time slice. In this way interactive tasks will not get over-prioritized over the other tasks in the system. Having this option can help to easily track down potential performance regressions arising for over-prioritizing interactive tasks. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-23 00:34:06 +01:00
Andrea Righi	b4269452fc	scx_userland: handle preemption events from higher sched_class Make sure to re-schedule the user-space scheduler if it's preempted by a task from a higher priority sched_class. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-23 00:34:06 +01:00
Andrea Righi	2426d1024f	scx_rustland: increase max amount of enqueued tasks As the scheduler is progressing towards a more stable and usable state, it may be subject to heavy stress tests. For this reason, bump up the limit of MAX_ENQUEUED_TASKS to 8192 in the BPF component, to be able to sustain task-intensive stress tests, reducing the risk of potential scheduling congestion conditions. The downside is a negligible increase in the memory footprint of the BPF component, that is worth the cost in order to have an improved scheduler stability. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-21 15:47:35 +01:00
Andrea Righi	28bf96c78e	scx_rustland: mitigate unevictable memory page faults Page faults cannot happen when the user-space scheduler is running, otherwise we may hit deadlock conditions: a kthread may need to run to resolve the page fault, but the user-space scheduler is waiting on the page fault to be resolved => deadlock. We solved this problem (mostly) in commit `9708a80` ("scx_userland: use a custom memory allocator to prevent page faults"), introducing a custom allocator for the user-space scheduler that operates on a pre-allocated mlocked memory buffer, but there is an exception that can still trigger page faults: kcompactd. When memory compaction is enabled, specifically with vm.compact_unevictable_allowed=1 (which is often the default in many distributions), kcompactd regularly attempts to compact all memory zones, such that free memory is available in contiguous blocks where feasible, including unevictable memory as well. In the event that kcompactd remaps pages within the user-space scheduler's address space, it can lead to page faults, resulting in a potential deadlock. To prevent this from happening automatically set vm.compact_unevictable_allowed=0 when the scheduler is loaded and restore the previous value when the scheduler in unloaded. In this way we can prevent kcompactd from touching the unevictable memory associated to the user-space scheduler. Keep in mind that this is not a full bullet proof solution: something else in the system may still set vm.compact_unevictable_allowed=1 while the scheduler is running, re-enabling the risk of deadlock. Ideally we would need a way to mark the user-space scheduler memory as "really unevictable", or a proper kernel ABI to instruct kcompactd to exclude certain tasks (or better, cgroups) from its proactive memory compaction actions, but since then, this seems to be the best way to mitigate this issue. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-21 15:47:35 +01:00
David Vernet	c6ada251ef	scx_rustland: use custom pcpu DSQ instead of SCX_DSQ_LOCAL{_ON} We still don't have a reliable and non-racy way to manage cpumasks from the user-space scheduler, so it is quite hard for the scheduler to enforce the proper CPU affinity behavior. Despite checking the cpumask in the BPF part, tasks may still be assigned to a CPU that they cannot use, triggering scheduler errors. For example, it is really easy to crash the scheduler with a simple CPU affinity stress test (`stress-ng --race-sched 8 --timeout 5`): 14:51:28 [WARN] FAIL: SCX_DSQ_LOCAL[_ON] verdict target cpu 1 not allowed for stress-ng-race-[567048] (err=1024) To prevent this issue from happening, create custom DSQ for each CPU available in the system and use these per-CPU DSQs to dispatch all the tasks processed by the user-space scheduler, including the user-space scheduler itself. Then consume the these DSQs from the .dispatch() callback of the respective CPU, to transfer all the tasks to the consuming CPU's local DSQ, preventing the cpumask race condition encountered using SCX_DSQ_LOCAL_ON. With this patch applied the `stress-ng --race-sched N` stress test can be executed successfully (even with large values of N) without causing the scheduler to crash. Signed-off-by: David Vernet <void@manifault.com> [ arighi: kick target cpu to improve responsiveness, update comments ] Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-21 15:47:35 +01:00

... 2 3 4 5 6 ...

428 Commits