scx-upstream

mirror of https://github.com/sched-ext/scx.git synced 2024-12-04 16:27:12 +00:00

Author	SHA1	Message	Date
David Vernet	b50ba626cc	uei: Pass skel to RESIZE_ARRAY() The RESIZE_ARRAY() macro assumes the presence of an in-scope "skel" variable. This is bad practice and can cause issues in other macros that use it. Let's update it to explicitly take a skel argument. Signed-off-by: David Vernet <void@manifault.com>	2024-06-11 10:15:26 -05:00
Tejun Heo	92317aa2f9	Use __always_inline uniformly Instead of using __attribute__((always_inline)) use the __always_inline macro provided by BPF.	2024-06-10 11:23:26 -10:00
I Hsin Cheng	5881c61a5e	scx_central: Provide backward compability Newer sched_ext kernel versions sets the scheduler to schedule all tasks within the system by default. However, some users are using the old versions of kernel. Therefore we call "__COMPAT_scx_bpf_switch_all()" to move all tasks to "SCHED_EXT" class so scx_central would schedule all tasks by default in older kernels.	2024-05-24 15:12:34 +08:00
I Hsin Cheng	e605b067c6	scx_flatcg: Correct content error in comment A's share in the hierarchy should be 100/(200+100), plus 200/(200+100) doesn't equal to 1/3. Correct the mistake by changing "200" to "100".	2024-05-21 13:27:26 +08:00
Tejun Heo	0181df54b5	Merge pull request #303 from sched-ext/simple_comment simple: Add comment explaining use of SHARED_DSQ	2024-05-20 06:45:13 -10:00
David Vernet	0dda4badd5	simple: Add comment explaining use of SHARED_DSQ scx_simple is a basic scheduler that does either basic vtime, or global FIFO, scheduling. At first glance, it may be confusing why we create a separate DSQ rather than just using SCX_DSQ_GLOBAL. Let's add a comment explaining the reason for this, so that users that are going over scx_simple as an example scheduler don't get confused. Signed-off-by: David Vernet <void@manifault.com>	2024-05-20 08:48:31 -05:00
Andrea Righi	c5a4a01994	scx_simple: re-add __COMPAT_scx_bpf_switch_all() Although newer kernels default to switching-all, some users might still be using the scheduler with older kernels. Therefore, ensure all tasks are moved to the SCHED_EXT class by calling __COMPAT_scx_bpf_switch_all() during init, so that scx_simple can still operate on these older kernels as well. Fixes: `cf66e58` ("Sync from kernel (670bdab6073)") Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-05-20 04:50:16 +02:00
David Vernet	b421cee59e	Merge pull request #291 from sched-ext/htejun/sync-kernel Sync from kernel (73f4013eb1eb)	2024-05-17 20:43:00 -05:00
Tejun Heo	ab25992416	Add missing skel.attach() calls C SCX_OPS_ATTACH() and rust scx_ops_attach() macros were not calling .attach() and were only attaching the struct_ops. This meant that all non-struct_ops BPF programs contained in the skels were never attached which breaks e.g. scx_layered. Let's fix it by adding .attach() invocation the the attach macros.	2024-05-17 14:33:04 -10:00
Tejun Heo	e26fba9255	Sync from kernel (73f4013eb1eb) This pulls in the support for dump ops.	2024-05-17 01:57:36 -10:00
vax-r	093a08356e	Fix typo Fix "expermentation" to "experimentation".	2024-05-09 12:10:55 +08:00
Tejun Heo	c77d101655	scheds/c: Sync to the new conventions Sync with the in-kernel-tree example schedulers.	2024-04-29 10:13:46 -10:00
Tejun Heo	cf66e58118	Sync from kernel (670bdab6073) And fix build breakage in scx_utils due to an enum type rename.	2024-04-29 09:58:19 -10:00
David Vernet	eed338ef25	simple: Invoke __COMPAT_scx_bpf_switch_all(); scx_simple no longer supports running in "partial" mode, with only certain tasks usig scx_simple. When this option was removed, we also removed the call to scx_bpf_switch_all(); While switching-all is the default behavior for newer kernels, let's add __COMPAT_scx_bpf_switch_all() so that scx_simple can work on older kernels as well. Signed-off-by: David Vernet <void@manifault.com>	2024-04-16 11:09:44 -05:00
Andrea Righi	eca7ecd24e	build: introduce kernel_headers build option If we try to cross-build scx on builders with older versions of system's linux headers (such as those provided by linux-libc-headers in older releases of Ubuntu), we may hit build failures, due to the different kernel ABI, such as: error: invalid use of undefined type ‘struct btf_enum64’ To address this, introduce a new build option called "kernel_headers" that allows to specify a custom path for the kernel headers required during the build process. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-04-04 10:53:36 +02:00
Tejun Heo	348fe53256	Sync from kernel Synchronize stragglers. - Bug fix in __COMPAT_read_enum(). - A cosmetic difference in scx_qmap.bpf.c. - Stray 'p' when calling getopt() in scx_simple.c. After this the kernel tree and scx repo are in sync.	2024-04-02 11:29:50 -10:00
Tejun Heo	891df57b98	scheds/c: Fix up C schedulers Fix up the remaining C schedulers after the recent API and header updates. Also drop stray -p from usage help message from some schedulers.	2024-03-29 12:15:45 -10:00
Tejun Heo	9447cb27b2	scx: Sync from kernel, some schedulers are broken Sync from kernel to receive new vmlinux.h and the updates to common headers. This includes the following updates: - scx_bpf_switch_all() is replaced by SCX_OPS_SWITCH_PARTIAL flag. - sched_ext_ops.exit_dump_len added to allow customizing dump buffer size. - scx_bpf_exit() added. - Common headers updated to provide backward compatibility in a way which hides most complexities from scheduler implementations. scx_simple, qmap, central and flatcg are updated accordingly. Other schedulers are broken for the moment.	2024-03-29 12:14:26 -10:00
Dave Marchevsky	3b7f33ea1b	scx_flatcg: Keep cgroup rb nodes stashed The flatcg scheduler uses a rb_node type - struct cgv_node - to keep track of vtime. On cgroup init, a cgv_node is created and stashed in a hashmap - cgv_node_stash - for later use. In cgroup_enqueued and try_pick_next_cgroup, the node is inserted into the rbtree, which required removing it from the stash before this patch's changes. This patch makes cgv_node refcounted, which allows keeping it in the stash for the entirety of the cgroup's lifetime. Unnecessary bpf_kptr_xchg's and other boilerplate can be removed as a result. Note that in addition to bpf_refcount patches, which have been upstream for quite some time, this change depends on a more recent series [0]. [0]: https://lore.kernel.org/bpf/20231107085639.3016113-1-davemarchevsky@fb.com/ Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>	2024-03-11 08:35:25 -07:00
Tejun Heo	2062d1ad1f	scx: Add compat support for SCX_KICK_IDLE and use it for idle CPU wakeups SCX_KICK_IDLE is a new feature which isn't defined in older kernels. Add compat wrapper and use it for idle CPU wakeups. Signed-off-by: Tejun Heo <tj@kernel.org>	2024-02-06 15:28:40 -10:00
David Vernet	4108ece204	scx_userland: Increase scx_userland timeout This is meant to be an example scheduler that won't necessarily run well in production. Let's remove the 3 second timeout and use the system default of 30. Signed-off-by: David Vernet <void@manifault.com>	2024-02-04 16:23:18 -06:00
David Vernet	28a0b82be6	scx_userland: Print warning about poor performance Let's make it clear that this scheduler isn't expected to perform well, and instead point people to scx_rustland. Signed-off-by: David Vernet <void@manifault.com>	2024-02-04 16:02:57 -06:00
David Vernet	7a3fe759f2	scx_nest: Remove -D option for eager compaction Now that scheduling BPF timers works correctly, we don't need this extra logic to eagerly compact if a scheduling for compaction has happened a few times in a row. Let's remove it. Signed-off-by: David Vernet <void@manifault.com>	2024-01-16 14:08:36 -06:00
David Vernet	607119d8a4	scx_nest: Set timer callback after cancelling In scx_nest, we use a per-cpu BPF timer to schedule compaction for a primary core before it goes idle. If a task comes along that could use that core, we cancel the callback with bpf_timer_cancel(). bpf_timer_cancel() drops a refcnt on the prog and nullifies the callback, so if we want to schedule the callback again, we must use bpf_timer_set_callback() to reset the prog. This patch does that. Reported-by: Julia Lawall <julia.lawall@inria.fr> Signed-off-by: David Vernet <void@manifault.com>	2024-01-16 14:01:39 -06:00
Tejun Heo	18f7fe8477	scx_flatcg: Fix fallout from direct dispatch API update `552b75a9c7` ("scx: Build fix after kernel update") updated scx_flatcg along with other schedulers to use the new direct dispatching from ops.select_cpu() mechanism. However, this was buggy for flatcg. flatcg uses direct dispatch for two purposes - as an optimization when there are idle cpus and to avoid dealing with custom CPU affinities in the dispatch logic. While the former can be moved to ops.select_cpu(), the latter can't as it should also apply to tasks which get enqueued without preceding ops.select_cpu(), e.g., when the task gets requeued after an attribute change or runs out of time slice. The API update incorrectly moved both to ops.select_cpu() leading to futile retries of try_pick_next_cgroup() and scheduling misbheaviors. Fix it by separating out the two cases and only keeping the idle optimization case in ops.select_cpu(). Signed-off-by: Tejun Heo <tj@kernel.org>	2024-01-10 10:57:50 -10:00
Tejun Heo	c1f22ea073	scx_flatcg: Report pick_next_cgroup() race and fail counts To improve visibility into failure mode. While at it, improve output formatting. Signed-off-by: Tejun Heo <tj@kernel.org>	2024-01-10 10:52:24 -10:00
Andrea Righi	0609abdca6	scx_flatcg: introduce CGROUP_MAX_RETRIES We may end up stalling for too long in fcg_dispatch() if try_pick_next_cgroup() doesn't find another valid cgroup to pick. This can be quite risky, considering that we are holding the rq lock in dispatch(). This condition can be reproduced easily in our CI, where we can trigger stalling softirq works: [ 4.972926] NOHZ tick-stop error: local softirq work is pending, handler #200!!! Or rcu stalls: [ 47.731900] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 47.731900] rcu: 1-...!: (0 ticks this GP) idle=b29c/1/0x4000000000000000 softirq=2204/2204 fqs=0 [ 47.731900] rcu: 3-...!: (0 ticks this GP) idle=db74/1/0x4000000000000000 softirq=2286/2286 fqs=0 [ 47.731900] rcu: (detected by 0, t=26002 jiffies, g=6029, q=54 ncpus=4) [ 47.731900] Sending NMI from CPU 0 to CPUs 1: To mitigate this issue reduce the amount of try_pick_next_cgroup() retries from BPF_MAX_LOOPS (8M) to CGROUP_MAX_RETRIES (1024). Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2024-01-10 17:36:17 +01:00
Tejun Heo	552b75a9c7	scx: Build fix after kernel update In the latest kernel, sched_ext API has changed in two areas: - ops.prep_enable/cancel_enable/enable/disable() replaced with ops.init_task/enable/disable/exit_task(). - scx_bpf_dispatch() can now be called from ops.select_cpu(). Also, SCX_ENQ_LOCAL flag is removed. Instead, users can call scx_bpf_select_cpu_dfl() from ops.select_cpu() and use the @is_idle out param value to determine whether to dispatch directly. This commit updates all schedules so that they build. - Init functions renamed / merged / split. - ops.select_cpu() is added to several schedulers and local direct disptching logic is moved there. This is the minimum update which is need to make the schedulers build and work. It needs further update to e.g. move vtime udpates to ops.enable().	2024-01-08 14:48:24 -10:00
David Vernet	e8978ebe23	scx_userland: Introduce ops.update_idle() callback We can sometimes hit scenarios in the scx_userland scheduler where there is work to be done in user space, but we incorrectly fail to run the user space scheduler. In order to avoid this, we can use global variables that are set from both BPF and user space. The BPF-side variable reflects when one or more tasks have been enqueued, and the user space-side variable reflects when user space has received tasks but has not yet dispatched them. In the ops.update_idle() callback, we can check these variables and send a resched IPI to a core to ensure that the user-space scheduler is always scheduled when there's work to be done. Signed-off-by: David Vernet <void@manifault.com>	2024-01-02 16:29:19 -06:00
Andrea Righi	dbc8e23980	scx_userland: flush stdout when printing stats Periodically flush stdout to help following the scheduler progress during testing. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2023-12-29 15:53:12 +01:00
Andrea Righi	614a1ff901	scx_flatcg: flush stdout when printing stats Periodically flush stdout to help following the scheduler progress during testing. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>	2023-12-29 15:53:12 +01:00
David Vernet	eb7b3c99f0	Merge pull request #40 from sched-ext/ci scx: Add CI action that builds schedulers for PRs	2023-12-18 21:17:47 -06:00
David Vernet	4523b10e45	scx: Add CI action that builds schedulers for PRs When Ubuntu ships with sched_ext, we can also maybe test loading the schedulers (not sure if the runners can run as root though). For now, we should at least have a CI job that lets us verify that the schedulers can _build_. To that end, this patch adds a basic CI action that builds the schedulers. As is, this is a bit brittle in that we're having to manually download and install a few dependencies. I don't see a better way for now without hosting our own runners with our own containers, but that's a bigger investment. For now, hopefully this will get us _some_ coverage. Signed-off-by: David Vernet <void@manifault.com>	2023-12-18 21:12:50 -06:00
David Vernet	318c06fa9c	nest: Skip out of idle cpu selection on exec() path The core sched code calls select_task_rq() in a few places: the task wakeup path (typical path), the fork() path, and the exec() path. For nest scheduling, we don't want to select a core from the nest on the exec() path. If we were previously able to find an idle core, we would have found it on the fork() path, so we don't gain much by checking on the exec() path. In fact, it's actually harmful, because we could incorrectly blow up the primary nest unnecessarily by bumping the same task between multiple cores for no reason. Let's just opt-out of select_task_rq calls on the exec() path. Suggested-by: Julia Lawall <julia.lawall@inria.fr> Signed-off-by: David Vernet <void@manifault.com>	2023-12-18 13:51:15 -06:00
David Vernet	ab0e36f9ce	scx_nest: Apply r_impatient if no task is found in primary nest Julia pointed out that our current implementation of r_impatient is incorrect. r_impatient is meant to be a mechanism for more aggressively growing the primary nest if a task repeatedly isn't able to find a core. Right now, we trigger r_impatient if we're not able to find an attached or previous core in the primary nest, but we _should_ be triggering it only if we're unable to find _any_ core in the primary nest. Fixing the implementation to do this drastically decreases how aggressively we grow the primary nest when r_impatient is in effect. Reported-by: Julia Lawall <julia.lawall@inria.fr> Signed-off-by: David Vernet <void@manifault.com>	2023-12-18 11:05:36 -06:00
Jordan Rome	e9a9d32ab6	Restructure scheds folder names - combine c and kernel-examples as it's confusing to have both - rename 'rust-user' and 'c-user' to just 'rust' and 'c', which is simpler - update and fix sync-to-kernel.sh	2023-12-17 13:14:31 -08:00

36 Commits